Elevate the power of your work
Get a FREE consultation today!
Organizations building on disconnected information are building on sand. Those leveraging ontology-driven knowledge graphs are building on rock—with the intelligence advantage to prove it.

We built an ontology-driven knowledge graph creation system in our intelligent content management platform, Iron Mountain InSight® DXP, which transforms disconnected documents into an intelligent network that understands your business. Key components include:
Bottom line: Knowledge graphs turn "find me information" into "show me the complete picture"—answering complex business questions that require connecting dots across dozens of documents in seconds, not days.
The revelation: Your documents already contain the answers. Knowledge graphs simply unveil the secret connections you couldn't see before.
Every organization stores millions of documents—contracts, HR files, compliance records, policies—each holding valuable knowledge. But that knowledge remains locked inside static files, disconnected from context.
Traditional document management can tell you where a file is. What it can't tell you is how those files relate to each other.
When users ask: "Which contracts are governed by this regulation, and which employees are on the affected projects?" teams launch days of manual research.
We faced this exact problem. Vast archives without structure. No semantic context. Just isolated files.
The realization hit hard: We didn't have a search problem. We had a “connection” problem.
Knowledge graphs change this by connecting entities (people, projects, contracts) and relationships (reports_to, governed_by, assigned_to). The result:
We needed to combine ontology-guided extraction, intelligent deduplication, and scalable processing to make knowledge graphs work at enterprise scale.
An ontology defines the business concepts, entities, and relationships that matter most. It's the "instruction manual" that tells AI how to think in your language.
Our ontology framework includes:
We store this as a W3C-compliant JSON-LD (OWL 2) schema, ensuring standard interoperability.
When processing documents, we retrieve the ontology and leverage it to guide LLM extraction:
This ensures every node and every edge aligns with your business vocabulary, drastically improving precision and consistency across millions of documents.
Example: When we read "Maria Rodriguez reports to David Chen, Senior Engineering Manager", the ontology ensures:
Without ontology, this is just text. With ontology, it becomes structured knowledge—a reusable, queryable relationship inside the enterprise graph.
We process documents through a structured pipeline designed to preserve context at every step.
Document chunking: Each document gets divided into meaningful text segments.
Entity & relationship extraction: Large Language Models extract entities and relationships from each chunk, guided by the ontology.
Embedding generation: Vector embeddings represent semantic meaning for fast similarity matching.
The critical detail—metadata preservation: Throughout this process, we preserve critical tracking information: which document it came from, where exactly in that document, how confident the extraction was, and a unique document identifier. This ensures every entity and relationship traces back to its origin for complete auditability.
Why this matters: When a customer questions a connection in the graph, we can show the exact paragraph in the source document that created it. No black boxes. No guessing.
Duplicate detection is where most knowledge graph systems fail. Without it, the same person appearing in ten documents becomes ten separate nodes—fragmenting insights and creating noise.
We built a multi-stage, ontology-aware deduplication system that balances precision with completeness.
The challenge: Recognize when two references describe the same real-world entity—while ensuring unrelated items never merge incorrectly. This isn't just matching strings; it's understanding semantic equivalence across varying document formats and naming conventions.
Our approach: If ten HR documents reference Maria Rodriguez working in Engineering, we unify them into one complete employee node—retaining every document reference and relationship connection.
The result: A clean, self-healing graph that stays accurate and connected as it scales—eliminating redundancy, consolidating context, and preserving every source link for full traceability.
The outcome: a single, accurate representation of reality—free of duplication and rich in context.
After extraction and merging, we validate relationships using the ontology:
This continuous validation ensures semantic integrity across the entire graph. Invalid relationships get caught before they pollute downstream analysis.
Each finalized graph lives in two complementary formats:
Apache AGE (PostgreSQL): Our production-grade graph database supporting real-time traversal and Cypher-compatible queries. It's optimized for scalability, ensuring fast and reliable query performance across millions of entities and relationships.
NetworkX Serialized Graph (Pickle): Used by developers for rapid iteration, algorithmic experimentation, and advanced analysis.
Why dual storage? Operational teams need instant answers. Data scientists need to run complex graph algorithms without impacting production performance. Together, these enable both real-time reasoning and deep analytical analysis, without duplication or reprocessing.
Processing massive document sets demands an architecture built for performance. We deliver this through multi-level parallel processing:
Asset-level parallelism: Multiple documents processed simultaneously
Chunk-level parallelism: Within each document, chunks processed in parallel
Operation-level parallelism: Concurrent extraction, deduplication, and graph creation
We also use batch & asynchronous database operations to reduce latency. Connection pooling and schema caching prevent bottlenecks. Vector indexing enables fast similarity detection across embeddings.
This design ensures both high throughput and semantic precision—scaling linearly from thousands to millions of nodes while maintaining accuracy and consistency.
Moving to this ontology-driven approach transformed how we work with enterprise information.
From manual to instant: Users identify issues instantly instead of manually searching through thousands of documents.
Self-maintaining accuracy: Built-in deduplication and validation ensure accuracy as data grows. The graph cleans itself.
Complete audit trail: Every connection is backed by document metadata—which source document, exact location within that document, and unique tracking identifiers. Zero "source unclear" findings in external audits. When questioned, we show the exact source paragraph.
AI-ready foundation: Ontology-guided data provides reliable inputs for downstream AI and analytics. Structured, connected knowledge prevents hallucination and enables intelligent agents.
Cross-domain intelligence: The same architectural advantages apply across HR, compliance, contracts, records management, and customer relationships.
Ontology-driven precision: Domain frameworks teach AI your business language, transforming extraction from "pretty good" to "business ready."
Self-maintaining quality: Multi-stage deduplication keeps graphs clean and consolidated even as they scale to millions of entities.
Enterprise-scale architecture: Parallel processing and intelligent resource management handle document volumes from thousands to millions.
Strategic flexibility: Dual storage approach supports both real-time operational queries and deep analytical processing.
Complete traceability: Every connection is explainable, making audit readiness and compliance transparency automatic.
Get a FREE consultation today!
