From document chaos to connected intelligence: how Iron Mountain's ontology-driven knowledge graphs transform unstructured information into enterprise insight

Blogs and Articles

Organizations building on disconnected information are building on sand. Those leveraging ontology-driven knowledge graphs are building on rock—with the intelligence advantage to prove it.

Iron Mountain logo with blue mountains
Sushant Tiwari
Senior Machine Learning Engineer, Digital Business Unit, Iron Mountain
December 16, 20257  mins
AI lines with finger

We built an ontology-driven knowledge graph creation system in our intelligent content management platform, Iron Mountain InSight® DXP,  which transforms disconnected documents into an intelligent network that understands your business. Key components include:

  • Ontology-guided intelligence: Domain-specific frameworks teach AI to understand your business language, entities, and relationships—dramatically improving accuracy
  • Self-cleaning architecture: Multi-stage deduplication prevents information bloat, ensuring your graph stays clean even as it scales
  • Enterprise-scale performance: Parallel processing architecture handles massive document volumes without compromising speed
  • Flexible query infrastructure: Dual storage approach enables both real-time graph traversal and advanced analytical processing

Bottom line: Knowledge graphs turn "find me information" into "show me the complete picture"—answering complex business questions that require connecting dots across dozens of documents in seconds, not days.

The revelation: Your documents already contain the answers. Knowledge graphs simply unveil the secret connections you couldn't see before.

 

The problem we couldn't ignore

Every organization stores millions of documents—contracts, HR files, compliance records, policies—each holding valuable knowledge. But that knowledge remains locked inside static files, disconnected from context.

Traditional document management can tell you where a file is. What it can't tell you is how those files relate to each other.

When users ask: "Which contracts are governed by this regulation, and which employees are on the affected projects?" teams launch days of manual research.

We faced this exact problem. Vast archives without structure. No semantic context. Just isolated files.

The realization hit hard: We didn't have a search problem. We had a “connection” problem.

Knowledge graphs change this by connecting entities (people, projects, contracts) and relationships (reports_to, governed_by, assigned_to). The result:

  • Faster, more accurate answers to complex, cross-document questions
  • Transparent reasoning paths for auditability and compliance
  • AI-ready structure that grounds automation and analytics

We needed to combine ontology-guided extraction, intelligent deduplication, and scalable processing to make knowledge graphs work at enterprise scale.

 

The foundation: teaching AI our business language

An ontology defines the business concepts, entities, and relationships that matter most. It's the "instruction manual" that tells AI how to think in your language.

Our ontology framework includes:

  • Entity classes & attributes: Employee (employee_id, hire_date), Contract (contract_id, expiration_date)
  • Relationship patterns: Employee works_in Department, Contract governed_by Policy
  • Source document type links: "state_of_residence" present in 'Driver License' document type

We store this as a W3C-compliant JSON-LD (OWL 2) schema, ensuring standard interoperability.

Ontology as the filter

When processing documents, we retrieve the ontology and leverage it to guide LLM extraction:

  • The LLM identifies only entities and relationships defined in the ontology—reducing noise and false positives
  • It uses precise ontology keys for labeling to standardize naming
  • It recognizes relationship types like reports_to, works_on, and governed_by from domain-specific mappings

This ensures every node and every edge aligns with your business vocabulary, drastically improving precision and consistency across millions of documents.

Example: When we read "Maria Rodriguez reports to David Chen, Senior Engineering Manager", the ontology ensures:

  • Maria Rodriguez → Employee class
  • David Chen → Employee class
  • reports_to → valid relationship type between Employee nodes
  • Senior Engineering Manager → recognized as a Role entity linked via has_role

Without ontology, this is just text. With ontology, it becomes structured knowledge—a reusable, queryable relationship inside the enterprise graph.

 

Deep dive 1: extraction and the metadata trail

We process documents through a structured pipeline designed to preserve context at every step.

Document chunking: Each document gets divided into meaningful text segments.

Entity & relationship extraction: Large Language Models extract entities and relationships from each chunk, guided by the ontology.

Embedding generation: Vector embeddings represent semantic meaning for fast similarity matching.

The critical detail—metadata preservation: Throughout this process, we preserve critical tracking information: which document it came from, where exactly in that document, how confident the extraction was, and a unique document identifier. This ensures every entity and relationship traces back to its origin for complete auditability.

Why this matters: When a customer questions a connection in the graph, we can show the exact paragraph in the source document that created it. No black boxes. No guessing.

 

Deep dive 2: the self-cleaning process

Duplicate detection is where most knowledge graph systems fail. Without it, the same person appearing in ten documents becomes ten separate nodes—fragmenting insights and creating noise.

We built a multi-stage, ontology-aware deduplication system that balances precision with completeness.

The challenge: Recognize when two references describe the same real-world entity—while ensuring unrelated items never merge incorrectly. This isn't just matching strings; it's understanding semantic equivalence across varying document formats and naming conventions.

Our approach: If ten HR documents reference Maria Rodriguez working in Engineering, we unify them into one complete employee node—retaining every document reference and relationship connection.

The result: A clean, self-healing graph that stays accurate and connected as it scales—eliminating redundancy, consolidating context, and preserving every source link for full traceability.

The outcome: a single, accurate representation of reality—free of duplication and rich in context.

 

Deep dive 3: validation and the ontology safety net

After extraction and merging, we validate relationships using the ontology:

  • Confirms that each edge (e.g., Employee → works_in → Department) matches an allowed pattern
  • Flags invalid or incomplete relationships automatically

This continuous validation ensures semantic integrity across the entire graph. Invalid relationships get caught before they pollute downstream analysis.

 

Deep dive 4: dual storage—operational speed and analytical depth

Each finalized graph lives in two complementary formats:

Apache AGE (PostgreSQL): Our production-grade graph database supporting real-time traversal and Cypher-compatible queries. It's optimized for scalability, ensuring fast and reliable query performance across millions of entities and relationships.

NetworkX Serialized Graph (Pickle): Used by developers for rapid iteration, algorithmic experimentation, and advanced analysis.

Why dual storage? Operational teams need instant answers. Data scientists need to run complex graph algorithms without impacting production performance. Together, these enable both real-time reasoning and deep analytical analysis, without duplication or reprocessing.

 

Built for scale: the parallel processing architecture

Processing massive document sets demands an architecture built for performance. We deliver this through multi-level parallel processing:

Asset-level parallelism: Multiple documents processed simultaneously

Chunk-level parallelism: Within each document, chunks processed in parallel

Operation-level parallelism: Concurrent extraction, deduplication, and graph creation

We also use batch & asynchronous database operations to reduce latency. Connection pooling and schema caching prevent bottlenecks. Vector indexing enables fast similarity detection across embeddings.

This design ensures both high throughput and semantic precision—scaling linearly from thousands to millions of nodes while maintaining accuracy and consistency.

 

The payoff: what we can see now

Moving to this ontology-driven approach transformed how we work with enterprise information.

From manual to instant: Users identify issues instantly instead of manually searching through thousands of documents.

Self-maintaining accuracy: Built-in deduplication and validation ensure accuracy as data grows. The graph cleans itself.

Complete audit trail: Every connection is backed by document metadata—which source document, exact location within that document, and unique tracking identifiers. Zero "source unclear" findings in external audits. When questioned, we show the exact source paragraph.

AI-ready foundation: Ontology-guided data provides reliable inputs for downstream AI and analytics. Structured, connected knowledge prevents hallucination and enables intelligent agents.

Cross-domain intelligence: The same architectural advantages apply across HR, compliance, contracts, records management, and customer relationships.

 

Key takeaways

Ontology-driven precision: Domain frameworks teach AI your business language, transforming extraction from "pretty good" to "business ready."

Self-maintaining quality: Multi-stage deduplication keeps graphs clean and consolidated even as they scale to millions of entities.

Enterprise-scale architecture: Parallel processing and intelligent resource management handle document volumes from thousands to millions.

Strategic flexibility: Dual storage approach supports both real-time operational queries and deep analytical processing.

Complete traceability: Every connection is explainable, making audit readiness and compliance transparency automatic.

 

The bottom line

As AI agents and intelligent automation move from experimental to essential, knowledge graphs shift from optional enhancement to critical infrastructure.
 
Organizations building on disconnected information are building on sand. Those leveraging ontology-driven knowledge graphs are building on rock—with the intelligence advantage to prove it.
 
Your documents already contain the answers. Knowledge graphs simply unveil the secret connections you couldn't see before.

 

Elevate the power of your work

Get a FREE consultation today!

Get Started