Documents as Data: Why Structure Is Meaning 2026 Updated
A PDF is not text. It is a knowledge graph waiting to be read correctly.
**Article Id:** 2
**Series Position:** 2
**Series Name:** Verifiable AI Architecture
**Title:** Documents as Data: Why Structure Is Meaning
**Slug:** documents-as-data-structure-is-meaning
**Status:** Published
**Published:** 2026-04
**Description:**
A PDF is not text. It is a knowledge graph waiting to be read correctly. Learn the four stages of document understanding and why skipping any of them causes AI to fail.
**Summary:**
Document structure is meaning. Stripping it causes hallucination. Learn why structure-aware parsing with Docling enables trustworthy AI.
**Author:** Elyas Karbouch
**Author Url:** https://karbouch.substack.com
**Og Title:** Your AI can't read a table. Here's why structure is meaning.
**Og Description:**
OCR gives you letters. Layout gives you regions. Logical structure gives you meaning. Without all four stages, your AI is reading noise.
**Canonical:** https://karbouch.substack.com/p/documents-as-data-structure-is-meaning
**Twitter Creator:** @elyaskarbouch
**Tags:**
- document-intelligence
- document-structure
- knowledge-graphs
- pdf-parsing
- verifiable-ai
**Keywords Primary:** document structure AI meaning
**Keywords Secondary:**
- document intelligence four stages OCR
- structured vs unstructured data LLM
- documents as knowledge graphs
- JSON vs Markdown AI pipeline
- document parsing hallucination prevention
**Word Count:** ~6500
**Reading Time Minutes:** 22
**GitHub Repo:** https://github.com/elyas-karbouch/docling-examples
**External Links:**
- [Docling on GitHub](https://github.com/DS4SD/docling)
**Previous Article:**
- [Docling: Structure-Aware PDF Parsing](https://karbouch.substack.com/p/docling-structure-aware-pdf-parsing)
**Next Article:**
- [TOON Format: Cut LLM Token Costs by 60%](https://karbouch.substack.com/p/toon-format-token-efficient-serialization)
Documents as Data: Why Structure Is Meaning
Here is a number: $1,100,000.
What does it represent? You don’t know. Neither does your AI.
Without context — without the table header above it, the column label to its left, the document section it lives in — that number is trivia. It could be revenue. It could be debt. It could be a fine. Handed to an LLM as raw text, it is nothing more than a character sequence that looks like money.
Now here is the same number in its original context:
Document: Acme Annual Report 2023
Section: Financial Highlights
Table: Year-over-Year Performance
Metric │ 2023 │ 2022
─────────────┼───────────────┼───────────────
Revenue │ $5,200,000 │ $4,600,000
Net Profit │ $1,100,000 │ $950,000
Now the number means something. It is the 2023 net profit figure for Acme Inc., representing a 15.8% increase over the prior year, found in the Financial Highlights section of their annual report.
That transformation — from character sequence to structured knowledge — is what this article is about. And the gap between those two representations is exactly where AI systems fail.
The Problem Nobody Talks About
Most AI pipelines treat document parsing as a solved problem. You upload a PDF, a library extracts the text, you send that text to an LLM. Simple.
But this is the equivalent of scanning a book, running it through OCR, feeding the output to a researcher, and wondering why they keep getting the answers wrong. The researcher is doing their best — but you handed them character soup instead of a book.
What gets lost in naive text extraction:
┌─────────────────────────────────────────────────────────────────────┐
│ WHAT NAIVE EXTRACTION LOSES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ TABLE STRUCTURE Rows, columns, headers, cell relationships │
│ → $1.1M net profit becomes a floating number │
│ │
│ HEADING HIERARCHY Which section each paragraph belongs to │
│ → Clause 4.3 disconnected from "Termination" │
│ │
│ READING ORDER Multi-column layouts, sidebars, callouts │
│ → Text from column 2 interrupts column 1 │
│ │
│ ELEMENT ROLES What type of content each block is │
│ → Footnotes confused with body text │
│ │
│ CROSS-REFERENCES "See Appendix G" — now a dead end │
│ → The answer exists, but the path is broken │
│ │
│ SEMANTIC SCOPE Which facts belong to which entities │
│ → Revenue in 2023 vs revenue in 2022 │
│ mixed into the same embedding │
└─────────────────────────────────────────────────────────────────────┘
These are not minor inconveniences. Each one is a vector for hallucination. And every one of them is caused by the same mistake: treating a document as text when it is actually a structured artifact.
What a Document Really Is
A document is not a sequence of words. It is a hierarchical knowledge structure — a tree of semantically typed elements, each with a role, a position in a hierarchy, and relationships to other elements.
A well-parsed annual report looks like this in its true form:
Document: Acme_Annual_Report_2023.pdf
│
├── Section: Executive Summary [heading level 1]
│ ├── Paragraph: "2023 was a strong year..." [narrative]
│ └── Paragraph: "APAC expansion drove..." [narrative]
│
├── Section: Financial Highlights [heading level 1]
│ ├── Table: Year-over-Year Performance [structured fact]
│ │ ├── Row: ["Revenue", "$5,200,000", "$4,600,000"]
│ │ └── Row: ["Net Profit", "$1,100,000", "$950,000"]
│ └── Paragraph: "Revenue growth driven by..." [narrative]
│
├── Section: Risk Factors [heading level 1]
│ ├── Sub-section: Market Risk [heading level 2]
│ │ └── Paragraph: "Exposure to interest rate..." [narrative]
│ └── Sub-section: Operational Risk [heading level 2]
│ └── Paragraph: "Supply chain dependencies..." [narrative]
│
└── Section: Appendix [heading level 1]
└── Table: Quarterly Breakdown [structured fact]
Every element in this tree has:
A type (heading, paragraph, table, footnote, caption, list item)
A level in the hierarchy (which section it belongs to)
A semantic role (is this narrative explanation or factual data?)
Relationships to its siblings and parent sections
This is what Docling produces. This is what a traditional text extractor destroys.
The principle that follows from this is the foundation of all trustworthy AI pipelines:
Structure is meaning. Discarding it causes hallucination.
The Four Stages of Document Understanding
There is a complete stack of operations that transforms a visual document into machine-readable knowledge. Most tools stop early. Where you stop determines what your AI can know.
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 1 · OCR │
│ "What letters are on this page?" │
│ │
│ Input: Scanned image or native PDF │
│ Output: Raw character sequences │
│ Limit: No understanding of where text is or what it means │
│ │
│ Most pipelines stop here ◄────────────────────────────────────── │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 2 · LAYOUT ANALYSIS │
│ "Where are the text blocks, and how are they arranged?" │
│ │
│ Input: Character positions + page geometry │
│ Output: Regions identified: header, footer, columns, figures │
│ Limit: Knows WHERE, not WHAT — a heading looks like a paragraph │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 3 · LOGICAL STRUCTURE RECOGNITION │
│ "What is the PURPOSE of each block?" │
│ │
│ Input: Layout regions + visual signals │
│ Output: Typed elements — heading, paragraph, table, footnote │
│ Docling reaches here ──────────────────────────────────────────── │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 4 · CONTENT ENRICHMENT │
│ "What does this content MEAN in context?" │
│ │
│ Input: Typed, structured elements │
│ Output: Named entities, financial terms, cross-references resolved │
│ Docling + PageIndex reach here ────────────────────────────────── │
└──────────────────────────────────────────────────────────────────────┘
The stage at which you stop parsing is the ceiling of what your AI can know. Stage 1 gives you characters. Stage 4 gives you knowledge.
This is not an incremental improvement — it is a qualitative difference in what the system can reason about.
Markdown vs JSON: An Architectural Decision
When Docling processes a document, it can output two formats. This is not a cosmetic choice. It is an architectural decision that determines what your downstream pipeline can and cannot do.
Markdown output
## Financial Highlights
| Metric | 2023 | 2022 |
|------------|--------------|--------------|
| Revenue | $5,200,000 | $4,600,000 |
| Net Profit | $1,100,000 | $950,000 |
Revenue growth was driven by expansion in APAC markets...
Markdown output is human-readable and preserves visual structure. It is excellent for:
Display to human readers
Heading-based chunking in RAG pipelines
Cases where a human will review the output
What it cannot do: a program reading this Markdown cannot reliably tell whether $1,100,000 is a table cell, a heading, body text, or a footnote without parsing the Markdown syntax itself — which reintroduces the structural ambiguity you were trying to eliminate.
JSON output
{
"name": "Acme_Annual_Report_2023.pdf",
"confidence": 0.96,
"quality": "excellent",
"items": [
{
"type": "heading",
"level": 2,
"text": "Financial Highlights"
},
{
"type": "table",
"data": {
"num_rows": 3,
"num_cols": 3,
"table_cells": [
{"row": 0, "col": 0, "text": "Metric", "is_header": True},
{"row": 0, "col": 1, "text": "2023", "is_header": True},
{"row": 0, "col": 2, "text": "2022", "is_header": True},
{"row": 1, "col": 0, "text": "Revenue"},
{"row": 1, "col": 1, "text": "$5,200,000"},
{"row": 1, "col": 2, "text": "$4,600,000"},
{"row": 2, "col": 0, "text": "Net Profit"},
{"row": 2, "col": 1, "text": "$1,100,000"},
{"row": 2, "col": 2, "text": "$950,000"}
]
}
},
{
"type": "paragraph",
"text": "Revenue growth was driven by expansion in APAC markets..."
}
]
}
JSON output is unambiguous. Every element has an explicit type. The program downstream does not need to guess whether something is a table header or body text — it is declared. The $1,100,000 figure is reachable as a structured object: items[1].data.table_cells[row=2, col=1].text.
┌────────────────────────────────────────────────────────────────────┐
│ MARKDOWN vs JSON — WHAT EACH ENABLES │
├────────────────────────────────┬───────────────────────────────────┤
│ MARKDOWN │ JSON │
├────────────────────────────────┼───────────────────────────────────┤
│ Human-readable │ Machine-readable │
│ Preserves visual formatting │ Preserves semantic structure │
│ Good for display │ Good for programmatic processing │
│ Heading-based chunking │ Fact extraction by element type │
│ Context visible at a glance │ Every field explicitly typed │
│ Ambiguous for programs │ Unambiguous for programs │
│ Cannot route elements │ Routes narrative vs facts │
│ Cannot score quality │ Confidence score per document │
└────────────────────────────────┴───────────────────────────────────┘
For RAG pipelines: use Markdown
For VAA pipelines: use JSON ←── this is the right answer for
trustworthy, verifiable AI
The VAA pipeline uses JSON. The reason is simple: you cannot route what you cannot identify. And you cannot build trustworthy AI on top of data you cannot reason about programmatically.
Documents as Knowledge Graphs
Once you have a structured document tree, something important becomes possible: you can reason across it, not just retrieve from it.
A traditional RAG system treats a document as a bag of chunks — discrete text segments that either match a query or don’t. A document understood as a knowledge graph is something fundamentally different: a network of facts, entities, and relationships where each node knows its context.
Consider a contract document processed into a knowledge graph:
┌─────────────────────────────────────────────────────────────────────┐
│ CONTRACT AS KNOWLEDGE GRAPH │
│ │
│ [Vendor Services Agreement] │
│ │ │
│ ┌────────────────┼─────────────────┐ │
│ │ │ │ │
│ [Party A] [Party B] [Term] │
│ Acme Inc. CloudSecure LLC 2 years │
│ │ │ │ │
│ │ ┌──────┴──────┐ │ │
│ │ [SLA Clause] [Penalty] [Renewal] │
│ │ 99.9% uptime $50K/breach Auto-renew │
│ │ │ │
│ └─────────┤ RELATIONSHIP: │
│ │ "CloudSecure is liable for $50K per breach │
│ │ of the 99.9% uptime SLA with Acme Inc." │
│ │ │
│ [Data Breach — 2024] │
│ 3 incidents │
└─────────────────────────────────────────────────────────────────────┘
Now a question like “What is Acme’s total exposure from CloudSecure’s 2024 breach record?” is answerable — not by finding the most similar text chunk, but by traversing a graph: Party → SLA Clause → Penalty Clause × Breach Count.
That is not retrieval. That is reasoning. And it requires the document to have been understood as a knowledge graph in the first place.
We cover the full implementation of this in Hybrid RAG Architecture — including how to route facts to graph databases and narrative to vector databases. The point here is architectural: before you can reason across documents, you need to have understood them as structured knowledge, not flat text.
The Practical Difference
Here is a concrete comparison. Same question, same document, two different parsing approaches.
Question: “What was Acme’s net profit growth from 2022 to 2023?”
With naive text extraction
The parser produces something like this:
"Financial Highlights Revenue 2023 2022 5200000 4600000 Net
Profit 1100000 950000 Revenue growth was driven by expansion
in APAC markets and new enterprise contracts signed in Q3..."
The LLM receives this string. It tries to find numbers near the words “net profit.” It may correctly identify 1,100,000 and 950,000. Or it may confuse them with revenue figures. Or it may hallucinate a growth percentage that is slightly off because it cannot verify which number belongs to which year. There is no way to know from the output whether the answer is grounded or invented.
With structure-aware parsing
The parser produces the JSON tree shown earlier. The pipeline knows:
items[1]is atablewith type explicitly declaredRow 2 contains
"Net Profit"in col 0,"$1,100,000"in col 1 (2023),"$950,000"in col 2 (2022)The table is a child of the
"Financial Highlights"heading
The answer is computed, not guessed:
net_profit_2023 = 1_100_000
net_profit_2022 = 950_000
growth = (net_profit_2023 - net_profit_2022) / net_profit_2022
# → 15.79%
output = {
"answer": "Net profit grew 15.8% from $950K (2022) to $1.1M (2023)",
"source": {
"document": "Acme_Annual_Report_2023.pdf",
"section": "Financial Highlights",
"element": "table",
"page": 4,
"confidence": 0.96
}
}
The answer is verifiable. The source is cited. The confidence is known. There is nothing to hallucinate because the value was never embedded — it was looked up.
This is the practical consequence of treating documents as structured data.
What This Means for Hallucination
Hallucination is usually framed as a model problem. It is not. It is an ingestion problem.
When an LLM hallucinates a financial figure, it is almost never because the model invented a number from nothing. It is because:
The correct number was in the document
The number was stripped of its structural context during parsing
The embedding placed it near similar-looking numbers from other rows, columns, or years
The model, unable to verify, generated a plausible-sounding answer from blended context
Every step in that chain is an ingestion failure. The fix is upstream — at the parsing layer, not the model layer.
┌─────────────────────────────────────────────────────────────────────┐
│ WHERE HALLUCINATION ACTUALLY COMES FROM │
│ │
│ WRONG DIAGNOSIS: Model made something up │
│ │
│ CORRECT DIAGNOSIS: │
│ │
│ 1. Document parsed as flat text ──────► Structure destroyed │
│ │
│ 2. Text chunked arbitrarily ──────────► Context severed │
│ │
│ 3. Numbers embedded as text ──────────► Fuzzy matching enabled │
│ ("operating" ≈ "investing")│
│ │
│ 4. kNN retrieval ─────────────────────► Wrong context returned │
│ │
│ 5. LLM generates from blended context ► Confident wrong answer │
│ │
│ THE FIX: Solve step 1. Everything else follows. │
└─────────────────────────────────────────────────────────────────────┘
Structure-aware parsing eliminates steps 1 and 2. Routing facts to structured stores instead of vector databases eliminates step 3. The hallucination never gets a chance to form.
Implementation: Reading a Document as Structured Data
Here is the complete workflow from raw PDF to structured knowledge tree, using Docling:
Step 1: Parse and inspect
from docling.document_converter import DocumentConverter
import json
converter = DocumentConverter()
result = converter.convert("acme_annual_report_2023.pdf")
# Get the structured JSON tree
doc = result.document.export_to_dict()
print(f"Document: {doc['name']}")
print(f"Quality: {doc['quality']} (confidence: {doc['confidence']:.2f})")
print(f"Elements: {len(doc['items'])}")
Step 2: Walk the tree and classify each element
def walk_document_tree(doc: dict) -> None:
"""
Walk every element in the document tree.
Print type, hierarchy level, and a preview of the content.
"""
for i, item in enumerate(doc.get("items", [])):
element_type = item.get("type", "unknown")
if element_type == "heading":
level = item.get("level", 1)
indent = " " * (level - 1)
print(f"{indent}[H{level}] {item['text']}")
elif element_type == "paragraph":
preview = item["text"][:80] + "..." if len(item["text"]) > 80 else item["text"]
print(f" [PARA] {preview}")
elif element_type == "table":
rows = item["data"]["num_rows"]
cols = item["data"]["num_cols"]
print(f" [TABLE] {rows}×{cols} — extracting cells...")
for cell in item["data"]["table_cells"]:
if cell.get("is_header"):
print(f" Header col {cell['col']}: {cell['text']}")
else:
print(f" [{cell['row']},{cell['col']}]: {cell['text']}")
# Usage
walk_document_tree(doc)
Step 3: Extract facts from tables with full provenance
def extract_table_facts(doc: dict) -> list[dict]:
"""
Extract all table cells with their structural context.
Returns facts ready for graph database ingestion.
"""
facts = []
current_section = "Unknown"
for item in doc.get("items", []):
# Track the current section heading for context
if item.get("type") == "heading":
current_section = item["text"]
# Extract every non-header table cell as a fact
elif item.get("type") == "table":
cells = item["data"]["table_cells"]
# Build header map: col index → header label
headers = {
cell["col"]: cell["text"]
for cell in cells
if cell.get("is_header")
}
# Extract data cells with full context
row_labels = {}
for cell in cells:
if not cell.get("is_header") and cell["col"] == 0:
row_labels[cell["row"]] = cell["text"]
for cell in cells:
if cell.get("is_header") or cell["col"] == 0:
continue # Skip headers and row labels
facts.append({
"value": cell["text"],
"metric": row_labels.get(cell["row"], ""),
"period": headers.get(cell["col"], ""),
"section": current_section,
"document": doc["name"],
"confidence": doc["confidence"],
})
return facts
# Usage — extract and inspect facts
facts = extract_table_facts(doc)
for f in facts:
print(f"{f['metric']} ({f['period']}) = {f['value']}")
print(f" Source: {f['document']} → {f['section']} (confidence: {f['confidence']:.2f})")
print()
# Output:
# Revenue (2023) = $5,200,000
# Source: Acme_Annual_Report_2023.pdf → Financial Highlights (confidence: 0.96)
#
# Revenue (2022) = $4,600,000
# Source: Acme_Annual_Report_2023.pdf → Financial Highlights (confidence: 0.96)
#
# Net Profit (2023) = $1,100,000
# Source: Acme_Annual_Report_2023.pdf → Financial Highlights (confidence: 0.96)
Step 4: Separate narrative from facts
def separate_narrative_and_facts(doc: dict) -> dict:
"""
Route document elements to their correct downstream store.
Narrative → vector DB (embed for semantic search)
Facts → graph DB (store for exact lookup)
Everything → doc store (audit trail)
"""
narrative = [] # → vector DB (Qdrant, Milvus)
facts = [] # → graph DB (Neo4j)
current_section = None
for item in doc.get("items", []):
element_type = item.get("type")
if element_type == "heading":
current_section = item["text"]
elif element_type in ("paragraph", "list_item", "caption"):
narrative.append({
"text": item["text"],
"section": current_section,
"document": doc["name"],
"type": element_type,
})
elif element_type == "table":
facts.extend(
extract_table_facts_for_element(item, current_section, doc)
)
return {
"narrative": narrative, # → embed and store in vector DB
"facts": facts, # → store as nodes in graph DB
"source_document": doc, # → store in document store (always)
}
Full pipeline code including Qdrant ingestion, Neo4j fact storage, and query routing is in the docling-examples repository.
Why This Changes Everything Downstream
Once your documents are parsed into structured trees, a chain of possibilities opens up that was simply not available before:
┌─────────────────────────────────────────────────────────────────────┐
│ WHAT STRUCTURE-AWARE PARSING ENABLES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ HYBRID STORAGE Route narrative to vector DB. │
│ (Article 3) Route facts to graph DB. │
│ Never mix them. Never embed numbers. │
│ │
│ PAGEINDEX TREES Build a navigable tree from the document │
│ (Article 8) hierarchy. Let an LLM reason its way to │
│ the right section — not guess via cosine │
│ similarity. │
│ │
│ TOON SERIALIZATION Convert JSON tree to TOON format before │
│ (Article 6) sending to LLM. Save 30–60% in token costs. │
│ │
│ COMPLIANCE Every answer cites a source: │
│ (Article 9) document name, section, page, confidence. │
│ GDPR Art.22 requires this. August 2026. │
│ │
│ VERIFIABLE AI The difference between "the AI said so" │
│ (All articles) and "here is the page, the table, the cell." │
└─────────────────────────────────────────────────────────────────────┘
None of these capabilities are available if you started with flat text.
FAQ
Q: Does this only apply to financial documents? No. The principle — that structure carries meaning — applies to any document where layout encodes information: legal contracts (clause hierarchy matters), medical records (dosage in a footnote means something different than dosage in a body text recommendation), scientific papers (results tables, methods sections, appendices), policy documents, compliance frameworks. Any domain where getting a fact wrong has consequences.
Q: Can’t a good enough LLM figure out the structure from raw text? Sometimes, partially. But this is exactly the wrong approach. You are asking the most expensive, most error-prone step in the pipeline to compensate for a failure at the cheapest, most reliable step. Docling parsing is deterministic and fast. LLM reasoning is probabilistic and expensive. Fix the problem at the source.
Q: What about documents that are truly unstructured — like emails or chat logs? Some documents genuinely have minimal structure. For those, the four-stage model still applies — you just stop earlier. But most enterprise documents (contracts, reports, manuals, filings) are highly structured. The opportunity is largest exactly where the stakes are highest.
Q: How does this connect to hallucination prevention in practice? Directly. When a fact is extracted from a table cell and stored in a graph database with its row label, column header, and section heading, the LLM never has to guess or approximate it. It looks it up. There is nothing to hallucinate. The hallucination was a symptom of the model working from degraded context — fix the context, fix the hallucination.
Q: Is this a lot of engineering overhead? Less than you think, because Docling does the heavy lifting. The ingestion classification and routing logic shown above is around 50 lines of Python. The payoff — verifiable, cited, auditable AI outputs — is substantial in any regulated or high-stakes context.
What Comes Next
You now have a structured document tree. Every element has a type, a hierarchy position, a semantic role. Facts are separated from narrative.
The next question is: where do each of these go?
The answer is the core of the Verifiable AI Architecture. Narrative belongs in a vector database — embedded and searchable by meaning. Facts belong in a graph database — stored as nodes and relationships, retrievable by exact lookup. The full document belongs in a document store as an audit trail.
That three-store pattern, and why mixing them causes the problems you have already seen, is the subject of Hybrid RAG Architecture.
Before that, if you want to understand the mechanics of the three database types — what embeddings actually are, why graph databases handle facts better than vector similarity, and how to decide which store to use for which type of data — AI Memory Systems covers that ground.
The foundation is here. Structure is meaning. Now it is time to store it correctly.
Elyas Karbouch builds Verifiable AI Architecture — AI systems that show their work. Code for all pipeline components lives in the verifiable-ai-architecture GitHub organization.

