Hybrid RAG Architecture: Beyond Vector Search 2026 Edition

Part 5 of the Verifiable AI Architecture series — By Elyas Karbouch

Nov 19, 2025

article_id: 5
series_position: 5
series_name: "Verifiable AI Architecture"
title: "Hybrid RAG Architecture: Beyond Vector Search"
slug: "hybrid-rag-architecture-vector-graph-document-stores"
status: "Published"
published: "2026-04"
description: "Most RAG pipelines embed everything and retrieve by similarity. Learn why hybrid architecture — vector + graph + document stores — prevents hallucination and enables verifiable AI."
summary: "The three-store architecture that stops hallucination: narrative goes to vectors, facts go to graphs, everything goes to document store. Here's the complete implementation."
author: "Elyas Karbouch"
author_url: "https://karbouch.substack.com"
og_title: "Your RAG pipeline embeds facts. Stop doing that."
og_description: "Vector search finds similar text. It does not retrieve exact figures. Here's the architecture that separates narrative from facts and makes RAG reliable."
canonical: "https://karbouch.substack.com/p/hybrid-rag-architecture-vector-graph-document-stores"
twitter_creator: "@elyaskarbouch"
tags:
  - hybrid-rag
  - vector-database
  - graph-database
  - document-store
  - rag-architecture
  - qdrant
  - neo4j
  - verifiable-ai
keywords_primary: "hybrid RAG architecture vector graph"
keywords_secondary:
  - "RAG without hallucination"
  - "vector search vs graph database RAG"
  - "document store architecture AI"
  - "verifiable RAG pipeline"
  - "facts vs narrative AI retrieval"
word_count: "~8500"
reading_time_minutes: 28
github_repo: "https://github.com/elyas-karbouch/vaa-core"
external_links:
  - title: "vaa-core Repository"
    url: "https://github.com/elyas-karbouch/vaa-core"
  - title: "Qdrant Documentation"
    url: "https://qdrant.tech"
  - title: "Neo4j Graph Database"
    url: "https://neo4j.com"
series_prev:
  article_id: 4
  title: "AI Memory Systems: How Databases Power Trustworthy RAG"
  slug: "ai-memory-systems-databases-trustworthy-rag"
  url: "https://karbouch.substack.com/p/ai-memory-systems-databases-trustworthy-rag"
series_next:
  article_id: 6
  title: "Why Traditional RAG Fails in High-Stakes Domains"
  slug: "why-traditional-rag-fails"
  url: "https://karbouch.substack.com/p/why-traditional-rag-fails"

Hybrid RAG Architecture: Beyond Vector Search

Part 5 of the Verifiable AI Architecture series

Here is a query a legal team submitted to their RAG system: ”What is the penalty if our vendor has a data breach?”

The system retrieved three paragraphs about data breach penalties from contracts across their document library. The LLM synthesised them into a confident answer: “The penalty is €150,000.”

The actual penalty in their active vendor contract was €1,500,000.

The system had retrieved semantically similar content from older contracts with lower penalty thresholds. The answer was fluent, confident, and wrong by a factor of ten.

This is not an unusual failure. It is the default failure mode of retrieval-augmented generation built on a single vector store. The system retrieved what felt relevant — paragraphs that were semantically similar to the query. It did not retrieve what was correct — the exact clause in the active contract.

The distinction is not subtle. And the fix requires rethinking the entire storage layer.

Hybrid RAG Architecture is the thesis of this entire series: structure-aware ingestion, followed by type-aware storage across three specialised stores, followed by query-routing that retrieves from the right store for each type of question. The result is an AI that does not just sound authoritative — it is verifiable.

This article defines the architecture, explains why it works, and gives you the complete implementation. Part 1 covers ingestion — how documents are read and routed into three stores. Part 2 covers retrieval — how queries navigate those stores to produce answers that can be traced back to their source.

All code is in the vaa-core repository.

Part 1: Read and Store

1.1 Why flat text fails at the ingestion layer

The failure starts before the query. It starts when you load a document.

As covered in detail in Documents as Data, naive text extraction discards the structural information that gives facts their meaning. A table cell containing $1,500,000 has no meaning without the column header that says Penalty, the row header that says Data Breach, and the document context that says Vendor Contract 0847 — Active. Strip the structure, and you have a number that floats free of its meaning.

Docling solves this at the parsing stage: it preserves tables, headings, hierarchy, and semantic element types in a structured JSON output. But parsing correctly is only the first step. What you do with the structured output is what determines whether your pipeline is trustworthy.

Most pipelines take Docling’s structured output and immediately chunk it back into flat text for embedding. They have solved the parsing problem and then discarded the solution. The structure was recovered, and then thrown away at the next step.

The ingestion architecture that follows keeps structure throughout.

1.2 Two types of content — and why this matters

Every document contains two fundamentally different types of content. They have different retrieval requirements. They should never be stored the same way.

Narrative content — paragraphs, management commentary, summaries, contextual descriptions — carries meaning. When you query narrative content, you are asking “what does this document say about this topic?” The retrieval requirement is semantic: find content that is about the question, even if it uses different words.

Factual content — numbers, dates, legal clauses, identifiers, thresholds, named entities — carries precision. When you query factual content, you are asking “what is the exact value of this specific thing?” The retrieval requirement is exact: return the precise value with zero tolerance for approximation.

These two retrieval requirements are incompatible with a single storage layer. Vector search is designed for semantic retrieval. It is architecturally incapable of exact retrieval — it retrieves by similarity, and similarity is not the same as identity. A structured database is designed for exact lookup. It cannot understand paraphrases or synonyms.

The solution is obvious once stated: store each content type in the system that is actually designed for it.

┌─────────────────────────────────────────────────────────────────────┐
│              TWO CONTENT TYPES — TWO RETRIEVAL MODES                │
├──────────────────────┬──────────────────────────────────────────────┤
│ NARRATIVE            │ FACTUAL                                      │
│                      │                                              │
│ "Revenue growth was  │ Revenue: $5,200,000                          │
│  driven by expansion │ Period: 2023                                  │
│  in APAC markets,    │ Document: Acme Annual Report 2023            │
│  where enterprise    │ Section: Financial Highlights                │
│  adoption accelerated│ Confidence: 0.96                             │
│  significantly in    │                                              │
│  Q3 and Q4."         │                                              │
│                      │                                              │
│ → Vector store       │ → Relational / structured store              │
│ → Semantic search    │ → Exact lookup                               │
│ → Ranked results     │ → Single precise value                       │
└──────────────────────┴──────────────────────────────────────────────┘

1.3 Embed words. Store facts.

This is the foundational design rule of the Verifiable AI Architecture. Four words. One architectural decision that eliminates the most common class of RAG hallucination.

Embed words. Text that carries meaning — narrative, context, description, explanation — gets encoded as an embedding and stored in the vector database. Queries for this content use semantic similarity search.

Store facts. Text that carries precision — numbers, dates, identifiers, clauses, thresholds — gets stored as structured data in a relational or graph database. Queries for this content use exact lookup.

The implication: never embed a number. A number encoded as an embedding becomes a mathematical representation of a concept — “a large dollar amount in a financial context.” That representation is semantically similar to other large dollar amounts. A vector search for “revenue 2023” will find $5,200,000 and $52,000,000 with nearly equal similarity scores. The model cannot distinguish them from the embedding alone. It guesses.

A structured store containing {metric: "Revenue", period: "2023", value: "$5,200,000"} returns the exact value. No guessing. No approximation. No hallucination.

This rule sounds simple. Applying it requires a classification layer at ingestion time that decides, for each extracted element, which store it belongs in.

1.4 The three-store architecture

The full Hybrid RAG architecture routes content from a single parsed document into three specialised stores simultaneously:

┌─────────────────────────────────────────────────────────────────────┐
│                    HYBRID RAG INGESTION PIPELINE                    │
│                                                                     │
│  Raw Document (PDF, DOCX, PPTX)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌──────────────┐                                                   │
│  │   DOCLING    │  Structure-aware parsing                          │
│  │              │  Tables, headings, hierarchy preserved            │
│  └──────┬───────┘                                                   │
│         │  Structured JSON (element types + content + metadata)     │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │ CLASSIFIER   │  Routes each element by content type              │
│  └──┬───────┬───┘                                                   │
│     │       │         │                                             │
│     ▼       ▼         ▼                                             │
│  ┌──────┐ ┌──────┐ ┌──────┐                                        │
│  │VECTOR│ │STRUC │ │GRAPH │                                        │
│  │  DB  │ │TURED │ │  DB  │                                        │
│  │      │ │STORE │ │      │                                        │
│  │Qdrant│ │Post- │ │Neo4j │                                        │
│  │      │ │greSQL│ │      │                                        │
│  └──────┘ └──────┘ └──────┘                                        │
│  Narrative  Facts  Relations                                        │
│  Semantic   Exact   Graph                                           │
│  search     lookup  traversal                                       │
└─────────────────────────────────────────────────────────────────────┘

Every document passes through every store simultaneously. Nothing is discarded. The narrative sections of a vendor contract go to the vector store — for queries about what the contract says. The penalty clauses go to the structured store — for queries about exact values. The entity relationships — which vendor is party to which contract, which clause references which regulation — go to the graph store.

At retrieval time, a query router inspects the question and determines which store or combination of stores to query. “What does our vendor contract say about data handling?” → vector store. “What is the exact penalty threshold?” → structured store. “Which contracts reference GDPR Article 83?” → graph store.

The AI Memory Systems article covers each store in depth — its design, its implementation, and when to use it. This article focuses on the ingestion pipeline that populates all three.

1.5 One ingestion flow, three outputs

Here is the complete ingestion pipeline. A single function parses a document with Docling and routes every element to the correct store:

from docling.document_converter import DocumentConverter
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
from sentence_transformers import SentenceTransformer
import psycopg2
from neo4j import GraphDatabase
import uuid
import re


# ── Clients ──────────────────────────────────────────────────────────

encoder = SentenceTransformer("all-MiniLM-L6-v2")
qdrant = QdrantClient(host="localhost", port=6333)
pg_conn = psycopg2.connect("postgresql://localhost/vaa_db")
neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

VECTOR_COLLECTION = "vaa_narratives"
VECTOR_SIZE = 384


# ── Classifier ───────────────────────────────────────────────────────

def classify_element(element: dict) -> str:
    """
    Determine which store an element belongs in.

    Returns one of: 'narrative', 'fact', 'relationship', 'skip'

    Classification rules:
    - Tables → 'fact'  (cells contain exact values)
    - Paragraphs with numbers → split: narrative body to vector,
      extracted values to structured store
    - Headings → 'relationship' (section structure for graph)
    - Short fragments → 'skip'
    """
    element_type = element.get("type", "")

    if element_type == "table":
        return "fact"

    if element_type == "heading":
        return "relationship"

    if element_type in ("paragraph", "list_item"):
        text = element.get("text", "")
        if len(text) < 50:
            return "skip"
        return "narrative"

    return "skip"


# ── Structured store (facts) ──────────────────────────────────────────

def store_table_facts(table_element: dict, section: str, doc_path: str) -> None:
    """
    Extract structured facts from a Docling table element
    and insert them into PostgreSQL.

    Each non-header cell becomes a fact row with:
    - metric (row header)
    - period or dimension (column header)
    - value (cell content)
    - full provenance (section, document, confidence)
    """
    data = table_element.get("data", {})
    cells = data.get("table_cells", [])
    num_cols = data.get("num_cols", 0)
    num_rows = data.get("num_rows", 0)

    # Build grid
    grid = {(c["row"], c["col"]): c["text"] for c in cells}

    # Extract column headers from row 0
    col_headers = [grid.get((0, col), f"col_{col}") for col in range(num_cols)]

    facts = []
    for row in range(1, num_rows):
        metric = grid.get((row, 0), "")
        for col in range(1, num_cols):
            value = grid.get((row, col), "")
            if value:
                facts.append({
                    "source_doc": doc_path,
                    "section": section,
                    "metric": metric,
                    "dimension": col_headers[col],
                    "value": value,
                })

    if not facts:
        return

    with pg_conn.cursor() as cur:
        cur.executemany(
            """
            INSERT INTO facts
                (source_doc, section, metric, dimension, value)
            VALUES
                (%(source_doc)s, %(section)s, %(metric)s,
                 %(dimension)s, %(value)s)
            ON CONFLICT DO NOTHING
            """,
            facts,
        )
    pg_conn.commit()


# ── Vector store (narrative) ──────────────────────────────────────────

def store_narrative(text: str, section: str, doc_path: str, confidence: float) -> None:
    """
    Encode a narrative paragraph and store in Qdrant.
    Only paragraph/list content goes here — never tables, never raw numbers.
    """
    vector = encoder.encode(text).tolist()
    point = PointStruct(
        id=str(uuid.uuid4()),
        vector=vector,
        payload={
            "text": text,
            "section": section,
            "source_doc": doc_path,
            "confidence": confidence,
        },
    )
    qdrant.upsert(collection_name=VECTOR_COLLECTION, points=[point])


# ── Graph store (relationships) ───────────────────────────────────────

def store_section_node(section: str, doc_path: str) -> None:
    """Create a Section node in Neo4j and link it to its Document."""
    with neo4j.session() as session:
        session.run(
            """
            MERGE (d:Document {path: $doc})
            MERGE (s:Section {title: $section, doc: $doc})
            MERGE (d)-[:CONTAINS]->(s)
            """,
            doc=doc_path,
            section=section,
        )


def store_entity_mentions(text: str, section: str, doc_path: str) -> None:
    """
    Extract named entities from text and create MENTIONS edges
    from Section to Entity nodes in the graph.
    """
    # Pattern-based extraction (production: use spaCy or fine-tuned NER)
    amount_pattern = r'\€[\d,]+(?:\.\d{2})?(?:\s*(?:million|billion))?|\$[\d,]+(?:\.\d{2})?(?:\s*(?:million|billion))?'
    date_pattern   = r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}|\b\d{4}-\d{2}-\d{2}\b'
    clause_pattern = r'(?:Article|Clause|Section|Annex)\s+[\dA-Z][\d\.]*'

    entities = []
    for match in re.finditer(amount_pattern, text):
        entities.append(("Amount", match.group().strip()))
    for match in re.finditer(date_pattern, text):
        entities.append(("Date", match.group().strip()))
    for match in re.finditer(clause_pattern, text):
        entities.append(("Clause", match.group().strip()))

    if not entities:
        return

    with neo4j.session() as session:
        for entity_type, entity_value in entities:
            session.run(
                """
                MERGE (e:Entity {type: $type, value: $value})
                MERGE (s:Section {title: $section, doc: $doc})
                MERGE (s)-[:MENTIONS]->(e)
                """,
                type=entity_type,
                value=entity_value,
                section=section,
                doc=doc_path,
            )


# ── Main ingestion pipeline ───────────────────────────────────────────

def ingest_document(doc_path: str) -> dict:
    """
    Full hybrid ingestion pipeline.

    Parses a document with Docling and routes every element
    to the correct store based on content type:
    - Tables → structured store (exact facts)
    - Paragraphs → vector store (semantic narrative)
    - Headings + entity mentions → graph store (relationships)

    Args:
        doc_path: Path to the document (PDF, DOCX, PPTX)

    Returns:
        Ingestion summary with counts per store
    """
    converter = DocumentConverter()
    result = converter.convert(doc_path)

    # Reject low-confidence parses before they corrupt the stores
    confidence = result.document.export_to_dict().get("confidence", 0.0)
    if confidence < 0.7:
        return {
            "status": "rejected",
            "reason": f"confidence {confidence:.2f} below threshold 0.70",
            "doc": doc_path,
        }

    doc = result.document.export_to_dict()
    current_section = "Document Root"
    counts = {"narrative": 0, "fact": 0, "relationship": 0, "skip": 0}

    for element in doc.get("items", []):
        route = classify_element(element)
        counts[route] += 1

        if route == "relationship":
            current_section = element.get("text", current_section)
            store_section_node(current_section, doc_path)

        elif route == "fact":
            store_table_facts(element, current_section, doc_path)

        elif route == "narrative":
            text = element.get("text", "")
            store_narrative(text, current_section, doc_path, confidence)
            # Also extract entity mentions from narrative for the graph
            store_entity_mentions(text, current_section, doc_path)

    return {
        "status": "ok",
        "doc": doc_path,
        "confidence": confidence,
        "counts": counts,
    }


# Example usage
if __name__ == "__main__":
    result = ingest_document("Vendor_Contract_0847.pdf")
    print(result)
    # → {"status": "ok", "doc": "Vendor_Contract_0847.pdf",
    #    "confidence": 0.94,
    #    "counts": {"narrative": 38, "fact": 12, "relationship": 9, "skip": 4}}

1.6 The classification decision in detail

The classifier is the most consequential component of the ingestion pipeline. A misclassification at ingestion time propagates errors to every downstream query.

The rules above are correct for the majority of document content. Three edge cases require additional handling in production:

Paragraphs that contain numbers. A sentence like “Revenue increased 13% to 5.2 million in fiscal 2023, driven by enterprise adoption in APAC” contains both narrative (growth story, regional driver) and factual content (5.2 million, 13%, fiscal 2023). The correct approach is dual-routing: the full paragraph goes to the vector store as narrative context, and the extracted values go to the structured store as facts. The extraction uses regex or NER; the narrative stores the full sentence.

Legal clauses. A clause like “Penalty for data breach: €1,500,000 per incident” reads syntactically like a paragraph — it is short text, not a table. But it carries exact factual content. The classifier should recognise clause patterns (short paragraphs with explicit value assertions) and route them to the structured store, not the vector store.

Document metadata. Document-level information — title, author, date, version — belongs in both the graph store (as document node properties) and the structured store (for exact lookup). It should not go to the vector store at all.

The vaa-core repository contains the full classifier with all edge-case handling, tested against a corpus of financial reports, legal contracts, and clinical documents.

Part 2: Connect and Verify

2.1 The limit of single-document AI

The three-store ingestion pipeline solves precision. It does not, by itself, solve the hardest class of enterprise queries — the ones that require connecting information across multiple documents.

Consider what a legal team actually needs to do when evaluating vendor risk. They need to:

Find the active vendor contract (from a library of hundreds)
Identify the relevant penalty clause in that contract
Check whether that clause references an external regulation
Determine whether recent regulatory updates have changed the applicable threshold
Verify that the vendor’s data handling practices, described in their compliance documentation, meet the contractual requirements

Each of these steps touches a different document. A single-document RAG system can answer questions within a document. It cannot reason across them. It cannot find a vendor by name, retrieve their specific contract, navigate to the penalty section, cross-reference the regulation, and synthesise an answer that traces back to each source.

This is cross-document reasoning. It requires the graph store to track entity relationships across documents, and it requires a retrieval strategy that chains queries rather than executing a single similarity search.

2.2 The CloudSecure example

CloudSecure is the vendor. They process customer payment data on behalf of a European financial institution. In March 2025, CloudSecure reports a data breach affecting 12,000 customer records.

The legal team queries their RAG system: ”What are our options and what is the maximum penalty CloudSecure is liable for?”

Here is how a vanilla RAG system handles this, and here is how the Hybrid RAG architecture handles it.

Vanilla RAG — what goes wrong:

The system embeds the query and searches its vector index. It retrieves:

Three paragraphs about data breach penalties from older contracts with lower thresholds
A summary of GDPR Article 83 (general rule, not the specific contracted amount)
A paragraph from CloudSecure’s marketing materials about their “industry-leading security”

The synthesised answer: “CloudSecure may be liable for penalties up to €150,000 under GDPR Article 83, and you may have grounds to terminate the contract.”

The actual maximum liability: €1,500,000, as specified in Clause 14.3 of the active Vendor Agreement v3.2, executed January 2024.

The system was not wrong about GDPR — €150,000 is one GDPR tier. It was wrong about the contracted amount, which supersedes the regulatory default. It retrieved general information about the topic instead of the specific clause in the specific active document.

Hybrid RAG — the correct retrieval chain:

QUERY: "What are our options and what is the maximum penalty 
        CloudSecure is liable for?"

STEP 1 — Identify the entity
  Graph query: find Document nodes where vendor = "CloudSecure"
               AND status = "active"
  Result: Vendor_Agreement_CloudSecure_v3.2_Jan2024.pdf

STEP 2 — Locate the relevant clause
  Structured query: SELECT * FROM facts
                    WHERE source_doc = 'Vendor_Agreement_CloudSecure_v3.2...'
                    AND metric LIKE '%penalty%'
  Result: {metric: "Data Breach Penalty", value: "€1,500,000 per incident",
           section: "Clause 14.3 — Liability"}

STEP 3 — Retrieve narrative context around the clause
  Vector query: semantic search within source_doc = 'Vendor_Agreement_CloudSecure...'
                for "data breach liability options"
  Result: Clause 14.3 full text, Clause 14.4 (termination rights),
          Clause 8.2 (notification obligations)

STEP 4 — Cross-reference the regulation
  Graph traversal: Clause 14.3 node → MENTIONS → Entity{type: "Regulation"}
  Result: Entity{value: "GDPR Article 83(4)"} — the specific tier referenced

STEP 5 — Synthesise with provenance
  Answer: "Under Clause 14.3 of the active Vendor Agreement v3.2 (executed
           January 2024), CloudSecure is liable for €1,500,000 per incident
           in the event of a data breach. This exceeds the GDPR Article 83(4)
           default and represents the contracted maximum. Clause 14.4 grants
           you termination rights within 30 days of confirmed breach.
           Sources: Vendor_Agreement_CloudSecure_v3.2_Jan2024.pdf,
           Clause 14.3 (p.8), Clause 14.4 (p.9)."

Every number is exact. Every claim is sourced to a specific clause and page. The answer is not just plausible — it is verifiable. Anyone on the legal team can open the contract to page 8 and confirm the figure.

Here is the query routing implementation:

from enum import Enum
from dataclasses import dataclass


class QueryType(Enum):
    FACTUAL    = "factual"     # Requires exact value → structured store
    SEMANTIC   = "semantic"    # Requires meaning search → vector store
    RELATIONAL = "relational"  # Requires traversal → graph store
    COMPOUND   = "compound"    # Requires multiple stores


@dataclass
class RetrievalResult:
    answer_components: list[dict]  # Each component with source + store
    stores_used: list[str]
    provenance: list[dict]         # Full source trail


def classify_query(query: str) -> QueryType:
    """
    Determine which store(s) a query requires.

    Signals for factual queries: specific values, "exactly", "what is the",
    numbers, dates, "how much", "what percentage"

    Signals for relational queries: "which documents", "find all", "connected",
    "related to", "across", entity names with "contract" / "agreement"

    Signals for semantic queries: "what does", "explain", "describe",
    "what are the implications", open-ended interpretation questions
    """
    query_lower = query.lower()

    factual_signals    = ["what is the", "how much", "exact", "penalty",
                          "threshold", "percentage", "maximum", "minimum",
                          "how many", "what date", "what amount"]
    relational_signals = ["which documents", "find all", "connected to",
                          "across", "related to", "who is"]
    semantic_signals   = ["what does", "explain", "describe", "implications",
                          "what are", "how does", "why"]

    has_factual    = any(s in query_lower for s in factual_signals)
    has_relational = any(s in query_lower for s in relational_signals)
    has_semantic   = any(s in query_lower for s in semantic_signals)

    if has_factual and has_semantic:
        return QueryType.COMPOUND
    if has_factual and has_relational:
        return QueryType.COMPOUND
    if has_factual:
        return QueryType.FACTUAL
    if has_relational:
        return QueryType.RELATIONAL
    return QueryType.SEMANTIC


def route_query(query: str, entity_filter: str = None) -> RetrievalResult:
    """
    Route a query to the appropriate store(s) and return results
    with full provenance.

    Args:
        query: Natural language query
        entity_filter: Optional entity name to scope the search
                       (e.g., vendor name, document name)

    Returns:
        RetrievalResult with components from each store queried
    """
    query_type = classify_query(query)
    components = []
    stores_used = []

    if query_type in (QueryType.FACTUAL, QueryType.COMPOUND):
        # Structured store: exact lookup
        facts = query_structured_store(query, entity_filter)
        if facts:
            components.extend(facts)
            stores_used.append("structured")

    if query_type in (QueryType.SEMANTIC, QueryType.COMPOUND):
        # Vector store: semantic search
        narratives = query_vector_store(query, entity_filter)
        if narratives:
            components.extend(narratives)
            stores_used.append("vector")

    if query_type in (QueryType.RELATIONAL, QueryType.COMPOUND):
        # Graph store: entity traversal
        relations = query_graph_store(query, entity_filter)
        if relations:
            components.extend(relations)
            stores_used.append("graph")

    provenance = [
        {
            "source": c["source"],
            "section": c.get("section"),
            "store": c["store"],
            "confidence": c.get("confidence"),
        }
        for c in components
    ]

    return RetrievalResult(
        answer_components=components,
        stores_used=stores_used,
        provenance=provenance,
    )

2.3 Canonical entities — the normalisation problem

Cross-document reasoning requires one more ingredient: the ability to recognise that “CloudSecure”, “CloudSecure GmbH”, “CloudSecure Group”, and “CS GmbH” all refer to the same vendor.

Without entity normalisation, a graph traversal looking for documents related to “CloudSecure” will miss all documents that use a variation of the name. The vendor breach is in a document that says “CloudSecure GmbH.” The penalty clause is in a contract that says “CloudSecure.” A naive graph query finds one but not the other.

Canonical entities solve this by creating a single authoritative node for each real-world entity — Entity{canonical: "CloudSecure", aliases: ["CloudSecure GmbH", "CloudSecure Group", "CS GmbH"]} — and routing all mentions, regardless of surface form, to that canonical node.

import re

# Canonical entity registry — extend as your domain grows
CANONICAL_ENTITIES = {
    "CloudSecure":   ["CloudSecure GmbH", "CloudSecure Group", "CS GmbH", "CloudSecure Inc"],
    "GDPR":          ["General Data Protection Regulation", "EU 2016/679", "Regulation (EU) 2016/679"],
    "Acme Inc":      ["Acme", "Acme Corporation", "Acme Corp", "ACME"],
}

# Reverse lookup: alias → canonical name
ALIAS_TO_CANONICAL = {
    alias.lower(): canonical
    for canonical, aliases in CANONICAL_ENTITIES.items()
    for alias in [canonical] + aliases
}


def resolve_canonical(entity_value: str) -> str:
    """
    Resolve any entity surface form to its canonical name.

    Args:
        entity_value: Raw entity string from document extraction

    Returns:
        Canonical entity name, or the original if no mapping found
    """
    return ALIAS_TO_CANONICAL.get(entity_value.lower(), entity_value)


def store_canonical_entity(entity_type: str, entity_value: str, section: str, doc: str) -> None:
    """
    Store an entity mention using its canonical form.
    All aliases point to the same canonical node in the graph.
    """
    canonical = resolve_canonical(entity_value)

    with neo4j.session() as session:
        session.run(
            """
            MERGE (e:Entity {canonical: $canonical, type: $type})
            ON CREATE SET e.aliases = [$raw]
            ON MATCH SET e.aliases = CASE
                WHEN NOT $raw IN e.aliases
                THEN e.aliases + $raw
                ELSE e.aliases
            END
            WITH e
            MERGE (s:Section {title: $section, doc: $doc})
            MERGE (s)-[:MENTIONS]->(e)
            """,
            canonical=canonical,
            type=entity_type,
            raw=entity_value,
            section=section,
            doc=doc,
        )

With canonical entities in place, a graph query for “all documents mentioning CloudSecure” correctly returns documents that use any surface form of the name — regardless of how the vendor is referenced in each individual document.

2.4 Multi-hop reasoning — connecting the dots

The CloudSecure example above used four sequential queries across three stores. Each query used the output of the previous query as input. This is multi-hop reasoning: a chain of retrieval steps where each step narrows the search space and enriches the context.

Multi-hop reasoning is not possible with a single vector search. A vector search executes one operation: find chunks similar to the query. It cannot condition on the result of a previous retrieval, navigate relationships, or combine exact values with semantic context.

The Hybrid RAG architecture enables multi-hop by design: the graph store tracks which entities appear in which documents, the structured store provides exact values for those entities, and the vector store provides the narrative context surrounding them. The query router chains these stores in sequence, building a complete picture from components that individually would be insufficient.

The complete multi-hop implementation is in vaa-core/src/retrieval/multi_hop.py in the repository.

2.5 Traceability — why it matters beyond accuracy

The Hybrid RAG architecture produces answers with full provenance: every claim traces back to a specific store, a specific document, a specific section, a specific value. This is not merely an accuracy feature. It is a trust feature.

Consider the difference between two answers to the CloudSecure query:

Without provenance: “CloudSecure is liable for €1,500,000 per incident under the data breach penalty clause.”

With provenance: “CloudSecure is liable for €1,500,000 per incident. Source: Vendor Agreement v3.2, Clause 14.3, page 8. Retrieved from structured store (confidence: 0.94). Cross-referenced against graph node Entity{CloudSecure, active contracts}.”

The first answer is a claim. The second answer is a verifiable statement. Anyone who doubts the first can only accept or reject it. Anyone who doubts the second can open the contract to page 8.

For enterprise use cases — legal, compliance, medical, financial — the difference between a claim and a verifiable statement is the difference between a tool a professional will rely on and one they will not. Traceability is not a nice-to-have. It is the feature that determines adoption.

def format_answer_with_provenance(
    llm_response: str,
    retrieval_result: RetrievalResult,
) -> dict:
    """
    Package an LLM response with its complete retrieval provenance.

    Every answer surfaced to the user includes:
    - The response text
    - Which stores were queried
    - Full source list with document, section, page, confidence

    This structure enables downstream verification: any claim in the
    response can be traced to a specific source in the provenance list.
    """
    return {
        "answer": llm_response,
        "stores_queried": retrieval_result.stores_used,
        "sources": [
            {
                "document": p["source"],
                "section":  p["section"],
                "store":    p["store"],
                "confidence": p["confidence"],
            }
            for p in retrieval_result.provenance
        ],
        "verifiable": True,
    }

2.6 The limit of Hybrid RAG — and where PageIndex fits

The three-store architecture solves precision and cross-document reasoning. It has one remaining limitation: it requires that documents be classified and chunked during ingestion. The vector store receives paragraph-level chunks — which means a long document is split into fragments, and the relationship between those fragments is only partially preserved in the graph.

For documents with deep hierarchy — a 200-page clinical trial report, a complex regulatory filing, a multi-section contract with nested annexes — chunking loses the structural context that determines which fact belongs to which section of which argument. Two paragraphs that were adjacent in the original document may be stored in different vector chunks with no link between them. Retrieval that reaches one may miss the other, even when both are necessary to answer the question correctly.

PageIndex solves this by replacing chunking entirely with tree-based indexing — preserving the full document hierarchy as a navigable structure and using an LLM to reason its way to the relevant sections rather than relying on vector proximity. Future articles will cover PageIndex architecture in detail.

Hybrid RAG and PageIndex are complementary. For most enterprise documents, Hybrid RAG is sufficient. For document-heavy pipelines where hierarchical context is essential — healthcare, law, financial regulation — PageIndex extends the architecture’s retrieval capability to match the structural complexity of the documents.

FAQ

Does Hybrid RAG require all three stores from day one? No. Start with vector and structured — they eliminate numeric hallucination and handle the majority of queries. Add graph when you encounter cross-document reasoning requirements. The architecture is additive.

What is the performance overhead of querying three stores? For most queries, only one or two stores are queried — the router classifies the query type and targets the appropriate store. Compound queries that touch all three typically run three parallel queries, adding 20–80ms depending on dataset size. For the accuracy improvement, this is well within acceptable latency for enterprise use cases.

How does TOON format fit into this architecture? When the structured store returns facts, or when the graph store returns relationship data, that data is serialised as TOON before being injected into the LLM prompt. As covered in Part 3 on TOON format, this reduces token usage by 30–60% compared to JSON serialisation — directly reducing cost and context window pressure. The three stores produce structured output; TOON is how that output is efficiently handed to the model.

Can this architecture handle real-time document updates? Yes, with two caveats. New documents can be ingested incrementally — the pipeline appends to each store without reprocessing existing documents. Updated documents require re-ingestion of the changed document and cleanup of stale facts in the structured and graph stores. The vaa-core repository includes an update handler for both cases.

Is this architecture suitable for smaller teams without dedicated infrastructure? Yes, with appropriate store selection. For smaller pipelines, PostgreSQL can serve as both the relational store and a lightweight graph store (using recursive CTEs for relationship queries). Qdrant runs locally on a single machine. The three-store separation is an architectural pattern, not a requirement for three separate infrastructure components from day one.

What’s Next

This article has defined the full ingestion and retrieval architecture — the three stores, the classification logic, the query router, canonical entities, and cross-document reasoning. The next article applies this architecture to the highest-stakes domain where it has been deployed: healthcare.

Future articles will demonstrate how the Verifiable AI Architecture applies to high-stakes domains like healthcare, where a single retrieval error can lead to a dangerous recommendation. Traceability is not optional — it is essential when answers matter.

All code from this article is in vaa-core: the full ingestion pipeline, the classifier with edge-case handling, the query router, canonical entity resolution, and the multi-hop retrieval chain.

Elyas Karbouch writes the Verifiable AI Architecture series about building AI pipelines that are trustworthy, auditable, and accurate. Subscribe to get each article when it publishes.

Elyas's Substack

Discussion about this post

Ready for more?