RAG Architecture Patterns: Beyond Basic Document Q&A
Every team building with LLMs eventually arrives at the same place: the model needs access to private data, and fine-tuning is either too expensive, too slow, or too rigid. Retrieval-Augmented Generation (RAG) is the answer most reach for. The problem is that most RAG implementations plateau at “works in demos, fails in production.” The gap between a basic document Q&A prototype and a system that actually handles ambiguous queries, multi-hop reasoning, and noisy corpora is enormous.
At Harbor Software, we have built RAG pipelines for clients across legal, healthcare, and e-commerce. Every engagement taught us something about the architectural patterns that separate toy demos from production systems. This post catalogs those patterns, with specific trade-offs and implementation guidance for each one.
The Naive RAG Baseline (And Why It Breaks)
The standard RAG tutorial looks like this: chunk documents into 500-token blocks, embed them with text-embedding-ada-002, store in a vector database, retrieve top-k results by cosine similarity, stuff them into a prompt, and call the LLM. This works for about 40% of real queries. Here is why it falls apart for the other 60%:
- Chunking destroys context. A 500-token chunk from the middle of a legal contract loses the definitions section, the parties involved, and the governing clause. The LLM hallucinates because the retrieved context is incomplete.
- Embedding similarity is not relevance. “What is our refund policy?” and “How do I get my money back?” are semantically similar but might retrieve different chunks depending on how the source documents phrase things. Worse, embedding models score passages about refund policies in competitor analyses just as highly as your actual policy document.
- Top-k retrieval is a blunt instrument. Retrieving 5 chunks when the answer spans 2 documents and 8 relevant paragraphs means you either miss critical context or stuff the prompt with noise.
- No metadata awareness. The vector search treats a draft document from 2019 identically to the current approved version. It treats an internal memo the same as an official policy. Without metadata filtering, your RAG system cannot distinguish authoritative from incidental mentions.
Understanding these failure modes is the starting point for every advanced pattern below. Each pattern addresses one or more of these failures, and the art is in combining them correctly for your specific use case.
Pattern 1: Hierarchical Chunking with Parent-Child Retrieval
Instead of flat chunking, create a two-level hierarchy. The child chunks are small (200-300 tokens) for precise embedding similarity. The parent chunks are large (1000-1500 tokens) for complete context. When a child chunk matches, you retrieve and pass the parent chunk to the LLM.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Child chunks for retrieval precision
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50,
separators=["nn", "n", ". ", " "]
)
# Parent chunks for context completeness
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=200,
separators=["nn", "n"]
)
def index_document(doc):
parents = parent_splitter.split_text(doc.text)
for i, parent in enumerate(parents):
children = child_splitter.split_text(parent)
for j, child in enumerate(children):
vector_store.upsert({
"id": f"{doc.id}_p{i}_c{j}",
"text": child,
"embedding": embed(child),
"metadata": {
"parent_id": f"{doc.id}_p{i}",
"parent_text": parent,
"source": doc.source,
"doc_type": doc.document_type,
"effective_date": doc.effective_date
}
})
The key insight is that retrieval operates on the child level (more precise matches because shorter text has a tighter embedding) but the LLM receives the parent level (more complete context because you get the surrounding paragraphs). In our benchmarks across 3 client projects, this pattern improved answer accuracy by 23-31% compared to flat chunking at the same token budget.
The implementation details matter. The child chunk size should be small enough that a single chunk contains one idea or statement. If your chunks contain multiple ideas, the embedding becomes a blurred average that matches too many queries. The parent chunk size should be large enough to contain the full context needed to understand any child within it. For legal documents, this usually means the full section or subsection. For technical documentation, it is the full page or topic block.
We also store the document structure (section headings, table of contents position) as metadata on each chunk. This lets us reconstruct the reading order and provide breadcrumb context (“This excerpt is from Section 4.2: Termination Clauses of the Master Services Agreement”) which dramatically improves the LLM’s ability to interpret the retrieved content correctly.
When to use it
Hierarchical chunking shines when your documents have dense, interconnected content: legal contracts, technical specifications, medical records, financial reports. It adds indexing complexity and roughly doubles your vector storage requirements, so skip it if your documents are naturally self-contained (like FAQ entries or product descriptions where each item is independent).
Pattern 2: Query Decomposition and Multi-Step Retrieval
Many real user queries are compound: “Compare our Q3 and Q4 revenue and explain why the churn rate increased.” A single embedding lookup cannot handle this. The query embeds into a single vector that is a blurred average of all its sub-questions, matching nothing well. You need to decompose the query into sub-queries, retrieve for each, then synthesize.
import openai
import json
def decompose_query(query: str) -> list[str]:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Break the user's question into independent
sub-questions that can each be answered by searching a document
collection. Return a JSON object with a "questions" key containing
an array of strings. If the question is already atomic, return
it as a single-element array."""
}, {
"role": "user",
"content": query
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)["questions"]
def multi_step_retrieve(query: str, k: int = 5) -> list[dict]:
sub_queries = decompose_query(query)
all_results = []
seen_ids = set()
for sq in sub_queries:
results = vector_store.similarity_search(sq, k=k)
for r in results:
if r.id not in seen_ids:
all_results.append(r)
seen_ids.add(r.id)
# Re-rank combined results by relevance to original query
return rerank(query, all_results, top_n=k)
# Example decomposition:
# Input: "Compare Q3 and Q4 revenue and explain churn increase"
# Output: [
# "What was the Q3 revenue?",
# "What was the Q4 revenue?",
# "What caused the increase in churn rate?"
# ]
The decomposition step adds one LLM call (about 200ms with GPT-4o, 80ms with GPT-4o-mini) but dramatically improves recall for complex queries. We measured a 45% improvement in recall@10 on a legal document corpus where most queries involved comparing clauses across multiple contracts.
There is a subtlety in how you handle the decomposed results. Simply concatenating all retrieved chunks and passing them to the LLM often exceeds the context window or buries relevant results in noise. Our approach is to retrieve generously (top-10 per sub-query), deduplicate by document ID, then re-rank the combined pool against the original full query using a cross-encoder. This ensures the final context set is both comprehensive (covering all sub-queries) and focused (ranked by overall relevance).
The re-ranking step matters
After gathering results from multiple sub-queries, you have a pool of candidates that may be relevant to different parts of the original question. A cross-encoder re-ranker like cross-encoder/ms-marco-MiniLM-L-12-v2 or Cohere’s rerank endpoint scores each candidate against the full original query, producing much better ordering than raw embedding similarity. Bi-encoder (embedding) search is fast but approximate. Cross-encoder re-ranking is slower (it processes query-document pairs rather than comparing pre-computed vectors) but far more accurate. This step costs about 50ms for 20 candidates and is almost always worth it. In our testing, re-ranking improved NDCG@5 by 18% on average across four different corpora.
We have also experimented with iterative retrieval, where the LLM evaluates the first round of results and generates follow-up queries to fill gaps. This works well for research-style questions but adds 2-3 seconds of latency per iteration, so we only use it in async workflows where the user is not waiting for an immediate response.
Pattern 3: Metadata Filtering and Hybrid Search
Pure vector similarity search treats all documents as equally relevant regardless of source, date, or category. In practice, metadata filtering eliminates entire classes of retrieval errors. This is the most underutilized pattern we see: teams spend weeks tuning embedding models when adding three metadata filters would solve their problems immediately.
# Instead of just: vector_store.search(query, k=5)
# Use metadata filters to constrain the search space
results = vector_store.search(
query=query_embedding,
k=5,
filter={
"source_type": {"$in": ["policy", "handbook"]},
"effective_date": {"$gte": "2024-01-01"},
"department": {"$eq": user_department},
"status": {"$eq": "approved"}
}
)
# Hybrid search: combine vector similarity with BM25 keyword matching
from rank_bm25 import BM25Okapi
def hybrid_search(query: str, k: int = 5, alpha: float = 0.7):
# Vector search (semantic understanding)
vec_results = vector_store.search(embed(query), k=k*2)
# BM25 keyword search (exact term matching)
tokenized_query = query.lower().split()
bm25_scores = bm25_index.get_scores(tokenized_query)
bm25_top = sorted(
enumerate(bm25_scores),
key=lambda x: x[1],
reverse=True
)[:k*2]
# Reciprocal rank fusion
combined = reciprocal_rank_fusion(
vec_results, bm25_top, k=60 # RRF constant
)
return combined[:k]
Hybrid search catches cases that pure vector search misses. Product codes like “SKU-7742-B”, version numbers like “v2.3.1”, acronyms like “HIPAA” or “SOC2”, and proper nouns like “Johnson v. State of California” are often mangled by embedding models. These models compress text into semantic meaning, which is great for concepts but terrible for exact identifiers. BM25 handles these perfectly because it is literal keyword matching. In one e-commerce project, hybrid search improved retrieval precision for product-specific queries from 62% to 89%.
The alpha parameter in hybrid search controls the balance between semantic and keyword matching. We default to 0.7 (70% vector, 30% BM25) for general-purpose RAG. For technical documentation with lots of code references and identifiers, we shift to 0.5. For creative or conceptual queries, we use 0.85. We have built a simple classifier that detects query type and adjusts alpha automatically, though honestly the default 0.7 works well enough for most use cases that the classifier is rarely worth the added complexity.
Pattern 4: Contextual Compression and Prompt Engineering
Even with perfect retrieval, you can waste your context window on irrelevant sentences within retrieved chunks. A 1500-token chunk retrieved because one sentence matches the query still contains 1400 tokens of noise. Contextual compression extracts only the sentences relevant to the query from each retrieved chunk before passing them to the LLM.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_store.as_retriever(
search_kwargs={"k": 10}
)
)
This is expensive (one LLM call per retrieved chunk) but powerful. We use it selectively: for high-stakes queries (legal, medical) where accuracy matters more than latency, and in pipelines where the final LLM call uses a model with a smaller context window. For cost-sensitive applications, gpt-4o-mini handles compression well at roughly $0.0003 per chunk. On a 10-chunk retrieval, that is $0.003 per query for compression alone.
An alternative to LLM-based compression is extractive compression using sentence embeddings. Embed each sentence in the chunk, compute cosine similarity with the query embedding, and keep only sentences above a 0.7 similarity threshold. This is free (you already have the embeddings) and fast (under 1ms per chunk) but less accurate than LLM-based compression because it cannot handle paraphrasing or implicit relevance.
Prompt structure for RAG
The prompt template matters more than most teams realize. We have settled on a structure that consistently outperforms the “here are some documents, answer the question” approach. The system message establishes the role, constraints, and citation format. Retrieved context is presented with clear source labels, document dates, and visual separators between chunks. The user query comes last, with an explicit instruction to cite sources by their labels. A final instruction reminds the model to say “I don’t have enough information to answer this” when the context does not support an answer, and to distinguish between what the sources say and what the model might know independently.
That final instruction is critical. Without it, the model will hallucinate plausible-sounding answers using its parametric knowledge, defeating the entire purpose of RAG. We have seen this cause real harm in legal and medical applications where the model confidently stated facts that sounded authoritative but were not in the source documents. Adding the explicit “only use the provided context” instruction reduced unsupported claims from 12% to under 3% in our testing.
Pattern 5: Agentic RAG with Tool Use
The most advanced pattern replaces the fixed retrieve-then-generate pipeline with an agent that decides when and how to retrieve. The agent can reformulate queries, retrieve multiple times, query different data sources, and validate its own answers before presenting them to the user.
tools = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search the company knowledge base for policies, procedures, and documentation. Use specific keywords and phrases.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"source_filter": {
"type": "string",
"enum": ["policies", "technical", "financial", "hr", "all"],
"description": "Filter by document category"
},
"date_after": {
"type": "string",
"format": "date",
"description": "Only return docs updated after this date"
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "search_sql_database",
"description": "Query structured business data (revenue, headcount, metrics). Use for numerical and analytical questions.",
"parameters": {
"type": "object",
"properties": {
"sql_query": {"type": "string", "description": "SQL query to execute"}
},
"required": ["sql_query"]
}
}
},
{
"type": "function",
"function": {
"name": "verify_claim",
"description": "Verify a specific factual claim against the knowledge base before including it in the response.",
"parameters": {
"type": "object",
"properties": {
"claim": {"type": "string"},
"source_context": {"type": "string"}
},
"required": ["claim"]
}
}
}
]
Agentic RAG handles the hardest class of queries: those requiring multiple retrieval steps, cross-referencing structured and unstructured data, and iterative refinement. The agent might first search for a policy, find that it references another policy, then search for that one. Or it might retrieve financial data from a SQL database and then search for the narrative explanation in a board report.
The cost is latency (3-8 seconds per query vs. 1-2 seconds for single-step RAG) and complexity (you need robust error handling, loop detection to prevent infinite retrieval cycles, and token budget management to ensure the agent does not consume all available context before generating the answer). We also track the agent’s tool usage patterns: if more than 40% of queries result in 3+ retrieval steps, it usually means the knowledge base organization needs improvement rather than more sophisticated retrieval.
Choosing the Right Pattern
Not every project needs agentic RAG. Here is our decision framework based on deploying these patterns across 15+ client projects:
- Simple FAQ / single-document lookup: Naive RAG with good chunking is fine. Invest your time in chunk quality and metadata, not architecture. This handles internal knowledge bases, product documentation, and customer support scenarios where questions map to specific documents.
- Multi-document, single-topic queries: Hierarchical chunking + hybrid search. This covers 70% of enterprise use cases. Legal document review, compliance checking, technical troubleshooting all fall here.
- Complex, multi-hop queries: Query decomposition + re-ranking. Add contextual compression if your context window is tight or your documents are verbose. Research analysis, competitive intelligence, and audit queries live here.
- Cross-source, analytical queries: Agentic RAG. Reserve this for cases where users genuinely need to combine structured and unstructured data, or where the retrieval path cannot be determined in advance. Executive decision support and ad-hoc business analysis are the primary use cases.
The most common mistake we see is teams jumping to agentic RAG for problems that hierarchical chunking and metadata filtering would solve. The simpler the architecture, the easier it is to debug, monitor, and improve. Start simple, measure where retrieval fails, and add complexity only where the data demands it.
Evaluation: The Part Everyone Skips
You cannot improve what you do not measure. We build evaluation datasets for every RAG project: 100-200 question-answer pairs with source references, manually curated by domain experts. The curation process takes 2-3 days and is the single highest-ROI investment in the entire project. We track three metrics:
- Retrieval recall@k: Does the correct source document appear in the top-k results? This isolates retrieval quality from generation quality. If recall@5 is below 80%, no amount of prompt engineering will fix your answers. The retrieval layer needs work first.
- Answer correctness: Judged by GPT-4 against reference answers on a 1-5 scale, validated against human judgment. We find 92% agreement at the binary correct/incorrect level. For ambiguous cases, we use two independent human judges and take the majority vote.
- Faithfulness: Does the generated answer only use information from the retrieved context? We use a dedicated prompt that decomposes the answer into individual claims and checks each one against the provided sources. This catches the subtle hallucinations that manual review often misses: the model stating a policy “requires 30 days notice” when the source says “reasonable notice” without specifying a timeframe.
These metrics run in CI on every change to the retrieval pipeline, the prompt templates, or the chunking strategy. If recall drops below our threshold (typically 85% recall@5), the pull request does not merge. If faithfulness drops below 93%, we investigate before deploying. This discipline is what separates RAG systems that improve over time from ones that silently degrade as the document corpus evolves.
One final note: evaluation datasets need maintenance too. As your document corpus changes, some questions become outdated and new edge cases emerge. We schedule quarterly reviews of the evaluation set, adding 20-30 new question-answer pairs that reflect real user queries that failed or produced low-confidence answers in production. This keeps the evaluation set representative of actual usage patterns rather than the idealized scenarios you imagined at project kickoff.
RAG is not a single architecture. It is a spectrum of patterns, each with specific trade-offs in accuracy, latency, cost, and complexity. The teams that succeed are the ones that match the pattern to their data and queries, measure relentlessly, and resist the urge to over-engineer before they have the data to justify the complexity.