An Introduction to Retrieval-Augmented Generation
Large language models are remarkably capable and fundamentally limited. They can generate fluent, contextually appropriate text on almost any topic, but they cannot access information beyond their training data cutoff. Ask GPT-3 about a document you wrote last week, and it will either confess ignorance or—worse—hallucinate a plausible-sounding answer with fabricated details. Retrieval-Augmented Generation (RAG) addresses this limitation by combining a retrieval system (which finds relevant documents from your data) with a generative model (which produces answers grounded in those documents). The result is an AI system that answers questions about your data instead of relying solely on what it memorized during training. This post explains how RAG works, walks through building a production implementation, and catalogs the failure modes that teams encounter in practice.
The Architecture
A RAG system executes three steps for every user query:
- Retrieve — Search a document corpus to find passages relevant to the user’s question. This is typically vector search (embed the query, find nearest neighbors) but can also use keyword search or a hybrid of both.
- Augment — Insert the retrieved passages into a prompt template along with the original question. This gives the language model the context it needs to answer accurately.
- Generate — Send the augmented prompt to a large language model, which produces an answer based on the provided context rather than its parametric knowledge.
The simplicity of this architecture is deceptive. Each step has significant design decisions that affect answer quality, and the steps interact in ways that compound good and bad choices. Let me show the complete pipeline in code before diving into each step:
from dataclasses import dataclass
import openai
import numpy as np
import time
@dataclass
class RAGResponse:
answer: str
sources: list[dict]
confidence: float
latency_ms: float
class RAGPipeline:
def __init__(self, retriever, llm_model: str = "gpt-3.5-turbo"):
self.retriever = retriever
self.llm_model = llm_model
def query(self, question: str, top_k: int = 5) -> RAGResponse:
start = time.perf_counter()
# Step 1: Retrieve relevant documents
documents = self.retriever.search(question, top_k=top_k)
# Step 2: Build the augmented prompt
context = self._build_context(documents)
prompt = self._build_prompt(question, context)
# Step 3: Generate answer
response = openai.ChatCompletion.create(
model=self.llm_model,
messages=[
{"role": "system", "content": self._system_prompt()},
{"role": "user", "content": prompt}
],
temperature=0.1,
max_tokens=500
)
latency = (time.perf_counter() - start) * 1000
return RAGResponse(
answer=response.choices[0].message.content,
sources=[{"text": d.text[:200], "score": d.score, "doc_id": d.document_id} for d in documents],
confidence=self._estimate_confidence(documents),
latency_ms=latency
)
def _system_prompt(self) -> str:
return """You are a helpful assistant that answers questions based only on the
provided context. If the context does not contain enough information to answer
the question fully, say so explicitly. Do not use prior knowledge to fill gaps.
Cite your sources using [Source N] notation."""
def _build_context(self, documents: list) -> str:
blocks = []
for i, doc in enumerate(documents, 1):
blocks.append(f"[Source {i}] (relevance: {doc.score:.2f})n{doc.text}")
return "nn".join(blocks)
def _build_prompt(self, question: str, context: str) -> str:
return f"""Based on the following sources, answer the question.
Sources:
{context}
Question: {question}
Provide a clear, concise answer with source citations:"""
def _estimate_confidence(self, documents: list) -> float:
if not documents:
return 0.0
top_score = documents[0].score
score_spread = documents[0].score - documents[-1].score if len(documents) > 1 else 0
# High confidence when top result is strong AND there's clear separation from lower results
if top_score > 0.85 and score_spread > 0.15:
return 0.9
elif top_score > 0.7:
return 0.7
elif top_score > 0.5:
return 0.5
return 0.3
Why RAG Works Better Than Fine-Tuning for Knowledge
The alternative to RAG for incorporating custom knowledge into an LLM is fine-tuning—training the model on your specific data so it memorizes the information. Fine-tuning has three fundamental problems that RAG avoids:
- Staleness. A fine-tuned model’s knowledge is frozen at training time. When your documentation changes (which, in a software company, means daily), you must re-fine-tune. Each fine-tuning run costs $50–$200, takes 2–4 hours, and requires assembling a training dataset. A RAG system uses live data—update a document in your vector index, and the next query uses the updated version. No retraining, no cost, no delay.
- Opacity. A fine-tuned model cannot cite its sources. When it says “the refund policy allows returns within 30 days,” you cannot trace that claim back to a specific document. Is it from the current policy? An outdated version? A hallucination? You have no way to know. A RAG system returns the retrieved documents alongside the answer, enabling users to verify every claim and trace it to its source.
- Cost and scale. Fine-tuning GPT-3.5 on 10,000 documents costs $50–$200 per training run. If your corpus changes weekly, that is $2,600–$10,400 per year just for retraining. RAG requires embedding generation ($1–$5 for the same corpus, one-time cost) plus per-query inference costs (which you pay with fine-tuning too). At our usage levels, RAG costs roughly 80% less than a comparable fine-tuning approach.
Fine-tuning still has its place. It is better for teaching the model a specific style (writing tone, output format, domain-specific terminology usage). If you want the model to write like a medical professional or format responses as structured JSON, fine-tuning is appropriate. But for grounding answers in factual data—which is the most common enterprise use case—RAG is almost always the better approach.
The Retrieval-Generation Coupling Problem
The most subtle and important challenge in RAG is the tight coupling between retrieval quality and generation quality. The generator can only work with what the retriever gives it. If the retriever returns irrelevant documents, the generator will do one of three things—all bad:
- Hallucinate. The model ignores the useless context and falls back on parametric knowledge, producing a fluent answer that sounds authoritative but is not grounded in your data. This is the most dangerous failure mode because the answer looks correct.
- Refuse to answer. If the system prompt is strict enough, the model says “I don’t have enough information” even though the answer exists in your corpus—it just was not retrieved. This is a frustrating user experience.
- Produce confused output. The model tries to synthesize an answer from marginally relevant context, producing hedging, contradictions, or non-sequiturs. Users lose trust quickly.
This coupling means RAG quality is bottlenecked by retrieval quality. You can use GPT-4 (the most capable model available) and your system will still produce bad answers if the retriever returns wrong passages. In our experience across four RAG deployments, approximately 80% of quality issues trace back to the retriever, not the generator. The practical implication: invest most of your optimization effort in retrieval.
Specific retrieval failure modes we have encountered:
- Wrong granularity. Chunks too large (500+ words) dilute the relevant sentence in a sea of context, making the similarity score lower and the retrieved chunk less useful. Chunks too small (under 100 words) lack sufficient surrounding context for the generator to produce a complete answer. We settled on 200–300 word chunks with 50-word overlap after testing five chunk sizes against our evaluation set.
- Vocabulary mismatch. A question about “the Q3 pricing change” fails if the relevant document says “we adjusted our Enterprise plan from $500/mo to $750/mo in August” without using the words “pricing” or “Q3.” Semantic search helps (embeddings capture the conceptual similarity between “pricing change” and “adjusted plan cost”) but it is not perfect. Short queries with domain-specific jargon are the most vulnerable to this failure.
- Ambiguity without clarification. “What is our policy?” retrieves random policy documents because the embedding for “our policy” is equally close to the refund policy, the privacy policy, the cancellation policy, and the employment policy. The system needs either query clarification (“Which policy are you asking about?”) or session context to disambiguate.
- Temporal confusion. “What is the current pricing?” might retrieve an outdated pricing page that is still in the index because documents are rarely deleted. Add a
last_updatedfield to your document metadata and boost recent documents in ranking, or implement a document lifecycle process that removes or archives outdated content.
Prompt Engineering for RAG
The system prompt and user prompt template have outsized impact on answer quality. Through systematic testing with 200 annotated questions, we identified three high-impact prompt patterns and measured their effect on hallucination rate and answer quality:
Pattern 1: Explicit grounding instruction. Adding “Answer ONLY based on the provided context” to the system prompt reduced hallucination rate from 23% to 8% in our tests. Without this instruction, GPT-3.5 supplements retrieved context with parametric knowledge approximately one-quarter of the time. The instruction to say “I don’t have enough information” is critical—it gives the model an acceptable alternative to hallucination when the context is insufficient.
Pattern 2: Source attribution. Asking the model to cite sources using [Source N] notation does two things: it forces the model to ground each claim in a specific retrieved passage (which mechanically reduces hallucination), and it provides users with a way to verify the answer. In our evaluation, source attribution reduced hallucination rate by an additional 5 percentage points (from 8% to 3%) compared to the grounding instruction alone.
Pattern 3: Structured output format. Asking for a specific answer structure (“First provide a direct answer, then supporting details with citations”) improves both readability and accuracy. The model is less likely to ramble or combine information from unrelated sources when given a clear output structure to follow.
Hybrid Retrieval: Combining Vector and Keyword Search
Pure vector search has a specific weakness: it can miss exact matches. If a user searches for error code “ERR_AUTH_042,” vector search might return documents about authentication errors in general because the embedding captures the concept of authentication errors without privileging the specific string. Keyword search (BM25) excels at exact matches but misses semantic similarity entirely.
Hybrid retrieval combines both approaches and produces better results than either alone. We use Reciprocal Rank Fusion (RRF) to combine results from both retrievers:
class HybridRetriever:
def __init__(self, vector_retriever, keyword_retriever, k: int = 60):
self.vector_retriever = vector_retriever
self.keyword_retriever = keyword_retriever
self.k = k # RRF constant (standard value: 60)
def search(self, query: str, top_k: int = 10) -> list:
# Get results from both retrievers (over-fetch for merging)
vector_results = self.vector_retriever.search(query, top_k=top_k * 3)
keyword_results = self.keyword_retriever.search(query, top_k=top_k * 3)
# Reciprocal Rank Fusion
rrf_scores = {}
for rank, result in enumerate(vector_results):
doc_id = result.doc_id
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1.0 / (self.k + rank + 1)
for rank, result in enumerate(keyword_results):
doc_id = result.doc_id
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1.0 / (self.k + rank + 1)
# Sort by combined RRF score
sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return [self._get_document(doc_id) for doc_id, score in sorted_docs[:top_k]]
RRF is simpler and more robust than weighted score combination because it does not require score normalization—different retrievers produce scores on different scales, and normalizing them introduces assumptions about the score distribution that are often wrong. RRF only uses rank positions, which are comparable across any retrieval method.
In our evaluation, hybrid retrieval (RRF) improved Recall@5 from 0.84 (vector only) and 0.71 (keyword only) to 0.91. The 7-point improvement over vector-only search comes primarily from exact match queries (error codes, product names, specific dates) where keyword search adds value that vector search misses.
Evaluation Framework
RAG systems are harder to evaluate than traditional search because you are evaluating two coupled systems: retrieval quality (did we find the right documents?) and generation quality (did we produce the right answer from those documents?). We measure both independently and together:
Retrieval metrics:
- Recall@5: What fraction of relevant documents appear in the top 5 retrieved passages?
- MRR: Where does the first relevant passage appear in the ranking?
Generation metrics:
- Answer correctness: Human-judged on a 1–5 scale (1 = wrong, 5 = perfect)
- Hallucination rate: Fraction of answers containing claims not supported by any retrieved document
- Attribution accuracy: Fraction of cited
[Source N]references that actually support the cited claim - Completeness: Does the answer address all aspects of the question?
End-to-end metrics:
- User satisfaction: Thumbs up/down on answers in the production application
- Escalation rate: How often do users escalate to human support after receiving a RAG answer?
Our current production numbers after three months of iteration: Recall@5 = 0.91, answer correctness = 4.2/5, hallucination rate = 3%, attribution accuracy = 94%, escalation rate = 12% (down from 35% before RAG). The hallucination rate is our primary improvement target. We are experimenting with re-ranking retrieved passages before sending them to the generator—preliminary results show an additional 1–2 percentage point reduction by filtering out marginally relevant passages that confuse the generator.
Conclusion
RAG is the most practical way to build AI systems that answer questions about your specific, current data. The architecture is straightforward (retrieve, augment, generate), but quality depends heavily on retrieval tuning, prompt engineering, and systematic evaluation with real user questions. Start with a simple pipeline—vector search plus a strong system prompt with grounding instructions and source attribution—and iterate based on evaluation data. The biggest quality gains come from improving retrieval, not from switching to a larger language model. Invest in your chunking strategy, experiment with hybrid retrieval, measure hallucination rates explicitly, and build an evaluation set of at least 100 questions from real users. RAG done well reduces support tickets, improves customer self-service, and builds trust in AI-powered features. RAG done poorly erodes trust faster than no AI at all, because confidently wrong answers are worse than no answer.
Implementation Checklist for Your First RAG System
If you are building RAG for the first time, this checklist captures the decisions and validations we wish we had followed from day one instead of learning through production incidents:
- Document preparation. Clean your source documents before chunking: remove duplicate content, strip boilerplate (headers, footers, navigation text), normalize whitespace, and handle special characters. We spent a week debugging low retrieval quality before realizing that 15% of our chunks contained navigation menu text from web-scraped pages, which diluted embedding quality across the entire corpus.
- Chunking validation. After chunking, manually review 20–30 random chunks. Are they coherent on their own? Do they contain enough context for a reader (or an LLM) to understand them without the surrounding document? If a chunk starts mid-sentence or ends mid-thought, your chunk boundaries need adjustment.
- Embedding model evaluation. Test at least 3 models against your evaluation set before committing. The model that ranks first on public benchmarks may not rank first on your data.
- System prompt testing. Test your system prompt with at least 50 questions, including edge cases: questions with no answer in the corpus, ambiguous questions, questions that require combining information from multiple documents, and adversarial questions designed to trigger hallucination. Measure hallucination rate for each prompt variant.
- Confidence calibration. Your confidence scoring should correlate with actual answer quality. If the system reports high confidence on wrong answers, users will lose trust faster than if you report no confidence at all. Validate that low-scoring retrieval results actually correspond to low-quality answers.
- Monitoring in production. Track: query volume, average retrieval scores, token generation latency, user feedback (thumbs up/down), and error rates. Set up alerts for retrieval score degradation (could indicate embedding drift, index corruption, or data quality issues) and hallucination spikes (could indicate prompt regression or model behavior changes after provider updates).
This checklist is not exhaustive, but it covers the issues that caused our first three production incidents. If we had validated each of these items before launch, those incidents would have been caught in testing instead of discovered by users.