RAG
Retrieval-Augmented Generation
Query → document retrieval → LLM context injection → answer generation.
A user asks a question. First, relevant documents are retrieved (retrieval), then those documents are added to the LLM context (augmentation), and the LLM produces an answer (generation). The standard way to add fresh or private knowledge to an LLM without retraining.
A typical pipeline: 1. Documents are split into chunks and embedded. 2. Vectors loaded into a vector DB. 3. At query time, the question is also embedded. 4. The vector DB returns the top-K (usually 5–20) closest chunks. 5. Those chunks are stuffed into the prompt and sent to the LLM. 6. The LLM produces an answer grounded in the context.
Modern variants: hybrid search (vector + keyword), reranking (a cross-encoder reorders results), graph RAG (knowledge graph), agentic RAG (the LLM writes its own queries).
Instead of memorizing all books before an exam, you keep relevant pages open during the exam. RAG gives the model a "cheat sheet" — it relies on current documents, not its own memory. Think open-book exam.
A SaaS product's docs change constantly. A user asks "what's the API rate limit?". Asked directly, ChatGPT might cite an old limit or make one up (hallucination).
With RAG: 1. All 200 doc pages are chunked and stored in Pinecone. 2. The query is embedded; top-5 chunks come back: "Rate limit: 100 req/min on free, 1000 req/min on pro…" 3. Those chunks are passed to Claude as context. 4. Claude reads the context and answers specifically.
Result: always-current, sourced, low-hallucination answer. When docs change, you just re-embed the affected chunks.
from openai import OpenAI
import numpy as np
client = OpenAI()
# 1. SETUP — embed docs once
docs = [
"Free plan: 100 req/min rate limit.",
"Pro plan: 1000 req/min + priority support.",
"If exceeded, returns 429; resets after 60s.",
]
def embed(t): return np.array(
client.embeddings.create(model="text-embedding-3-small", input=t).data[0].embedding
)
doc_vecs = [embed(d) for d in docs]
# 2. QUERY — find top-2 nearest docs
query = "How does the API rate limit work?"
q_vec = embed(query)
sims = [float(np.dot(q_vec, v) / (np.linalg.norm(q_vec) * np.linalg.norm(v))) for v in doc_vecs]
top = sorted(zip(sims, docs), reverse=True)[:2]
context = "\n".join(d for _, d in top)
# 3. GENERATE — feed context to the LLM
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer only from the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
)
print(resp.choices[0].message.content)- Private/internal knowledge (company docs, customer data)
- Continuously changing info (news, prices, inventory)
- High-stakes domains where hallucinations are dangerous (legal, medical, finance)
- Large knowledge base + query mode (>10K docs)
- When auditing/citing sources matters — RAG can cite
- General-knowledge questions (the LLM already knows)
- Very few documents (< 50) — just stuff them all in the context
- One-shot creative content (poems, blog drafts) — RAG doesn't help
- Structured queries (a DB query is more direct)
Retrieval quality is everything
Before tweaking the model, fix retrieval. Wrong chunks → even the best LLM produces nonsense. Recall@5 evaluation is mandatory.
Chunks lack context
Raw chunks often reference 'as mentioned earlier' or 'this section'. Add document title and heading hierarchy to chunk metadata.
Leaving RAG bare
RAG + reranking + hybrid search + query expansion = full solution. Pure vector search alone usually plateaus around 60% accuracy in production.