Intermediate· ~2 min read#rag#search#advanced

RAG

Retrieval-Augmented Generation

Query → document retrieval → LLM context injection → answer generation.

Definition

A user asks a question. First, relevant documents are retrieved (retrieval), then those documents are added to the LLM context (augmentation), and the LLM produces an answer (generation). The standard way to add fresh or private knowledge to an LLM without retraining.

A typical pipeline: 1. Documents are split into chunks and embedded. 2. Vectors loaded into a vector DB. 3. At query time, the question is also embedded. 4. The vector DB returns the top-K (usually 5–20) closest chunks. 5. Those chunks are stuffed into the prompt and sent to the LLM. 6. The LLM produces an answer grounded in the context.

Modern variants: hybrid search (vector + keyword), reranking (a cross-encoder reorders results), graph RAG (knowledge graph), agentic RAG (the LLM writes its own queries).

Analogy

Instead of memorizing all books before an exam, you keep relevant pages open during the exam. RAG gives the model a "cheat sheet" — it relies on current documents, not its own memory. Think open-book exam.

Real-world example

A SaaS product's docs change constantly. A user asks "what's the API rate limit?". Asked directly, ChatGPT might cite an old limit or make one up (hallucination).

With RAG: 1. All 200 doc pages are chunked and stored in Pinecone. 2. The query is embedded; top-5 chunks come back: "Rate limit: 100 req/min on free, 1000 req/min on pro…" 3. Those chunks are passed to Claude as context. 4. Claude reads the context and answers specifically.

Result: always-current, sourced, low-hallucination answer. When docs change, you just re-embed the affected chunks.

A deeper look

Code examples

Minimal RAG · OpenAI + in-memory vectorsPython

from openai import OpenAI
import numpy as np

client = OpenAI()

# 1. SETUP — embed docs once
docs = [
    "Free plan: 100 req/min rate limit.",
    "Pro plan: 1000 req/min + priority support.",
    "If exceeded, returns 429; resets after 60s.",
]
def embed(t): return np.array(
    client.embeddings.create(model="text-embedding-3-small", input=t).data[0].embedding
)
doc_vecs = [embed(d) for d in docs]

# 2. QUERY — find top-2 nearest docs
query = "How does the API rate limit work?"
q_vec = embed(query)
sims = [float(np.dot(q_vec, v) / (np.linalg.norm(q_vec) * np.linalg.norm(v))) for v in doc_vecs]
top = sorted(zip(sims, docs), reverse=True)[:2]
context = "\n".join(d for _, d in top)

# 3. GENERATE — feed context to the LLM
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Answer only from the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
    ],
)
print(resp.choices[0].message.content)

When to use

Private/internal knowledge (company docs, customer data)
Continuously changing info (news, prices, inventory)
High-stakes domains where hallucinations are dangerous (legal, medical, finance)
Large knowledge base + query mode (>10K docs)
When auditing/citing sources matters — RAG can cite

When not to use

General-knowledge questions (the LLM already knows)
Very few documents (< 50) — just stuff them all in the context
One-shot creative content (poems, blog drafts) — RAG doesn't help
Structured queries (a DB query is more direct)

Common pitfalls

Retrieval quality is everything

Before tweaking the model, fix retrieval. Wrong chunks → even the best LLM produces nonsense. Recall@5 evaluation is mandatory.

Chunks lack context

Raw chunks often reference 'as mentioned earlier' or 'this section'. Add document title and heading hierarchy to chunk metadata.

Leaving RAG bare

RAG + reranking + hybrid search + query expansion = full solution. Pure vector search alone usually plateaus around 60% accuracy in production.