RAG Explained: How AI Apps Answer Questions on Your Data

Why LLMs Lie — And What RAG Does About It

You build an AI chatbot. You ask it about your product docs. It confidently gives you an answer that sounds right but is completely made up. That's not a bug in your implementation — that's how LLMs behave when they don't have the information they need. They fill the gap with plausible-sounding text.

Retrieval-Augmented Generation (RAG) solves this. Instead of asking the model to remember your data, you retrieve the relevant pieces at query time and hand them directly to the model as context. The model stops guessing and starts answering from actual sources.

This post explains how RAG works under the hood, when you need it, and how to build it correctly — including the failure modes most tutorials skip.

🎯 Quick Answer (30-Second Read)

What it is: RAG retrieves relevant documents from your data store and injects them into the LLM prompt before generation
When to use it: Any AI feature that needs to answer questions about data the model was not trained on
Main benefit: Dramatically reduces hallucination on domain-specific queries
Main limitation: Retrieval quality determines answer quality — garbage in, garbage out
Recommendation: Use RAG for any product knowledge base, documentation search, or customer-facing Q&A feature

How RAG Actually Works

Most explanations of RAG describe what it does. Here is what it actually does at each step.

User Query
↓
Embed the query into a vector
↓
Search your vector database for similar chunks
↓
Retrieve top-K most relevant chunks
↓
Inject chunks into the LLM prompt as context
↓
LLM generates answer grounded in retrieved context
↓
Return answer (with optional source citations)

The key insight is that the LLM is not doing the retrieval — it is doing the generation. A separate system (your vector search layer) handles finding the right information. The model's job is to synthesize and articulate what was retrieved.

This is why RAG is called retrieval-augmented generation and not retrieval-replaced generation. The model still does the hard linguistic work. You just give it better inputs.

The Embedding Step

Before you can retrieve anything, your documents need to be embedded. An embedding model (like OpenAI's text-embedding-3-small or a local model via Ollama) converts text into a high-dimensional vector — a list of numbers that represents the semantic meaning of that text.

Documents with similar meaning end up with vectors that are close together in this space. When a user asks a question, you embed the query the same way, then find the document vectors nearest to it. That proximity is semantic similarity — not keyword matching.

The Chunking Step

You do not embed whole documents. You split them into chunks — typically 256 to 512 tokens each — with some overlap between chunks so context does not get cut off at boundaries. The chunk size is one of the most consequential decisions in a RAG pipeline. Too small and individual chunks lose context. Too large and you waste the model's context window on irrelevant content.

The Vector Database

Chunks and their embeddings get stored in a vector database. Popular options include Pinecone, Weaviate, Qdrant, and pgvector (Postgres extension). At small scale, even an in-memory FAISS index works. The database handles approximate nearest-neighbor search — finding the top-K most semantically similar chunks to your query vector fast.

The Right Way vs The Wrong Way to Build RAG

The right way starts with chunking strategy. Most developers who get poor RAG results are not using the wrong model — they are chunking their documents badly. Chunk at semantic boundaries (paragraphs, sections) not at arbitrary character counts. Add metadata to each chunk: document title, section heading, date. At retrieval time, use that metadata to filter before ranking.

Use hybrid search where possible. Pure vector search misses exact keyword matches. Pure BM25 keyword search misses semantic similarity. Combining both — called hybrid retrieval — gives measurably better recall. Tools like Weaviate and Qdrant support this natively.

The wrong way is treating RAG as a single fetch-and-paste operation. The most common mistake: retrieving five chunks, dumping them all into the prompt regardless of relevance scores, and hoping the model figures it out. The model does not figure it out. It gets confused by contradictory or irrelevant context and produces worse answers than it would with no context at all.

Set a relevance threshold. If your top chunk has a similarity score below 0.75, do not retrieve — tell the user you don't have enough information to answer. A confident "I don't know" is better than a hallucinated answer.

My Take

The deep reason RAG works is that LLMs are extraordinary reasoning engines but terrible databases. Asking a model to memorize your 500-page product documentation during fine-tuning is like asking a senior engineer to memorize every Jira ticket — the cognitive overhead is wrong for the task. The best outcome with RAG is a system where the retrieval layer is treated as a first-class engineering problem: monitored, evaluated, and improved on real query logs. The worst outcome is a RAG pipeline that was working fine at demo time and slowly degrades in production because nobody set up retrieval quality metrics. Right now, most teams underinvest in the retrieval half and overinvest in model selection — picking between GPT-4o and Claude when the actual bottleneck is that their chunks are 2,000 tokens and their embeddings are six months stale. Where this is heading: agentic retrieval, where the model itself decides what to retrieve and when, is going to replace static RAG pipelines for complex queries. Whether most teams are ready to evaluate that is a different question.

Comparison Table

Approach	Handles New Data	Hallucination Risk	Setup Cost	Best For
Base LLM (no RAG)	No	High	None	General chat, known domains
Fine-tuning	Partially	Medium	High	Style, tone, task format
RAG	Yes	Low	Medium	Domain Q&A, docs, knowledge base
RAG + Fine-tuning	Yes	Very low	Very high	Production AI products at scale

Real Developer Use Case

ThoughtStream (thoughtstream.anuragwagh.me) uses a RAG pipeline to let users query across their own notes and entries. Each note is chunked into paragraphs, embedded with text-embedding-3-small, and stored in a pgvector table on Supabase. When a user asks a question, the query is embedded, top-5 chunks are retrieved with a similarity threshold of 0.78, and the chunks are passed to Claude with a prompt that explicitly says: answer only from the provided context, cite which note each point comes from.

The result is answers that feel genuinely useful — because they are grounded in what the user actually wrote, not in what the model thinks they probably meant. Without RAG, the same queries return plausible but generic responses with no connection to the user's actual data.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

Fine-tuning bakes knowledge into the model's weights — it changes how the model behaves but cannot easily be updated with new data. RAG keeps knowledge external and retrievable — you update your documents and the model immediately has access to the new information. For most product use cases, RAG is faster to build and easier to maintain.

What vector database should I use for RAG?

For early-stage products, pgvector on Supabase is the easiest starting point — you already have a database, no new infrastructure required. For production scale with high query volume, Qdrant or Pinecone offer better performance and filtering. Do not over-engineer this early.

How many chunks should I retrieve per query?

Start with top-3 to top-5. More is not always better — adding irrelevant chunks increases noise and degrades answer quality. Set a similarity threshold and retrieve fewer high-quality chunks rather than more low-quality ones.

Does RAG work with any LLM?

Yes. RAG is model-agnostic — it is a pattern around the model, not inside it. It works with GPT-4o, Claude, Mistral, Llama, or any model that accepts a context window. The model only sees the retrieved chunks as part of the prompt.

How do I evaluate RAG quality?

Track retrieval recall (did the right chunk get retrieved?), answer faithfulness (did the answer stick to the context?), and answer relevance (did it actually answer the question?). Tools like Ragas and LangSmith provide automated metrics for this. Set these up before you ship, not after users complain.

Conclusion

RAG is the right architecture for any AI feature that needs to answer questions about data the model was not trained on — product docs, user data, knowledge bases, internal wikis. The pattern is straightforward: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and pass them to the model as context.

The developers who build RAG well treat retrieval as a real engineering layer with monitoring and quality metrics — not a one-time setup. If your AI app needs to be accurate and trustworthy on domain-specific questions, RAG is not optional.

RAG Explained: How AI Apps Answer Questions on Your Data

RAG Explained: How AI Apps Answer Questions on Your Data

Why LLMs Lie — And What RAG Does About It

🎯 Quick Answer (30-Second Read)

How RAG Actually Works

The Embedding Step

The Chunking Step

The Vector Database

The Right Way vs The Wrong Way to Build RAG

My Take

Comparison Table

Real Developer Use Case

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

What vector database should I use for RAG?

How many chunks should I retrieve per query?

Does RAG work with any LLM?

How do I evaluate RAG quality?

Conclusion

Continue reading