The gap between a RAG demo and a production RAG system is roughly the same as the gap between a prototype airplane and one you'd actually fly in. Both have wings. Only one won't kill you.
After building RAG systems that serve thousands of daily users in a B2B context, here are the lessons that aren't in the tutorials.
Lesson 1: Retrieval Quality Is Everything
The LLM is only as good as what you feed it. We spent 60% of our engineering time on retrieval — chunking strategies, embedding model selection, re-ranking — and 10% on the generation prompt. This ratio felt wrong at first. It was exactly right.
Chunking Strategy
Naive chunking (split every 512 tokens) produces garbage retrieval. Instead:
- Semantic chunking: Split at paragraph or section boundaries
- Overlap: 10-15% overlap between chunks prevents context loss at boundaries
- Metadata enrichment: Attach section headers, document title, and page numbers to each chunk
Lesson 2: Validate Everything
We implemented a three-layer validation system:
- Retrieval scoring — Drop chunks below a relevance threshold before they reach the LLM
- Grounded generation — Constrain the LLM to cite its sources explicitly
- Post-generation verification — A second pass checks every claim against source documents
This reduced our hallucination rate from 12% to 0.4%.
Lesson 3: Latency Is a Feature
Users will tolerate 3-5 seconds for a high-quality response. They will not tolerate 30 seconds, even if the response is perfect. We optimized aggressively:
- Parallel retrieval across multiple indices
- Streaming responses (show partial output immediately)
- Caching frequent queries
Lesson 4: Evaluation Is Hard
How do you measure RAG quality at scale? We built a custom evaluation framework:
- Faithfulness: Does the response only contain information from retrieved documents?
- Relevance: Does the response actually answer the question?
- Completeness: Did it miss important information that was available?
Automated evaluation using a judge LLM gets you 80% of the way. Human evaluation covers the remaining 20%.
The Bottom Line
RAG is not a solved problem. It's an engineering discipline that requires the same rigor as any other production system. The magic isn't in the model — it's in the plumbing.