← BACK TO WRITING
ARTICLE

RAG Pipelines in Production: Lessons Learned

2025-03-01

The gap between a RAG demo and a production RAG system is roughly the same as the gap between a prototype airplane and one you'd actually fly in. Both have wings. Only one won't kill you.

After building RAG systems that serve thousands of daily users in a B2B context, here are the lessons that aren't in the tutorials.

Lesson 1: Retrieval Quality Is Everything

The LLM is only as good as what you feed it. We spent 60% of our engineering time on retrieval — chunking strategies, embedding model selection, re-ranking — and 10% on the generation prompt. This ratio felt wrong at first. It was exactly right.

Chunking Strategy

Naive chunking (split every 512 tokens) produces garbage retrieval. Instead:

Lesson 2: Validate Everything

We implemented a three-layer validation system:

  1. Retrieval scoring — Drop chunks below a relevance threshold before they reach the LLM
  2. Grounded generation — Constrain the LLM to cite its sources explicitly
  3. Post-generation verification — A second pass checks every claim against source documents

This reduced our hallucination rate from 12% to 0.4%.

Lesson 3: Latency Is a Feature

Users will tolerate 3-5 seconds for a high-quality response. They will not tolerate 30 seconds, even if the response is perfect. We optimized aggressively:

Lesson 4: Evaluation Is Hard

How do you measure RAG quality at scale? We built a custom evaluation framework:

Automated evaluation using a judge LLM gets you 80% of the way. Human evaluation covers the remaining 20%.

The Bottom Line

RAG is not a solved problem. It's an engineering discipline that requires the same rigor as any other production system. The magic isn't in the model — it's in the plumbing.