Retrieval Quality > Model Size: RAG Levers That Matter
Most teams jump models when answers feel off. Retrieval augmentation failures are the more common culprit. Here are the levers I’ve seen move the needle fastest.
1. Chunk Strategy
Overlapping fixed chunks are the default but rarely optimal. I start with semantic chunking for docs over 3K tokens and enforce a hard upper bound (so transformer attention isn’t wasted on boilerplate). Overlap only when context fragmentation shows up in eval samples.
2. Embedding Freshness
Content changes silently over time. I track a reindex_lag_hours metric (last updated timestamp vs last embedding time). Spikes predict relevance drop before users complain.
3. Negative Sampling
Adding representative “confuser” docs to evaluation sets prevents over-fitting to trivial matches. Quick win: include near-duplicate docs with different accepted answers.
4. Retrieval Fan-out Budget
I cap initial k (e.g., 8–12) and only expand if answer confidence scoring declines. Blindly increasing k usually adds noise tokens.
5. Reranking Layer
Lightweight cross-encoder reranking (applied to top 20) outperforms prematurely swapping the base LLM. Add when raw retrieval precision < 0.7 on eval set.
6. Answerability Detection
Return “I don’t know” gracefully if no retrieved chunk passes a relevance threshold. Better than hallucinated prose. Track answerable_rate.
7. Freshness Layering
I maintain two stores: long‑term embeddings + hot recent docs (e.g., last 48h) reindexed aggressively. Merge results with a simple boost for recency when query intent suggests it.
8. Eval Harness
I collect real user queries (anonymized) + expected answers weekly. Manual labeling of just 50–80 samples surfaces drift early. Track precision@k, answerability, and latency.
Upgrading model size is the last resort—not the first lever.
Takeaways
- Optimize retrieval before swapping models.
- Track freshness lag and answerability.
- Use semantic + bounded chunking.
- Add reranking only when needed.
- Build a tiny evolving eval set.
Better retrieval = cheaper, faster, clearer answers. Start there.