Production RAG Cost & Latency
The Production Cost Stack
The Production Cost Stack
A naive RAG pipeline at scale becomes expensive fast. The cost breakdown per query at 1000 QPS:
| Component | Cost Source | Optimization | |-----------|-------------|-------------| | Embedding | API calls or GPU | Matryoshka truncation, self-host at scale | | Vector search | DB compute | HNSW index, pgvector is free | | Re-ranking | API calls | Conditional skip for simple queries | | LLM generation | Tokens in/out | Model routing, caching, shorter contexts |
Semantic caching is the #1 lever: store (query_embedding โ response) in Redis. Cache hit = zero LLM cost. Production hit rates: 30-70% depending on query diversity.
Model routing: route simple factual queries to GPT-4o-mini ($0.15/MTok), reserve GPT-4o ($2.50/MTok) for complex reasoning. A classifier adds 5ms and saves 94% on 70% of queries.