Re-ranking
Diagnose the p95 latency problem in this RAG system
“A customer support AI hit its 1s p95 SLA until the team added a cross-encoder re-ranker across all queries. p95 jumped to 1.8s. The fix: conditional re-ranking — skip the reranker when bi-encoder confidence is above 0.85. 63% of queries were simple enough to skip, bringing p95 back to 950ms.
Retrieve → Re-rank → Generate Pipeline
Bi-Encoder vs Cross-Encoder: The Speed-Accuracy Tradeoff
Bi-encoders (like text-embedding-3-small) encode query and document independently. Fast (encode once, compare via cosine) but imprecise — they never see the query and document together.
Cross-encoders (like Cohere Rerank, Jina Reranker v3) take the query and document as a single concatenated input and produce a relevance score. They are 100-1000x slower but +33-40% more accurate because they can model fine-grained interactions between query and document tokens.
The production pattern: use the bi-encoder for fast initial retrieval (top-100), then the cross-encoder to re-rank just the top candidates (top-10). This gives you cross-encoder accuracy at near bi-encoder speed.
ColBERT (Contextualized Late Interaction over BERT) pre-computes per-token embeddings for all documents at index time. At query time, it computes max-sim scores between query tokens and document tokens — 2-5x faster than full cross-encoder while maintaining comparable accuracy. Used in production at Databricks for semantic search over Databricks documentation (millions of docs). If you need the accuracy of cross-encoder without the full latency penalty, ColBERT is the pattern.
Diagnosis Lab
Re-ranking
1 question • ~2 min
Tip: Complete the learn sections first for the best score.