Evaluation & CI/CD

Fix the 3 critical bugs in this RAG quality gate pipeline

A team shipped a RAG update that improved answer fluency scores by 12% but hallucination rate jumped from 8% to 23%. Their CI/CD only checked answer relevancy — not faithfulness. The rollout caused a compliance incident. Faithfulness < 0.8 should be a hard deployment block, not a warning.
— Level 7 · Production RAG Pipeline

+100 XP5 min7 / 10

RAGAS: The 4 Metrics That Matter

The RAGAS framework defines four core metrics for RAG evaluation:

Faithfulness: Does the answer only contain claims supported by the retrieved context? A score of 0.6 means 40% of the answer is fabricated. This is the hallucination detector.
Answer Relevancy: Does the answer actually address the user's question?
Context Precision: Of the chunks retrieved, what fraction are relevant?
Context Recall: Of all chunks that contain the answer, how many did we retrieve?

Faithfulness should always be the primary gate. Never use the same model family as both judge and generator. Judge-generator bias inflates scores by 15-25%.

1 of 12