Embedding Models
Hit the cost target with the right model + quantization combo
“A fintech team spent $800/month on OpenAI embeddings for their compliance search system. A 2-hour migration to self-hosted Qwen3-Embedding on a single $20/month GPU instance dropped costs 97% with better MTEB scores. The decision framework: API cost × monthly queries > $20 → self-host.
Sparse vs Dense Embeddings
How Embeddings Capture Meaning
Embeddings convert text into high-dimensional vectors (arrays of numbers) where similar meanings cluster together in vector space. The sentence 'How do I return a product?' and 'What is the refund process?' produce vectors that are close together, even though they share few words.
Matryoshka Representation Learning (MRL) is a game-changing technique: the model is trained so that the first N dimensions of the vector retain semantic meaning when truncated. You can shrink a 1536-dim vector to 256 dims and keep 98% of the quality — at 75% less storage cost.
Enterprise Skills Bridge: Think of MRL as compression without loss. Like gzip on structured JSON — you preserve the signal, discard the redundancy. The first 256 dimensions capture the 'shape' of meaning; the remaining 1280 add fine-grained nuance you rarely need for production retrieval.
< 1M queries/month: text-embedding-3-small at $0.02/MTok — best quality-per-dollar, zero ops.
> 5K queries/day: Self-host BGE-M3 on a $20/mo GPU — same MTEB performance, 97% cost reduction.
Privacy requirement: Self-host — your documents never leave your infrastructure.
Always: Use 256-512 dims via Matryoshka truncation — 98% quality at 25-33% of full-dim storage cost.
Cost Optimizer
Embedding Models
1 question • ~2 min
Tip: Complete the learn sections first for the best score.