TL;DR
- Dense retrieval encodes both queries and documents into fixed-dimensional vectors with a bi-encoder model; relevance is computed as cosine similarity or inner product.
- Replaces lexical overlap with semantic similarity — the basis of every modern RAG system and most enterprise search rebuilt since 2022.
- Bi-encoders (one tower per side) enable offline document encoding and ANN search; cross-encoders (one tower over the joint pair) are more accurate but only feasible as rerankers.
- Quality is bounded by the embedding model — DPR (2020), E5, BGE, Nomic, and OpenAI text-embedding-3 are the canonical milestones.
Bi-Encoders and the Two-Tower Pattern#
Dense Passage Retrieval (Karpukhin et al., 2020) established the production pattern: two BERT-style encoders, one for queries and one for passages, trained with contrastive loss on pairs of (question, relevant passage) plus in-batch negatives. At index time, the passage encoder runs once per document and stores the vector; at query time, the query encoder runs once per request, and the system returns the top-k passages by inner product. Because the two encoders run independently, the expensive part of the work (document encoding) happens offline.
Modern embedding models have moved past the original two-tower split. Most production encoders today are a single model with an instruction prefix that tells it whether to encode as a query or a passage. E5, BGE, and Nomic all use this pattern, which simplifies serving while preserving the same offline / online split.
Why Inner Product, Not Distance#
Most embedding models are trained with normalised vectors and scored with inner product (equivalent to cosine similarity when both vectors have unit norm). Approximate nearest-neighbour indices that optimise inner product (HNSW with IP metric, IVF with IP) are typically used in preference to L2-distance indices. The exception is when an embedding model has been explicitly trained on Euclidean distance, which is rare in modern open-weight embedders.
Embedding Model Generations#
| Year | Model family | Notable property |
|---|---|---|
| 2020 | DPR | First widely-used dense retriever; BERT-base bi-encoder |
| 2022 | E5 (Microsoft) | Weakly supervised on web text pairs; strong zero-shot |
| 2023 | BGE (BAAI) | Mixed instruction tuning; topped MTEB for months |
| 2024 | Nomic / GTE / Jina v3 | Matryoshka embeddings; truncatable dimensions |
| 2024 | OpenAI text-embedding-3 | Native Matryoshka; 256-3072 dimensions |
| 2025+ | Cohere Embed v4, Voyage 3, multimodal encoders | Multilingual, image + text, longer contexts |
Matryoshka Representation Learning#
Kusupati et al. (2022) introduced Matryoshka Representation Learning — training an embedding so that any prefix of the vector is itself a usable embedding at lower dimensionality. A model trained with Matryoshka loss at 1536 dimensions can be truncated to 768, 512, 256 or even 64 with graceful quality degradation, which dramatically reduces storage and ANN search cost. Most embedding models released after 2024 support this out of the box.
Limitations#
- Rare terms and proper nouns the encoder has not seen are weakly represented — the lexical blind spot hybrid search exists to cover.
- Long documents are usually chunked before encoding because most bi-encoders cap at 512-8192 tokens; long-context embedders are emerging but still rare.
- Domain shift hurts more than for BM25: an encoder trained on web text underperforms on legal or biomedical corpora until fine-tuned.
- Score is not calibrated — a cosine of 0.7 means different things on different models, so absolute thresholds rarely transfer.
If you swap embedding models in production, you must re-encode the entire corpus. There is no transfer between embedding spaces of different models.
References
- Dense Passage Retrieval for Open-Domain Question Answering · arXiv (Karpukhin et al., 2020)
- Matryoshka Representation Learning · arXiv (Kusupati et al., 2022)
- MTEB: Massive Text Embedding Benchmark · arXiv (Muennighoff et al., 2022)