TL;DR
- HNSW was introduced by Yury Malkov and Dmitry Yashunin in 'Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs' (arXiv:1603.09320, March 2016). The paper combined the navigable-small-world graph of NSW with a probabilistic skip-list-style hierarchy, producing the first graph index that beat tree- and LSH-based ANN on both recall and latency.
- Builds a multi-layer proximity graph. Each node is assigned a maximum layer drawn from an exponential distribution; layer 0 contains every node, each higher layer contains a geometric subsample. Search starts at a fixed entry point in the top layer, greedily walks to a local minimum, drops one layer, repeats. Long jumps happen at the top; fine local search at the bottom.
- Three knobs dominate. M (typically 16-48, with 16 the standard default) is the maximum out-degree at higher layers. efConstruction (typically 100-500, default 200) is the candidate-pool size during build — higher values mean a slower build and a higher-recall final graph. efSearch is the only query-time knob and trades latency directly for recall.
- Recall@10 above 95% at sub-millisecond query latency on tens to hundreds of millions of vectors with logarithmic search complexity in practice. Build is O(N log N) and embarrassingly parallel; deletion is the algorithm's weakest point and is handled in practice by tombstoning plus periodic rebuilds.
- Default index in pgvector (since 0.5), Qdrant, Weaviate, Milvus, Elasticsearch, OpenSearch, Lucene, FAISS, Vespa, Vald, MongoDB Atlas Search and Pinecone (serverless and pod-based). HNSW won the ANN-algorithms competition; the interesting decisions in 2026 are tuning knobs, quantisation choices, and disk-spill variants.
Overview#
HNSW is the approximate-nearest-neighbour algorithm that gave production vector search its standard floor for recall and ceiling for latency. Before HNSW, ANN was a noisy mix of tree-based methods (KD-tree, ball-tree, annoy), locality-sensitive hashing (LSH), and inverted-file methods with product quantisation (FAISS IVFPQ). None scaled cleanly to high-dimensional embedding vectors with both >95% recall and sub-millisecond latency. HNSW achieved both, and the production world consolidated on it within five years of the paper.
The construction is at heart simple. Build a navigable graph in which each node has a small number of neighbours that include both close points (for refining a candidate) and a few much further points (for escaping local minima). Search the graph greedily, always moving to the neighbour closer to the query. Stack a logarithmic hierarchy on top so that long-distance traversal does not have to happen one short hop at a time. The result is an index whose query cost is O(log N) in practice, whose recall is tunable through one query-time knob, and whose build is straightforward to parallelise.
By 2026 the algorithm is ubiquitous. pgvector adopted HNSW as a first-class index in 0.5 and most Postgres-based RAG runs on it. Qdrant, Weaviate, Milvus and Vespa all default to HNSW for production collections. Elasticsearch, OpenSearch and MongoDB Atlas Search use HNSW under their vector query APIs. Pinecone serverless and pod indexes are HNSW variants under the hood. Lucene shipped a native HNSW implementation in version 9 (2022). The reference open-source implementation, hnswlib by the original authors, remains the cleanest place to study the algorithm.
This entry helps you decide when HNSW is the right index for your RAG workload, how to set M, efConstruction and efSearch on a real corpus, how to plan memory and ingestion, and when one of the alternatives (IVFPQ at billion scale, DiskANN when memory runs out, ScaNN when memory is the constraint, GPU CAGRA when batch throughput dominates) is the better call. Yobitel MediQuery's vector store runs on HNSW; Yobibyte's reference RAG recipes default to pgvector HNSW for the same reason most of the industry has — it is the algorithm that almost never makes the wrong default choice.
How it works: the small-world hierarchy#
Navigable small-world (NSW) graphs, introduced by Malkov and colleagues in 2011-2014, are graphs whose nodes have both close neighbours and a few long-range neighbours. Greedy search — at every step, jump to the neighbour closest to the query — converges to a near-neighbour in O(log N) hops on average, provided the long-range edges are correctly distributed. Pure NSW has a well-known weakness: every search begins with short hops, so reaching a far-away neighbourhood requires many initial steps that contribute little.
HNSW fixes this with a probabilistic hierarchy inspired by the skip list. Each inserted node is assigned a maximum layer level drawn from the distribution P(level = l) = (1 - p) * p^l for some probability p, where p is set so that the expected number of nodes halves with each layer. The bottom layer (layer 0) contains every node and is densely connected; each higher layer contains roughly half the nodes of the layer below; the top layer has only a handful of nodes connected by very long edges.
Search begins at a fixed entry point in the top layer. The algorithm greedily walks toward the query, dropping into the next layer down whenever it reaches a local minimum at the current layer. By the time the search arrives at layer 0, it is already in the correct broad region of the space, and only fine-grained local search remains. The number of hops at each upper layer is roughly constant, and the number of layers grows as log N — so the overall cost is O(log N) hops, each touching M neighbours.
Construction inserts nodes one at a time. Each new node runs the same search-from-top algorithm to find efConstruction candidate neighbours at each layer it appears in, then attaches itself to the M closest at higher layers and M_max0 closest at layer 0. A heuristic pruning step (the 'neighbour selection heuristic') keeps the neighbour list diverse rather than letting it collapse onto a tight cluster — this is the detail that gives HNSW its recall advantage over earlier graph methods.
HNSW's neighbour selection heuristic — favouring a diverse neighbour set over a closest-neighbour set — is the unsung hero of the algorithm. It is what stops the graph from getting stuck in local minima during search and is the main reason HNSW dominates the ann-benchmarks leaderboard at high recall.
Key parameters and how to set them#
Three parameters dominate. M controls the maximum out-degree at higher layers — typically 16 to 48, with 16 the standard pgvector / Qdrant / Weaviate default. Higher M means denser graphs, higher recall, more memory, slower build. efConstruction is the candidate-pool size during construction — typically 100 to 500, default 200 in most implementations. Higher efConstruction means a slower build and a higher-recall final graph; below efConstruction = 100 the graph quality degrades visibly. efSearch is the only knob you can change at query time without rebuilding — it is the candidate-pool size during search, typically 50 to 400 in production. Pushing efSearch up adds latency and recall; pushing it down drops both.
Two derived parameters also matter. M_max0 is the maximum out-degree at layer 0, almost always set to 2 * M because layer 0 must absorb the highest density of edges. mL is the layer-assignment scaling factor; the canonical setting is 1 / ln(M) and almost no production system changes it. The full layer assignment for a new node is level = floor(-ln(uniform_random()) * mL).
Practical tuning recipe. Start with the implementation defaults (M = 16, efConstruction = 200) and measure Recall@10 against your labelled eval set at several efSearch values (50, 100, 200, 400). Pick the lowest efSearch that meets your recall target. If even efSearch = 400 cannot reach the target, raise M to 32 or 48 and rebuild — but rebuild cost is real, so do this once and freeze the build-time settings. Almost every production HNSW deployment in 2026 lives within (M = 16-32, efConstruction = 200, efSearch = 100-200).
| Parameter | Role | Typical range | Default | Effect of increasing |
|---|---|---|---|---|
| M | Max out-degree at higher layers | 16-48 | 16 | Higher recall, more memory, slower build |
| M_max0 | Max out-degree at layer 0 | 2 x M | 32 | Higher recall, more memory, slower build |
| efConstruction | Candidate-pool size during build | 100-500 | 200 | Higher recall, much slower build |
| efSearch (ef) | Candidate-pool size during query | 50-400 | 100 | Higher recall, linearly higher latency |
| mL | Layer-assignment scaling factor | 1 / ln(M) | 1 / ln(M) | Rarely tuned |
Memory footprint and the quantisation trade-off#
HNSW expects to live in RAM. Every vector occupies its raw size (dim * 4 bytes for float32) plus a neighbour list of pointers — roughly M * 8 bytes at each higher layer and M_max0 * 8 bytes at layer 0. For 1 million 768-dimensional float32 vectors with M = 32, the raw vectors are 1M * 768 * 4 = ~3 GB and the graph adds roughly 200-400 MB on top. For 100 million 1024-dimensional vectors, the raw vectors alone are 400 GB and the graph adds another 25-40 GB — comfortably above the RAM of a single commodity instance and into Yobitel NeoCloud bare-metal-server territory.
Quantisation is the standard mitigation. Scalar quantisation (float32 to int8) compresses 4x with essentially no recall loss when the vectors are well-distributed; this is the default in Qdrant, Weaviate and pgvector and is almost free in 2026. Product quantisation (PQ) compresses 8-32x with measurable recall loss that is recovered by a rescoring pass over full-precision vectors. Binary quantisation (1 bit per dimension) compresses 32x and is workable on high-dimensional embeddings (1024+) with a rescoring pass — Cohere Embed v4 and the latest OpenAI text-embedding-3-large were trained to remain useful at binary precision.
The graph itself is harder to compress than the vectors. Edge lists are short integers; there is no easy quantisation. This is why disk-aware variants (DiskANN, Vamana) re-engineer the algorithm rather than just compressing HNSW — the algorithm assumes random access to neighbour lists at speeds that NVMe cannot match, so disk-spill HNSW degrades sharply where DiskANN does not.
HNSW indices are typically not memory-mappable from cold storage without huge latency hits. Plan to size RAM to the index — including 1.5-2x headroom for fragmentation, build buffers and concurrent query traffic — or pick a disk-aware variant. Yobitel NeoCloud HBM-rich SKUs (H100 SXM5 / H200 / B200) are the natural fit when the index outgrows commodity-cloud RAM tiers.
Inserts, updates and deletions#
Insertion is fast and lossless. Each new vector runs the same greedy-search-and-attach procedure as construction; the operation cost is O(log N * M * efConstruction) and is embarrassingly parallel across nodes, so most implementations expose multi-threaded ingest. Throughput on commodity hardware reaches tens of thousands of vectors per second at M = 16, efConstruction = 200.
Updates are insert-after-delete in almost every implementation. The vector is removed (typically tombstoned), the new version is inserted with a fresh ID, and the application layer maintains an ID alias if it needs stable identifiers. In-place vector mutation is not supported by the algorithm — neighbour lists computed against the old vector would still point at it.
Deletions are HNSW's weakest point. The canonical paper offers no deletion algorithm; the standard practice is tombstoning — mark the node deleted, skip it during search, leave the neighbour edges in place. Tombstoned nodes still consume memory and still pull search traffic through their neighbour lists. Long-running indices with high churn slowly accumulate dead edges and lose recall. The standard remediation is a periodic full rebuild: typical cadence is weekly to monthly, depending on churn rate. pgvector 0.7+ added incremental rebuild support; Qdrant supports background segment compaction with deleted-vector cleanup; Milvus uses background segment merges. The operational pattern across all of them is the same — accept tombstoning at write time, repay it at rebuild time.
- Insert cost — O(log N * M * efConstruction); multi-threaded, tens of thousands of inserts per second on commodity CPUs.
- Update cost — delete + insert; no in-place mutation.
- Delete cost — tombstone at query time; rebuild at scheduled cadence.
- Rebuild cadence — weekly for high-churn, monthly for slow-changing corpora; budget for it.
- Tombstoned nodes consume memory until rebuild — track 'live vs tombstoned' as a first-class metric.
Where HNSW is the right index, and where it loses#
HNSW dominates one specific regime — medium-scale (millions to hundreds of millions of vectors), in-memory, high-recall (>95% Recall@10), low-latency (single-digit milliseconds). That regime covers the overwhelming majority of production RAG workloads. Yobitel MediQuery's clinical vector store, the Yobibyte reference RAG recipes, and almost every customer RAG pipeline running on Yobitel NeoCloud sit firmly inside it.
Outside that regime other algorithms win. At billion-vector scale with tight memory budgets, IVFPQ on FAISS GPU is cheaper and faster on batch query workloads (see faiss). When the index is large enough that RAM cannot hold it but NVMe is available, DiskANN's Vamana graph maintains performance where HNSW collapses. At fixed memory budgets with high recall targets, ScaNN's anisotropic quantisation often beats HNSW on the GLOVE and BigANN benchmarks. For GPU-resident batch query workloads where launch overhead is amortised, NVIDIA CAGRA (and its Milvus integration as GPU_CAGRA) delivers very high QPS.
Hybrid search adds another consideration. HNSW with filterable variants (Qdrant filterable HNSW, Weaviate filtered HNSW, Milvus bitmap-pre-filter, pgvector 0.7+ filter pushdown) handles metadata-filtered queries efficiently. The naive 'filter then HNSW' or 'HNSW then filter' patterns both degrade sharply when filters are selective; filterable HNSW threads the filter into the graph walk and is the right choice for any multi-tenant or ACL-aware RAG workload.
| Workload shape | Best index | Why HNSW is or is not the right choice |
|---|---|---|
| Millions to ~100M vectors, in-memory, RAG | HNSW | Sweet spot — default for Yobibyte reference recipes |
| Billion-scale, memory-constrained, batch | IVFPQ on FAISS GPU | HNSW graph and raw vectors do not fit in RAM |
| Hundreds of millions, NVMe-backed, low memory | DiskANN (Vamana) | HNSW degrades sharply on disk spill |
| High recall at very tight RAM budget | ScaNN | Anisotropic PQ beats HNSW at fixed memory on GLOVE/BigANN |
| GPU-resident batch query, high QPS | CAGRA (Milvus GPU_CAGRA) | GPU-tuned graph layout beats CPU HNSW on batch throughput |
| Heavily filtered RAG (per tenant, per ACL) | Filterable HNSW (Qdrant / Weaviate / pgvector 0.7+) | Naive HNSW-then-filter loses recall; filterable variants thread the filter |
| Tiny corpus (< 100K vectors) | Flat (brute force) | HNSW overhead does not pay back on small corpora |
Practical implementation notes#
Pick your engine first, then tune. pgvector is the default if you already run Postgres — single operational story, SQL joins onto metadata, transactional ACID, and HNSW since 0.5. Qdrant is the default for a dedicated engine where filtered queries matter. Weaviate for the opinionated module-driven story. Milvus for billion-scale where Kubernetes ops are unobjectionable. FAISS for embedded in-process retrieval. Pinecone for fully managed when sovereignty is not a constraint.
Start with the implementation defaults — M = 16, efConstruction = 200, efSearch = 100, inner-product metric on normalised vectors. Build a labelled eval set of at least 100 (query, relevant-chunk) pairs. Measure Recall@10 and p95 query latency at several efSearch values. The optimal operating point is the lowest efSearch that meets your recall target — pushing higher only buys diminishing returns at linear latency cost.
Use the right metric. Almost every modern embedding model — BGE, E5, Nomic, OpenAI text-embedding-3, Cohere Embed v4, Voyage 3 — produces unit-norm vectors and is trained on inner-product objectives. Build the HNSW index with inner-product metric (cosine on normalised vectors is mathematically identical and faster). L2 distance is rarely the right choice in 2026 unless your embedding model was explicitly trained on it.
Plan ingestion. Bulk index loads at construction time are 5-10x faster than incremental inserts because the graph can be built in parallel without locking. For first ingest of a multi-million-chunk corpus, use a bulk-build API (pgvector's CREATE INDEX, Qdrant's optimisers, Weaviate's bulk import) rather than upserting one at a time. Plan for the index build to take minutes to hours at scale and design ingest pipelines to surface progress.
Plan for rebuild. Tombstoned-node accumulation is non-negotiable on any churning corpus. Pick a weekly or monthly rebuild cadence at the start; treat it as a first-class operational task. Most production HNSW deployments rebuild their indices on a fixed cadence that is well below the point where tombstones materially affect recall.
Filter-aware HNSW is the right default for any multi-tenant or ACL-driven workload. Qdrant filterable HNSW, Weaviate filtered HNSW, Milvus pre-filtered HNSW and pgvector 0.7+ filter pushdown are all production-grade. Do not retrofit unfiltered HNSW with post-filtering — recall collapses on selective filters.
- Default starting point — pgvector HNSW with M = 16, efConstruction = 200, efSearch = 100, inner-product metric, scalar quantisation on if memory matters.
- Recall target — measure on your labelled eval set; do not trust public benchmarks to transfer.
- Bulk-build for first ingest; incremental insert thereafter.
- Tombstone metric — track 'live vs tombstoned vectors' and schedule rebuilds.
- Filterable HNSW variants are mandatory for multi-tenant RAG; do not retrofit post-filtering.
- Yobibyte reference RAG recipes ship with pgvector HNSW as the default; Yobitel NeoCloud HBM-rich SKUs cover the scale tier where the index outgrows commodity RAM.
Where HNSW fits in the Yobitel stack#
HNSW is the vector index underneath Yobitel MediQuery's clinical retrieval layer — PubMed abstracts, NICE guidance, internal hospital protocols and drug-interaction data all live in an HNSW index on top of the embedding store, with filterable HNSW used to enforce per-hospital tenant isolation. The choice is unremarkable on purpose — the algorithm has won the production ANN argument and MediQuery's job is to be a clinical product, not an ANN-algorithm research project.
Yobibyte's reference RAG recipes default to pgvector HNSW for a single-database story with full SQL joins onto metadata. Customers who outgrow pgvector — or whose workloads need dedicated filtered-search performance — are guided toward Qdrant or Weaviate on the same platform, both also HNSW-based. The platform exposes managed embedding endpoints, managed cross-encoder rerankers and managed vector stores as composable primitives; the customer owns the orchestration, the platform owns the index ops, the index itself is HNSW.
Yobitel NeoCloud's HBM-rich SKU tier (H100 SXM5, H200, B200) is the natural fit when the index outgrows commodity-cloud RAM tiers. Yobitel InferenceBench publishes HNSW-relevant numbers alongside raw inference figures — build throughput at fixed M / efConstruction, query latency at fixed efSearch, recall under quantisation — so customers can size their HNSW deployment against measured Yobitel-class hardware rather than vendor-datasheet numbers.
References
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs · arXiv (Malkov & Yashunin, 2016)
- hnswlib reference implementation · GitHub
- ann-benchmarks: reproducible ANN comparison suite · GitHub
- pgvector HNSW Index Documentation · pgvector
- Qdrant HNSW Index Reference · Qdrant Docs
- Lucene 9.0 HNSW Vector Search · Apache Lucene