TL;DR
- FAISS (Facebook AI Similarity Search) is an open-source C++ library with Python bindings from Meta AI Research, first released in 2017 under the MIT licence. It implements almost every published approximate-nearest-neighbour index — Flat, IVFFlat, IVFPQ, HNSW, NSG, RaBitQ, PQ, OPQ, ScaNN-style — behind a single unified API.
- The Index Factory is the distinctive ergonomics: a small DSL where short strings ('IVF4096,PQ64', 'OPQ64_256,IVF65536,PQ64', 'HNSW32') compose pre-processors, coarse quantisers, encoders and index structures into a working composite index in one line.
- First-class GPU support. The 2017 paper 'Billion-scale similarity search with GPUs' (Johnson, Douze, Jegou, arXiv:1702.08734) demonstrated billion-vector ANN on commodity GPU hardware for the first time and remains the basis for the GpuIndexFlat, GpuIndexIVFFlat and GpuIndexIVFPQ classes. Multi-GPU sharding is built in.
- FAISS is a library, not a database. No network API, no authentication, no replication, no metadata filtering beyond ID selectors, no schema. Persistence is single-file write_index / read_index of the whole index. Milvus, Vespa, Vald and a long tail of in-house systems wrap FAISS as their ANN kernel.
- Default tool for batch billion-scale embedding-index construction; the dominant choice for offline scoring, batch retrieval pipelines, and research benchmarking. Yobitel NeoCloud's billion-scale batch-scoring pipelines use FAISS GPU; Yobibyte's data-prep tooling wraps FAISS for one-shot embedding-index construction before the index is handed off to a serving layer.
Overview#
FAISS is the C++ library Meta AI Research released in 2017 to make billion-scale similarity search tractable on commodity hardware. It is the most complete catalogue of approximate-nearest-neighbour index types in any single library — Flat (brute force), IVFFlat (inverted file with full vectors), IVFPQ (inverted file with product-quantised codes), IVFSQ (scalar quantised), HNSW, NSG, RaBitQ (binary), PQ, OPQ — all behind a unified API. The same library hosts the original CUDA implementations that made GPU ANN search a real production option, and it remains the canonical reference for almost every published ANN algorithm.
FAISS is a library, not a database. There is no built-in network protocol, no authentication, no schema, no replication, no metadata filtering beyond ID selectors. You construct an index in memory, train it (for index types that require training), add vectors, optionally persist to a single file, and query — everything in-process. This is the source of FAISS's power for engineering work and the reason production deployments wrap it (Milvus, Vespa, Vald, a long tail of in-house systems) rather than ship it directly to end users.
In 2026 FAISS occupies two stable roles. First, it is the batch-billion-scale workhorse — when a single corpus of hundreds of millions to billions of embeddings needs to be indexed once and queried offline (batch scoring, similarity joins, candidate generation for recommendation), FAISS GPU is the cheapest and fastest tool. Second, it is the embedded retrieval engine for any pipeline whose performance-critical loop fits inside a single process — research codebases, large batch pipelines, ML training pipelines that need k-NN for in-batch negative mining or for retrieval-augmented training. Yobitel NeoCloud's batch-scoring pipelines lean on FAISS GPU for billion-scale similarity work; Yobibyte's data-prep tooling wraps FAISS for the one-shot 'build the index, hand it to the serving layer' phase before the index is loaded into a dedicated vector database for query traffic.
This entry helps you decide whether FAISS is the right kernel for your retrieval problem, which index type to compose with the index factory, how to size GPU vs CPU FAISS, and where the library's library-not-database posture forces decisions in the wrapping system. It is aimed at engineers building either embedded retrieval pipelines or the FAISS-backed core of a larger system.
Quick start: build, train, query, persist#
FAISS's Python API is a thin idiomatic wrapper over the C++ library — most production code uses it directly. The pattern is always the same: construct an index, train it on a representative vector sample if the index type needs training, add the corpus, query, persist. The Index Factory string-builder removes most of the verbosity from index construction.
# pip install faiss-cpu (or faiss-gpu-cuXX for GPU)
import faiss
import numpy as np
d = 768 # embedding dimensions
N = 1_000_000 # corpus size
xb = np.random.random((N, d)).astype("float32")
faiss.normalize_L2(xb) # normalise for inner-product search
# Index Factory composes a billion-scale-friendly index in one line:
# IVF with 4096 Voronoi cells + Product Quantisation to 64-byte codes.
index = faiss.index_factory(d, "IVF4096,PQ64", faiss.METRIC_INNER_PRODUCT)
# IVFPQ requires training: it learns the coarse quantiser centroids
# (k-means on a sample) and the PQ codebooks (k-means per sub-quantiser).
index.train(xb[:200_000]) # 200K vectors is usually plenty
index.add(xb) # full corpus
# Query — nprobe controls the recall/latency trade-off at query time.
index.nprobe = 16
xq = np.random.random((1, d)).astype("float32")
faiss.normalize_L2(xq)
D, I = index.search(xq, k=10) # top-10 by inner product
# Persistence — single file, whole-index snapshot, no incremental writes.
faiss.write_index(index, "corpus.ivfpq.index")
restored = faiss.read_index("corpus.ivfpq.index")
# GPU search on a single device — same index, GPU-resident copy.
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
D, I = gpu_index.search(xq, k=10)Train IVFPQ on a representative sample, not on the full corpus. 100K-500K vectors typically saturates the coarse quantiser; training on more vectors costs time without measurably improving the centroids. Yobibyte's data-prep tooling exposes a train-on-sample convenience wrapper precisely because almost every team gets this wrong on the first try.
How it works: the FAISS architecture#
FAISS organises its catalogue around three orthogonal concerns — pre-processors (linear transforms applied to vectors before indexing), coarse quantisers (partitioning the space, almost always with k-means), and encoders (lossless or lossy representations of each vector). Almost every FAISS index is a composition of (pre-processor, coarse quantiser, encoder), and the Index Factory exposes that composition as a string.
Pre-processors normalise vectors so the chosen distance metric is sensible. Common pre-processors include L2 normalisation (for cosine / inner-product search), PCA dimensionality reduction (to compress without losing too much signal), and OPQ rotation (an orthogonal transform that aligns variance with the PQ subvector boundaries so PQ codes capture more information).
Coarse quantisers partition the embedding space. IVF (inverted file) is the canonical coarse quantiser — k-means produces nlist centroids, each vector is assigned to the inverted list of its nearest centroid, and at query time the top nprobe lists are searched. IMI (inverted multi-index) is a higher-resolution variant. HNSW can itself act as a coarse quantiser inside an IVF-HNSW composite index, accelerating the centroid lookup.
Encoders compress vectors after assignment. Flat encoders keep the full float32 vector (highest fidelity, largest footprint). SQ8 encoders quantise to int8 with 4x compression and near-zero recall loss. PQ encoders split each vector into M sub-vectors and replace each with the ID of its nearest centroid in a learned codebook (typical compression 8-32x with measurable recall loss recovered by rescoring). OPQ + PQ adds the rotation before quantising. RaBitQ encodes to one bit per dimension for 32x compression with rescoring.
The composition rules are simple: pre-processors come first, then a coarse quantiser, then an encoder. 'OPQ64_256,IVF65536,PQ64' reads left to right — apply OPQ with 64 sub-vectors of dimension 256, partition into 65536 IVF cells, encode each vector as a 64-byte PQ code. This composability is what makes FAISS the canonical tool for benchmarking ANN designs.
The Index Factory in detail#
The Index Factory accepts a string description and returns the constructed index. The grammar is comma-separated components, each component a short name with optional integer parameters. Knowing the common idioms covers 95% of production FAISS usage.
- Pick the encoder for the memory budget — Flat for fidelity, SQ8 for cheap 4x, PQ for aggressive billion-scale compression.
- Pick the coarse quantiser for the corpus size — IVF for almost everything, IMI when nlist would need to exceed 65536, HNSW-as-coarse-quantiser when centroid search itself becomes the bottleneck.
- Pre-processors are optional but worth measuring — OPQ before PQ is essentially free recall, PCA before IVF compresses storage and speeds up training on very high-dim embeddings.
- Train on a 100K-500K vector sample; training on more rarely moves the needle.
- Use METRIC_INNER_PRODUCT on L2-normalised vectors unless your embedding model was trained on Euclidean distance — almost no modern embedder is.
| Factory string | What it builds | Best fit |
|---|---|---|
| 'Flat' | Brute force inner-product search | Small corpora (< 100K), ground-truth eval |
| 'IVF4096,Flat' | IVF coarse quantiser + full-precision vectors in each list | Up to ~10M vectors, plenty of RAM |
| 'IVF4096,SQ8' | IVF + 1-byte scalar quantisation | Up to ~50M vectors, modest RAM |
| 'IVF65536,PQ64' | IVF + 64-byte PQ codes | 100M-1B vectors, RAM-constrained |
| 'OPQ64_256,IVF65536,PQ64' | OPQ rotation + IVF + PQ — the billion-scale workhorse | Billion-scale on FAISS GPU |
| 'HNSW32' | Pure HNSW with M = 32 | Million-scale, in-memory, high recall |
| 'HNSW32,SQ8' | HNSW with int8 quantisation per vector | Tens of millions, in-memory, RAM-constrained |
| 'IVF1024_HNSW32,Flat' | IVF with HNSW as the coarse quantiser | Fast centroid lookup at very large nlist |
| 'PCA64,IVF1024,PQ32' | PCA pre-projection to 64 dim + IVF + PQ | High-dimensional embeddings (>2048) |
GPU FAISS#
FAISS's GPU implementation was introduced in the 2017 paper 'Billion-scale similarity search with GPUs' (Johnson, Douze, Jegou, arXiv:1702.08734) and remains the basis for the GpuIndexFlat, GpuIndexIVFFlat, GpuIndexIVFPQ and GpuIndexIVFScalarQuantizer classes. The library moves a CPU-built index to GPU memory with `index_cpu_to_gpu(res, device_id, cpu_index)`, after which all `search()` calls run on the device. Multi-GPU sharding is exposed via `index_cpu_to_all_gpus()`, which replicates or shards the inverted lists across all available devices and merges per-query results.
The original paper's technical contribution — and still its main practical advantage — is the fused CUDA kernel that performs per-list distance computation and heap update in a single GPU launch, plus careful memory-layout choices that maximise coalesced reads. On an 8 x A100 or 8 x H100 node, a well-tuned IVFPQ index of one billion 768-dimensional vectors achieves sub-millisecond search at high QPS — the threshold that made billion-scale ANN economically feasible.
GPU FAISS is the right tool when batch query throughput dominates the workload — offline scoring jobs, recommendation candidate generation, similarity joins, retrieval-augmented training where every step in the training loop fetches k-NN. It is the wrong tool when single-query interactive latency matters, because the GPU launch overhead (typically 50-150 microseconds) dominates a single short search. Yobitel NeoCloud's H100 SXM5 and H200 SKUs are the natural fit for FAISS GPU workloads; the same hardware feeds Yobibyte's data-prep tooling when an embedding index must be constructed in one batch sweep before serving begins.
FAISS GPU and HNSW are not competing for the same workload. GPU FAISS wins on batch throughput; HNSW on commodity CPU wins on single-query p95 latency. Most production stacks use both — FAISS GPU to build the index in one pass at billion scale, HNSW (or a wrapper that hosts the FAISS HNSW index) to serve queries.
Variants, alternatives and what FAISS is not#
FAISS is one library in a small ecosystem of ANN engines. Knowing what it does and does not do clarifies when to reach for an alternative.
FAISS competes with hnswlib on pure HNSW workloads — hnswlib is the original-author reference implementation and is often slightly faster on single-threaded queries; FAISS's HNSW is good enough for almost every production use and benefits from the library's surrounding tooling. ScaNN (Google) often beats FAISS at fixed-memory high-recall settings on the GLOVE / BigANN benchmarks because its anisotropic PQ loss preserves the inner product more precisely than vanilla PQ. NVIDIA CAGRA (in RAPIDS RAFT and Milvus as GPU_CAGRA) is the modern GPU-native graph index — its memory access patterns are tuned for coalesced GPU reads in a way FAISS's GPU HNSW is not. DiskANN handles disk-backed billion-scale better than any FAISS index does.
FAISS is not a database. There is no network API, no auth, no schema, no replication, no metadata filtering beyond simple ID selectors, no per-tenant ACL, no incremental persistence (write_index dumps the whole index). For production query traffic with all the operational expectations of a database, wrap FAISS in Milvus, Vespa or a custom service — or pick a vector database (Qdrant, Weaviate, Milvus, pgvector) where FAISS-equivalent algorithms are already wrapped with the database layer.
FAISS does support persistent on-disk variants for the largest workloads (OnDiskInvertedLists, IndexBinaryHash), but the default Python idioms assume an in-memory index. Production deployments that need durable storage with low-RAM serving typically prefer DiskANN or a database wrapper.
When to use FAISS directly vs wrap it#
Use FAISS directly when the retrieval lives inside a single process — ML training pipelines doing in-batch negative mining or kNN-augmented training, large batch scoring jobs, similarity joins, research benchmarking, custom retrieval pipelines that do not fit a generic vector database, embedded retrieval inside an application binary. Yobibyte's data-prep tooling wraps FAISS this way for the one-shot embedding-index construction phase — the index is built in a single batch sweep on Yobitel NeoCloud GPU SKUs and then handed off to the serving layer.
Use a database wrapper (Milvus, Vespa, Vald) when you need FAISS-quality ANN behind a network API, with persistence, replication, sharding, multi-tenant access, metadata filtering, and the operational expectations a database brings. Milvus is the most prominent FAISS-derived production database; its catalogue of index types mirrors FAISS closely and its GPU_CAGRA option goes beyond FAISS's GPU coverage.
Pick a non-FAISS vector database (Qdrant, Weaviate, pgvector, Pinecone) when the algorithmic catalogue matters less than the operational story — most of these engines re-implement HNSW (and increasingly IVFPQ) in their own codebases, with first-class metadata filtering and hybrid search. The wins are operational simplicity, not algorithmic novelty.
- Use FAISS directly — batch billion-scale index construction, ML training pipelines, similarity joins, research benchmarking.
- Use Milvus (a FAISS-derived database) — production query traffic at billion-scale with persistence and sharding.
- Use Qdrant / Weaviate / pgvector — production query traffic at million- to hundred-million-scale with first-class hybrid search and metadata filters.
- Use Pinecone — fully managed without sovereignty constraints.
- Yobitel guidance — FAISS GPU for batch index construction on NeoCloud; Qdrant / Weaviate / pgvector or Yobibyte's managed vector store for serving traffic.
Practical implementation notes#
Pick the index type for the corpus size and memory budget first; tune second. For sub-million vector corpora, Flat (brute force) is the right default — recall is exact, latency is acceptable, and the operational complexity of training and tuning is zero. For 1M-50M, IVFFlat with 1024-4096 cells or HNSW with M = 16-32 is the standard choice depending on whether memory or build time matters more. For 50M-1B, IVFPQ with OPQ pre-processing is the canonical recipe. For multi-billion, IVFPQ on FAISS GPU is the standard.
Train on a representative sample. IVFPQ training is k-means on a vector sample; 100K-500K vectors saturates the centroids on most distributions. Training on the full corpus rarely improves recall measurably and slows construction substantially. Sample uniformly at random across the corpus rather than taking the first N vectors — corpora with clustered ingest order otherwise produce poorly-distributed centroids.
Plan nprobe at query time. The single most important query-time tuning knob in IVF indices is nprobe — the number of cells the query searches. Higher nprobe lifts recall linearly at proportional latency cost. The standard tuning recipe is to sweep nprobe over (1, 4, 16, 64, 256), measure Recall@10 and latency, pick the lowest nprobe that meets your recall target.
Watch the persistence story. FAISS's write_index dumps the whole index to a file in a custom binary format; reading is the inverse. There is no incremental persistence, no write-ahead log, no point-in-time snapshot. Production deployments that need durability typically rebuild the index from a vector store (Postgres, S3, MinIO) at startup or run FAISS behind a database wrapper that handles WAL externally. Yobibyte's data-prep tooling treats FAISS indices as one-shot artefacts written to object storage and reloaded by the serving layer at startup.
Pick the metric correctly. METRIC_INNER_PRODUCT on L2-normalised vectors is the right default for modern embedding models (BGE, E5, Nomic, OpenAI text-embedding-3, Cohere Embed v4, Voyage 3). METRIC_L2 is rarely right unless your embedder was trained on Euclidean distance. METRIC_Lp and METRIC_Jaccard exist for niche workloads.
Multi-threaded ingest and OpenMP. FAISS scales near-linearly with CPU cores during construction and search — set the FAISS_NUM_THREADS environment variable or call `faiss.omp_set_num_threads(N)`. Default behaviour depends on the OpenMP runtime; for predictable performance, set the value explicitly.
Where FAISS fits in the Yobitel stack#
Yobitel NeoCloud's billion-scale batch-scoring pipelines run on FAISS GPU. The workload shape — building or scoring against a one-billion-vector embedding index in a single batch sweep, on hardware with high HBM bandwidth (H100 SXM5, H200, B200) — is exactly what the 2017 Johnson-Douze-Jegou paper made possible, and FAISS GPU remains the cheapest and fastest tool for it in 2026. Customers running similarity-join, recommendation candidate-generation or retrieval-augmented-training workloads on NeoCloud lean on FAISS for the batch index construction phase.
Yobibyte's data-prep tooling wraps FAISS for the one-shot embedding-index construction step that precedes serving. Customers bring a corpus, embed it through the platform's managed embedding endpoints, build the FAISS index in batch on Yobitel NeoCloud GPU SKUs, and hand the resulting artefact off to a serving layer (pgvector HNSW, Qdrant, Weaviate, or Yobibyte's managed vector store) for query traffic. The split — FAISS for batch construction, dedicated vector engine for serving — matches the production reality that GPU launch overhead makes FAISS the wrong choice for single-query interactive latency but the right choice for batch throughput.
Yobitel InferenceBench publishes FAISS-relevant numbers alongside raw inference figures — IVFPQ index build throughput at billion scale, GpuIndexIVFPQ batch-query throughput across NeoCloud SKU tiers, and recall under varying nprobe. The data lets customers size FAISS GPU workloads against measured Yobitel hardware rather than the 2017 V100-era figures the original paper used.
References
- Billion-scale similarity search with GPUs · arXiv (Johnson, Douze, Jegou, 2017)
- The Faiss library · arXiv (Douze et al., 2024)
- facebookresearch/faiss · GitHub
- FAISS Wiki — Indexing methods · GitHub Wiki
- FAISS Wiki — The index factory · GitHub Wiki
- Product Quantization for Nearest Neighbor Search · Jegou, Douze, Schmid (PAMI 2011)