TL;DR
- Milvus is an open-source vector database under Apache 2.0, developed primarily by Zilliz and graduated from the CNCF in 2024 — the first vector database to do so.
- Cloud-native architecture splits compute and storage; uses object storage (S3, GCS, MinIO) for persistence, message queues (Pulsar, Kafka) for write-path durability.
- Pluggable index layer — HNSW, IVFFlat, IVFPQ, IVF_SQ8, DiskANN, ScaNN, GPU_CAGRA — chosen per collection. Designed to scale to tens of billions of vectors.
- Trade-off: significantly more operationally complex than Qdrant or Weaviate; rewards teams that need its scale, frustrates teams that do not.
Disaggregated Architecture#
Milvus splits responsibilities across multiple stateless service tiers — proxy, query nodes, data nodes, index nodes, and a coordinator — communicating via a message queue (Pulsar by default, Kafka supported). Persistence lives in object storage; metadata in etcd. Each tier scales independently. The advantage is true cloud-native elasticity; the cost is many moving parts to operate.
For smaller deployments, Milvus Lite (a single-binary embedded variant) and Milvus Standalone (everything in one container) exist. Production at scale generally means Milvus Distributed on Kubernetes.
Index Backends#
| Index | Profile | Best for |
|---|---|---|
| HNSW | Memory, high recall | Sub-100M vectors, latency-critical |
| IVFFlat | Exact within IVF cells | Medium corpus, simple ops |
| IVFPQ / IVF_SQ8 | Memory-efficient | Hundreds of millions of vectors |
| DiskANN | NVMe-backed, low memory | Billion-scale on commodity disks |
| GPU_CAGRA | NVIDIA GPU | Highest QPS, GPU-resident workloads |
| SCANN | Anisotropic PQ | Memory-constrained high-recall |
GPU Indexing with CAGRA#
CAGRA (CUDA Accelerated Graph Index for Approximate Nearest Neighbor) was introduced by NVIDIA in 2023 as part of RAPIDS RAFT and integrated into Milvus shortly after. It is a graph-based ANN algorithm designed specifically for GPU memory access patterns — instead of HNSW's variable-degree multi-layer graph, CAGRA uses a single-layer, fixed-degree graph that maps cleanly to coalesced GPU reads. On H100-class hardware, GPU_CAGRA delivers very high QPS at 95%+ recall when the index fits in GPU memory.
GPU indices are dramatic at batch query workloads but rarely the right answer for low-latency interactive RAG. Single-query GPU launch overhead means CPU HNSW often wins on P95 latency.
Operational Reality#
Milvus's billion-scale story is real — production deployments at e-commerce, recommendation, and security companies push into the tens of billions of vectors. The cost is operational complexity: a full Milvus Distributed cluster requires Kubernetes, etcd, Pulsar or Kafka, MinIO or S3, and several service tiers. For teams that already operate Kubernetes at scale this is unremarkable; for smaller teams it is heavy.
Zilliz Cloud is the managed Milvus offering and removes this burden, at the cost of vendor lock-in similar to Pinecone.
When to Pick Milvus#
Pick Milvus when you genuinely need billion-scale, when you want pluggable index backends including GPU options, or when CNCF governance matters to your procurement. Pick something lighter (Qdrant, Weaviate, pgvector) when your corpus fits comfortably on a single machine.
References
- Milvus Documentation · Milvus
- milvus-io/milvus on GitHub · GitHub
- CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs · arXiv (Ootomo et al., 2023)