Milvus

TL;DR

Milvus is an open-source vector database under Apache 2.0, developed primarily by Zilliz and graduated from the CNCF in 2024 — the first vector database to do so.
Cloud-native architecture splits compute and storage; uses object storage (S3, GCS, MinIO) for persistence, message queues (Pulsar, Kafka) for write-path durability.
Pluggable index layer — HNSW, IVFFlat, IVFPQ, IVF_SQ8, DiskANN, ScaNN, GPU_CAGRA — chosen per collection. Designed to scale to tens of billions of vectors.
Trade-off: significantly more operationally complex than Qdrant or Weaviate; rewards teams that need its scale, frustrates teams that do not.

Disaggregated Architecture#

Milvus splits responsibilities across multiple stateless service tiers — proxy, query nodes, data nodes, index nodes, and a coordinator — communicating via a message queue (Pulsar by default, Kafka supported). Persistence lives in object storage; metadata in etcd. Each tier scales independently. The advantage is true cloud-native elasticity; the cost is many moving parts to operate.

For smaller deployments, Milvus Lite (a single-binary embedded variant) and Milvus Standalone (everything in one container) exist. Production at scale generally means Milvus Distributed on Kubernetes.

Index Backends#

Index	Profile	Best for
HNSW	Memory, high recall	Sub-100M vectors, latency-critical
IVFFlat	Exact within IVF cells	Medium corpus, simple ops
IVFPQ / IVF_SQ8	Memory-efficient	Hundreds of millions of vectors
DiskANN	NVMe-backed, low memory	Billion-scale on commodity disks
GPU_CAGRA	NVIDIA GPU	Highest QPS, GPU-resident workloads
SCANN	Anisotropic PQ	Memory-constrained high-recall

GPU Indexing with CAGRA#

CAGRA (CUDA Accelerated Graph Index for Approximate Nearest Neighbor) was introduced by NVIDIA in 2023 as part of RAPIDS RAFT and integrated into Milvus shortly after. It is a graph-based ANN algorithm designed specifically for GPU memory access patterns — instead of HNSW's variable-degree multi-layer graph, CAGRA uses a single-layer, fixed-degree graph that maps cleanly to coalesced GPU reads. On H100-class hardware, GPU_CAGRA delivers very high QPS at 95%+ recall when the index fits in GPU memory.

GPU indices are dramatic at batch query workloads but rarely the right answer for low-latency interactive RAG. Single-query GPU launch overhead means CPU HNSW often wins on P95 latency.

Operational Reality#

Milvus's billion-scale story is real — production deployments at e-commerce, recommendation, and security companies push into the tens of billions of vectors. The cost is operational complexity: a full Milvus Distributed cluster requires Kubernetes, etcd, Pulsar or Kafka, MinIO or S3, and several service tiers. For teams that already operate Kubernetes at scale this is unremarkable; for smaller teams it is heavy.

Zilliz Cloud is the managed Milvus offering and removes this burden, at the cost of vendor lock-in similar to Pinecone.

When to Pick Milvus#

Pick Milvus when you genuinely need billion-scale, when you want pluggable index backends including GPU options, or when CNCF governance matters to your procurement. Pick something lighter (Qdrant, Weaviate, pgvector) when your corpus fits comfortably on a single machine.

References

Milvus Documentation · Milvus
milvus-io/milvus on GitHub · GitHub
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs · arXiv (Ootomo et al., 2023)

Disaggregated Architecture#

Index Backends#

Index	Profile	Best for
HNSW	Memory, high recall	Sub-100M vectors, latency-critical
IVFFlat	Exact within IVF cells	Medium corpus, simple ops
IVFPQ / IVF_SQ8	Memory-efficient	Hundreds of millions of vectors
DiskANN	NVMe-backed, low memory	Billion-scale on commodity disks
GPU_CAGRA	NVIDIA GPU	Highest QPS, GPU-resident workloads
SCANN	Anisotropic PQ	Memory-constrained high-recall

GPU Indexing with CAGRA#

GPU indices are dramatic at batch query workloads but rarely the right answer for low-latency interactive RAG. Single-query GPU launch overhead means CPU HNSW often wins on P95 latency.

Operational Reality#

Zilliz Cloud is the managed Milvus offering and removes this burden, at the cost of vendor lock-in similar to Pinecone.

Milvus

Disaggregated Architecture#

Index Backends#

GPU Indexing with CAGRA#

Operational Reality#

When to Pick Milvus#

References

Browse all entries

Deploy on Yobitel

Milvus

Disaggregated Architecture#

Index Backends#

GPU Indexing with CAGRA#

Operational Reality#

When to Pick Milvus#

References

Browse all entries

Deploy on Yobitel