TL;DR
- Open-source C/C++ inference engine for LLMs, started by Georgi Gerganov in March 2023.
- Native author of the GGML tensor library and its successor GGUF model format, which together became the de facto standard for quantised local LLMs.
- Runs efficiently on Apple Silicon (Metal), x86 with AVX/AVX-512, ARM, NVIDIA CUDA, AMD ROCm, Vulkan and SYCL — the broadest hardware reach of any LLM runtime.
- Powers Ollama, LM Studio, Jan, GPT4All and most other consumer local-LLM applications.
Overview#
llama.cpp began as a weekend port of Meta's original LLaMA model to plain C/C++, with the goal of running 7B-class models on a MacBook. Within a few weeks the project had attracted a large contributor base and grown into a general-purpose LLM runtime supporting most open-weight architectures.
Two design choices defined its trajectory: it depends on no Python or framework runtime, and it uses heavy weight quantisation so that consumer-grade RAM is enough. The combination made local LLMs practical for the first time outside of GPU-rich workstations.
GGML and GGUF#
GGML is the tensor library underneath llama.cpp — a small set of operators with quantised data types and CPU-first execution. GGUF is the model file format that succeeded the earlier GGML and GGJT formats in 2023; it is a single binary file containing weights, tokeniser, architecture description and arbitrary metadata. Most quantised open-weight models published on Hugging Face today ship a GGUF variant.
Quantisation formats supported include Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q4_0, Q3_K_S and even Q2_K, providing a granular trade-off between size and quality. The K-quant family (denoted by `_K`) uses block-wise scales and is typically preferred over the legacy Q4_0 family on the same bit budget.
Hardware Backends#
- Apple Silicon — Metal Performance Shaders, native and highly tuned.
- x86 CPU — AVX2, AVX-512, AMX on Sapphire Rapids and Granite Rapids.
- ARM CPU — NEON, SVE, dotprod and i8mm extensions.
- NVIDIA — CUDA backend with cuBLAS / cuBLASLt and custom kernels.
- AMD — HIP/ROCm backend.
- Vulkan — portable GPU backend that runs on AMD, NVIDIA, Intel, ARM Mali.
- SYCL — Intel Arc and Data Center GPU Max.
Server Mode#
Beyond the original CLI, llama.cpp ships an HTTP server (`llama-server`) that exposes both a native API and an OpenAI-compatible endpoint, with continuous batching, slot-based scheduling, prompt caching and parallel decoding. Single-host throughput on a well-quantised model can rival much heavier runtimes, particularly on Apple Silicon where Metal acceleration is mature.
# Run a quantised Llama 3.1 8B on Apple Silicon with Metal
./llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--parallel 4 \
--cont-batchingEcosystem#
llama.cpp is the runtime inside Ollama, LM Studio, Jan, GPT4All, KoboldCPP and many smaller projects. The `llama-cpp-python` bindings expose it to Python code; the C API is wrapped by Rust, Go, Node.js and Swift bindings. Embedded uses include on-device assistants on iOS, Android, Raspberry Pi and edge gateways.
When to Use#
Use llama.cpp for any on-device or edge deployment, for any setup that needs to run without an NVIDIA GPU, and as the format target when distributing quantised open-weight models. For cloud-scale GPU serving the picture flips — vLLM, TensorRT-LLM and SGLang are the right tools — but for the laptop, desktop, phone or single-board computer, llama.cpp remains the dominant runtime.
References
- llama.cpp on GitHub · GitHub
- GGUF Format Specification · GitHub (ggml)
- Ollama · Ollama