llama.cpp

TL;DR

Open-source C/C++ inference engine for LLMs, started by Georgi Gerganov in March 2023.
Native author of the GGML tensor library and its successor GGUF model format, which together became the de facto standard for quantised local LLMs.
Runs efficiently on Apple Silicon (Metal), x86 with AVX/AVX-512, ARM, NVIDIA CUDA, AMD ROCm, Vulkan and SYCL — the broadest hardware reach of any LLM runtime.
Powers Ollama, LM Studio, Jan, GPT4All and most other consumer local-LLM applications.

Overview#

llama.cpp began as a weekend port of Meta's original LLaMA model to plain C/C++, with the goal of running 7B-class models on a MacBook. Within a few weeks the project had attracted a large contributor base and grown into a general-purpose LLM runtime supporting most open-weight architectures.

Two design choices defined its trajectory: it depends on no Python or framework runtime, and it uses heavy weight quantisation so that consumer-grade RAM is enough. The combination made local LLMs practical for the first time outside of GPU-rich workstations.

GGML and GGUF#

GGML is the tensor library underneath llama.cpp — a small set of operators with quantised data types and CPU-first execution. GGUF is the model file format that succeeded the earlier GGML and GGJT formats in 2023; it is a single binary file containing weights, tokeniser, architecture description and arbitrary metadata. Most quantised open-weight models published on Hugging Face today ship a GGUF variant.

Quantisation formats supported include Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q4_0, Q3_K_S and even Q2_K, providing a granular trade-off between size and quality. The K-quant family (denoted by `_K`) uses block-wise scales and is typically preferred over the legacy Q4_0 family on the same bit budget.

Hardware Backends#

Apple Silicon — Metal Performance Shaders, native and highly tuned.
x86 CPU — AVX2, AVX-512, AMX on Sapphire Rapids and Granite Rapids.
ARM CPU — NEON, SVE, dotprod and i8mm extensions.
NVIDIA — CUDA backend with cuBLAS / cuBLASLt and custom kernels.
AMD — HIP/ROCm backend.
Vulkan — portable GPU backend that runs on AMD, NVIDIA, Intel, ARM Mali.
SYCL — Intel Arc and Data Center GPU Max.

Server Mode#

Beyond the original CLI, llama.cpp ships an HTTP server (`llama-server`) that exposes both a native API and an OpenAI-compatible endpoint, with continuous batching, slot-based scheduling, prompt caching and parallel decoding. Single-host throughput on a well-quantised model can rival much heavier runtimes, particularly on Apple Silicon where Metal acceleration is mature.

bash

# Run a quantised Llama 3.1 8B on Apple Silicon with Metal
./llama-server \
    -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --port 8080 \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --parallel 4 \
    --cont-batching

Ecosystem#

llama.cpp is the runtime inside Ollama, LM Studio, Jan, GPT4All, KoboldCPP and many smaller projects. The `llama-cpp-python` bindings expose it to Python code; the C API is wrapped by Rust, Go, Node.js and Swift bindings. Embedded uses include on-device assistants on iOS, Android, Raspberry Pi and edge gateways.

When to Use#

Use llama.cpp for any on-device or edge deployment, for any setup that needs to run without an NVIDIA GPU, and as the format target when distributing quantised open-weight models. For cloud-scale GPU serving the picture flips — vLLM, TensorRT-LLM and SGLang are the right tools — but for the laptop, desktop, phone or single-board computer, llama.cpp remains the dominant runtime.

References

llama.cpp on GitHub · GitHub
GGUF Format Specification · GitHub (ggml)
Ollama · Ollama

Overview#

GGML and GGUF#

Hardware Backends#

Apple Silicon — Metal Performance Shaders, native and highly tuned.

x86 CPU — AVX2, AVX-512, AMX on Sapphire Rapids and Granite Rapids.

ARM CPU — NEON, SVE, dotprod and i8mm extensions.

NVIDIA — CUDA backend with cuBLAS / cuBLASLt and custom kernels.

AMD — HIP/ROCm backend.

Vulkan — portable GPU backend that runs on AMD, NVIDIA, Intel, ARM Mali.

SYCL — Intel Arc and Data Center GPU Max.

Server Mode#

bash

# Run a quantised Llama 3.1 8B on Apple Silicon with Metal
./llama-server \
    -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --port 8080 \
    --n-gpu-layers 99 \
    --ctx-size 8192 \
    --parallel 4 \
    --cont-batching

Ecosystem#

When to Use#

llama.cpp

Overview#

GGML and GGUF#

Hardware Backends#

Server Mode#

Ecosystem#

When to Use#

References

Browse all entries

Deploy on Yobitel

llama.cpp

Overview#

GGML and GGUF#

Hardware Backends#

Server Mode#

Ecosystem#

When to Use#

References

Browse all entries

Deploy on Yobitel