GGUF Format

TL;DR

Single-file binary model format introduced by Georgi Gerganov and the llama.cpp community on 22 August 2023 (PR #2398) as the successor to the earlier GGML and GGJT formats; designed for forward compatibility, extensibility through arbitrary key-value metadata, and single-file portability.
One file contains everything needed to run inference — weights, tokeniser vocabulary, architecture metadata, RoPE / attention parameters and any number of custom metadata entries — eliminating the multi-file pain that GGML and GGJT inherited from PyTorch checkpoints.
Supports a wide quantisation family: F32, F16, BF16 baselines; Q8_0 / Q6_K / Q5_K_M / Q4_K_M / Q4_0 / Q3_K_M / Q2_K legacy and K-quants; IQ4_NL, IQ3_M, IQ2_S, IQ1_S and IQ1_M I-quants (importance-aware, 2024+); per-tensor scheme selection so embeddings and output projections can use Q6_K while the bulk of weights use Q4_K_M.
Consumed by llama.cpp and every project built on it — Ollama, LM Studio, Jan, GPT4All, KoboldCPP, llamafile, llama-cpp-python, plus partial readers in Candle and Apple MLX; published GGUF checkpoints dominate the Hugging Face Hub for local-LLM use cases.
Yobitel uses GGUF as the edge-targeted distribution format: the Yobibyte Marketplace catalogues GGUF recipes for the Llama, Qwen, Mistral and Phi families specifically for edge deployment, and Yobitel Edge AI ships llama.cpp on Jetson and x86 SKUs consuming GGUF artefacts directly.

Overview#

GGUF — GPT-Generated Unified Format — is the binary file format used by the llama.cpp project to package quantised LLMs for portable inference. It replaced two earlier formats: GGML (the original 2022 tensor-and-quant format) and GGJT (a 2023 interim format that added more structure). Both predecessors had grown brittle as model architectures diversified through 2023 — every new tokeniser, every new positional-encoding scheme, every new attention variant forced a breaking format change because the metadata channel was fixed and small. GGUF was designed by the llama.cpp community as a forward-compatible answer.

Three design goals shaped the format. First, single-file portability: no separate tokeniser file, no separate config file, no sharded weights — one file with everything needed to load and run inference. Second, forward compatibility: new model architectures should not break old readers, achieved through versioned metadata and unknown-key-ignoring semantics. Third, extensibility: arbitrary key-value metadata so any future runtime can attach its own annotations without spec changes.

By mid-2026 GGUF is the lingua franca of local-LLM distribution. The Hugging Face Hub hosts tens of thousands of GGUF checkpoints, with the bulk of community-distributed quantisations published in GGUF format. The format powers Ollama (which uses GGUF internally), LM Studio, Jan, GPT4All, llamafile (a single-binary Cosmopolitan-compiled llama.cpp), and the various editor and IDE LLM integrations. Hugging Face surfaces GGUF metadata natively in the model viewer and supports direct streaming load of GGUF files into local runtimes.

This entry helps you understand the GGUF format well enough to produce GGUF checkpoints from HuggingFace weights, choose the right per-tensor quantisation scheme for an edge deployment, and consume GGUF artefacts through llama.cpp or its ecosystem. It also positions GGUF in the broader quantisation landscape — when GGUF Q4_K_M is the right answer versus AWQ or GPTQ on the same hardware — and explains where the format fits in Yobitel's edge-targeted product lines on the Yobibyte Marketplace and Yobitel Edge AI.

Quick start: convert and quantise a HuggingFace model to GGUF#

The shortest path to a GGUF checkpoint is the llama.cpp toolchain. Conversion from HuggingFace safetensors to F16 GGUF uses `convert_hf_to_gguf.py`; quantisation from F16 GGUF to a target scheme (Q4_K_M is the de facto default for general use) uses the `llama-quantize` binary. The snippet below produces an 8B Llama checkpoint at Q4_K_M, suitable for consumer-GPU or Apple-M-series local inference, and runs it through `llama-cli` to confirm the artefact loads.

bash

# Build llama.cpp from source (or apt install llama-cpp on Debian-style distros)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build -DLLAMA_CUDA=ON && cmake --build build --config Release

# 1. Convert HF safetensors to F16 GGUF (single file; tokeniser + config embedded)
python convert_hf_to_gguf.py \
    ./Meta-Llama-3.1-8B-Instruct \
    --outfile llama3-8b.f16.gguf \
    --outtype f16

# 2. Quantise F16 GGUF to Q4_K_M (the de facto default scheme)
./build/bin/llama-quantize \
    llama3-8b.f16.gguf \
    llama3-8b.Q4_K_M.gguf \
    Q4_K_M

# 3. Sanity-check the artefact
./build/bin/llama-cli \
    -m llama3-8b.Q4_K_M.gguf \
    -p "Summarise GGUF in three sentences." \
    -n 128 --temp 0.7

# 4. (Optional) Inspect metadata
./build/bin/llama-gguf-info llama3-8b.Q4_K_M.gguf
# Expect: architecture llama, tokeniser vocab_size 128256, RoPE base 500000, ...

Q4_K_M is the recommended starting point for any new GGUF deployment. It hits the best quality-to-size ratio across the entire quant family for typical models, and every llama.cpp-derived runtime supports it. Move to Q5_K_M only when quality matters more than memory; move to Q3_K_M or Q2_K only when memory pressure leaves no choice.

Format layout: how the bytes are arranged#

A GGUF file is a four-section binary blob with a fixed magic-bytes header, a self-describing metadata key-value section, a tensor-info table and the raw tensor data. The format is little-endian, version-stamped, and uses length-prefixed strings throughout so that any reader can skip unknown fields without losing alignment.

Strings are length-prefixed UTF-8 (u64 length followed by bytes); no null terminator.
Types in the metadata section: u8/u16/u32/u64/i8/i16/i32/i64/f32/f64/bool/string/array (recursive).
Tensor types (ggml_type): F32, F16, BF16, Q8_0/Q8_1/Q8_K, Q6_K, Q5_0/Q5_1/Q5_K, Q4_0/Q4_1/Q4_K, Q3_K, Q2_K, IQ4_NL/IQ4_XS, IQ3_S/IQ3_M, IQ2_XXS/IQ2_XS/IQ2_S, IQ1_S/IQ1_M.
Per-tensor quantisation type: each tensor can use a different ggml_type, which is the mechanism behind K-quants' mixed-precision strategy (Q6_K embeddings + Q4_K_M body).
Forward compatibility: unknown metadata keys are skipped silently; unknown tensor types fail with a clear error rather than misinterpret.

Section	Contents	Purpose
Header	Magic 'GGUF' (4 bytes) + version (u32) + tensor_count (u64) + metadata_kv_count (u64)	Identify file, version-gate parsers, size subsequent sections
Metadata KV	metadata_kv_count entries of (string key, type tag, value)	Architecture name, tokeniser vocab, special tokens, RoPE base/scale, attention head counts, custom annotations
Tensor info	tensor_count entries of (name, n_dims, shape[u64;n_dims], ggml_type, offset)	Locate every tensor without parsing its bytes; lazy-load to RAM or mmap
Tensor data	Raw tensor bytes at the offsets given above, aligned to ggml_type's natural alignment (typically 32 bytes)	The actual quantised weights

Quantisation schemes: choosing per-tensor types#

GGUF supports a wide family of quantisation types organised into three generations: legacy quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0), K-quants (Q2_K through Q6_K, introduced mid-2023), and I-quants (IQ1_S through IQ4_NL, introduced 2024 with importance-aware codebooks). Each tensor in a GGUF file can use a different scheme — the standard practice is to keep embeddings and output projections at Q6_K (where precision matters most) and quantise the bulk of weights at Q4_K_M.

The naming convention — Q<bits>_<variant> — encodes the target bit width and a variant suffix. Suffix _0 / _1 are legacy quants distinguished by whether they use a constant zero offset. K-quant suffixes _S, _M, _L are size variants: K_M is the standard, K_S is smaller-and-slightly-lower-quality, K_L is larger-and-slightly-higher-quality. I-quant suffixes follow the same convention with the IQ prefix.

Scheme	Bits/weight	Quality (Llama 8B vs F16)	Use when
F32	32	Identical (reference)	Calibration / debugging only
F16	16	~Identical (reference)	Source for re-quantisation
BF16	16	~Identical (reference)	Source on hardware with BF16 native support
Q8_0	8.5	Within 0.1 perplexity	Near-lossless; ~2x smaller than F16
Q6_K	6.6	Within 0.2 perplexity	Recommended for embeddings + output projection layers
Q5_K_M	5.7	Within 0.3-0.4 perplexity	High-quality general use; mid-size memory
Q4_K_M	4.8	Within 0.5-0.7 perplexity	DEFAULT — best quality-to-size trade-off
Q4_0	4.5	Within 0.7-1.0 perplexity	Legacy compatibility only; prefer Q4_K_M
IQ4_NL	4.5	Within 0.5 perplexity	Smaller than Q4_K_M with comparable quality; needs newer llama.cpp
Q3_K_M	3.9	1.5-2.5 perplexity worse	Aggressive: noticeable quality drop, half the Q4 size
IQ3_M	3.7	1.0-1.5 perplexity worse	Better than Q3_K_M at similar size; importance-aware
Q2_K	2.6	5-10 perplexity worse	Last-resort: degrades reasoning and coding noticeably
IQ2_S	2.5	3-5 perplexity worse	Better than Q2_K at similar size
IQ1_M / IQ1_S	1.6-1.7	10-20 perplexity worse	Experimental: large quality loss; only for memory-critical edge

I-quants (IQ-series, 2024+) use an importance-aware codebook derived from a calibration matrix — they consistently produce better quality than the same-bit K-quant variants but require newer llama.cpp versions and have slightly slower kernel paths. For new edge deployments, IQ4_NL is increasingly the right default over Q4_K_M.

Where it is used today: the llama.cpp ecosystem and edge inference#

GGUF is the distribution format of the entire llama.cpp ecosystem. llama.cpp itself is consumed directly by power users; the broader population reaches it through Ollama (which embeds llama.cpp and pulls GGUF artefacts from a registry), LM Studio (a desktop GUI), Jan (an open-source desktop GUI), GPT4All (a packaged desktop app), KoboldCPP (a fork focused on creative writing), llamafile (a single-binary build that runs anywhere via Cosmopolitan libc), and llama-cpp-python (Python bindings used by LangChain, LlamaIndex, and a long tail of agent frameworks).

Partial readers exist outside the llama.cpp lineage. Candle (Hugging Face's Rust ML framework) reads GGUF for selected architectures. Apple MLX reads GGUF for Llama-family models on M-series silicon. These are intentionally narrow implementations — full architecture coverage remains the privilege of llama.cpp itself.

The format dominates one specific deployment niche: local, edge and on-device LLM inference. Anywhere the constraints are memory-tight, latency-critical, network-disconnected, or single-user — Jetson Orin Nano running a 3B model on 8 GB, a developer's M3 MacBook serving a coding assistant locally, an industrial gateway running a maintenance-prediction model in an air-gapped environment — GGUF is what gets shipped. For server-side, multi-tenant, GPU-rich cloud serving with continuous batching and PagedAttention, the native formats of vLLM (safetensors with AWQ / GPTQ / FP8), TensorRT-LLM (engine plan files) and SGLang are far better suited.

Yobitel's Yobibyte Marketplace catalogues GGUF recipes specifically for the edge-targeted deployment shape. The same Llama 3.1 8B model appears in the Marketplace with multiple recipes — AWQ INT4 for cloud GPU serving, FP8 for Hopper / Blackwell tenancies, and GGUF Q4_K_M / IQ4_NL for edge — and customers pick by deployment target. Yobitel Edge AI ships llama.cpp as the on-device runtime on its Jetson Orin and x86 micro-server SKUs, consuming GGUF artefacts directly with no intermediate conversion. The two product lines share the same Marketplace catalogue and the same model-card metadata, with the runtime selection driven by the target.

Trade-offs and known limitations#

GGUF is optimised for portability and single-file simplicity at the cost of large-scale serving efficiency. The format has no sharding — a 70B-class model in F16 GGUF is a single 140 GB file, which is awkward to distribute and slow to load. llama.cpp mitigates this with mmap-based loading (only the pages actually touched by inference are read from disk), but multi-shard formats like safetensors handle 200B+ models more gracefully.

The quantisation kernels in llama.cpp are CPU-and-edge-focused. They are highly optimised for Apple Metal, x86 AVX2/AVX512, ARM NEON and CUDA on consumer SKUs (RTX 30/40/50 series), but they are not competitive with vLLM's PagedAttention plus continuous batching on data-centre H100 SXM5. A single 70B Q4_K_M GGUF on llama.cpp serves perhaps 30-50 tokens/sec on a single H100; the equivalent AWQ INT4 on vLLM serves 1,500+ tokens/sec aggregated across hundreds of concurrent sequences. The format is not the bottleneck; the kernel and scheduler stack is.

Continuous batching is absent. llama.cpp's server mode supports a simple round-robin batching of requests but does not implement iteration-level scheduling. For single-user or low-concurrency workloads this is fine; for multi-tenant serving with hundreds of concurrent sequences it is a structural limitation that no kernel tuning fixes.

Tokeniser coverage lags new model releases by days to weeks. A new tokeniser variant (e.g., the Gemma 2 SentencePiece variant, or DeepSeek-V3's mixed BPE) typically needs an explicit llama.cpp patch before convert_hf_to_gguf.py produces a working artefact. The same model converted to GGUF before the patch may load but produce garbage output.

Metadata-only changes are not backward-compatible across major architecture revisions. The format claims forward compatibility, but a Llama 4 architectural change (hypothetical: new positional encoding scheme) requires a llama.cpp code patch and a re-quantised GGUF. Old GGUFs of old architectures keep working; new architectures need new GGUFs.

Do not benchmark GGUF on H100 or H200 to compare against AWQ / FP8 quantised models in vLLM. The comparison is between runtime stacks (llama.cpp versus vLLM), not between quant formats, and llama.cpp is structurally not competitive on multi-tenant data-centre serving. Benchmark GGUF on the deployment shape it is built for: Jetson, M-series, single-user consumer GPU.

Practical implementation notes#

Conversion from HuggingFace safetensors to F16 GGUF is the script `convert_hf_to_gguf.py` in the llama.cpp tree; it handles the bulk of Hugging Face architectures (Llama, Mistral, Qwen, Phi, Gemma, DeepSeek, Falcon, MPT, GPT-NeoX, Baichuan, Yi, ChatGLM, InternLM, StarCoder, CodeLlama). Architectures not yet supported produce a clear error rather than a silent miscompile. New architectures typically land within days to a week of the original model release.

Quantisation from F16 GGUF to a target scheme is the `llama-quantize` binary. It supports importance-matrix calibration via the `--imatrix` flag — an imatrix is computed from a representative calibration dataset (a few hundred KB of text) and biases the I-quant codebook toward representative input distributions. For I-quants (IQ-series) and aggressive K-quants (Q3 and below), imatrix calibration noticeably improves output quality at the cost of a few minutes of calibration time.

Per-tensor scheme override during quantisation is the `--token-embedding-type`, `--output-tensor-type` flags. The recommended pattern for production GGUF artefacts is `--token-embedding-type Q6_K --output-tensor-type Q6_K` with the body at Q4_K_M or IQ4_NL — this preserves the highest-precision quantisation on the two layers that matter most for output quality, at minimal storage cost.

Loading a GGUF in Python uses `llama-cpp-python`: `from llama_cpp import Llama; llm = Llama(model_path="./model.gguf", n_ctx=8192, n_gpu_layers=-1)` loads the model with all layers offloaded to GPU if a CUDA / Metal / Vulkan build is in use. The `n_gpu_layers=-1` shortcut means 'offload everything that fits'; for memory-constrained edge deployments, set it to a smaller integer to keep some layers on CPU.

python

# Production GGUF inference from Python via llama-cpp-python
# pip install llama-cpp-python  (or llama-cpp-python[cuda] / [metal])
from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.1-8b-instruct.Q4_K_M.gguf",
    n_ctx=8192,             # context window
    n_gpu_layers=-1,        # offload all layers to GPU if available
    n_threads=8,            # CPU threads for layers not on GPU
    n_batch=512,            # prompt processing batch size
    use_mmap=True,          # memory-map the file; lazy-load pages
    verbose=False,
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Summarise GGUF in three sentences."}
    ],
    max_tokens=256,
    temperature=0.7,
)
print(resp["choices"][0]["message"]["content"])

# Inspect the loaded model's GGUF metadata directly
print(llm.metadata)
# Expect: {'general.architecture': 'llama', 'general.name': 'Meta-Llama-3.1-8B-Instruct',
#          'llama.context_length': '131072', 'llama.embedding_length': '4096',
#          'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.tokens': [...], ...}

Where GGUF fits in the Yobitel stack#

Yobitel uses GGUF as the edge-targeted distribution format across two product lines. On the Yobibyte Marketplace, GGUF appears as a recipe variant alongside AWQ / GPTQ / FP8 for the same model — same model-card metadata, same evaluation results, different artefact for a different deployment target. Customers selecting a Yobibyte Edge tier consume GGUF artefacts; customers selecting cloud GPU serving consume the AWQ / FP8 variants. The Marketplace UX hides the format complexity: the customer picks a model and a tier, and the platform delivers the right artefact.

Yobitel Edge AI is the direct consumer of GGUF artefacts. The product ships llama.cpp (CUDA-built for Jetson Orin, x86 AVX2-built for industrial micro-servers) as the on-device inference runtime, consuming GGUF artefacts pulled from the Yobibyte Marketplace at provisioning time. Air-gapped deployments — common in industrial and OFFICIAL-sensitive UK government settings — pre-stage the GGUF artefact during commissioning and run entirely offline thereafter; the single-file portability of GGUF is what makes that operationally sane.

Yobitel's InferenceBench includes a separate edge-class benchmark suite that measures GGUF performance on Jetson Orin Nano, Jetson Orin NX, Apple M3 Pro and consumer RTX 4090 across the Q4_K_M and IQ4_NL quants for the Llama, Qwen and Phi families. The reported metrics — tokens-per-second, watts-per-token, memory footprint — are the operational numbers that matter for edge deployment selection. For teams choosing between Yobitel Edge AI managed delivery and a self-built edge stack, InferenceBench is the empirical reference point.

References

GGUF Format Specification · GitHub (ggml)
llama.cpp on GitHub · GitHub
Hugging Face GGUF Documentation · Hugging Face
Importance Matrix (imatrix) quantisation in llama.cpp · GitHub Discussions (llama.cpp)
llama-cpp-python · GitHub

TL;DR

Single-file binary model format introduced by Georgi Gerganov and the llama.cpp community on 22 August 2023 (PR #2398) as the successor to the earlier GGML and GGJT formats; designed for forward compatibility, extensibility through arbitrary key-value metadata, and single-file portability.
One file contains everything needed to run inference — weights, tokeniser vocabulary, architecture metadata, RoPE / attention parameters and any number of custom metadata entries — eliminating the multi-file pain that GGML and GGJT inherited from PyTorch checkpoints.
Supports a wide quantisation family: F32, F16, BF16 baselines; Q8_0 / Q6_K / Q5_K_M / Q4_K_M / Q4_0 / Q3_K_M / Q2_K legacy and K-quants; IQ4_NL, IQ3_M, IQ2_S, IQ1_S and IQ1_M I-quants (importance-aware, 2024+); per-tensor scheme selection so embeddings and output projections can use Q6_K while the bulk of weights use Q4_K_M.
Consumed by llama.cpp and every project built on it — Ollama, LM Studio, Jan, GPT4All, KoboldCPP, llamafile, llama-cpp-python, plus partial readers in Candle and Apple MLX; published GGUF checkpoints dominate the Hugging Face Hub for local-LLM use cases.
Yobitel uses GGUF as the edge-targeted distribution format: the Yobibyte Marketplace catalogues GGUF recipes for the Llama, Qwen, Mistral and Phi families specifically for edge deployment, and Yobitel Edge AI ships llama.cpp on Jetson and x86 SKUs consuming GGUF artefacts directly.

Overview#

Quick start: convert and quantise a HuggingFace model to GGUF#

bash

# Build llama.cpp from source (or apt install llama-cpp on Debian-style distros)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build -DLLAMA_CUDA=ON && cmake --build build --config Release

# 1. Convert HF safetensors to F16 GGUF (single file; tokeniser + config embedded)
python convert_hf_to_gguf.py \
    ./Meta-Llama-3.1-8B-Instruct \
    --outfile llama3-8b.f16.gguf \
    --outtype f16

# 2. Quantise F16 GGUF to Q4_K_M (the de facto default scheme)
./build/bin/llama-quantize \
    llama3-8b.f16.gguf \
    llama3-8b.Q4_K_M.gguf \
    Q4_K_M

# 3. Sanity-check the artefact
./build/bin/llama-cli \
    -m llama3-8b.Q4_K_M.gguf \
    -p "Summarise GGUF in three sentences." \
    -n 128 --temp 0.7

# 4. (Optional) Inspect metadata
./build/bin/llama-gguf-info llama3-8b.Q4_K_M.gguf
# Expect: architecture llama, tokeniser vocab_size 128256, RoPE base 500000, ...

Format layout: how the bytes are arranged#

Strings are length-prefixed UTF-8 (u64 length followed by bytes); no null terminator.
Types in the metadata section: u8/u16/u32/u64/i8/i16/i32/i64/f32/f64/bool/string/array (recursive).
Tensor types (ggml_type): F32, F16, BF16, Q8_0/Q8_1/Q8_K, Q6_K, Q5_0/Q5_1/Q5_K, Q4_0/Q4_1/Q4_K, Q3_K, Q2_K, IQ4_NL/IQ4_XS, IQ3_S/IQ3_M, IQ2_XXS/IQ2_XS/IQ2_S, IQ1_S/IQ1_M.
Per-tensor quantisation type: each tensor can use a different ggml_type, which is the mechanism behind K-quants' mixed-precision strategy (Q6_K embeddings + Q4_K_M body).
Forward compatibility: unknown metadata keys are skipped silently; unknown tensor types fail with a clear error rather than misinterpret.

Section	Contents	Purpose
Header	Magic 'GGUF' (4 bytes) + version (u32) + tensor_count (u64) + metadata_kv_count (u64)	Identify file, version-gate parsers, size subsequent sections
Metadata KV	metadata_kv_count entries of (string key, type tag, value)	Architecture name, tokeniser vocab, special tokens, RoPE base/scale, attention head counts, custom annotations
Tensor info	tensor_count entries of (name, n_dims, shape[u64;n_dims], ggml_type, offset)	Locate every tensor without parsing its bytes; lazy-load to RAM or mmap
Tensor data	Raw tensor bytes at the offsets given above, aligned to ggml_type's natural alignment (typically 32 bytes)	The actual quantised weights

Quantisation schemes: choosing per-tensor types#

Scheme	Bits/weight	Quality (Llama 8B vs F16)	Use when
F32	32	Identical (reference)	Calibration / debugging only
F16	16	~Identical (reference)	Source for re-quantisation
BF16	16	~Identical (reference)	Source on hardware with BF16 native support
Q8_0	8.5	Within 0.1 perplexity	Near-lossless; ~2x smaller than F16
Q6_K	6.6	Within 0.2 perplexity	Recommended for embeddings + output projection layers
Q5_K_M	5.7	Within 0.3-0.4 perplexity	High-quality general use; mid-size memory
Q4_K_M	4.8	Within 0.5-0.7 perplexity	DEFAULT — best quality-to-size trade-off
Q4_0	4.5	Within 0.7-1.0 perplexity	Legacy compatibility only; prefer Q4_K_M
IQ4_NL	4.5	Within 0.5 perplexity	Smaller than Q4_K_M with comparable quality; needs newer llama.cpp
Q3_K_M	3.9	1.5-2.5 perplexity worse	Aggressive: noticeable quality drop, half the Q4 size
IQ3_M	3.7	1.0-1.5 perplexity worse	Better than Q3_K_M at similar size; importance-aware
Q2_K	2.6	5-10 perplexity worse	Last-resort: degrades reasoning and coding noticeably
IQ2_S	2.5	3-5 perplexity worse	Better than Q2_K at similar size
IQ1_M / IQ1_S	1.6-1.7	10-20 perplexity worse	Experimental: large quality loss; only for memory-critical edge

# Production GGUF inference from Python via llama-cpp-python
# pip install llama-cpp-python  (or llama-cpp-python[cuda] / [metal])
from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.1-8b-instruct.Q4_K_M.gguf",
    n_ctx=8192,             # context window
    n_gpu_layers=-1,        # offload all layers to GPU if available
    n_threads=8,            # CPU threads for layers not on GPU
    n_batch=512,            # prompt processing batch size
    use_mmap=True,          # memory-map the file; lazy-load pages
    verbose=False,
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Summarise GGUF in three sentences."}
    ],
    max_tokens=256,
    temperature=0.7,
)
print(resp["choices"][0]["message"]["content"])

# Inspect the loaded model's GGUF metadata directly
print(llm.metadata)
# Expect: {'general.architecture': 'llama', 'general.name': 'Meta-Llama-3.1-8B-Instruct',
#          'llama.context_length': '131072', 'llama.embedding_length': '4096',
#          'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.tokens': [...], ...}

Where GGUF fits in the Yobitel stack#

References

GGUF Format Specification · GitHub (ggml)
llama.cpp on GitHub · GitHub
Hugging Face GGUF Documentation · Hugging Face
Importance Matrix (imatrix) quantisation in llama.cpp · GitHub Discussions (llama.cpp)
llama-cpp-python · GitHub

GGUF Format

Overview#

Quick start: convert and quantise a HuggingFace model to GGUF#

Format layout: how the bytes are arranged#

Quantisation schemes: choosing per-tensor types#

Where it is used today: the llama.cpp ecosystem and edge inference#

Trade-offs and known limitations#

Practical implementation notes#

Where GGUF fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

GGUF Format

Overview#

Quick start: convert and quantise a HuggingFace model to GGUF#

Format layout: how the bytes are arranged#

Quantisation schemes: choosing per-tensor types#

Where it is used today: the llama.cpp ecosystem and edge inference#

Trade-offs and known limitations#

Practical implementation notes#

Where GGUF fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel