QLoRA

TL;DR

QLoRA (Dettmers et al., NeurIPS 2023, arXiv:2305.14314) is a fine-tuning method that loads the frozen base model in 4-bit NF4 quantisation, attaches BF16 LoRA adapters to every linear layer, and trains the adapters while dequantising base weights on the fly — using 3-4x less VRAM than BF16 LoRA at near-identical quality.
Three ingredients make it work: the NormalFloat-4 (NF4) data type whose quantisation levels are placed at the quantiles of a unit normal (information-theoretically optimal for the near-Gaussian distribution of pretrained LLM weights); double quantisation of the per-block scaling constants (saves another ~0.4 bits/param); and paged AdamW 8-bit optimiser state that spills to CPU on transient VRAM spikes.
Headline result from the 2023 paper: matched 16-bit full-fine-tune quality on Vicuna and Llama-65B benchmarks using a single 48 GB GPU. By 2026 the same recipe routinely fine-tunes 70B models on a single 80 GB H100, 13B on a consumer 24 GB RTX 4090, and (with offloading) frontier 100B+ models on a single H200.
Standard 2026 recipe: NF4 + double quantisation base, BF16 compute dtype, LoRA r=16-64 with alpha=2*r on every linear layer, paged AdamW 8-bit, cosine LR 1e-4 to 3e-4, gradient checkpointing on, FlashAttention 2/3 enabled — supported one-line in PEFT, Axolotl, Unsloth (2x faster) and LLaMA-Factory.
Trade-offs vs BF16 LoRA: 20-40% slower per step (dequant kernel runs on every forward pass), <0.5 point quality cost on most instruction-tune workloads, but 3-4x less VRAM. The pragmatic conclusion: use BF16 LoRA when the base fits, QLoRA when it does not — which is most single-GPU fine-tuning of 30B+ models in 2026.

Overview#

QLoRA — Quantised Low-Rank Adaptation — is the fine-tuning technique that broke the single-GPU memory wall for large model adaptation. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman and Luke Zettlemoyer published it in May 2023 (arXiv:2305.14314, NeurIPS 2023) with a result that genuinely surprised the field: a 65 billion parameter LLM, full-quality instruction-tuned to match 16-bit fine-tunes on Vicuna and MMLU benchmarks, trained on a single 48 GB consumer-grade GPU. Before QLoRA, fine-tuning a 65B model meant a multi-node cluster with hundreds of gigabytes of HBM, accessible only to well-funded labs. After QLoRA, a single researcher with one H100 could do it overnight.

QLoRA is not a single new algorithm. It is a careful combination of three existing ideas wired together so that the precision loss from quantisation is recovered by the LoRA adapter. Step one: load the pretrained base model with weights quantised to 4-bit NF4 and keep them frozen — the base is never dequantised back to higher precision in HBM. Step two: attach standard LoRA adapters (see the LoRA entry) in BF16 to the target linear layers. Step three: during the forward pass, each frozen 4-bit weight matrix is dequantised on the fly to BF16 inside a fused CUDA kernel, multiplied with the input, summed with the LoRA contribution, and the intermediate BF16 weight is discarded. The dequantisation cost is small (a few percent of total step time), the memory saving is roughly 4x on the base weights, and the LoRA adapter — trained in BF16 with gradients flowing only through it — absorbs whatever precision the quantisation lost.

The economic consequence is the entire reason QLoRA mattered. A 70B BF16 model needs about 140 GB just for weights; with optimiser state and gradients for full fine-tuning, the total exceeds 500 GB and requires a multi-node DeepSpeed ZeRO-3 or FSDP cluster. With BF16 LoRA (see LoRA entry) the gradient and optimiser memory shrinks to the adapter size, but the 140 GB of frozen base weights still requires 2x 80 GB H100s at minimum. With QLoRA the base shrinks to ~35 GB at NF4 + double quantisation, gradient and optimiser memory shrinks to the adapter size, paged AdamW 8-bit absorbs the activation spikes, and the whole 70B fine-tune fits comfortably on a single 80 GB H100. For a 13B base the same recipe lands on a 24 GB RTX 4090 — making serious LLM customisation a hobbyist-budget activity for the first time.

This entry is the conceptual reference for the engineer who needs to reason about QLoRA: what each of the three ingredients does, which trade-offs are real and which are folklore, when to choose QLoRA over plain BF16 LoRA, and the variants that have appeared since 2023 (LoftQ initialisation, 2-bit and 3-bit extensions, QDoRA). This entry helps you decide whether QLoRA fits your fine-tune budget and how to run it on Yobibyte or your own GPU. Yobibyte's FineTune resource exposes QLoRA (alongside LoRA) as a first-class method — customers configure rank, alpha and target modules and Yobibyte runs the job on their behalf on Yobitel-managed H100 / H200 capacity in UK and EU regions with NCSC OFFICIAL alignment, which is what makes single-GPU 70B fine-tuning a credible product at hobbyist-budget price points.

How it works: NF4, double quantisation, paged optimisers and the forward pass#

The first ingredient is the NF4 data type. Standard 4-bit integer quantisation (INT4) spaces its 16 quantisation levels uniformly across the range [-1, 1] after rescaling — wasting representational budget on values that pretrained LLM weights almost never take. Pretrained LLM weights are well approximated by a normal distribution with mean zero, so most of the weight mass concentrates near zero with thin tails far from it. NF4 is information-theoretically optimal for this distribution: its 16 quantisation levels are placed at the quantiles of a unit normal, so each level is equally likely to be used by a randomly drawn weight. On real LLM weights, NF4 measurably outperforms both INT4 and FP4 at the same 4-bit budget — typically by 0.5-1 point on downstream benchmarks. Modern Hopper (H100, H200) and Blackwell (B100, B200) GPUs do not have native NF4 hardware; the dequantisation happens in fused CUDA / Triton kernels shipped by bitsandbytes and Unsloth, which translate NF4 indices into BF16 values inline with the GEMM. The overhead is small enough that QLoRA training runs at roughly 60-80% of the throughput of BF16 LoRA on the same GPU.

The second ingredient is double quantisation. Block-wise quantisation stores one FP32 scaling constant per block of weights — typically blocks of 64. For a 70B model, those scaling constants themselves consume around 350 MB of HBM. Double quantisation quantises the scaling constants themselves: it stores them at 8-bit with a single FP32 super-constant per super-block of 256 scaling constants. The arithmetic: 4 bits per weight + 8 bits per 64 weights (sub-constants) + 32 bits per 64*256 weights (super-constants) = roughly 4.1 bits per parameter average. For a 70B model that is about 35 GB versus 140 GB in BF16 — the 4x reduction that turns single-H100 70B fine-tuning from impossible to routine. Double quantisation contributes about 0.4 bits/param of saving on top of NF4 with negligible quality cost.

The third ingredient is paged optimisers. Long-context batches produce sporadic VRAM spikes from activation memory — the gradient of attention scales as O(seq_len * batch) and can briefly demand many gigabytes more than the steady-state working set. Paged optimisers use NVIDIA's unified memory feature to page the AdamW state (first and second moments, kept in 8-bit precision via Dettmers' earlier bitsandbytes work) out to CPU memory on demand and bring it back when needed. In practice paging is rare — the dynamic working set rarely exceeds available HBM during steady-state training — but when it happens, the run gracefully degrades to a slower step instead of OOM-crashing. This is what makes QLoRA usable on small VRAM budgets like 24 GB cards where transient spikes would otherwise abort training.

The forward pass mathematics is the LoRA forward (see LoRA entry) with one twist: the frozen W is stored quantised, so it must be dequantised before the matmul. For each linear layer with target weight W_q (quantised, frozen), trainable LoRA matrices A (r, d_in) and B (d_out, r), and input x, the forward computes y = dequantise(W_q) @ x + (alpha/r) * B @ A @ x. The dequantise call is fused with the matmul into a single CUDA kernel — bitsandbytes ships `bnb.matmul_4bit(x, W_q.t(), state)` which does exactly this. Gradients flow only through A and B; W_q is `requires_grad=False` and never has gradient buffers allocated. Optimiser state (paged AdamW 8-bit) is maintained only for A and B. The result is that training memory for a 70B QLoRA run is approximately 35 GB (NF4 base) + ~1 GB (LoRA adapter + 8-bit Adam state) + activation memory (5-25 GB depending on seq_len and gradient checkpointing).

NF4 data type: 16 quantisation levels at quantiles of unit normal; near-optimal for pretrained LLM weights.
Double quantisation: per-block FP32 scaling constants quantised to 8-bit with FP32 super-constants. Saves ~0.4 bits/param.
Total storage: ~4.1 bits/param for the base, ~4x compression vs BF16 (16 bits/param).
Paged AdamW 8-bit: 8-bit optimiser state with CPU spill on memory spikes — prevents OOM under transient activation pressure.
Forward pass: fused dequant + matmul (bitsandbytes `matmul_4bit`); LoRA contribution added in BF16; intermediate BF16 weight discarded.
Gradients: flow only through LoRA matrices A, B; base weights are frozen and have no gradient buffers.

python

# qlora_minimal.py — runs with: pip install transformers peft bitsandbytes accelerate trl datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

BASE = "meta-llama/Meta-Llama-3.1-8B"

# 1. NF4 + double quantisation config — the QLoRA recipe.
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                 # NormalFloat 4 (vs "fp4")
    bnb_4bit_use_double_quant=True,            # double quantisation of scaling consts
    bnb_4bit_compute_dtype=torch.bfloat16,     # compute dtype after dequant
)

# 2. Load the base in 4-bit. ~5 GB for an 8B model; ~35 GB for a 70B.
tokenizer = AutoTokenizer.from_pretrained(BASE)
model     = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")

# 3. Prepare the model: cast layer norms to FP32, enable input-grad propagation, etc.
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# 4. Standard LoRA on top — exactly like the LoRA entry's code.
lora = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: ~42M (0.5% of 8B base), with 8B base in 4-bit (~5 GB HBM).

# 5. Train with TRL SFTTrainer + paged AdamW 8-bit.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="./out-llama3-qlora",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",              # paged AdamW = the third QLoRA ingredient
        logging_steps=10,
        save_steps=200,
    ),
)
trainer.train()

# 6. Save the adapter only (~150 MB). Do NOT merge into the 4-bit base — see warning below.
model.save_pretrained("./llama3-qlora-adapter")

Do NOT merge a QLoRA adapter directly into a 4-bit base for serving. The 4-bit base lacks the precision to absorb the adapter cleanly and the merged model loses quality sharply. Correct workflow: (1) dequantise the base to BF16, (2) merge the adapter into the BF16 base, (3) optionally re-quantise the merged BF16 model with AWQ or GPTQ for serving. Or skip merging entirely and run the adapter alongside the 4-bit base via PEFT's runtime path.

Variants and architectural choices: the QLoRA family in 2026#

The 2023 QLoRA recipe has been refined into a small family of variants. Authoritative table of the variants that have shipped libraries by mid-2026; pick by what you are trying to fix.

Variant	Year	Key change	What it improves	Library support
QLoRA (original)	2023	NF4 base + double quant + BF16 LoRA + paged AdamW 8-bit	Baseline — 4x VRAM saving	bitsandbytes + PEFT, Axolotl, Unsloth, LLaMA-Factory
QLoRA with FP4	2023	FP4 instead of NF4 (less optimal for normal weights)	Slightly faster on some GPUs	bitsandbytes (`bnb_4bit_quant_type='fp4'`)
LoftQ initialisation	2024	Initialise LoRA A, B to compensate for quantisation error	Recovers ~0.3-0.5 pt quality lost to NF4	PEFT (`init_lora_weights='loftq'`)
QDoRA	2024	QLoRA combined with DoRA decomposition	Closes residual gap to full FT	PEFT (`use_dora=True` + 4-bit base)
HQQ + LoRA	2024	HQQ quantisation (calibration-free, fast) + LoRA	Faster quantisation step; comparable serving quality	hqq-org/hqq + PEFT
2-bit / 3-bit QLoRA	2024-2025	AQLM / QuIP# at 2 bits/param + LoRA	Fits 100B+ on single GPU; ~1-2 pt quality cost	PEFT + AQLM / QuIP#
QLoRA + FSDP	2024	QLoRA across multiple GPUs with FSDP wrapping	Multi-GPU QLoRA training for larger contexts	Answer.AI / PEFT (`fsdp_qlora`)

In 2026 the default-of-defaults remains the original QLoRA recipe (NF4 + double quant + paged AdamW 8-bit), with LoftQ initialisation added when the quality gap matters and QDoRA when DoRA's directional decomposition helps. The 2-bit variants (AQLM, QuIP#) are still cutting-edge and worth using only when the model truly does not fit at 4 bits.

Where it is used today: the open-source fine-tuning ecosystem#

By 2026, QLoRA is the dominant single-GPU recipe for LLM fine-tuning across the open-source ecosystem. Hugging Face's PEFT library treats it as a first-class option through the `BitsAndBytesConfig` integration — every PEFT-based training run can switch to QLoRA with three lines of config. Axolotl (axolotl-ai-cloud/axolotl) exposes it with `adapter: qlora` and `load_in_4bit: true` in YAML; the example configs at `axolotl/examples/llama-3/` ship pre-tuned QLoRA recipes for 7B through 70B Llama variants. Unsloth (unslothai/unsloth) optimises the QLoRA forward and backward kernels with hand-written Triton code and delivers roughly 2x throughput plus 50-70% less peak VRAM vs the bitsandbytes baseline on supported model families. LLaMA-Factory (hiyouga/LLaMA-Factory) adds a Gradio web UI on top of PEFT-style QLoRA with 100+ pre-registered model templates.

The community release pattern that defines the era looks like this: a base lab releases a 70B open-weights model (Llama 3.1 70B, Mistral Large, Qwen3 72B, DeepSeek-V3) and within days dozens of QLoRA fine-tunes appear on Hugging Face — domain specialisations, role-play variants, language adaptations, instruction-tune refinements — each costing a few hundred dollars of single-GPU compute to produce. Without QLoRA this entire layer of the ecosystem would not exist; full fine-tuning of 70B models would remain a privileged-lab activity. The 'every 70B model gets a 100-strong fine-tune family within a month' phenomenon is QLoRA's most visible cultural impact.

Commercial fine-tune services use QLoRA-style recipes under the hood for the same memory-economy reason. Together AI's fine-tune product, Replicate's fine-tunes, Fireworks AI's fine-tunes and AWS Bedrock's custom model offerings all rely on QLoRA-equivalent quantised-base + adapter recipes to serve fine-tuning at price points (typically $1-10 per million training tokens) that would be impossible with full BF16 training. The customer interface hides the quantisation; what they see is a fine-tune job that produces a small adapter at a fraction of dense fine-tune cost.

Sizing guidance for a planning conversation: the rough VRAM rules-of-thumb for QLoRA fine-tuning in 2026, assuming r=32 with rsLoRA scaling, sequence length 4,096, gradient checkpointing on, paged AdamW 8-bit and FlashAttention 2/3 enabled. These are working-set estimates and include base, adapter, activation and optimiser-state contributions.

Model size	Base (NF4)	Activation + adapter	Working VRAM	Fits on
7B (Mistral, Llama 3.1 8B)	~4 GB	~6-10 GB	~12-15 GB	RTX 4090 24 GB (head-room)
13B (Llama 2 13B, Qwen 14B)	~7 GB	~7-12 GB	~15-20 GB	RTX 4090 24 GB, A100 40 GB
34B (Yi 34B, CodeLlama 34B)	~17 GB	~10-15 GB	~28-35 GB	A100 40 GB tight, A100 80 GB
70B (Llama 3.1 70B, Qwen3 72B)	~35 GB	~15-25 GB	~55-70 GB	H100 80 GB, H200 141 GB
141B (Mixtral 8x22B MoE)	~70 GB	~20-30 GB	~95-110 GB	H200 141 GB, 2x H100 80 GB
405B (Llama 3.1 405B)	~200 GB	~30-50 GB	~250-280 GB	FSDP-QLoRA across 4x H100 / 2x H200

Trade-offs and known limitations#

QLoRA trades training throughput and a small amount of final quality for a large drop in VRAM consumption. The dequantisation kernel runs on every forward pass; depending on hardware, sequence length and batch size, QLoRA training is typically 20-40% slower per step than BF16 LoRA on the same GPU. Unsloth's hand-written Triton kernels close roughly half this gap on supported model families. The throughput cost is rarely the deciding factor in 2026 — most teams adopt QLoRA precisely because BF16 LoRA does not fit at all, so the comparison is 'slow training' vs 'no training'.

Quality cost on most instruction-tuning workloads is small — typically within 0.3-0.7 points on standard benchmarks like MMLU, IFEval, MT-Bench — and is generally smaller than the quality cost of having to drop to a smaller base model because the larger one did not fit at BF16. The 2023 QLoRA paper demonstrated that 65B QLoRA matched 16-bit full fine-tuning on Vicuna; subsequent work has reproduced and extended this finding across most workloads. LoftQ initialisation recovers most of the residual gap by initialising LoRA matrices to compensate for the NF4 quantisation error; QDoRA goes further by combining the directional decomposition trick. The pragmatic conclusion: for the kinds of fine-tuning most teams actually run (instruction tuning, domain specialisation, RAG-improvement fine-tunes), QLoRA quality is indistinguishable from BF16 LoRA in normal evaluation conditions.

Merge-and-serve workflows need care. Do not merge a QLoRA adapter directly into a 4-bit quantised base — the 4-bit base lacks the precision headroom to absorb the adapter cleanly and the merged model loses quality sharply. The correct serving workflow is: dequantise the base to BF16, merge the adapter into the BF16 base (single matrix add per layer), then optionally re-quantise the merged BF16 model with a serving-optimised quantisation scheme (AWQ, GPTQ, FP8). Alternatively, skip merging entirely and run the adapter alongside the 4-bit base using PEFT's runtime path or vLLM's multi-LoRA support, which handles the adapter on the fly. The merge mistake is one of the most common QLoRA pitfalls in production.

Context-length scaling has a hidden cost. The QLoRA paper's memory analysis assumed moderate context (2k-4k tokens). At long contexts (32k, 128k, 1M), activation memory grows as O(seq_len * batch) and quickly dominates the base-weight saving. For long-context QLoRA fine-tuning, FlashAttention 2/3 is mandatory (turns O(seq_len^2) attention memory into O(seq_len)), gradient checkpointing is mandatory (trades recompute time for activation memory), and small per-device batch sizes with high gradient accumulation become the norm. Even with these mitigations, a 70B 128k-context QLoRA fine-tune is closer to 'tight on a single H100' than 'comfortable'.

QLoRA-specific quirks worth knowing. (1) Layer norm and embedding layers are typically kept in FP32 / BF16 even when the base is quantised — `prepare_model_for_kbit_training()` handles this automatically. (2) Mixed-precision training requires BF16 compute dtype (`bnb_4bit_compute_dtype=torch.bfloat16`); FP16 compute dtype causes occasional NaN issues on Ampere and earlier GPUs. (3) Gradient checkpointing is effectively mandatory at >7B base size to keep activation memory in check. (4) Some model architectures (notably very recent or custom architectures) may not be fully supported by bitsandbytes' 4-bit path; check the supported-model list before committing. (5) The Llama / Mistral / Qwen / Gemma / Phi / DeepSeek families are all well-supported; exotic architectures may need testing first.

Pro: 3-4x less VRAM than BF16 LoRA, with 70B fine-tuning on a single 80 GB H100.
Pro: <0.5 point quality cost on most workloads vs BF16 LoRA; LoftQ / QDoRA recover most of the residual.
Pro: composes with every PEFT / Axolotl / Unsloth recipe — drop-in adoption.
Pro: makes single-engineer fine-tuning of frontier-scale open models economically viable.
Con: 20-40% slower training step than BF16 LoRA (Unsloth narrows to 10-20%).
Con: Merging directly into 4-bit base degrades quality — must dequant → merge → requant.
Con: Activation memory at long context (32k+) dominates the saving; needs FlashAttention + grad checkpointing.
Con: Limited to bitsandbytes-supported model architectures (broad but not universal).

Practical implementation notes#

Libraries that implement QLoRA well in 2026: bitsandbytes (bitsandbytes-foundation/bitsandbytes) is the canonical 4-bit and 8-bit CUDA + Triton kernel library — every QLoRA implementation in the open-source ecosystem depends on it. Hugging Face PEFT wraps bitsandbytes and exposes QLoRA through `BitsAndBytesConfig` + `prepare_model_for_kbit_training()` + standard `LoraConfig`. Unsloth (unslothai/unsloth) replaces bitsandbytes' kernels with custom Triton kernels and is the throughput leader for single-GPU QLoRA on supported model families (Llama, Mistral, Gemma, Qwen, Phi, DeepSeek). Axolotl provides YAML-driven QLoRA configs; LLaMA-Factory adds a Gradio UI. For multi-GPU QLoRA, the Answer.AI fsdp_qlora project (later merged into PEFT) supports FSDP-wrapped QLoRA across multiple GPUs and is the path for fine-tuning 405B-class models that do not fit on a single GPU even at 4 bits.

Hyperparameter defaults that work for QLoRA fine-tuning in 2026: NF4 quantisation type (not FP4), double quantisation enabled, BF16 compute dtype (not FP16), LoRA r=16-64 (sweep to find optimum), alpha = 2 * r, target every linear in attention + MLP (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj), LoRA dropout 0.05, paged AdamW 8-bit optimiser, learning rate 2e-4 with cosine decay to 1e-5 and 3% warmup, effective batch size 32-128, 1-3 epochs, gradient checkpointing on, FlashAttention 2/3 enabled. For longer context (>16k tokens), reduce per-device batch size and increase gradient accumulation. For long QLoRA runs on consumer cards (RTX 4090), expect to use micro_batch_size=1 and grad_accum=32 or higher.

Common failure modes and their fixes. (1) `RuntimeError: Could not load library libbitsandbytes_cuda...`: bitsandbytes built for the wrong CUDA version. Reinstall against the right CUDA toolkit, or use the prebuilt wheel matching your CUDA. (2) Sudden NaN in loss after a few hundred steps: FP16 compute dtype underflow — switch to BF16 (`bnb_4bit_compute_dtype=torch.bfloat16`). (3) OOM at start of training despite plenty of headroom: forgot to call `prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)`. (4) Adapter merge produces a model whose responses are nonsense: see the warning above — do not merge directly into 4-bit base. (5) Throughput much lower than expected: check that FlashAttention is actually enabled (some bitsandbytes versions fall back to SDPA silently) and that `optim='paged_adamw_8bit'` is set (default `adamw_torch` is FP32 and slower). (6) Quality unexpectedly worse than equivalent BF16 LoRA on a specific task: try LoftQ initialisation (`init_lora_weights='loftq'`) to compensate for quantisation error, or try QDoRA (`use_dora=True`).

Sizing recipes by hardware. Single RTX 4090 (24 GB): comfortable for 7-13B QLoRA at 4-8k context; tight for 30B (needs aggressive checkpointing). Single A100 80 GB or H100 80 GB: comfortable for 70B QLoRA at 4-8k context; reaches 16-32k context with care. Single H200 (141 GB): comfortable for 70B at 64k+ context, or 100B+ models like Mixtral 8x22B. 4x H100 with FSDP-QLoRA: covers 405B-class fine-tuning at moderate context. For everything larger, full multi-node distributed training (not QLoRA) is the right tool.

Teams that prefer not to manage bitsandbytes, kernel versions and quantisation-merge plumbing themselves can submit the same fine-tune as a Yobibyte FineTune job — Yobibyte runs the QLoRA recipe on Yobitel-managed H100 / H200 capacity and serves the resulting adapter through its multi-LoRA inference surface, with customers paying only for tokens generated rather than dedicated training and serving capacity.

Unsloth's QLoRA path is roughly 2x faster than the bitsandbytes baseline for single-GPU runs on Llama/Mistral/Gemma/Qwen/Phi/DeepSeek architectures, with no quality cost. If your QLoRA workload fits Unsloth's supported-model list, the throughput win is essentially free.

Where QLoRA fits in the Yobitel stack#

QLoRA is the engine room behind Yobibyte's fine-tune economics for mid-sized open-weights models. A customer submitting a FineTune job for a 70B base (Llama 3.1 70B, Qwen3 72B, etc.) does not need to know that QLoRA is the recipe underneath — they see a job spec with rank, learning rate and epochs, a job that completes in a few hours on a single H100 or H200, and an adapter artefact they own. The 4x VRAM saving is what makes single-GPU fine-tuning of 70B models a credible product at hobbyist-budget price points.

Yobibyte's multi-LoRA serving surface accepts QLoRA-trained adapters identically to BF16 LoRA adapters — adapters are merged back to BF16 before serving (per the merge warning above) and either served as a dedicated replica or hot-swapped alongside dozens of other adapters on a shared base. Customers pay only for tokens generated, not for dedicated capacity for their adapter, regardless of whether the adapter was trained via BF16 LoRA or QLoRA.

InferenceBench evaluates QLoRA-trained adapters alongside BF16-LoRA-trained adapters and full fine-tunes, so customers can confirm empirically the textbook claim that QLoRA quality matches BF16 LoRA within a fraction of a point on common evaluation suites. The data informs Yobibyte's default fine-tune recipe selection: QLoRA when the base is 30B or larger (memory-economy win is decisive), BF16 LoRA when the base fits comfortably (slight quality and throughput edge), full FT only when explicitly requested with the appropriate cluster budget.

References

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) · arXiv / NeurIPS 2023
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) · arXiv
LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models (Li et al., 2023) · arXiv
bitsandbytes — 8-bit and 4-bit CUDA functions · GitHub
PEFT quantisation guide (Hugging Face) · Hugging Face docs
Answer.AI FSDP-QLoRA — multi-GPU QLoRA training · Answer.AI
Unsloth — accelerated LoRA / QLoRA kernels · GitHub

TL;DR

QLoRA (Dettmers et al., NeurIPS 2023, arXiv:2305.14314) is a fine-tuning method that loads the frozen base model in 4-bit NF4 quantisation, attaches BF16 LoRA adapters to every linear layer, and trains the adapters while dequantising base weights on the fly — using 3-4x less VRAM than BF16 LoRA at near-identical quality.
Three ingredients make it work: the NormalFloat-4 (NF4) data type whose quantisation levels are placed at the quantiles of a unit normal (information-theoretically optimal for the near-Gaussian distribution of pretrained LLM weights); double quantisation of the per-block scaling constants (saves another ~0.4 bits/param); and paged AdamW 8-bit optimiser state that spills to CPU on transient VRAM spikes.
Headline result from the 2023 paper: matched 16-bit full-fine-tune quality on Vicuna and Llama-65B benchmarks using a single 48 GB GPU. By 2026 the same recipe routinely fine-tunes 70B models on a single 80 GB H100, 13B on a consumer 24 GB RTX 4090, and (with offloading) frontier 100B+ models on a single H200.
Standard 2026 recipe: NF4 + double quantisation base, BF16 compute dtype, LoRA r=16-64 with alpha=2*r on every linear layer, paged AdamW 8-bit, cosine LR 1e-4 to 3e-4, gradient checkpointing on, FlashAttention 2/3 enabled — supported one-line in PEFT, Axolotl, Unsloth (2x faster) and LLaMA-Factory.
Trade-offs vs BF16 LoRA: 20-40% slower per step (dequant kernel runs on every forward pass), <0.5 point quality cost on most instruction-tune workloads, but 3-4x less VRAM. The pragmatic conclusion: use BF16 LoRA when the base fits, QLoRA when it does not — which is most single-GPU fine-tuning of 30B+ models in 2026.

Overview#

How it works: NF4, double quantisation, paged optimisers and the forward pass#

NF4 data type: 16 quantisation levels at quantiles of unit normal; near-optimal for pretrained LLM weights.
Double quantisation: per-block FP32 scaling constants quantised to 8-bit with FP32 super-constants. Saves ~0.4 bits/param.
Total storage: ~4.1 bits/param for the base, ~4x compression vs BF16 (16 bits/param).
Paged AdamW 8-bit: 8-bit optimiser state with CPU spill on memory spikes — prevents OOM under transient activation pressure.
Forward pass: fused dequant + matmul (bitsandbytes `matmul_4bit`); LoRA contribution added in BF16; intermediate BF16 weight discarded.
Gradients: flow only through LoRA matrices A, B; base weights are frozen and have no gradient buffers.

python

# qlora_minimal.py — runs with: pip install transformers peft bitsandbytes accelerate trl datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

BASE = "meta-llama/Meta-Llama-3.1-8B"

# 1. NF4 + double quantisation config — the QLoRA recipe.
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                 # NormalFloat 4 (vs "fp4")
    bnb_4bit_use_double_quant=True,            # double quantisation of scaling consts
    bnb_4bit_compute_dtype=torch.bfloat16,     # compute dtype after dequant
)

# 2. Load the base in 4-bit. ~5 GB for an 8B model; ~35 GB for a 70B.
tokenizer = AutoTokenizer.from_pretrained(BASE)
model     = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")

# 3. Prepare the model: cast layer norms to FP32, enable input-grad propagation, etc.
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# 4. Standard LoRA on top — exactly like the LoRA entry's code.
lora = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: ~42M (0.5% of 8B base), with 8B base in 4-bit (~5 GB HBM).

# 5. Train with TRL SFTTrainer + paged AdamW 8-bit.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="./out-llama3-qlora",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",              # paged AdamW = the third QLoRA ingredient
        logging_steps=10,
        save_steps=200,
    ),
)
trainer.train()

# 6. Save the adapter only (~150 MB). Do NOT merge into the 4-bit base — see warning below.
model.save_pretrained("./llama3-qlora-adapter")

Variants and architectural choices: the QLoRA family in 2026#

The 2023 QLoRA recipe has been refined into a small family of variants. Authoritative table of the variants that have shipped libraries by mid-2026; pick by what you are trying to fix.

Variant	Year	Key change	What it improves	Library support
QLoRA (original)	2023	NF4 base + double quant + BF16 LoRA + paged AdamW 8-bit	Baseline — 4x VRAM saving	bitsandbytes + PEFT, Axolotl, Unsloth, LLaMA-Factory
QLoRA with FP4	2023	FP4 instead of NF4 (less optimal for normal weights)	Slightly faster on some GPUs	bitsandbytes (`bnb_4bit_quant_type='fp4'`)
LoftQ initialisation	2024	Initialise LoRA A, B to compensate for quantisation error	Recovers ~0.3-0.5 pt quality lost to NF4	PEFT (`init_lora_weights='loftq'`)
QDoRA	2024	QLoRA combined with DoRA decomposition	Closes residual gap to full FT	PEFT (`use_dora=True` + 4-bit base)
HQQ + LoRA	2024	HQQ quantisation (calibration-free, fast) + LoRA	Faster quantisation step; comparable serving quality	hqq-org/hqq + PEFT
2-bit / 3-bit QLoRA	2024-2025	AQLM / QuIP# at 2 bits/param + LoRA	Fits 100B+ on single GPU; ~1-2 pt quality cost	PEFT + AQLM / QuIP#
QLoRA + FSDP	2024	QLoRA across multiple GPUs with FSDP wrapping	Multi-GPU QLoRA training for larger contexts	Answer.AI / PEFT (`fsdp_qlora`)

Where it is used today: the open-source fine-tuning ecosystem#

Model size	Base (NF4)	Activation + adapter	Working VRAM	Fits on
7B (Mistral, Llama 3.1 8B)	~4 GB	~6-10 GB	~12-15 GB	RTX 4090 24 GB (head-room)
13B (Llama 2 13B, Qwen 14B)	~7 GB	~7-12 GB	~15-20 GB	RTX 4090 24 GB, A100 40 GB
34B (Yi 34B, CodeLlama 34B)	~17 GB	~10-15 GB	~28-35 GB	A100 40 GB tight, A100 80 GB
70B (Llama 3.1 70B, Qwen3 72B)	~35 GB	~15-25 GB	~55-70 GB	H100 80 GB, H200 141 GB
141B (Mixtral 8x22B MoE)	~70 GB	~20-30 GB	~95-110 GB	H200 141 GB, 2x H100 80 GB
405B (Llama 3.1 405B)	~200 GB	~30-50 GB	~250-280 GB	FSDP-QLoRA across 4x H100 / 2x H200

Trade-offs and known limitations#

Pro: 3-4x less VRAM than BF16 LoRA, with 70B fine-tuning on a single 80 GB H100.
Pro: <0.5 point quality cost on most workloads vs BF16 LoRA; LoftQ / QDoRA recover most of the residual.
Pro: composes with every PEFT / Axolotl / Unsloth recipe — drop-in adoption.
Pro: makes single-engineer fine-tuning of frontier-scale open models economically viable.
Con: 20-40% slower training step than BF16 LoRA (Unsloth narrows to 10-20%).
Con: Merging directly into 4-bit base degrades quality — must dequant → merge → requant.
Con: Activation memory at long context (32k+) dominates the saving; needs FlashAttention + grad checkpointing.
Con: Limited to bitsandbytes-supported model architectures (broad but not universal).

Practical implementation notes#

Where QLoRA fits in the Yobitel stack#

References

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) · arXiv / NeurIPS 2023
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) · arXiv
LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models (Li et al., 2023) · arXiv
bitsandbytes — 8-bit and 4-bit CUDA functions · GitHub
PEFT quantisation guide (Hugging Face) · Hugging Face docs
Answer.AI FSDP-QLoRA — multi-GPU QLoRA training · Answer.AI
Unsloth — accelerated LoRA / QLoRA kernels · GitHub

QLoRA

Overview#

How it works: NF4, double quantisation, paged optimisers and the forward pass#

Variants and architectural choices: the QLoRA family in 2026#

Where it is used today: the open-source fine-tuning ecosystem#

Trade-offs and known limitations#

Practical implementation notes#

Where QLoRA fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

QLoRA

Overview#

How it works: NF4, double quantisation, paged optimisers and the forward pass#

Variants and architectural choices: the QLoRA family in 2026#

Where it is used today: the open-source fine-tuning ecosystem#

Trade-offs and known limitations#

Practical implementation notes#

Where QLoRA fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel