LoRA (Low-Rank Adaptation)

TL;DR

LoRA (Hu et al., Microsoft Research, 2021, arXiv:2106.09685) freezes a pretrained weight matrix W and learns a low-rank update delta_W = (alpha/r) * B @ A, where A is (r x d_in), B is (d_out x r), and r is typically 8-64 — cutting trainable parameter count by 1,000-10,000x.
B is initialised to zero so the adapter contributes nothing at step 0; training starts at the pretrained behaviour exactly, and optimiser state is allocated only for A and B, not the base — which is why a 70B model with r=16 fits in single-GPU memory for adaptation work.
Quality is typically 95-99% of a full fine-tune on instruction-tuning and task-specific workloads at a fraction of the cost; the gap shrinks with higher rank and disappears almost entirely with DoRA, PiSSA or LoRA+ refinements (see Variants).
Modern recipe (2026): rank 16-32, alpha = 2 * rank, target every linear layer (q/k/v/o_proj plus gate/up/down_proj), LR 1e-4 to 3e-4, 1-3 epochs; merge into the base for zero-overhead inference, or hot-swap as an adapter under multi-LoRA serving in vLLM / TensorRT-LLM / SGLang.
LoRA is the default behind every commercial fine-tune offering shipping in 2026 — OpenAI fine-tune, Anthropic fine-tune (Claude 3.5 Haiku), Mistral fine-tune, Together fine-tune, Replicate, AWS Bedrock custom models — and the substrate that QLoRA, DoRA, AdaLoRA, rsLoRA and PiSSA all build on.

Overview#

LoRA — Low-Rank Adaptation — is the parameter-efficient fine-tuning technique that turned LLM adaptation from a privileged-lab activity into something a single engineer could run on one GPU. Edward Hu and colleagues at Microsoft Research published it in June 2021 (arXiv:2106.09685, ICLR 2022) with a memorable headline: a 175B GPT-3 fine-tune that updated 10,000x fewer parameters than full fine-tuning, fit the optimiser state into a fraction of the memory, and matched or beat full fine-tune quality on every task in the paper. Four years later, almost every commercial fine-tune product is LoRA underneath; every open-source training framework supports it as a one-line config; and a small ecosystem of refinements (QLoRA, DoRA, AdaLoRA, PiSSA, rsLoRA, LoRA+) has built on the original mechanism.

The intuition is structural rather than computational. The change a fine-tune induces in a pretrained weight matrix has a very low intrinsic rank: the pretrained model already knows most of what the task needs, and adaptation lives in a small subspace defined by which capabilities to up-weight, which to down-weight and which directions to add. If that is true, you do not need to learn a full-rank update; you can learn a low-rank factorisation of it. LoRA writes delta_W as the product of two small matrices B @ A, where A projects the input into an r-dimensional bottleneck and B projects it back. The pretrained W stays frozen — never receives a gradient, never has optimiser state allocated, never gets updated — and only the adapter trains. Once training finishes, delta_W can be folded into W (a single matrix add) so inference runs with no adapter overhead at all, or the adapter can be kept separate and swapped per request, which is the basis of multi-LoRA serving.

The economic consequence is the entire reason the technique took over. Full fine-tuning a 70B model in BF16 needs around 140 GB for weights, 140 GB for gradients and 280-560 GB for AdamW optimiser state — well over 500 GB of HBM before activations, requiring a multi-node DeepSpeed ZeRO-3 or FSDP cluster. The same 70B model with LoRA at r=16 trains roughly 30-50M parameters; optimiser state shrinks from hundreds of gigabytes to under a gigabyte; gradient memory tracks only the adapter; weights stay at 140 GB (or 35 GB with QLoRA's 4-bit base, see the QLoRA entry); and the whole run fits on a single 80 GB H100. Training time drops 3-5x. Adapters are small artefacts — typically 10-500 MB — that version cleanly in Git LFS, A/B test cheaply and load per-request alongside one shared base model.

This entry is the conceptual reference for the operator who needs to reason about LoRA as a technique: how the maths works, which variants exist and when each one helps, which hyperparameters matter, how the choice composes with quantisation and multi-tenant serving, and where the limits are when LoRA stops being the right answer. This entry helps you decide whether LoRA fits your fine-tune budget and how to run it on Yobibyte or your own GPU. Yobibyte's FineTune resource exposes LoRA (and QLoRA) as first-class methods — customers configure rank, alpha and target modules and Yobibyte runs the job on their behalf on Yobitel-managed H100 / H200 capacity in UK and EU regions with NCSC OFFICIAL alignment.

How it works: the rank-r decomposition, end-to-end#

For each target weight matrix W of shape (d_out, d_in), LoRA introduces two trainable matrices: A of shape (r, d_in) and B of shape (d_out, r), where r — the rank — is a small integer (typically 4, 8, 16, 32 or 64). The original linear layer's forward pass changes from y = Wx to y = Wx + (alpha / r) * B @ A @ x, with W frozen for the entire run. The scalar (alpha / r) is the LoRA scaling factor; it decouples the choice of rank from the effective learning-rate magnitude of the adapter, so you can vary r without re-tuning everything else.

Initialisation is asymmetric and deliberate. A is drawn from a Kaiming-uniform (or small-Gaussian) distribution, B is initialised to zero. The product B @ A is therefore the zero matrix at step 0, which means the LoRA term contributes nothing to the forward pass and training starts at exactly the pretrained behaviour. From step 1 onward, gradients flow through both A and B and the adapter begins to specialise. Crucially only A and B receive gradients: W is wrapped in `requires_grad=False` and the autograd graph never allocates gradient buffers for it. Adam (or any adaptive optimiser) maintains state — first and second moments — only for A and B, which is where almost all the memory saving comes from.

The forward pass mathematics is straightforward but worth writing out because it makes the inference path obvious. During training, the layer computes Wx and (alpha/r) * B @ A @ x as two separate matmuls and sums the results. Backward, the LoRA grad path is gradient @ B^T to update A, and (gradient * (alpha/r)) projected through A^T to update B. The frozen W matmul still has to happen on the forward pass (it is what produces the bulk of the output) but contributes no gradient computation on the backward — saving the backward FLOPs for the largest matrices in the model. After training, you can either keep the adapter separate (W and B @ A both live in memory; the runtime adds the LoRA contribution per forward pass) or merge it once: W_merged = W + (alpha/r) * B @ A, which is a single matrix add per layer. The merged model has identical structure to the base and runs with zero inference overhead.

The capacity claim is that delta_W is well-approximated by a rank-r matrix even for large W. Empirically this is true for the kinds of adaptation people actually want: instruction-following, domain specialisation, style transfer, single-task fine-tunes. It is less true for adaptations that genuinely shift the model's prior at scale — learning a new language family from scratch, or radically restructuring reasoning behaviour — where a higher rank or full fine-tuning is needed. The 2021 paper provides intuition through the 'intrinsic dimensionality' literature (Li et al., 2018; Aghajanyan et al., 2020) showing that the fine-tuning of pretrained language models has low effective dimensionality even when the parameter space is enormous; LoRA exploits that observation directly.

Parameter count: trainable params = sum over target layers of r * (d_in + d_out). For Llama-3-70B with r=16 on all 7 linear projections per block, this is roughly 35-50M trainable params vs 70B base (under 0.1%).
Memory: optimiser state for AdamW at FP32 is 8 bytes/param. 50M trainable params = 400 MB optimiser state vs 560 GB for full FT of a 70B base — a 1,400x reduction.
Initialisation: A ~ Kaiming-uniform, B = 0. B @ A = 0 at step 0, so the model starts at the pretrained behaviour exactly.
Scaling: forward adds (alpha/r) * B @ A @ x to Wx. Common convention alpha = 2 * r (so effective scaling is 2x regardless of rank), making rank an orthogonal capacity dial.
Inference: either merge (W += (alpha/r) * B @ A, zero overhead) or keep separate (multi-LoRA serving, sub-1% latency cost per adapter).

python

# lora_minimal.py — runs with: pip install torch && python lora_minimal.py
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0)

class LoRALinear(nn.Module):
    """Drop-in replacement for nn.Linear with a frozen base + rank-r adapter."""
    def __init__(self, in_features: int, out_features: int, r: int = 16, alpha: int = 32):
        super().__init__()
        self.in_features  = in_features
        self.out_features = out_features
        self.r            = r
        self.scaling      = alpha / r

        # Base weight: frozen.
        self.weight = nn.Parameter(torch.empty(out_features, in_features), requires_grad=False)
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))

        # LoRA matrices: trainable. A is Kaiming, B is zero.
        self.lora_A = nn.Parameter(torch.empty(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # y = Wx + (alpha/r) * B @ A @ x
        return F.linear(x, self.weight) + self.scaling * F.linear(F.linear(x, self.lora_A), self.lora_B)

    def merged_weight(self) -> torch.Tensor:
        """Return W + (alpha/r) * B @ A for zero-overhead inference."""
        return self.weight + self.scaling * (self.lora_B @ self.lora_A)

# Smoke test: confirm the adapter starts at zero and learns something.
layer = LoRALinear(in_features=128, out_features=256, r=8, alpha=16)
x      = torch.randn(4, 128)
target = torch.randn(4, 256)

# Step 0: LoRA output equals base output (B=0).
with torch.no_grad():
    base_only = F.linear(x, layer.weight)
    full      = layer(x)
    assert torch.allclose(base_only, full), "B should be zero at init"

opt = torch.optim.AdamW([layer.lora_A, layer.lora_B], lr=2e-4)
for step in range(500):
    out  = layer(x)
    loss = F.mse_loss(out, target)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 100 == 0:
        n_trainable = layer.lora_A.numel() + layer.lora_B.numel()
        n_total     = n_trainable + layer.weight.numel()
        print(f"step {step:>3} loss {loss.item():.4f} trainable {n_trainable}/{n_total} ({100*n_trainable/n_total:.2f}%)")
# Expect: trainable ~3% of total, loss falls from ~2 to <0.05.

The implementation above is illustrative — production use should go through huggingface/peft (`LoraConfig` + `get_peft_model`), which wires LoRA into every Linear in the model graph correctly, handles bias projections, supports `merge_and_unload()` and integrates with Trainer / SFTTrainer / DPOTrainer without code changes.

Variants and architectural choices: the LoRA family in 2026#

The original 2021 LoRA recipe has been refined into a small family of variants, each addressing a specific shortcoming. Authoritative table of the variants that have shipped libraries by mid-2026; pick by what you are trying to fix.

Variant	Year	Key change	What it improves	Library support
LoRA (original)	2021	Two trainable matrices A, B added to frozen W	Baseline — sets the standard	PEFT, Axolotl, Unsloth, every framework
AdaLoRA	2023	Dynamically allocates rank budget across layers via importance scoring	Uses fewer total params for the same quality	PEFT (TaskType + AdaLoraConfig)
LoRA+	2024	Asymmetric learning rates: LR_B = 16 * LR_A	Faster convergence, ~1-2% quality bump	PEFT (loraplus_lr_ratio), Axolotl
rsLoRA	2024	Scale = alpha / sqrt(r) instead of alpha / r	Stable training at high rank (r=128, r=256)	PEFT (use_rslora=True), Axolotl
DoRA	2024	Decompose W into magnitude m + direction; LoRA only on direction	Closes LoRA-vs-full-FT quality gap; ~1-2 pts on most tasks	PEFT (use_dora=True), Unsloth, Axolotl
PiSSA	2024	Init A, B from SVD of W (principal components first)	Faster convergence; better at low rank (r=4, r=8)	PEFT (init_lora_weights='pissa')
LoftQ	2024	Init A, B to compensate for quantisation error in QLoRA	Recovers quality lost to 4-bit quantisation	PEFT (init_lora_weights='loftq')
VeRA	2024	Share random A, B across layers; train only per-layer scaling vectors	10x smaller adapter file than LoRA	PEFT (VeraConfig)
QLoRA	2023	4-bit NF4 base + LoRA in BF16 on top (see QLoRA entry)	Fits 70B fine-tune on single 80GB GPU	PEFT + bitsandbytes, Unsloth, Axolotl

In 2026 the default-of-defaults is LoRA + rsLoRA scaling + LoRA+ asymmetric LRs, optionally with DoRA on top when the quality gap to full FT matters. PiSSA is the right initialisation when you must work at very low rank (r=4-8) for adapter-size reasons. AdaLoRA is rarely worth the complexity unless you are running an extensive sweep across model families.

Where it is used today: the commercial and open-source landscape#

By 2026, LoRA is the substrate behind essentially every commercial LLM fine-tuning offering. OpenAI's fine-tune API for GPT-4o, GPT-4o mini and o-series models is LoRA underneath (the API surface is opaque but the artefact size, training time and serving economics all match a LoRA-style implementation). Anthropic's Claude 3.5 Haiku fine-tune service is similarly LoRA-style. Mistral's fine-tune API exposes LoRA directly. Together's fine-tune product, Replicate's fine-tune, Fireworks' fine-tune, AWS Bedrock's custom model service, Google Vertex's tuning endpoints — all LoRA or LoRA-adjacent. The reason is uniformly economic: only LoRA's per-customer-adapter cost economics make multi-tenant fine-tuning viable at the price points these services advertise (typically $1-10 per million training tokens).

On the open-source side, every major fine-tuning framework treats LoRA as the default config path. Hugging Face PEFT is the reference implementation and the substrate the rest of the ecosystem builds on. Axolotl exposes LoRA, QLoRA and DoRA as one-line YAML config (`adapter: lora`, `lora_r: 32`, `lora_target_modules: [...]`). Unsloth wraps PEFT with custom Triton kernels for 2x throughput on single-GPU LoRA runs. LLaMA-Factory adds a web UI and 100+ pre-registered model templates with sensible LoRA defaults per model. NVIDIA NeMo and Microsoft DeepSpeed both ship native LoRA support for distributed training. The frontier closed labs use LoRA internally for safety fine-tunes, customer-specific adapters and rapid iteration on instruction recipes; the public open-weights releases (Llama 3.1 Instruct, Mistral Large Instruct, Qwen3 Instruct) are typically full fine-tunes from a base, but the instruction variants commonly ship with documented LoRA fine-tuning recipes for downstream specialisation.

Multi-LoRA serving — the operational pattern that makes per-customer adaptation economic — is the third place LoRA shows up. vLLM has shipped multi-LoRA support since 0.3 with the `--enable-lora` flag and a per-request `lora_request` parameter; the base model loads once, hundreds of LoRA adapters live in CPU memory, and tokens are routed to whichever adapter the request specifies with the LoRA matmul folded into attention on the fly. TensorRT-LLM exposes the same pattern through its `LoraConfig` build option and runtime API. SGLang adds it via `--lora-paths`. The deployment shape is one H100 hosting a 70B base plus dozens of LoRA adapters concurrently, each serving a customer or a task at a fraction of the cost of running independent fine-tuned model copies; this is the architectural basis for how OpenAI / Anthropic / Mistral can offer fine-tuning as a service at sub-cent-per-1k-tokens prices.

The illustrative end-to-end fine-tune below uses Hugging Face PEFT + TRL on a single H100 and runs in a few hours for an 8B base on a 50k-example dataset. Production teams typically wrap this in Axolotl or LLaMA-Factory for the YAML/UX, but the underlying call shape is the same.

python

# lora_finetune.py — runs with: pip install peft trl transformers datasets bitsandbytes accelerate
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

BASE = "meta-llama/Meta-Llama-3.1-8B"

# 1. Load base in BF16 (use QLoRA = load_in_4bit if VRAM-constrained).
tokenizer = AutoTokenizer.from_pretrained(BASE)
model     = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="bfloat16", device_map="auto")

# 2. Wrap with LoRA. Target every linear in attention and MLP.
lora = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=32,                                   # rank
    lora_alpha=64,                          # alpha = 2 * r convention
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_rslora=True,                        # stable scaling at higher rank
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: ~84M (1.0% of 8B base)

# 3. Train with TRL SFTTrainer.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="./out-llama3-lora",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        gradient_checkpointing=True,
        logging_steps=10,
        save_steps=200,
    ),
)
trainer.train()

# 4. Merge for zero-overhead inference, OR push the adapter separately for multi-LoRA serving.
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-merged")
# Alternative: model.save_pretrained("./llama3-adapter")  # ~330MB adapter only

Trade-offs and known limitations#

LoRA is not free quality. On tasks that genuinely require shifting the model's prior at scale — learning a new language from scratch, restructuring complex reasoning behaviour, large domain shifts where the pretrained distribution barely overlaps the target — full fine-tuning still wins. The gap is small on most workloads (typically within 1-2 points on standard benchmarks like MT-Bench, IFEval, MMLU) and shrinks further at higher ranks, but it exists. DoRA closes most of the remaining gap; PiSSA helps at low rank; QLoRA gives up another fraction of a point in exchange for a 4x memory saving on the base. The pragmatic 2026 default is: LoRA first, QLoRA when memory-constrained, DoRA when LoRA underperforms the bar, full FT only when you have hard evidence neither works and the budget to run a multi-node cluster.

Rank selection is the most consequential hyperparameter and the one most often mis-set. Lower rank (r=4-8) underfits on harder tasks and gives a measurable quality gap; higher rank (r=64-256) adds compute and training time without much benefit on standard instruction-tuning workloads, though it helps for continued pretraining and large domain shifts. The default that works for almost every task is r=16 or r=32 with alpha=2*r. If you are doing rapid iteration across hundreds of customer adapters, r=8 often suffices; if you are pushing for maximum quality on one specific task, r=64-128 with rsLoRA scaling is worth trying. Always sweep at least three ranks before committing.

Target-module choice is the second most consequential. The original 2021 paper targeted only the attention Q and V projections — a recipe that leaves measurable quality on the table for modern LLMs. The 2026 default is every linear layer in both attention (q_proj, k_proj, v_proj, o_proj) and MLP (gate_proj, up_proj, down_proj for SwiGLU-based models). Restricting to attention only saves ~30% of adapter parameters but loses 1-3 points on most benchmarks; the saving is rarely worth it. Note that the embedding layer and the LM head are typically NOT targeted by LoRA — they are huge (vocab x d_model), they generally do not need to change much for instruction tuning, and excluding them keeps the adapter file small.

Multi-LoRA serving has a quiet cost: when adapters are kept separate (not merged), each forward pass pays an extra 2 small matmuls per layer per adapter. vLLM's implementation handles this efficiently with batched adapter computation, but it is not free. Latency increase per adapter is typically <1% at small rank and adapter-batched configurations; it grows roughly linearly with the number of distinct adapters in the batch. For workloads where one customer = one adapter and the adapter is hot, merging into the base for that customer and serving a dedicated replica is cheaper at scale. For workloads with thousands of low-traffic adapters, multi-LoRA on a shared base wins comfortably.

Quality cannot be recovered from aggressive base-model quantisation through LoRA. If you quantise the base to INT4 with a poorly calibrated scheme and the activations exceed the calibration range, LoRA on top cannot compensate — the base is already producing degraded representations and the adapter has nowhere useful to push them. QLoRA works because NF4 is information-theoretically near-optimal for normally distributed weights AND because the LoRA training pass sees full BF16 forward; the adapter is learning to compensate for the dequantisation error in context. Naive 'quantise then LoRA on top' workflows (e.g. INT8 base + BF16 LoRA) typically work poorly compared to the QLoRA-specific recipe. Always use the documented QLoRA recipe (NF4 + double quantisation + paged AdamW 8-bit) rather than mixing-and-matching quantisation schemes.

Pro: 1,000-10,000x fewer trainable parameters and ~1,400x less optimiser-state memory vs full FT.
Pro: adapters are portable, mergeable and hot-swappable; one base can serve hundreds of LoRAs.
Pro: composes cleanly with quantisation (QLoRA) and weight decomposition (DoRA).
Pro: every major commercial fine-tune service uses LoRA under the hood — battle-tested at scale.
Con: 1-2 point quality gap vs full FT on hard or large-distribution-shift tasks; mitigated by DoRA.
Con: requires empirical tuning of rank and target modules; defaults work but optima vary by task.
Con: cannot recover quality lost to mis-calibrated base-model quantisation outside QLoRA recipe.
Con: embedding and LM-head changes (vocabulary expansion, new special tokens) need separate handling — LoRA targets linears only by default.

Practical implementation notes#

Libraries that implement LoRA well in 2026: Hugging Face PEFT (huggingface/peft) is the canonical implementation and the substrate everything else builds on — start here unless you have a specific reason not to. Unsloth (unslothai/unsloth) wraps PEFT with custom Triton kernels and delivers roughly 2x throughput on single-GPU LoRA runs for supported model families (Llama, Mistral, Gemma, Qwen, Phi, DeepSeek). Axolotl (axolotl-ai-cloud/axolotl) provides YAML-driven LoRA configs and is the recipe-of-choice for many open-model release teams. LLaMA-Factory (hiyouga/LLaMA-Factory) adds a Gradio UI and broad model coverage. NVIDIA NeMo, Microsoft DeepSpeed and TRL all support PEFT-format LoRA out of the box. For serving, vLLM (`--enable-lora`), TensorRT-LLM (`--lora-dir`) and SGLang (`--lora-paths`) cover multi-LoRA at production scale.

Hyperparameter defaults that work for instruction-tuning a 7-70B model in 2026: rank r=16 or r=32 (sweep [8, 16, 32, 64] if you have budget); alpha = 2 * r (so 32 or 64); dropout 0.05 on small datasets (<10k examples) or 0 on large; target every linear in attention + MLP; learning rate 2e-4 with cosine decay to 1e-5 and 3% warmup; effective batch size 32-128 via gradient accumulation; 1-3 epochs (more epochs over-fits on small SFT datasets); bf16 throughout; gradient checkpointing on. Enable rsLoRA (`use_rslora=True`) if you push rank above 64; add LoRA+ (`loraplus_lr_ratio=16`) for marginally faster convergence; add DoRA (`use_dora=True`) when LoRA at sensible ranks plateaus above the quality bar.

Adapter portability is one of LoRA's underrated wins but has discipline requirements. Always pin both the PEFT version and the Transformers version at training and serving time — the two libraries co-evolve quickly and adapter checkpoints occasionally need format migrations between minor versions. Save adapters with `model.save_pretrained()` (safetensors format, ~10-500 MB for a typical 7B-70B LoRA), which produces a directory containing `adapter_config.json` + `adapter_model.safetensors`. The config references the base model by Hugging Face ID; loading requires the same base. For multi-LoRA serving, place adapter directories under a shared path and let vLLM / TensorRT-LLM enumerate them. For Git LFS storage, the adapter files are small enough that they version cleanly without LFS up to about 100 MB per adapter.

Common failure modes and their fixes. (1) Adapter does nothing measurable in evaluation: check that target_modules names actually exist in the model (the naming convention varies between architectures — Llama uses q_proj/k_proj, GPT-NeoX uses query_key_value, Mistral matches Llama; misnaming silently produces a trainable-zero adapter). (2) Training loss decreases but eval quality drops: epochs too high (try 1 epoch first), learning rate too high (drop to 1e-4), or dataset is over-represented on one task type. (3) Adapter merge produces a model that responds wildly differently from the un-merged adapter: scaling factor calculation bug — confirm (alpha/r) is applied during merge. (4) Multi-LoRA serving slows down dramatically with more adapters: adapters being CPU-fetched per request — use vLLM's `max_loras` setting to keep the hot set in GPU memory. (5) QLoRA-trained adapter quality drops sharply when merged back into BF16 base: see the QLoRA-specific merge caveat in the QLoRA entry; dequantise base to BF16 before merging.

Teams that prefer not to run this plumbing themselves can submit the same fine-tune as a Yobibyte FineTune job — base, dataset, rank, alpha, target modules, epochs — and consume the resulting adapter through Yobibyte's multi-LoRA inference surface, paying only for tokens generated rather than dedicated training and serving capacity.

The single most common LoRA mis-configuration is targeting only `q_proj` and `v_proj` — the original 2021 paper recipe — on a modern LLM. That recipe was tuned for the GPT-2/GPT-3 architectures of the era and leaves measurable quality on the table for SwiGLU-based Llama/Mistral/Qwen architectures. The 2026 default is every linear in both attention and MLP: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'].

Where LoRA fits in the Yobitel stack#

Customers fine-tuning models on Yobibyte work in the LoRA paradigm by default. A FineTune job in the customer-facing API takes a base model, a dataset reference and a short set of hyperparameters (rank, learning rate, epochs) and returns a portable adapter the customer owns. The underlying training runs on industry-standard PEFT plumbing on Yobitel-managed H100 / H200 capacity, with the orchestration, dataset I/O and adapter storage handled transparently — the customer sees a job that takes minutes to a few hours and produces a small adapter artefact.

Multi-LoRA serving is the deployment shape Yobibyte's inference surface uses to make per-customer adaptation economic. One base model (e.g. Llama 3.1 70B or Mixtral 8x22B) loads once on a shared replica; customer adapters are loaded on demand from object storage, kept hot in GPU memory while they are receiving traffic, and aged out when they go cold. Customers see an OpenAI-compatible `model` parameter that resolves to base + their adapter; they pay only for the tokens they generate, not for dedicated capacity for their adapter. The multi-LoRA recipe is what allows Yobibyte to advertise fine-tuned inference at near-base-model rates.

InferenceBench measures fine-tuned adapter quality alongside base-model quality on its evaluation suites, so customers can compare the empirical quality of LoRA-adapted models against the dense bases they were derived from. The data confirms the textbook claim above: LoRA at r=16-32 consistently lands within 1-2 points of full FT on instruction-tuning benchmarks, at a tiny fraction of the training cost and with hot-swap serving economics dense fine-tunes cannot match.

References

TL;DR

LoRA (Hu et al., Microsoft Research, 2021, arXiv:2106.09685) freezes a pretrained weight matrix W and learns a low-rank update delta_W = (alpha/r) * B @ A, where A is (r x d_in), B is (d_out x r), and r is typically 8-64 — cutting trainable parameter count by 1,000-10,000x.
B is initialised to zero so the adapter contributes nothing at step 0; training starts at the pretrained behaviour exactly, and optimiser state is allocated only for A and B, not the base — which is why a 70B model with r=16 fits in single-GPU memory for adaptation work.
Quality is typically 95-99% of a full fine-tune on instruction-tuning and task-specific workloads at a fraction of the cost; the gap shrinks with higher rank and disappears almost entirely with DoRA, PiSSA or LoRA+ refinements (see Variants).
Modern recipe (2026): rank 16-32, alpha = 2 * rank, target every linear layer (q/k/v/o_proj plus gate/up/down_proj), LR 1e-4 to 3e-4, 1-3 epochs; merge into the base for zero-overhead inference, or hot-swap as an adapter under multi-LoRA serving in vLLM / TensorRT-LLM / SGLang.
LoRA is the default behind every commercial fine-tune offering shipping in 2026 — OpenAI fine-tune, Anthropic fine-tune (Claude 3.5 Haiku), Mistral fine-tune, Together fine-tune, Replicate, AWS Bedrock custom models — and the substrate that QLoRA, DoRA, AdaLoRA, rsLoRA and PiSSA all build on.

Overview#

How it works: the rank-r decomposition, end-to-end#

Parameter count: trainable params = sum over target layers of r * (d_in + d_out). For Llama-3-70B with r=16 on all 7 linear projections per block, this is roughly 35-50M trainable params vs 70B base (under 0.1%).
Memory: optimiser state for AdamW at FP32 is 8 bytes/param. 50M trainable params = 400 MB optimiser state vs 560 GB for full FT of a 70B base — a 1,400x reduction.
Initialisation: A ~ Kaiming-uniform, B = 0. B @ A = 0 at step 0, so the model starts at the pretrained behaviour exactly.
Scaling: forward adds (alpha/r) * B @ A @ x to Wx. Common convention alpha = 2 * r (so effective scaling is 2x regardless of rank), making rank an orthogonal capacity dial.
Inference: either merge (W += (alpha/r) * B @ A, zero overhead) or keep separate (multi-LoRA serving, sub-1% latency cost per adapter).

python

# lora_minimal.py — runs with: pip install torch && python lora_minimal.py
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0)

class LoRALinear(nn.Module):
    """Drop-in replacement for nn.Linear with a frozen base + rank-r adapter."""
    def __init__(self, in_features: int, out_features: int, r: int = 16, alpha: int = 32):
        super().__init__()
        self.in_features  = in_features
        self.out_features = out_features
        self.r            = r
        self.scaling      = alpha / r

        # Base weight: frozen.
        self.weight = nn.Parameter(torch.empty(out_features, in_features), requires_grad=False)
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))

        # LoRA matrices: trainable. A is Kaiming, B is zero.
        self.lora_A = nn.Parameter(torch.empty(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # y = Wx + (alpha/r) * B @ A @ x
        return F.linear(x, self.weight) + self.scaling * F.linear(F.linear(x, self.lora_A), self.lora_B)

    def merged_weight(self) -> torch.Tensor:
        """Return W + (alpha/r) * B @ A for zero-overhead inference."""
        return self.weight + self.scaling * (self.lora_B @ self.lora_A)

# Smoke test: confirm the adapter starts at zero and learns something.
layer = LoRALinear(in_features=128, out_features=256, r=8, alpha=16)
x      = torch.randn(4, 128)
target = torch.randn(4, 256)

# Step 0: LoRA output equals base output (B=0).
with torch.no_grad():
    base_only = F.linear(x, layer.weight)
    full      = layer(x)
    assert torch.allclose(base_only, full), "B should be zero at init"

opt = torch.optim.AdamW([layer.lora_A, layer.lora_B], lr=2e-4)
for step in range(500):
    out  = layer(x)
    loss = F.mse_loss(out, target)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 100 == 0:
        n_trainable = layer.lora_A.numel() + layer.lora_B.numel()
        n_total     = n_trainable + layer.weight.numel()
        print(f"step {step:>3} loss {loss.item():.4f} trainable {n_trainable}/{n_total} ({100*n_trainable/n_total:.2f}%)")
# Expect: trainable ~3% of total, loss falls from ~2 to <0.05.

Variants and architectural choices: the LoRA family in 2026#

Variant	Year	Key change	What it improves	Library support
LoRA (original)	2021	Two trainable matrices A, B added to frozen W	Baseline — sets the standard	PEFT, Axolotl, Unsloth, every framework
AdaLoRA	2023	Dynamically allocates rank budget across layers via importance scoring	Uses fewer total params for the same quality	PEFT (TaskType + AdaLoraConfig)
LoRA+	2024	Asymmetric learning rates: LR_B = 16 * LR_A	Faster convergence, ~1-2% quality bump	PEFT (loraplus_lr_ratio), Axolotl
rsLoRA	2024	Scale = alpha / sqrt(r) instead of alpha / r	Stable training at high rank (r=128, r=256)	PEFT (use_rslora=True), Axolotl
DoRA	2024	Decompose W into magnitude m + direction; LoRA only on direction	Closes LoRA-vs-full-FT quality gap; ~1-2 pts on most tasks	PEFT (use_dora=True), Unsloth, Axolotl
PiSSA	2024	Init A, B from SVD of W (principal components first)	Faster convergence; better at low rank (r=4, r=8)	PEFT (init_lora_weights='pissa')
LoftQ	2024	Init A, B to compensate for quantisation error in QLoRA	Recovers quality lost to 4-bit quantisation	PEFT (init_lora_weights='loftq')
VeRA	2024	Share random A, B across layers; train only per-layer scaling vectors	10x smaller adapter file than LoRA	PEFT (VeraConfig)
QLoRA	2023	4-bit NF4 base + LoRA in BF16 on top (see QLoRA entry)	Fits 70B fine-tune on single 80GB GPU	PEFT + bitsandbytes, Unsloth, Axolotl

Where it is used today: the commercial and open-source landscape#

python

# lora_finetune.py — runs with: pip install peft trl transformers datasets bitsandbytes accelerate
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

BASE = "meta-llama/Meta-Llama-3.1-8B"

# 1. Load base in BF16 (use QLoRA = load_in_4bit if VRAM-constrained).
tokenizer = AutoTokenizer.from_pretrained(BASE)
model     = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="bfloat16", device_map="auto")

# 2. Wrap with LoRA. Target every linear in attention and MLP.
lora = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=32,                                   # rank
    lora_alpha=64,                          # alpha = 2 * r convention
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_rslora=True,                        # stable scaling at higher rank
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: ~84M (1.0% of 8B base)

# 3. Train with TRL SFTTrainer.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="./out-llama3-lora",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        gradient_checkpointing=True,
        logging_steps=10,
        save_steps=200,
    ),
)
trainer.train()

# 4. Merge for zero-overhead inference, OR push the adapter separately for multi-LoRA serving.
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-merged")
# Alternative: model.save_pretrained("./llama3-adapter")  # ~330MB adapter only

Trade-offs and known limitations#

Pro: 1,000-10,000x fewer trainable parameters and ~1,400x less optimiser-state memory vs full FT.
Pro: adapters are portable, mergeable and hot-swappable; one base can serve hundreds of LoRAs.
Pro: composes cleanly with quantisation (QLoRA) and weight decomposition (DoRA).
Pro: every major commercial fine-tune service uses LoRA under the hood — battle-tested at scale.
Con: 1-2 point quality gap vs full FT on hard or large-distribution-shift tasks; mitigated by DoRA.
Con: requires empirical tuning of rank and target modules; defaults work but optima vary by task.
Con: cannot recover quality lost to mis-calibrated base-model quantisation outside QLoRA recipe.
Con: embedding and LM-head changes (vocabulary expansion, new special tokens) need separate handling — LoRA targets linears only by default.

LoRA (Low-Rank Adaptation)

Overview#

How it works: the rank-r decomposition, end-to-end#

Variants and architectural choices: the LoRA family in 2026#

Where it is used today: the commercial and open-source landscape#

Trade-offs and known limitations#

Practical implementation notes#

Where LoRA fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

LoRA (Low-Rank Adaptation)

Overview#

How it works: the rank-r decomposition, end-to-end#

Variants and architectural choices: the LoRA family in 2026#

Where it is used today: the commercial and open-source landscape#

Trade-offs and known limitations#

Practical implementation notes#

Where LoRA fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel