Unsloth

TL;DR

Unsloth (unslothai/unsloth, founded by Daniel and Michael Han, Apache 2.0) is a fine-tuning library that replaces the standard Hugging Face training path with hand-written Triton kernels, manually-derived backward passes, fused operations and aggressive activation recomputation — delivering roughly 2x faster training and 50-70% less peak VRAM for LoRA and QLoRA workloads at zero quality cost.
Supported families in mid-2026: Llama 1/2/3/3.1/3.2/3.3, Mistral 7B / Nemo / Large, Gemma 1/2/3, Qwen 1.5 / 2 / 2.5 / 3, Phi-2/3/3.5, DeepSeek-V2 / V3, Mixtral 8x7B / 8x22B, plus the multi-modal Llama 3.2 Vision and Qwen2-VL families. Architecture-specific kernels mean unsupported models silently fall back to baseline HF performance.
Standard entry point is `from unsloth import FastLanguageModel`, then `FastLanguageModel.from_pretrained(model_name='unsloth/llama-3.1-8b-bnb-4bit', max_seq_length=4096, load_in_4bit=True)` followed by `FastLanguageModel.get_peft_model(model, r=32, lora_alpha=64, target_modules=[...])`. The returned model plugs straight into TRL's SFTTrainer, DPOTrainer, GRPOTrainer and KTOTrainer — no other code changes.
Headline performance on a single H100 80 GB at sequence length 4k: Llama 3.1 8B QLoRA at ~6,500 tokens/sec (vs ~3,200 for vanilla HF + PEFT), Mistral 7B QLoRA at ~7,200 tokens/sec (vs ~3,800), Gemma 2 9B QLoRA at ~5,800 (vs ~2,900), Llama 3.1 70B QLoRA fits at sequence length 4k where the vanilla path OOMs.
OSS edition is fully featured for single-GPU; the commercial Unsloth Pro / Enterprise tier adds multi-GPU DDP / FSDP wrappers and proprietary kernel optimisations. Yobibyte's single-GPU FineTune profile defaults to Unsloth kernels for supported model families — it is the lowest-cost path to a high-quality adapter at the 7-13B base size that dominates customer fine-tunes.

Overview#

Unsloth is the throughput leader for single-GPU LLM fine-tuning in 2026 and the de facto default tool for solo researchers, indie ML teams and anyone working under a one-GPU constraint. Founded by Daniel and Michael Han in late 2023 and built on top of the standard Hugging Face Transformers + PEFT + TRL stack, the library systematically replaces every operation in the LoRA and QLoRA forward and backward passes that the generic PyTorch path leaves performance on the table — RMSNorm, RoPE, SwiGLU, attention QKV projection, cross-entropy loss, the LoRA A/B matmul — with hand-written Triton kernels. The backward pass is derived analytically rather than relying on autograd, so intermediate gradient buffers are never materialised; activation recomputation is opportunistic rather than always-on. The net effect is the headline result: a Llama 3.1 8B QLoRA run that takes 7 hours on a single H100 with the vanilla HF stack completes in roughly 3.5 hours with Unsloth, peak VRAM drops from ~22 GB to ~10 GB, and the final adapter is bit-identical in evaluation quality.

The architectural bet behind Unsloth is that for the specific shapes that show up in LoRA / QLoRA fine-tuning — small adapter ranks (r=8 to r=128), well-known model architectures (Llama-family, Mistral-family, Gemma-family decoder-only blocks), standard sequence lengths (4k-32k tokens) — the kernel space is small enough to hand-write profitably. The kernels are specialised to model-family geometry; supporting a new family requires roughly a week of kernel work per family. This is why Unsloth's supported-model list updates within days of every major open-weights release but lags slightly behind on novel architectures. The commercial Pro / Enterprise tier extends this to multi-GPU (DDP and FSDP wrappers around the Unsloth kernels) and ships proprietary optimisations for production-scale deployment.

By mid-2026 Unsloth is the substrate behind a large fraction of public Hugging Face fine-tune cards — the README footer 'Trained with Unsloth' is one of the most common sights in the open-weights ecosystem. The library is used by Cognitive Computations (the Dolphin model family), several Mistral and Llama community releases, and the bulk of the small-team commercial fine-tune services that rent single-GPU capacity. Yobitel's Yobibyte FineTune resource defaults to Unsloth kernels for any fine-tune job targeting a supported model family on a single GPU — the throughput win is what makes single-H100 fine-tuning a credible product at hobbyist-budget price points.

This entry helps you decide whether Unsloth is the right tool for your fine-tune job, write code that uses it correctly, and understand its boundaries (single-GPU constraint, architecture-specific kernels) versus Axolotl, LLaMA-Factory and Yobibyte FineTune's managed alternative. Yobitel's Yobibyte FineTune resource uses Unsloth as the default execution backend for sub-30B fine-tunes on a single GPU, with Axolotl + DeepSpeed taking over for multi-GPU and >30B workloads.

Quick start: 8B QLoRA in a Jupyter cell#

The shortest path from `pip install` to a trained adapter on a single H100 or RTX 4090 is the Unsloth snippet below. The example loads a pre-quantised Llama 3.1 8B base from the Unsloth Hugging Face org (saves the 30-60 second NF4 quantisation step at load time), wraps it with LoRA r=32, and trains via TRL's SFTTrainer for one epoch on Alpaca.

python

# unsloth_quickstart.py — runs on a single H100 / A100 / RTX 4090
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" trl datasets

from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# 1. Load a pre-quantised base (skip the NF4 quant step on every run).
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # pre-quantised NF4 mirror
    max_seq_length=4096,
    dtype=None,                       # auto-detect BF16 on Hopper / Ada
    load_in_4bit=True,
)

# 2. Wrap with LoRA. Target every linear in attention + MLP.
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0,                   # 0 is faster; Unsloth recommends 0 unless data is small
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's tuned variant — saves another ~30% VRAM
    random_state=42,
    use_rslora=True,                  # rank-stable scaling for r >= 32
    loftq_config=None,
)

# 3. Train via TRL — the model is a normal PeftModel underneath.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    dataset_text_field="text",
    max_seq_length=4096,
    args=SFTConfig(
        output_dir="./outputs/llama3-8b-unsloth",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.03,
        num_train_epochs=1,
        learning_rate=2e-4,
        bf16=is_bfloat16_supported(),
        fp16=not is_bfloat16_supported(),
        logging_steps=10,
        save_steps=200,
        optim="adamw_8bit",           # paged AdamW 8-bit
        lr_scheduler_type="cosine",
        seed=42,
    ),
)
trainer.train()

# 4. Save adapter (~300 MB) or merge for serving.
model.save_pretrained("./llama3-8b-unsloth-adapter")
# Merge to BF16 (dequant base first — critical for QLoRA-trained adapters):
model.save_pretrained_merged(
    "./llama3-8b-unsloth-merged",
    tokenizer,
    save_method="merged_16bit",       # dequant → merge → save in BF16
)
# Or push to GGUF for llama.cpp serving:
# model.save_pretrained_gguf("./llama3-8b-gguf", tokenizer, quantization_method="q4_k_m")

Use `unsloth/<model>-bnb-4bit` rather than the original repo when available — Unsloth's pre-quantised mirrors skip the NF4 quantisation step on every load, save 60-120 seconds of wall time per run, and are bit-identical to a fresh quantise. The mirrors cover every supported model family.

How it works: where the speed-up actually lives#

Unsloth's performance edge comes from four orthogonal optimisations, each addressing a specific inefficiency in the vanilla Hugging Face training path. Understanding them is the difference between treating Unsloth as a black box and being able to debug when it underperforms expectations.

Optimisation 1 — Hand-written Triton kernels for the hot path. The baseline HF + PEFT training path executes a long sequence of generic PyTorch ops for each Transformer block forward: RMSNorm (LayerNorm-without-mean variant), QKV linear projections, RoPE rotation, scaled dot-product attention, output projection, residual add, RMSNorm again, gate projection, up projection, SiLU activation, down projection, residual add. Each op allocates intermediate tensors, launches a CUDA kernel and writes back to HBM — even with torch.compile, the kernel fusion is conservative because PyTorch does not know which fusions are safe. Unsloth replaces RMSNorm, RoPE, SwiGLU (gate*up*silu combined into a single kernel) and the LoRA matmul with hand-tuned Triton kernels that fuse the operations into a single launch per group, eliminating intermediate HBM round-trips. The cross-entropy loss kernel is similarly fused with the LM-head logits projection, which is the single most memory-intensive op in the standard backward.

Optimisation 2 — Manually-derived backward pass. PyTorch autograd builds the backward graph by recording each op in the forward and replaying it. For LoRA, where most weights are frozen, this records gigabytes of gradient buffers that will never be used. Unsloth derives the backward analytically for the supported architectures: it knows the gradient w.r.t. LoRA A, LoRA B, and the input only — no gradient buffers for base weights are allocated. The result is dramatic activation-memory savings (the 50-70% figure quoted in the headline numbers) and faster backward execution because there is no autograd graph traversal.

Optimisation 3 — Selective activation recomputation (`use_gradient_checkpointing='unsloth'`). Standard gradient checkpointing recomputes the entire forward pass on backward, trading roughly 30% extra compute for half the activation memory. Unsloth's variant selectively recomputes only the cheap-to-recompute portions (RMSNorm, SiLU, residuals) while caching the expensive attention output, giving roughly the same memory saving for ~10% extra compute instead of 30%. The result is the headline 'gradient checkpointing on, no throughput cost' claim.

Optimisation 4 — Fused cross-entropy loss. The LM-head logits projection produces a tensor of shape (batch * seq_len, vocab_size) — for Llama 3.1 8B at 4k context with batch 4, that is 4 * 4096 * 128256 * 4 bytes = ~8 GB of activations, all held in HBM until the loss backward runs. Unsloth fuses the logits projection with the cross-entropy loss into a single chunked kernel that never materialises the full logits tensor — peak VRAM at the loss step drops by 5-10 GB, which is often the difference between OOM and fitting on a 24 GB consumer card.

What this gets you in practice: vanilla HF + PEFT runs Llama 3.1 8B QLoRA at sequence length 4k with peak VRAM ~22 GB and throughput ~3,200 tokens/sec on a single H100. Unsloth runs the same workload at peak VRAM ~10 GB and throughput ~6,500 tokens/sec — a 2x speed-up and a 55% memory reduction with bit-identical adapter quality. The savings are family-specific: Llama 3.1 and Mistral get the largest wins (2x throughput, 50-55% VRAM saving); Gemma 2 sees ~1.8x throughput and 55-60% VRAM; Qwen 2.5 sees ~1.7x and 50%; Phi-3 sees ~1.5x and 45%.

Triton kernel fusion: RMSNorm + RoPE in one kernel, SwiGLU (gate * silu(up)) in one kernel, LoRA A/B matmul fused into the linear.
Manual backward: no autograd buffers for frozen base; gradient only for A, B and the input. Saves both compute and memory.
Selective gradient checkpointing: recompute cheap ops only; cache expensive attention output. ~10% compute overhead vs 30% for standard checkpointing.
Fused cross-entropy + LM-head: never materialise the full (batch * seq_len, vocab_size) logits tensor. Saves 5-10 GB peak VRAM.
Pre-quantised model mirrors at `unsloth/<model>-bnb-4bit`: skip the 60-120 second NF4 quantisation step on every load.
Architecture support is per-family: Llama, Mistral, Gemma, Qwen, Phi, DeepSeek covered. New architectures take ~1 week of kernel work per family.

Reference: FastLanguageModel API surface#

Unsloth exposes two primary classes — FastLanguageModel for text models and FastVisionModel for the multi-modal supported families. Authoritative reference of the fields you use in every real Unsloth run.

Call / argument	Type	Default	What it does
FastLanguageModel.from_pretrained()	classmethod	—	Load a base model with Unsloth optimisations attached
model_name	string	—	HF model ID; prefer `unsloth/<model>-bnb-4bit` for pre-quantised
max_seq_length	int	—	Maximum context for training; affects RoPE scaling and kernel choice
dtype	torch.dtype or None	None (auto)	BF16 on Hopper/Ada, FP16 on Ampere
load_in_4bit	bool	False	Enable QLoRA NF4 quantisation
load_in_8bit	bool	False	8-bit quantisation (rarely used in 2026)
token	string	None	HF token for gated models
device_map	string	'sequential'	Distribution across GPUs; 'sequential' for single-GPU
rope_scaling	dict	None	Override RoPE scaling for context extension
fix_tokenizer	bool	True	Patch tokeniser pad token / chat template if broken
FastLanguageModel.get_peft_model()	classmethod	—	Attach LoRA with Unsloth kernels
r	int	16	LoRA rank
lora_alpha	int	16	LoRA alpha (Unsloth recommends alpha = r)
lora_dropout	float	0	Unsloth recommends 0 for speed; 0.05 for very small datasets
target_modules	list[string]	['q_proj','k_proj','v_proj','o_proj']	Override to include MLP for full coverage
bias	string	'none'	'none', 'all', or 'lora_only'
use_gradient_checkpointing	string	'unsloth'	'unsloth' (recommended), True (HF default), or False
random_state	int	3407	Seed for reproducibility
use_rslora	bool	False	Rank-stable scaling for r >= 32
loftq_config	dict or None	None	LoftQ initialisation for QLoRA quality recovery
modules_to_save	list[string]	None	Layers to fully fine-tune alongside LoRA (e.g. embed_tokens, lm_head for vocab expansion)
FastLanguageModel.for_inference()	classmethod	—	Switch model to inference mode (disables checkpointing, enables fast generate)
model.save_pretrained_merged()	method	—	Merge adapter into base and save BF16 model
save_method	string	'merged_16bit'	'merged_16bit' (BF16), 'merged_4bit' (re-quantise), or 'lora' (adapter only)
model.save_pretrained_gguf()	method	—	Export to GGUF for llama.cpp serving
quantization_method	string	'q4_k_m'	GGUF quant level: f16, q8_0, q6_k, q5_k_m, q4_k_m, q3_k_m, q2_k

Unsloth deliberately recommends `lora_alpha = r` (not `2 * r` as standard LoRA does) when paired with `use_rslora: True` — the rank-stable scaling makes the alpha-r ratio less significant. If you import an Axolotl or PEFT-native config with alpha = 2 * r and rsLoRA off, expect a slightly hotter learning rate; consider dropping LR by ~25%.

Workload patterns#

Three patterns cover essentially every real Unsloth use in 2026. Each pattern has a typical hardware profile and a known throughput envelope.

Pattern 1 — Single-GPU QLoRA on a 7-13B base. Dominant workload. RTX 4090 24 GB, RTX A6000 48 GB, L40S 48 GB or H100/H200. Sequence length 4k-8k, micro batch 2-4 with grad_accum 4-8, single epoch over 10-50k examples completes in 1-3 hours. This is the workload Unsloth is most aggressively optimised for and where the 2x speed-up is largest. Yobibyte FineTune's default profile for sub-30B targets.
Pattern 2 — Single-GPU QLoRA on a 30-70B base. H100 80 GB or H200 141 GB. Sequence length 4k (8k on H200), micro batch 1 with grad_accum 8-16. 70B QLoRA fits on a single H100 80 GB with Unsloth where it OOMs with vanilla HF + PEFT; this is the largest practical 'one GPU is enough' base size in 2026.
Pattern 3 — Single-GPU preference training (DPO, ORPO, KTO, GRPO). Same base sizes as Pattern 1 / 2; replace `SFTTrainer` with `DPOTrainer` / `ORPOTrainer` / `KTOTrainer` / `GRPOTrainer`. Unsloth's kernels accelerate the preference forward pass in the same way as the SFT forward; the speed-up is similar (~1.8-2x). Standard recipe is to layer the preference stage on top of a Pattern-1 SFT adapter.
Pattern 4 — Continued pretraining on a domain corpus. Full FT (`adapter:` empty) or LoRA on raw text. Sequence length 8k-32k. Less common but supported.
Pattern 5 — Multi-modal fine-tuning via FastVisionModel. Llama 3.2 Vision, Qwen2-VL, Pixtral. API is parallel to FastLanguageModel but the loader handles image processors and the chat template encodes image tokens. Sequence lengths 8k-16k typical because of image token expansion.

If your job does not fit Patterns 1-5 — multi-GPU, novel architecture, or a custom training loss that diverges from the supported preference methods — Unsloth will either fall back to baseline performance or refuse to start. Switch to Axolotl + DeepSpeed for multi-GPU and to a hand-written TRL loop for novel losses.

Sizing and capacity planning#

Sizing guidance for Unsloth fine-tuning in 2026, assuming `use_gradient_checkpointing='unsloth'`, `load_in_4bit=True`, sample packing on and standard chat templates.

Base size	Method	Seq len	Peak VRAM (Unsloth)	Peak VRAM (vanilla HF)	Fits on
Mistral 7B	QLoRA r=32	4k	~9 GB	~18 GB	RTX 4090 24 GB
Llama 3.1 8B	QLoRA r=32	4k	~10 GB	~22 GB	RTX 4090 24 GB
Llama 3.1 8B	QLoRA r=32	8k	~14 GB	~30 GB (OOM 24 GB)	RTX 4090 24 GB / L40S
Llama 3.1 8B	QLoRA r=32	16k	~20 GB	~50 GB	L40S 48 GB / H100 80 GB
Gemma 2 9B	QLoRA r=32	4k	~12 GB	~26 GB	RTX 4090 24 GB
Qwen 2.5 14B	QLoRA r=32	4k	~14 GB	~30 GB	RTX 4090 24 GB (tight) / L40S
Mixtral 8x7B	QLoRA r=32	4k	~22 GB	OOM 24 GB	L40S 48 GB / A100 80 GB
Llama 3.1 70B	QLoRA r=32	4k	~48 GB	~75 GB	H100 80 GB / H200 141 GB
Llama 3.1 70B	QLoRA r=32	8k	~58 GB	OOM 80 GB	H100 80 GB / H200 141 GB
Llama 3.1 70B	QLoRA r=32	16k	~80 GB	OOM 80 GB	H200 141 GB

Limits and quotas#

Unsloth itself has no hard quotas — it is a library, not a service. The practical ceilings are architectural and operational.

Limit	Practical ceiling (2026)	Notes
GPU count (OSS edition)	1	Multi-GPU requires Unsloth Pro / Enterprise
Max base model size (single GPU)	70B on H100 80 GB; ~141B on H200	Above this, multi-GPU is mandatory
Max sequence length	131,072+ (Llama 3.1)	Activation memory limits effective ceiling; sweep micro_batch_size
Max LoRA rank	1024+	Quality plateaus at r=64-128 for most workloads
Supported model families	~10 families covering 50+ models	See unsloth.ai/docs for live list
Custom architectures	Falls back to baseline HF perf or fails	Wait for upstream kernel support or use Axolotl
Custom loss functions	TRL-supported only (SFT, DPO, ORPO, KTO, GRPO, CPO, IPO)	Novel losses require hand-written TRL loop

Unsloth's most aggressive kernels are tied to specific model architectures. Custom or recently released models may not be supported until kernel updates land (typically 1-2 weeks after major releases) — check the supported-model list at unsloth.ai/docs before committing to Unsloth for a production project on a brand-new architecture.

Observability#

Unsloth uses TRL's standard logging surface — Weights & Biases, MLflow, TensorBoard and stdout via the `SFTConfig` / `TrainingArguments.report_to` field. The Unsloth-specific signals worth watching during a run.

`train/loss` — falls steadily. Sudden plateau or spike usually means the chat template is wrong (check tokeniser.apply_chat_template output).
`train/learning_rate` — confirms warmup completed and cosine decay engaged.
`train/grad_norm` — healthy range 0.3-2.0 for SFT, 0.1-0.5 for DPO. Spikes to 10+ indicate LR too high.
Throughput in tokens/sec — printed in the SFTTrainer progress bar. On H100 80 GB at 4k context expect 6,000-7,500 tokens/sec for 7-9B QLoRA; <4,000 means kernels did not load (check Unsloth version and model family support).
Peak GPU memory — `torch.cuda.max_memory_allocated() / 1e9` in GB. Should match the Sizing table within 10%.
`unsloth_version` and `is_bfloat16_supported()` — log at run start; they record which kernel version trained the adapter for reproducibility.

python

# At the top of every Unsloth run — capture kernel version + GPU context.
import unsloth, torch
print(f"unsloth: {unsloth.__version__}")
print(f"torch:   {torch.__version__}")
print(f"cuda:    {torch.version.cuda}")
print(f"gpu:     {torch.cuda.get_device_name(0)}")
print(f"vram:    {torch.cuda.get_device_properties(0).total_memory / 1e9:.0f} GB")

# After training — record peak memory for capacity planning.
print(f"peak vram: {torch.cuda.max_memory_allocated() / 1e9:.1f} GB")

# Optional: enable W&B / MLflow via SFTConfig(report_to=['wandb']) or args.
import os
os.environ["WANDB_PROJECT"]   = "yobitel-finetune"
os.environ["WANDB_RUN_NAME"]  = "llama3-8b-unsloth-r32"

Cost and FinOps#

Unsloth's primary cost win is the 2x throughput multiplier: every Unsloth-accelerated workload completes in roughly half the GPU-hours of the vanilla HF + PEFT equivalent. Indicative 2026 costs in USD computed at Yobitel NeoCloud reference pricing ($2.60/H100/hr, $3.20/H200/hr).

Workload	Vanilla HF GPU-hours	Unsloth GPU-hours	NeoCloud cost (Unsloth)
Mistral 7B QLoRA, 10k examples, 1 epoch	~2	~1	~$2.60
Llama 3.1 8B QLoRA, 50k examples, 3 epochs	~30	~15	~$39
Gemma 2 9B QLoRA, 50k examples, 3 epochs	~32	~17	~$44
Qwen 2.5 14B QLoRA, 30k examples, 2 epochs	~24	~13	~$34
Mixtral 8x7B QLoRA, 30k examples, 2 epochs	~36	~20	~$52
Llama 3.1 70B QLoRA, 30k examples, 1 epoch	~50	~28	~$90
Llama 3.1 8B DPO (after SFT), 20k pairs	~4	~2	~$5

Yobibyte FineTune's per-token training price for sub-30B bases reflects the Unsloth throughput multiplier — single-GPU Unsloth is what makes the price work. If you self-host on rented NeoCloud H100s, the cost above is what you pay; the managed Yobibyte FineTune wraps this with adapter storage, multi-LoRA serving and OIDC-bound RBAC at a competitive markup over raw GPU-hours.

Security and compliance#

Unsloth runs entirely on your machine — no telemetry, no model uploads, no inference calls during training. The security posture mirrors any Python library executing in your venv: audit the install source, pin versions and isolate sensitive workloads.

Install source: `pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"` pulls directly from the upstream repo; pin a commit SHA for production reproducibility.
Pre-quantised mirrors at `unsloth/<model>-bnb-4bit` are hosted on Hugging Face Hub under the official Unsloth org — same trust boundary as any HF model download. Verify hashes against the original repo for compliance-grade evidence.
Offline operation: set `HF_HUB_OFFLINE=1` once the base, tokeniser and dataset are pre-staged; required for OFFICIAL-SENSITIVE workloads with no outbound network.
Adapter artefacts: standard PEFT format, safetensors-only by default — no executable payload risk on load.
Audit logging: TRL's standard W&B / MLflow integration captures every run; sufficient for SOC 2 CC6 / ISO 27001 A.12.4 when shipped to your tenancy log store.
Reproducibility: pin `unsloth==2025.6`, `torch==2.5.0`, `transformers==4.45.0`, `peft==0.13.0`, `trl==0.11.0`, `bitsandbytes==0.43.0` in a lockfile alongside the dataset SHA and base model SHA. The Unsloth version is the most consequential pin — kernel changes between versions occasionally affect bit-equivalence.

Migration and alternatives#

Unsloth, Axolotl, LLaMA-Factory and managed Yobibyte FineTune occupy adjacent positions in the fine-tune toolchain.

Tool	Strength	Weakness	When to pick
Unsloth	2x throughput + 50-70% less VRAM on single GPU; bit-identical quality	Single-GPU OSS; architecture-specific kernels	Single GPU, supported family, throughput-bound
Axolotl	Most flexible; multi-GPU native; full TRL surface	No throughput multiplier vs vanilla HF without Unsloth integration	Multi-GPU; novel recipes; preference training at scale
Axolotl + Unsloth	Axolotl's flexibility + Unsloth's kernels (`unsloth_lora_mlp: true`)	Only on Axolotl-supported model families that overlap with Unsloth's	Highest single-GPU throughput in a YAML-driven workflow
LLaMA-Factory	Gradio web UI; broad family coverage	Less throughput-optimised than Unsloth	UI-driven workflow; rapid model exploration
Yobibyte FineTune (managed)	Wraps Unsloth + Axolotl behind an API; integrated multi-LoRA serving	Less granular than self-hosted Unsloth	Teams that want fine-tuning as a service on Yobitel-managed H100/H200

Axolotl + Unsloth composition (`unsloth_lora_mlp: true`, `unsloth_lora_qkv: true`, `unsloth_lora_o: true`, `unsloth_cross_entropy_loss: true` in an Axolotl YAML) is the highest-throughput single-GPU recipe in 2026 — Axolotl's flexibility plus Unsloth's kernel speed at no flexibility cost on supported families.

Troubleshooting#

Failure modes that bite real Unsloth users.

Symptom	Most likely cause	Fix
Throughput same as vanilla HF (no speed-up)	Unsloth kernels did not load — unsupported architecture	Verify model family on unsloth.ai/docs; fall back to Axolotl
`RuntimeError: Unsloth: ... not supported`	Model architecture not in supported list	Use the original HF + PEFT path or wait for kernel update
NaN loss from step 1	FP16 with LoRA (use BF16)	Set `dtype=torch.bfloat16` or rely on `is_bfloat16_supported()`
OOM despite Sizing table fitting	`use_gradient_checkpointing=False`	Set `use_gradient_checkpointing='unsloth'`
Slow first step then fast — fine	Triton kernel JIT compilation on first iteration	Normal; subsequent steps are fast
`Could not find unsloth/... model`	Pre-quantised mirror not yet published	Use the original repo + `load_in_4bit=True` (re-quantises locally)
Merged model produces garbage	Used `save_pretrained_merged(save_method='merged_4bit')` on QLoRA adapter	Use `save_method='merged_16bit'` — dequantises base before merge
Adapter loads in vLLM but quality is wrong	Tokeniser changed (added special tokens) without `modules_to_save`	Re-train with `modules_to_save=['embed_tokens','lm_head']`
`ImportError: cannot import name 'FastLanguageModel'`	Old Unsloth version (<2024.5)	Upgrade: `pip install --upgrade unsloth`
DPO loss not decreasing	Reference model not frozen / wrong beta	Confirm `beta=0.1` and `force_use_ref_model=True` if needed
GGUF export fails	llama.cpp not installed or model family unsupported by GGUF converter	Install `llama.cpp` via the Unsloth docs; check GGUF support per family

Where Unsloth fits in the Yobitel stack#

Yobibyte FineTune's single-GPU profile uses Unsloth as the default execution backend for any fine-tune targeting a supported model family on a single GPU — Llama, Mistral, Gemma, Qwen, Phi, DeepSeek bases at 7B-70B. The customer sees an API: submit a job spec, pay per token, receive an adapter directory. Underneath, Yobibyte resolves the spec into an Unsloth FastLanguageModel call on Yobitel-managed H100 or H200 capacity in UK and EU NeoCloud regions with NCSC OFFICIAL alignment, runs the job, and registers the resulting adapter with Yobibyte's multi-LoRA inference surface — the customer can call their fine-tuned model through an OpenAI-compatible endpoint within minutes of training completing.

For workloads outside Unsloth's coverage — multi-GPU, full FT, novel architectures, custom losses — Yobibyte FineTune automatically routes to Axolotl + DeepSpeed on the equivalent NeoCloud capacity. The handoff is transparent to the customer; the rule of thumb is 'Unsloth wins for single-GPU SFT/DPO on supported families, Axolotl wins for everything else'.

For self-hosted teams, Yobitel NeoCloud rents H100 and H200 SXM5 capacity by the hour, with the same NCSC OFFICIAL alignment the managed Yobibyte service uses. An Unsloth fine-tune script runs identically on rented NeoCloud GPUs; the choice between managed Yobibyte FineTune and self-managed Unsloth-on-NeoCloud is the standard build-vs-buy axis. InferenceBench evaluates the resulting adapters alongside base models so customers can confirm empirical quality lift before production rollout.

References

Unsloth on GitHub · GitHub
Unsloth documentation · Unsloth
Unsloth pre-quantised model mirrors · Hugging Face
TRL — Transformer Reinforcement Learning · GitHub

TL;DR

Unsloth (unslothai/unsloth, founded by Daniel and Michael Han, Apache 2.0) is a fine-tuning library that replaces the standard Hugging Face training path with hand-written Triton kernels, manually-derived backward passes, fused operations and aggressive activation recomputation — delivering roughly 2x faster training and 50-70% less peak VRAM for LoRA and QLoRA workloads at zero quality cost.
Supported families in mid-2026: Llama 1/2/3/3.1/3.2/3.3, Mistral 7B / Nemo / Large, Gemma 1/2/3, Qwen 1.5 / 2 / 2.5 / 3, Phi-2/3/3.5, DeepSeek-V2 / V3, Mixtral 8x7B / 8x22B, plus the multi-modal Llama 3.2 Vision and Qwen2-VL families. Architecture-specific kernels mean unsupported models silently fall back to baseline HF performance.
Standard entry point is `from unsloth import FastLanguageModel`, then `FastLanguageModel.from_pretrained(model_name='unsloth/llama-3.1-8b-bnb-4bit', max_seq_length=4096, load_in_4bit=True)` followed by `FastLanguageModel.get_peft_model(model, r=32, lora_alpha=64, target_modules=[...])`. The returned model plugs straight into TRL's SFTTrainer, DPOTrainer, GRPOTrainer and KTOTrainer — no other code changes.
Headline performance on a single H100 80 GB at sequence length 4k: Llama 3.1 8B QLoRA at ~6,500 tokens/sec (vs ~3,200 for vanilla HF + PEFT), Mistral 7B QLoRA at ~7,200 tokens/sec (vs ~3,800), Gemma 2 9B QLoRA at ~5,800 (vs ~2,900), Llama 3.1 70B QLoRA fits at sequence length 4k where the vanilla path OOMs.
OSS edition is fully featured for single-GPU; the commercial Unsloth Pro / Enterprise tier adds multi-GPU DDP / FSDP wrappers and proprietary kernel optimisations. Yobibyte's single-GPU FineTune profile defaults to Unsloth kernels for supported model families — it is the lowest-cost path to a high-quality adapter at the 7-13B base size that dominates customer fine-tunes.

Overview#

Quick start: 8B QLoRA in a Jupyter cell#

python

# unsloth_quickstart.py — runs on a single H100 / A100 / RTX 4090
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" trl datasets

from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# 1. Load a pre-quantised base (skip the NF4 quant step on every run).
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",  # pre-quantised NF4 mirror
    max_seq_length=4096,
    dtype=None,                       # auto-detect BF16 on Hopper / Ada
    load_in_4bit=True,
)

# 2. Wrap with LoRA. Target every linear in attention + MLP.
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0,                   # 0 is faster; Unsloth recommends 0 unless data is small
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's tuned variant — saves another ~30% VRAM
    random_state=42,
    use_rslora=True,                  # rank-stable scaling for r >= 32
    loftq_config=None,
)

# 3. Train via TRL — the model is a normal PeftModel underneath.
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(10_000))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    dataset_text_field="text",
    max_seq_length=4096,
    args=SFTConfig(
        output_dir="./outputs/llama3-8b-unsloth",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.03,
        num_train_epochs=1,
        learning_rate=2e-4,
        bf16=is_bfloat16_supported(),
        fp16=not is_bfloat16_supported(),
        logging_steps=10,
        save_steps=200,
        optim="adamw_8bit",           # paged AdamW 8-bit
        lr_scheduler_type="cosine",
        seed=42,
    ),
)
trainer.train()

# 4. Save adapter (~300 MB) or merge for serving.
model.save_pretrained("./llama3-8b-unsloth-adapter")
# Merge to BF16 (dequant base first — critical for QLoRA-trained adapters):
model.save_pretrained_merged(
    "./llama3-8b-unsloth-merged",
    tokenizer,
    save_method="merged_16bit",       # dequant → merge → save in BF16
)
# Or push to GGUF for llama.cpp serving:
# model.save_pretrained_gguf("./llama3-8b-gguf", tokenizer, quantization_method="q4_k_m")

How it works: where the speed-up actually lives#

Triton kernel fusion: RMSNorm + RoPE in one kernel, SwiGLU (gate * silu(up)) in one kernel, LoRA A/B matmul fused into the linear.
Manual backward: no autograd buffers for frozen base; gradient only for A, B and the input. Saves both compute and memory.
Selective gradient checkpointing: recompute cheap ops only; cache expensive attention output. ~10% compute overhead vs 30% for standard checkpointing.
Fused cross-entropy + LM-head: never materialise the full (batch * seq_len, vocab_size) logits tensor. Saves 5-10 GB peak VRAM.
Pre-quantised model mirrors at `unsloth/<model>-bnb-4bit`: skip the 60-120 second NF4 quantisation step on every load.
Architecture support is per-family: Llama, Mistral, Gemma, Qwen, Phi, DeepSeek covered. New architectures take ~1 week of kernel work per family.

Reference: FastLanguageModel API surface#

Call / argument	Type	Default	What it does
FastLanguageModel.from_pretrained()	classmethod	—	Load a base model with Unsloth optimisations attached
model_name	string	—	HF model ID; prefer `unsloth/<model>-bnb-4bit` for pre-quantised
max_seq_length	int	—	Maximum context for training; affects RoPE scaling and kernel choice
dtype	torch.dtype or None	None (auto)	BF16 on Hopper/Ada, FP16 on Ampere
load_in_4bit	bool	False	Enable QLoRA NF4 quantisation
load_in_8bit	bool	False	8-bit quantisation (rarely used in 2026)
token	string	None	HF token for gated models
device_map	string	'sequential'	Distribution across GPUs; 'sequential' for single-GPU
rope_scaling	dict	None	Override RoPE scaling for context extension
fix_tokenizer	bool	True	Patch tokeniser pad token / chat template if broken
FastLanguageModel.get_peft_model()	classmethod	—	Attach LoRA with Unsloth kernels
r	int	16	LoRA rank
lora_alpha	int	16	LoRA alpha (Unsloth recommends alpha = r)
lora_dropout	float	0	Unsloth recommends 0 for speed; 0.05 for very small datasets
target_modules	list[string]	['q_proj','k_proj','v_proj','o_proj']	Override to include MLP for full coverage
bias	string	'none'	'none', 'all', or 'lora_only'
use_gradient_checkpointing	string	'unsloth'	'unsloth' (recommended), True (HF default), or False
random_state	int	3407	Seed for reproducibility
use_rslora	bool	False	Rank-stable scaling for r >= 32
loftq_config	dict or None	None	LoftQ initialisation for QLoRA quality recovery
modules_to_save	list[string]	None	Layers to fully fine-tune alongside LoRA (e.g. embed_tokens, lm_head for vocab expansion)
FastLanguageModel.for_inference()	classmethod	—	Switch model to inference mode (disables checkpointing, enables fast generate)
model.save_pretrained_merged()	method	—	Merge adapter into base and save BF16 model
save_method	string	'merged_16bit'	'merged_16bit' (BF16), 'merged_4bit' (re-quantise), or 'lora' (adapter only)
model.save_pretrained_gguf()	method	—	Export to GGUF for llama.cpp serving
quantization_method	string	'q4_k_m'	GGUF quant level: f16, q8_0, q6_k, q5_k_m, q4_k_m, q3_k_m, q2_k

Workload patterns#

Three patterns cover essentially every real Unsloth use in 2026. Each pattern has a typical hardware profile and a known throughput envelope.

Pattern 1 — Single-GPU QLoRA on a 7-13B base. Dominant workload. RTX 4090 24 GB, RTX A6000 48 GB, L40S 48 GB or H100/H200. Sequence length 4k-8k, micro batch 2-4 with grad_accum 4-8, single epoch over 10-50k examples completes in 1-3 hours. This is the workload Unsloth is most aggressively optimised for and where the 2x speed-up is largest. Yobibyte FineTune's default profile for sub-30B targets.
Pattern 2 — Single-GPU QLoRA on a 30-70B base. H100 80 GB or H200 141 GB. Sequence length 4k (8k on H200), micro batch 1 with grad_accum 8-16. 70B QLoRA fits on a single H100 80 GB with Unsloth where it OOMs with vanilla HF + PEFT; this is the largest practical 'one GPU is enough' base size in 2026.
Pattern 3 — Single-GPU preference training (DPO, ORPO, KTO, GRPO). Same base sizes as Pattern 1 / 2; replace `SFTTrainer` with `DPOTrainer` / `ORPOTrainer` / `KTOTrainer` / `GRPOTrainer`. Unsloth's kernels accelerate the preference forward pass in the same way as the SFT forward; the speed-up is similar (~1.8-2x). Standard recipe is to layer the preference stage on top of a Pattern-1 SFT adapter.
Pattern 4 — Continued pretraining on a domain corpus. Full FT (`adapter:` empty) or LoRA on raw text. Sequence length 8k-32k. Less common but supported.
Pattern 5 — Multi-modal fine-tuning via FastVisionModel. Llama 3.2 Vision, Qwen2-VL, Pixtral. API is parallel to FastLanguageModel but the loader handles image processors and the chat template encodes image tokens. Sequence lengths 8k-16k typical because of image token expansion.

Sizing and capacity planning#

Sizing guidance for Unsloth fine-tuning in 2026, assuming `use_gradient_checkpointing='unsloth'`, `load_in_4bit=True`, sample packing on and standard chat templates.

Base size	Method	Seq len	Peak VRAM (Unsloth)	Peak VRAM (vanilla HF)	Fits on
Mistral 7B	QLoRA r=32	4k	~9 GB	~18 GB	RTX 4090 24 GB
Llama 3.1 8B	QLoRA r=32	4k	~10 GB	~22 GB	RTX 4090 24 GB
Llama 3.1 8B	QLoRA r=32	8k	~14 GB	~30 GB (OOM 24 GB)	RTX 4090 24 GB / L40S
Llama 3.1 8B	QLoRA r=32	16k	~20 GB	~50 GB	L40S 48 GB / H100 80 GB
Gemma 2 9B	QLoRA r=32	4k	~12 GB	~26 GB	RTX 4090 24 GB
Qwen 2.5 14B	QLoRA r=32	4k	~14 GB	~30 GB	RTX 4090 24 GB (tight) / L40S
Mixtral 8x7B	QLoRA r=32	4k	~22 GB	OOM 24 GB	L40S 48 GB / A100 80 GB
Llama 3.1 70B	QLoRA r=32	4k	~48 GB	~75 GB	H100 80 GB / H200 141 GB
Llama 3.1 70B	QLoRA r=32	8k	~58 GB	OOM 80 GB	H100 80 GB / H200 141 GB
Llama 3.1 70B	QLoRA r=32	16k	~80 GB	OOM 80 GB	H200 141 GB

Limits and quotas#

Unsloth itself has no hard quotas — it is a library, not a service. The practical ceilings are architectural and operational.

Limit	Practical ceiling (2026)	Notes
GPU count (OSS edition)	1	Multi-GPU requires Unsloth Pro / Enterprise
Max base model size (single GPU)	70B on H100 80 GB; ~141B on H200	Above this, multi-GPU is mandatory
Max sequence length	131,072+ (Llama 3.1)	Activation memory limits effective ceiling; sweep micro_batch_size
Max LoRA rank	1024+	Quality plateaus at r=64-128 for most workloads
Supported model families	~10 families covering 50+ models	See unsloth.ai/docs for live list
Custom architectures	Falls back to baseline HF perf or fails	Wait for upstream kernel support or use Axolotl
Custom loss functions	TRL-supported only (SFT, DPO, ORPO, KTO, GRPO, CPO, IPO)	Novel losses require hand-written TRL loop

Observability#

`train/loss` — falls steadily. Sudden plateau or spike usually means the chat template is wrong (check tokeniser.apply_chat_template output).
`train/learning_rate` — confirms warmup completed and cosine decay engaged.
`train/grad_norm` — healthy range 0.3-2.0 for SFT, 0.1-0.5 for DPO. Spikes to 10+ indicate LR too high.
Throughput in tokens/sec — printed in the SFTTrainer progress bar. On H100 80 GB at 4k context expect 6,000-7,500 tokens/sec for 7-9B QLoRA; <4,000 means kernels did not load (check Unsloth version and model family support).
Peak GPU memory — `torch.cuda.max_memory_allocated() / 1e9` in GB. Should match the Sizing table within 10%.
`unsloth_version` and `is_bfloat16_supported()` — log at run start; they record which kernel version trained the adapter for reproducibility.

python

# At the top of every Unsloth run — capture kernel version + GPU context.
import unsloth, torch
print(f"unsloth: {unsloth.__version__}")
print(f"torch:   {torch.__version__}")
print(f"cuda:    {torch.version.cuda}")
print(f"gpu:     {torch.cuda.get_device_name(0)}")
print(f"vram:    {torch.cuda.get_device_properties(0).total_memory / 1e9:.0f} GB")

# After training — record peak memory for capacity planning.
print(f"peak vram: {torch.cuda.max_memory_allocated() / 1e9:.1f} GB")

# Optional: enable W&B / MLflow via SFTConfig(report_to=['wandb']) or args.
import os
os.environ["WANDB_PROJECT"]   = "yobitel-finetune"
os.environ["WANDB_RUN_NAME"]  = "llama3-8b-unsloth-r32"

Cost and FinOps#

Workload	Vanilla HF GPU-hours	Unsloth GPU-hours	NeoCloud cost (Unsloth)
Mistral 7B QLoRA, 10k examples, 1 epoch	~2	~1	~$2.60
Llama 3.1 8B QLoRA, 50k examples, 3 epochs	~30	~15	~$39
Gemma 2 9B QLoRA, 50k examples, 3 epochs	~32	~17	~$44
Qwen 2.5 14B QLoRA, 30k examples, 2 epochs	~24	~13	~$34
Mixtral 8x7B QLoRA, 30k examples, 2 epochs	~36	~20	~$52
Llama 3.1 70B QLoRA, 30k examples, 1 epoch	~50	~28	~$90
Llama 3.1 8B DPO (after SFT), 20k pairs	~4	~2	~$5

Security and compliance#

Install source: `pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"` pulls directly from the upstream repo; pin a commit SHA for production reproducibility.
Pre-quantised mirrors at `unsloth/<model>-bnb-4bit` are hosted on Hugging Face Hub under the official Unsloth org — same trust boundary as any HF model download. Verify hashes against the original repo for compliance-grade evidence.
Offline operation: set `HF_HUB_OFFLINE=1` once the base, tokeniser and dataset are pre-staged; required for OFFICIAL-SENSITIVE workloads with no outbound network.
Adapter artefacts: standard PEFT format, safetensors-only by default — no executable payload risk on load.
Audit logging: TRL's standard W&B / MLflow integration captures every run; sufficient for SOC 2 CC6 / ISO 27001 A.12.4 when shipped to your tenancy log store.
Reproducibility: pin `unsloth==2025.6`, `torch==2.5.0`, `transformers==4.45.0`, `peft==0.13.0`, `trl==0.11.0`, `bitsandbytes==0.43.0` in a lockfile alongside the dataset SHA and base model SHA. The Unsloth version is the most consequential pin — kernel changes between versions occasionally affect bit-equivalence.

Migration and alternatives#

Unsloth, Axolotl, LLaMA-Factory and managed Yobibyte FineTune occupy adjacent positions in the fine-tune toolchain.

Tool	Strength	Weakness	When to pick
Unsloth	2x throughput + 50-70% less VRAM on single GPU; bit-identical quality	Single-GPU OSS; architecture-specific kernels	Single GPU, supported family, throughput-bound
Axolotl	Most flexible; multi-GPU native; full TRL surface	No throughput multiplier vs vanilla HF without Unsloth integration	Multi-GPU; novel recipes; preference training at scale
Axolotl + Unsloth	Axolotl's flexibility + Unsloth's kernels (`unsloth_lora_mlp: true`)	Only on Axolotl-supported model families that overlap with Unsloth's	Highest single-GPU throughput in a YAML-driven workflow
LLaMA-Factory	Gradio web UI; broad family coverage	Less throughput-optimised than Unsloth	UI-driven workflow; rapid model exploration
Yobibyte FineTune (managed)	Wraps Unsloth + Axolotl behind an API; integrated multi-LoRA serving	Less granular than self-hosted Unsloth	Teams that want fine-tuning as a service on Yobitel-managed H100/H200

Troubleshooting#

Failure modes that bite real Unsloth users.

Symptom	Most likely cause	Fix
Throughput same as vanilla HF (no speed-up)	Unsloth kernels did not load — unsupported architecture	Verify model family on unsloth.ai/docs; fall back to Axolotl
`RuntimeError: Unsloth: ... not supported`	Model architecture not in supported list	Use the original HF + PEFT path or wait for kernel update
NaN loss from step 1	FP16 with LoRA (use BF16)	Set `dtype=torch.bfloat16` or rely on `is_bfloat16_supported()`
OOM despite Sizing table fitting	`use_gradient_checkpointing=False`	Set `use_gradient_checkpointing='unsloth'`
Slow first step then fast — fine	Triton kernel JIT compilation on first iteration	Normal; subsequent steps are fast
`Could not find unsloth/... model`	Pre-quantised mirror not yet published	Use the original repo + `load_in_4bit=True` (re-quantises locally)
Merged model produces garbage	Used `save_pretrained_merged(save_method='merged_4bit')` on QLoRA adapter	Use `save_method='merged_16bit'` — dequantises base before merge
Adapter loads in vLLM but quality is wrong	Tokeniser changed (added special tokens) without `modules_to_save`	Re-train with `modules_to_save=['embed_tokens','lm_head']`
`ImportError: cannot import name 'FastLanguageModel'`	Old Unsloth version (<2024.5)	Upgrade: `pip install --upgrade unsloth`
DPO loss not decreasing	Reference model not frozen / wrong beta	Confirm `beta=0.1` and `force_use_ref_model=True` if needed
GGUF export fails	llama.cpp not installed or model family unsupported by GGUF converter	Install `llama.cpp` via the Unsloth docs; check GGUF support per family

Where Unsloth fits in the Yobitel stack#

References

Unsloth on GitHub · GitHub
Unsloth documentation · Unsloth
Unsloth pre-quantised model mirrors · Hugging Face
TRL — Transformer Reinforcement Learning · GitHub

Unsloth

Overview#

Quick start: 8B QLoRA in a Jupyter cell#

How it works: where the speed-up actually lives#

Reference: FastLanguageModel API surface#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where Unsloth fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Unsloth

Overview#

Quick start: 8B QLoRA in a Jupyter cell#

How it works: where the speed-up actually lives#

Reference: FastLanguageModel API surface#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where Unsloth fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel