Liger Kernel

TL;DR

Open-sourced by LinkedIn in 2024, Liger Kernel is a collection of efficient Triton kernels for LLM training — RMSNorm, RoPE, SwiGLU, fused cross-entropy, fused linear + cross-entropy, and more.
Drop-in replacements: monkey-patch HuggingFace model classes with `apply_liger_kernel_to_llama()` and similar one-liners.
Reported ~20 % training throughput improvement and ~60 % activation-memory reduction on Llama-architecture training, with no model code changes.

Overview#

Liger Kernel is LinkedIn's contribution to the Triton-based LLM-kernel ecosystem. Released in August 2024, it bundles fused implementations of the operations that dominate transformer training time outside attention: RMSNorm, RoPE, SwiGLU, GeGLU, cross-entropy, and the fused-linear-cross-entropy 'lm head' operation that eliminates the huge logits tensor.

Its value proposition is integration ergonomics: one line of code applies the patches to standard HuggingFace model classes. The kernels themselves are written in OpenAI Triton, which makes them portable across NVIDIA and AMD GPUs without rewriting in CUDA.

What Liger Includes#

RMSNorm — fused forward + backward, ~7× faster than naive PyTorch.
Rotary Position Embedding (RoPE) — in-place rotation, lower memory.
SwiGLU and GeGLU — fused gating + projection.
Cross-entropy — chunked computation that avoids materialising the full logits tensor.
Fused Linear + Cross-Entropy — the headline kernel: combines the lm-head matmul with the cross-entropy step, eliminating activation memory for the [batch × seq × vocab] logits.
GroupNorm, LayerNorm, KL divergence, JSD — secondary kernels for fine-tuning and distillation.

Performance Characteristics#

On a representative Llama-7B training benchmark, Liger reports ~20 % step-time improvement and ~60 % reduction in peak activation memory. The activation-memory win is dominated by the Fused-Linear-Cross-Entropy kernel: for a Llama-3 vocabulary of 128k, the standard logits tensor at long context is several GB; Liger's chunked computation never materialises it.

Fused Linear Cross Entropy is the single biggest win for long-context fine-tuning. If you only enable one Liger kernel, enable that one.

When to Use#

Use Liger when fine-tuning or training on HuggingFace-style transformer models — Llama, Mistral, Mixtral, Qwen, Gemma, Phi families are all covered. It composes with FSDP, DeepSpeed, and torchtune; it is now bundled into HuggingFace TRL and Accelerate as an optional accelerator. For Megatron / NeMo training, the native fused kernels already cover most of Liger's surface area.

Pitfalls#

Coverage is HuggingFace-architecture-shaped — non-standard architectures may not be patched automatically.
Triton compilation cost on first run can be noticeable; warmup before benchmarking.
Some kernels assume specific shapes — very small batch sizes or unusual head dimensions may fall back to PyTorch.

Software#

github.com/linkedin/Liger-Kernel — main repository, BSD-2-Clause.
Built into HuggingFace TRL via `use_liger_kernel=True`.
Compatible with HuggingFace Accelerate, FSDP, DeepSpeed, torchtune.
Written in OpenAI Triton — runs on NVIDIA and AMD (ROCm Triton) backends.

References

Liger Kernel: Efficient Triton Kernels for LLM Training · arXiv (Hsu et al., 2024)
Liger Kernel on GitHub · GitHub (LinkedIn)
Liger Kernel launch blog · LinkedIn Engineering

Overview#

What Liger Includes#

RMSNorm — fused forward + backward, ~7× faster than naive PyTorch.

Rotary Position Embedding (RoPE) — in-place rotation, lower memory.

SwiGLU and GeGLU — fused gating + projection.

Cross-entropy — chunked computation that avoids materialising the full logits tensor.

Fused Linear + Cross-Entropy — the headline kernel: combines the lm-head matmul with the cross-entropy step, eliminating activation memory for the [batch × seq × vocab] logits.

GroupNorm, LayerNorm, KL divergence, JSD — secondary kernels for fine-tuning and distillation.

Performance Characteristics#

Fused Linear Cross Entropy is the single biggest win for long-context fine-tuning. If you only enable one Liger kernel, enable that one.

When to Use#

Pitfalls#

Coverage is HuggingFace-architecture-shaped — non-standard architectures may not be patched automatically.

Triton compilation cost on first run can be noticeable; warmup before benchmarking.

Some kernels assume specific shapes — very small batch sizes or unusual head dimensions may fall back to PyTorch.

Liger Kernel

Overview#

What Liger Includes#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel

Liger Kernel

Overview#

What Liger Includes#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel