INT4 Marlin Kernel

TL;DR

Open-source CUDA kernel by IST-DASLab (Frantar et al., 2024) for INT4-weight, FP16-activation GEMM.
Designed for INT4 weight-only quantised LLMs (GPTQ, AWQ); delivers 2-3x speedup over earlier dequant-then-matmul kernels.
Tightly tuned for Ampere (A100) and Hopper (H100, H200) tensor cores with mixed-precision dot-product instructions.
Integrated into vLLM, TensorRT-LLM, TGI and SGLang; usually selected automatically for INT4 weights on supported hardware.

Overview#

Marlin — short for 'Mixed Auto-Regressive Linear' — is a CUDA kernel for the linear layers that dominate LLM inference cost when weights are quantised to INT4 and activations remain in FP16. Earlier kernels dequantised INT4 weights to FP16 in shared memory and then ran a standard FP16 GEMM. Marlin instead orchestrates the mixed-precision math directly on the tensor cores, with carefully designed memory layouts and pipelining.

The result is a step-change in INT4 LLM throughput, lifting the realised speedup from the 1.3-1.5x typical of dequant-then-matmul implementations to 2.5x or more on H100 for decode-dominant workloads.

Why It Matters#

INT4 quantisation halves memory bandwidth requirements compared to FP8, which matters because decode is memory-bound.
Without an efficient kernel, that bandwidth advantage is squandered on dequantisation overhead.
Marlin pushes INT4 GEMM efficiency close to the memory-bandwidth ceiling.
On H100, decode throughput on a 70B AWQ INT4 model with Marlin frequently beats the FP8 path at the same batch size.

Variants#

Subsequent kernels in the same family — sometimes shipped under the same name and sometimes as 'Marlin-MoE' or 'Sparse Marlin' — extend the approach to MoE routing and 2:4 sparse INT4 weights. Most LLM runtimes expose them through a single `marlin` or `int4_marlin` option and pick the appropriate variant for the model.

Integration#

When a runtime detects an AWQ or GPTQ INT4 checkpoint on Ampere or Hopper hardware, it typically routes through Marlin automatically. Explicit options exist in vLLM (`--quantization gptq_marlin`, `--quantization awq_marlin`) and TensorRT-LLM (Marlin plugin in `gemm_plugin`). On older hardware without the required tensor-core support, runtimes fall back to dequant-then-matmul kernels.

If INT4 inference feels slow, check which kernel was selected. A `marlin` kernel in the logs is the fast path; anything else is leaving performance on the table.

Limits#

Marlin requires Ampere or newer NVIDIA tensor cores. It does not run on Volta, Turing or non-NVIDIA hardware. For AMD MI300 and Intel Gaudi, equivalent kernels exist in the vendor stacks but are not Marlin per se.

References

Marlin on GitHub · GitHub (IST-DASLab)
GPTQ Paper · arXiv (Frantar et al., 2022)
vLLM Quantisation Documentation · vLLM

Overview#

Why It Matters#

INT4 quantisation halves memory bandwidth requirements compared to FP8, which matters because decode is memory-bound.

Without an efficient kernel, that bandwidth advantage is squandered on dequantisation overhead.

Marlin pushes INT4 GEMM efficiency close to the memory-bandwidth ceiling.

On H100, decode throughput on a 70B AWQ INT4 model with Marlin frequently beats the FP8 path at the same batch size.

Variants#

Integration#

If INT4 inference feels slow, check which kernel was selected. A `marlin` kernel in the logs is the fast path; anything else is leaving performance on the table.

INT4 Marlin Kernel

Overview#

Why It Matters#

Variants#

Integration#

Limits#

References

Browse all entries

Deploy on Yobitel

INT4 Marlin Kernel

Overview#

Why It Matters#

Variants#

Integration#

Limits#

References

Browse all entries

Deploy on Yobitel