Model Pruning

TL;DR

Compression family that removes parameters from a trained model to reduce inference cost.
Three granularities: unstructured (individual weights), semi-structured (2:4 sparsity patterns) and structured (attention heads, MLP channels, whole layers).
Semi-structured 2:4 sparsity has hardware acceleration on NVIDIA Ampere and later — 2x dense GEMM throughput on sparse tensor cores.
Structured pruning (depth pruning, width pruning) is most popular in 2026 because it composes with quantisation and distillation cleanly.

Overview#

Model pruning removes parameters from an already-trained network with the goal of cutting inference cost while preserving as much quality as possible. The intuition — well documented since the early Optimal Brain Damage papers of 1990 — is that trained networks contain substantial redundancy and many parameters can be removed with minor effect on outputs.

For modern LLMs the question is not whether pruning works in isolation (it does) but whether the produced sparsity pattern can be exploited efficiently by inference hardware. The answer depends sharply on the pruning granularity.

Unstructured Pruning#

Individual weights below a magnitude threshold are zeroed out. The resulting tensor is sparse but irregular; without special hardware support, GEMM kernels still treat it as dense, so wall-clock speedup is limited. Memory footprint can shrink with sparse storage, but only at the cost of more complex kernels.

Semi-Structured (2:4) Sparsity#

NVIDIA Ampere introduced sparse tensor cores that accept a 2:4 pattern — every group of four consecutive weights must have exactly two zeros. The compressed representation halves storage and doubles the effective throughput. SparseGPT and Wanda are the dominant methods for producing 2:4 sparse LLM weights with acceptable quality loss.

2:4 sparsity is available on every NVIDIA tensor core from Ampere onward, but adoption in production LLM serving is patchy because integrating it with INT4 / FP8 quantisation and continuous batching is non-trivial.

Structured Pruning#

Width pruning — remove attention heads or MLP intermediate channels.
Depth pruning — remove whole Transformer blocks based on importance scores.
Layer fusion — merge consecutive blocks via low-rank factorisation.
Structured pruning preserves the dense GEMM shape, so any inference runtime accelerates the smaller model directly.

Pipeline#

A typical 2026 production pipeline combines depth or width pruning with distillation (the pruned model is fine-tuned to match the original's outputs) and then quantisation. NVIDIA's Minitron family is a canonical example: a 15B parameter base model pruned and distilled to 8B and 4B variants that recover most of the original's quality.

Trade-offs#

Pruning is cheap inference-time but expensive engineering-time: producing a good pruned model requires retraining, careful evaluation and benchmarking. For most teams, picking an already-pruned model from a model family (e.g. Llama 3.1 8B vs 70B) is more cost-effective than custom pruning. Custom pruning is most worth the effort when a specific deployment constraint (latency budget, memory budget) is not met by any off-the-shelf model size.