TL;DR
- Post-training INT4 weight-only quantisation by Frantar, Ashkboos, Hoefler and Alistarh (arXiv:2210.17323, 'GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers', ICLR 2023).
- Algorithm: column-by-column quantisation within each linear layer, using a Cholesky-decomposed inverse Hessian to propagate the error introduced at column i into the still-unquantised columns i+1..n.
- Weight-only (W4A16): INT4 weights with per-group FP16 scales (group 128 default); activations stay in FP16/BF16; matmuls are mixed-precision through fused INT4xFP16 GEMM kernels.
- First method to demonstrate INT3 and INT4 quantisation of OPT-175B and BLOOM-176B in <4 GPU-hours per model with single-digit-percent perplexity loss; established the recipe later refined by AWQ.
- Standard runtime support in vLLM (`--quantization gptq` / `gptq_marlin`), TensorRT-LLM, Hugging Face TGI, SGLang, ExLlamaV2; available as an alternative Yobibyte Marketplace recipe alongside AWQ for teams with existing GPTQ checkpoints or specific Marlin-kernel preferences.
Overview#
GPTQ — Generalised Post-Training Quantisation — was published in October 2022 by Frantar, Ashkboos, Hoefler and Alistarh (IST Austria + ETH Zurich) and was the first method to demonstrate that the largest Transformers of the era — OPT-175B and BLOOM-176B — could be compressed to INT4 or even INT3 weights in a single GPU-day per model with minimal perplexity loss. The technique extends the older OBQ (Optimal Brain Quantisation) framework, which was prohibitively expensive at LLM scale, through three engineering choices: arbitrary insertion-order processing, lazy batched updates, and Cholesky-based numerical stabilisation.
Three and a half years on, GPTQ remains one of the two dominant INT4 paths for open-weights LLM serving. AWQ has overtaken it as the modal default on the Hugging Face Hub, but GPTQ retains a meaningful ecosystem because (a) many published quantised checkpoints predate AWQ's release in mid-2023, (b) the Marlin INT4 GEMM kernel (IST-DASLab, 2024) gave GPTQ a kernel performance ceiling that closed the gap with AWQ on Ampere and Hopper, and (c) for some workloads GPTQ's per-column error correction produces slightly better perplexity than AWQ's per-channel scaling.
Through mid-2026, GPTQ INT4 retains approximately 98.5 percent of BF16 perplexity on standard benchmarks (WikiText-2, C4) for Llama 3.1 70B at group 128 — typically 0.05-0.15 points behind AWQ at the same bit budget, an empirical gap that depends on the model family and the calibration set. On 8B-class and smaller models the difference is within run-to-run noise.
This entry helps you decide whether GPTQ is the right INT4 method for your deployment, how to produce a GPTQ checkpoint with AutoGPTQ for self-hosted serving, and when to choose a Yobitel Yobibyte GPTQ recipe over the platform's default AWQ recipe. After reading you should be able to reason about GPTQ throughput, quality and ecosystem trade-offs without running a benchmark sweep.
How it works: the Hessian-based error correction#
Naive round-to-nearest INT4 quantisation introduces an error e_i in each weight column i. If those errors are simply accepted, they accumulate across the layer's output and damage downstream quality. GPTQ asks: given that we have already quantised columns 1..i-1 with errors e_1..e_{i-1}, how should we adjust the still-unquantised columns i..n to compensate?
The OBQ framework answers this with the inverse Hessian of the layer's reconstruction loss. The Hessian H is approximately 2 * X * X^T (where X is the layer's input activation matrix), and its inverse H^-1 tells you how to redistribute the error of column i across the remaining columns to minimise the L2 error in the layer's output. The classic OBQ algorithm is O(n^4) and only practical for matrices up to a few thousand parameters — useless at LLM scale.
GPTQ makes three changes that turn the algorithm into something that runs on a 70B-parameter model in tens of minutes. First, columns are processed in arbitrary fixed order (left-to-right) rather than greedily searching for the next-best column to quantise — Frantar et al. showed this loses essentially no quality and removes the O(n^2) outer search. Second, error propagation is batched lazily: updates to columns i+1..n are accumulated and applied in groups rather than after every column. Third, the inverse Hessian is computed once via Cholesky decomposition, which is numerically stable for the kind of poorly-conditioned Hessians that LLM weight matrices produce.
The combined algorithm processes one linear layer in O(d^3) time where d is the layer width — under 10 seconds per layer on an H100 for a 70B model. Total quantisation cost is dominated by the one forward pass over the calibration set needed to materialise X * X^T, which takes roughly 30 minutes on a single H100.
- Step 1: collect a small calibration dataset (128 samples is the standard, drawn from C4 or in-domain text).
- Step 2: for each linear layer in topological order, run a forward pass to capture the input activation matrix X and compute the Hessian H = 2 * X * X^T.
- Step 3: Cholesky-decompose H + lambda * I (where lambda is a small damping factor for numerical stability) to obtain a numerically stable inverse-Hessian L^-T L^-1.
- Step 4: for each column i (left to right), quantise w_i to INT4 producing error e_i, then update columns i+1..n by subtracting e_i * (H^-1)[i, i+1..n] / (H^-1)[i, i] — the OBQ closed-form error redistribution.
- Step 5: apply lazy batched updates to amortise memory traffic; process columns in groups of 128 so the per-column update cost amortises across the group.
- Step 6: store INT4 weights, per-group FP16 scales, and optionally per-group zero points; serialise the GPTQ-format checkpoint.
The 'act_order' / 'desc_act' flag in AutoGPTQ orders columns by descending activation magnitude rather than the default left-to-right. It produces a 0.1-0.3 perplexity point win on most models but requires the runtime to support reordered indices — vLLM's gptq_marlin kernel handles it transparently in v0.5+.
Variants and architectural choices#
Four knobs shape every GPTQ deployment: the bit width, the group size, the column ordering (act_order on or off), and whether per-group zero points are stored. The defaults below match what AutoGPTQ ships and what the majority of published Hugging Face GPTQ checkpoints use.
| Knob | Default | Effect | When to deviate |
|---|---|---|---|
| Bit width | INT4 (w4) | ~4x memory reduction, 1-2 percent perplexity loss | INT3 only when memory is critically tight; quality drop is visible |
| Group size | 128 | Industry standard; balanced quality vs storage | Group 64 for tighter quality at +5 percent storage overhead |
| Column ordering | act_order=True (desc_act) | 0.1-0.3 perplexity win over left-to-right | Set False only when targeting a runtime without reorder support |
| Zero point | True (asymmetric) | Per-group zero shift recovers 0.1-0.2 perplexity points | Symmetric only on specialised kernels (rare in 2026) |
| Damping factor | 0.01 | Hessian regularisation; prevents Cholesky failure | Raise to 0.1 if Cholesky fails on outlier-heavy layers |
| Kernel format | gptq_marlin (vLLM v0.5+) | Fastest decode + prefill on Ampere/Hopper | exllama-v2 kernel on consumer GPUs without Marlin support |
| Calibration set | ~128 samples | Cheap, robust | 512+ samples for domain shift (code, multilingual, math) |
When to use GPTQ versus AWQ, FP8 or FP4#
GPTQ and AWQ both produce W4A16 INT4 checkpoints with broadly similar quality and throughput. The choice between them is rarely about quality (typically within 0.1-0.2 perplexity points) and usually about ecosystem fit and operational preference.
Choose GPTQ when (a) the model you need is already published on Hugging Face as a GPTQ checkpoint and you want to avoid the re-quantisation cost, (b) the runtime stack on which you serve has stronger GPTQ kernel support than AWQ — true historically for ExLlamaV2-based stacks but no longer the case in vLLM where awq_marlin and gptq_marlin both ship, or (c) you have a specific perplexity-sensitive workload where GPTQ's per-column error correction has measured slightly better on your evaluation harness.
Choose AWQ when starting fresh, when the workload is throughput-dominated (AWQ's simpler layout is 5-15 percent faster on small-batch decode on Ampere), or when the published checkpoint you need is AWQ-only.
Choose FP8 W8A8 on Hopper (H100/H200) and Blackwell (B200/B300) when FP8 Tensor Core hardware is available and the workload is prefill-heavy or large-batch — FP8 paths avoid the INT4 dequant step entirely and are denser on FLOP/byte. Choose FP4 on Blackwell when memory pressure justifies it; FP4 quality is close to INT4 in 2026 but the kernel ecosystem is younger.
Yobitel's Yobibyte Marketplace supports both GPTQ and AWQ as Marketplace recipes for the same model — customers select by recipe name in their workspace and Yobibyte handles the runtime selection, kernel path and routing to NeoCloud capacity. The platform's default for new Llama, Qwen and Mistral deployments is AWQ INT4 because TTFT and tokens-per-second dominate the typical chat SLO; GPTQ remains available for teams with an existing GPTQ checkpoint they want to consume unchanged.
Trade-offs and known limitations#
GPTQ inherits the structural limitations of all weight-only post-training quantisation. Activations stay in FP16/BF16, so the inference path is mixed-precision; the dequant step on each forward pass costs measurable bandwidth that W8A8 FP8 paths avoid. On Hopper and Blackwell with FP8 hardware available, FP8 W8A8 typically beats GPTQ INT4 on throughput at equal or better quality.
Calibration drift is the most common operational pitfall. The Hessian H = 2 * X * X^T depends on the activation statistics of the calibration set; if that set is drawn from English Wikipedia but the production workload is multilingual code completion, the resulting Hessian misallocates error budget and perplexity on the deployed workload is worse than the published number. Recalibrate with 512-1024 in-domain samples when domain shift is large.
Cholesky numerical stability fails on activation distributions with extreme outliers — most commonly on the down_proj layer of MoE models. The damping factor (default 0.01) addresses this; raising it to 0.1 typically resolves Cholesky failures at the cost of slightly worse error redistribution. AutoGPTQ surfaces a clear error in this case.
Long-context generation quality is generally preserved but reasoning and mathematical benchmarks (GSM8K, HumanEval, MATH) show 3-5 percent absolute degradation versus BF16 for Llama 3.1 70B — comparable to AWQ at the same bit budget. For reasoning-heavy workloads, FP8 W8A8 or selective FP16 retention on math-heavy layers is preferable.
Very small models (1B-3B class) can show 1-3 percent perplexity loss that becomes visible on downstream tasks. For 7B and up, GPTQ INT4 quality is essentially within noise of AWQ INT4 and the choice is operational.
Do not compare 'GPTQ perplexity' numbers across papers without checking the calibration set, group size, act_order setting and evaluation harness. A 'GPTQ INT4 at 5.2 perplexity' paper might be using act_order=False and group 32 while a 'GPTQ INT4 at 5.6 perplexity' paper might be act_order=True and group 128 — neither is wrong, but they are not the same configuration.
Practical implementation notes#
AutoGPTQ (AutoGPTQ/AutoGPTQ) is the canonical production conversion path; it wraps the original IST Austria reference implementation in a HuggingFace-compatible API and exposes the standard BaseQuantizeConfig knobs. Conversion of a 70B Llama-class model on a single H100 takes 30-60 minutes depending on calibration set size and group size. The snippet below covers the standard recipe: load BF16 weights, set bits=4 with group 128 and act_order=True, calibrate on 128 C4 samples, save the GPTQ checkpoint.
Serving the checkpoint in vLLM is a single flag — `--quantization gptq_marlin` selects the Marlin INT4 kernel path on Ampere and Hopper, which is the only kernel choice worth running in production in 2026. The Yobibyte managed alternative skips conversion entirely: customers select a GPTQ recipe by name in their workspace and Yobitel handles the checkpoint, the kernel selection and the routing to NeoCloud capacity. Conversion of net-new models for the Marketplace is a Yobitel operations task, not a customer responsibility.
# Producing a GPTQ INT4 checkpoint with AutoGPTQ
# pip install auto-gptq>=0.7 transformers datasets
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
MODEL_ID = "meta-llama/Meta-Llama-3.1-70B-Instruct"
OUT_DIR = "./llama-3.1-70b-gptq-int4"
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit INT4 weights
group_size=128, # Industry standard
desc_act=True, # act_order: 0.1-0.3 perplexity win
damp_percent=0.01, # Hessian damping; raise to 0.1 if Cholesky fails
sym=False, # Asymmetric (zero_point on)
)
model = AutoGPTQForCausalLM.from_pretrained(
MODEL_ID, quantize_config=quantize_config
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
# 128-sample calibration from C4; pass an in-domain set when domain shifts
ds = load_dataset("allenai/c4", "en", split="train", streaming=True)
calib = []
for i, row in enumerate(ds):
if i >= 128: break
calib.append(tokenizer(row["text"], truncation=True, max_length=2048,
return_tensors="pt"))
model.quantize(calib)
model.save_quantized(OUT_DIR, use_safetensors=True)
tokenizer.save_pretrained(OUT_DIR)
# Serve the GPTQ checkpoint with vLLM (self-hosted; Yobibyte handles this end-to-end)
# vllm serve ./llama-3.1-70b-gptq-int4 \
# --quantization gptq_marlin \
# --tensor-parallel-size 2 \
# --max-model-len 32768 \
# --enable-prefix-caching \
# --enable-chunked-prefillWhere GPTQ fits in the Yobitel stack#
Yobitel's Yobibyte Marketplace catalogues GPTQ INT4 as an alternative INT4 recipe alongside the default AWQ INT4 recipe for the open-weights Llama, Qwen and Mistral families. The two recipes are interchangeable from the customer's perspective — same OpenAI-compatible endpoint, same workspace controls, same observability surface — and the choice is exposed as a recipe-level selection. Teams arriving on Yobibyte with an existing GPTQ checkpoint they have validated on their evaluation harness can select the GPTQ recipe and consume their checkpoint without re-quantising.
Yobitel NeoCloud's H100 SXM5 and H200 SXM5 SKUs run GPTQ checkpoints through the gptq_marlin kernel path, which is kernel-equivalent in throughput to awq_marlin within 5 percent on most workloads. The 80 GB / 141 GB HBM headroom on those SKUs lets a single GPU host a 70B GPTQ model with comfortable KV cache budget for chat-shaped workloads.
Yobitel's InferenceBench publishes side-by-side measurements of GPTQ INT4, AWQ INT4 and FP8 W8A8 on the same model, GPU and runtime — tokens-per-second, time-to-first-token, p99 latency and cost-per-million-tokens, with the configurations reproducible. For teams choosing between GPTQ and AWQ, InferenceBench is the empirical complement to the structural reasoning in this entry; for most readers the two are within run-to-run noise and the choice should be driven by ecosystem fit, not benchmark sweep results.
References
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers · arXiv (Frantar et al., 2022)
- AutoGPTQ on GitHub (production conversion) · GitHub
- Marlin: Mixed-Precision Auto-Regressive Parallel Inference of LLMs · GitHub (IST-DASLab)
- vLLM GPTQ Documentation · vLLM
- Optimal Brain Quantization (OBQ, Frantar & Alistarh, 2022) · arXiv