TL;DR
- Lion (EvoLved Sign Momentum) — Chen et al., 2023 (arXiv:2302.06675) — was found by an evolutionary search over symbolic optimiser programs and uses only the sign of an EMA of gradients.
- It tracks one EMA (the first moment) instead of two, halving optimiser-state memory compared to AdamW.
- Lion converges competitively with AdamW on Transformer pretraining, ViT, diffusion, and contrastive learning, often at slightly lower learning rates (3-10× smaller).
- It has not replaced AdamW for frontier LLM pretraining but has seen real adoption for fine-tuning, vision and constrained-memory training.
The Algorithm in Six Lines#
Two moving averages of the gradient with different betas (β_1 ≈ 0.9 for the update, β_2 ≈ 0.99 for the next step's accumulator), but only one piece of state — the moving average m. The update direction is the sign of an interpolated gradient EMA. Step size is purely the learning rate, with no per-parameter scaling.
# Lion update — note the sign() and the single moment.
update = beta1 * m + (1 - beta1) * grad
param.add_(update.sign(), alpha=-lr)
param.mul_(1 - lr * weight_decay)
m.mul_(beta2).add_(grad, alpha=1 - beta2)Where Lion Came From#
Chen et al. used neural architecture search to evolve symbolic optimiser programs from a small set of operations (multiply, add, sign, EMA). After 8,000 iterations of mutation-and-evaluation across thousands of small training runs, Lion was the surviving champion.
The discovery is interesting because it produced an optimiser that no human had proposed but which nonetheless has clear connections to known ideas: sign descent (signSGD) for memory efficiency, Polyak momentum for smoothing, and decoupled weight decay (à la AdamW).
Memory Advantage#
AdamW stores m and v per parameter (FP32 typically) — 2× param count in optimiser state alone. Lion stores only m — 1× param count. For a 70B model in BF16, that is 280 GB versus 560 GB of optimiser state. The savings compound when sharded with ZeRO/FSDP.
The memory savings are particularly attractive for fine-tuning on memory-constrained hardware. QLoRA-style adapter training plus Lion fits larger models on a single GPU than the equivalent AdamW configuration.
Empirical Behaviour#
Across the original paper and reproductions:
- Image classification (ViT, CLIP, ConvNeXt) — Lion matches or beats AdamW at iso-step, with 3-10× smaller learning rate.
- Language model pretraining — Lion is competitive at small to medium scale (under 7B). At 70B+, results are mixed; some teams report it being slightly worse at iso-step, others report neutral.
- Diffusion model training — Lion is widely used; saves memory in image-generation pipelines where activations dominate.
- Reinforcement learning and contrastive pretraining — competitive or slightly better.
If you swap AdamW for Lion, divide the learning rate by 3-5 as your starting point. The sign-based update has a different effective scale.
Why It Has Not Displaced AdamW for Frontier LLMs#
Lion's competitive results are strongest at small-to-medium scale. At frontier scale (hundreds of billions of parameters, trillions of tokens), AdamW's per-parameter adaptive scaling appears to handle the diverse gradient landscape more robustly. The sign-only update discards magnitude information that matters when scales differ across layers.
There is also a risk-aversion effect: a frontier pretraining run costs tens of millions of dollars and a failed run from optimiser instability is unacceptable. AdamW has a decade of debugged failure modes; Lion does not.