TL;DR
- GaLore (Zhao et al., 2024, arXiv:2403.03507) is a memory-efficient training technique that keeps all parameters trainable — unlike LoRA — but stores the optimiser state in a low-rank projection of the gradients.
- By periodically refreshing the projection basis from the current gradient SVD, GaLore captures most of the optimiser-state benefit of LoRA without freezing any weights.
- Headline result: pretraining a 7B model on a single 24 GB consumer GPU. For fine-tuning, GaLore offers an alternative when LoRA's quality gap is unacceptable but full optimiser state would not fit.
- Trades training throughput for memory — GaLore-AdamW is 20-50% slower per step than plain AdamW.
How GaLore Differs from LoRA#
LoRA reduces memory by reducing the number of trainable parameters — the optimiser only allocates state for the small adapter. GaLore keeps all parameters trainable but reduces the dimensionality of the optimiser state itself. Every parameter still receives a gradient and is still updated; what changes is the representation Adam (or any adaptive optimiser) uses to track its moments.
Concretely, for each weight matrix W of shape (m, n), GaLore periodically computes the SVD of the gradient G and keeps the top-r left singular vectors as a projection matrix P of shape (m, r). The gradient is projected as G̃ = Pᵀ G into shape (r, n), Adam's state is maintained in that smaller space, the Adam update is computed in that space, then projected back via P before applying.
Why It Works#
The intuition is similar to LoRA's: the gradient of a weight matrix during fine-tuning has low effective rank. Empirically, the top few singular directions of the gradient capture most of its variance. Projecting Adam's state onto those directions preserves nearly all of the optimiser's signal at a fraction of the memory.
Because the basis is recomputed every few hundred steps, GaLore tracks the gradient subspace as it drifts during training. This is the key difference from naive 'low-rank Adam' approaches, which fix the basis up front and lose accuracy as training progresses.
Memory Savings#
Standard AdamW maintains two state tensors (m and v) at the same precision as the parameters. For a 7B model in BF16, that is roughly 28 GB of optimiser state on top of 14 GB of weights and 14 GB of gradients — too much for a 24 GB card.
GaLore with rank r = 128 reduces optimiser state by an order of magnitude, bringing total memory under 24 GB and enabling 7B pretraining on a single RTX 4090. For fine-tuning, the savings are smaller in absolute terms but still significant on multi-billion-parameter models where the optimiser state dominates.
Trade-offs#
- Pro: every parameter remains trainable — no LoRA-style quality ceiling.
- Pro: stronger quality than LoRA on pretraining and continued pretraining workloads.
- Pro: composes with 8-bit Adam for further savings.
- Con: SVD recomputation every few hundred steps adds wall-clock cost.
- Con: more sensitive to learning rate and projection-update interval than LoRA.
- Con: less battle-tested than LoRA for adapter portability and serving.
GaLore is best understood as a memory-efficient alternative to full fine-tuning, not as a competitor to LoRA. For pure adaptation tasks LoRA is usually still the right call; for continued pretraining or large domain shifts GaLore is worth evaluating.
When to Use GaLore#
Reach for GaLore when (a) you need full-parameter training but cannot fit AdamW state, and (b) the workload is large enough that LoRA's quality ceiling is a real constraint — typically continued pretraining, domain pre-adaptation, or large-scale alignment runs. For ordinary instruction tuning or task-specific adaptation, LoRA or QLoRA remain the simpler choice.
References
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection · arXiv (Zhao et al., 2024)
- GaLore reference repository · GitHub