TL;DR
- DoRA (Liu et al., 2024, arXiv:2402.09353) decomposes each pretrained weight matrix into a magnitude vector and a direction matrix, then fine-tunes the magnitude directly while applying LoRA to the direction.
- The decomposition matches the structure of how full fine-tuning actually moves weights — empirically, magnitude and direction update differently — and the explicit split lets LoRA spend its capacity where it helps.
- Quality is consistently better than vanilla LoRA at the same rank and often matches full fine-tuning, with only a small compute overhead and the same memory profile.
- Supported in Hugging Face PEFT since v0.10 with a single flag. Often combined with QLoRA as QDoRA.
The Insight#
When a model is fully fine-tuned, the magnitude of each weight column tends to change by a different amount than its direction. Vanilla LoRA does not separate these — its low-rank update fuses both. Analysing this, Liu et al. observed that LoRA spends a disproportionate share of its rank budget on magnitude shifts rather than directional change, which is the dimension that actually carries the new behaviour.
DoRA fixes this by decomposing W = m · (V / ||V||) where m is a per-column magnitude vector and V/||V|| is the normalised direction. m is trained directly (a small number of additional parameters), and V is adapted with standard LoRA. The LoRA budget is then spent entirely on direction.
Mechanism#
For a pretrained weight W of shape (d_out, d_in), DoRA computes the per-column L2 norm m_0 (shape (1, d_in)) and initialises m = m_0. During training, m is a fully trainable vector. The directional component is W / m, which is held frozen, with a LoRA update BA added: V' = (W + BA) / ||W + BA||. The final forward pass uses m · V'.
The extra trainable count is small — a 70B model with DoRA adds only the magnitude vectors on top of the LoRA adapters, perhaps a few million additional parameters. The compute overhead from the normalisation is roughly 10-30% more wall-clock per step than vanilla LoRA, which most teams happily pay.
Hyperparameters#
| Hyperparameter | Typical value |
|---|---|
| Rank r | 4 - 32 (often lower than LoRA needs) |
| Alpha α | 2 × rank |
| Target modules | All linear layers |
| Magnitude learning rate | Same as LoRA LR |
| Magnitude initialisation | Per-column L2 norm of W |
DoRA often reaches LoRA-r=32 quality at r=8 or r=16, which more than recovers the training-step overhead in wall-clock terms.
Trade-offs#
- Pro: consistently closes or eliminates the LoRA vs full fine-tune quality gap.
- Pro: same memory footprint as LoRA — the magnitude vectors are negligible.
- Pro: composes with QLoRA (commonly called QDoRA).
- Con: 10-30% slower per step than LoRA due to the column-norm computation.
- Con: the magnitude vector is harder to merge cleanly than a pure LoRA delta; some serving stacks do not support multi-DoRA.
When to Use DoRA#
Use DoRA when LoRA at sensible ranks is underperforming on the target task and full fine-tuning is out of reach. The cleanest signal is a LoRA run whose validation loss plateaus above what you need — DoRA usually closes that gap. For straightforward instruction tuning on a strong base, vanilla LoRA is often enough.
References
- DoRA: Weight-Decomposed Low-Rank Adaptation · arXiv (Liu et al., 2024)
- NVIDIA DoRA reference repository · GitHub
- PEFT DoRA documentation · Hugging Face docs