DoRA (Weight-Decomposed Low-Rank Adaptation)

TL;DR

DoRA (Liu et al., 2024, arXiv:2402.09353) decomposes each pretrained weight matrix into a magnitude vector and a direction matrix, then fine-tunes the magnitude directly while applying LoRA to the direction.
The decomposition matches the structure of how full fine-tuning actually moves weights — empirically, magnitude and direction update differently — and the explicit split lets LoRA spend its capacity where it helps.
Quality is consistently better than vanilla LoRA at the same rank and often matches full fine-tuning, with only a small compute overhead and the same memory profile.
Supported in Hugging Face PEFT since v0.10 with a single flag. Often combined with QLoRA as QDoRA.

The Insight#

When a model is fully fine-tuned, the magnitude of each weight column tends to change by a different amount than its direction. Vanilla LoRA does not separate these — its low-rank update fuses both. Analysing this, Liu et al. observed that LoRA spends a disproportionate share of its rank budget on magnitude shifts rather than directional change, which is the dimension that actually carries the new behaviour.

DoRA fixes this by decomposing W = m · (V / ||V||) where m is a per-column magnitude vector and V/||V|| is the normalised direction. m is trained directly (a small number of additional parameters), and V is adapted with standard LoRA. The LoRA budget is then spent entirely on direction.

Mechanism#

For a pretrained weight W of shape (d_out, d_in), DoRA computes the per-column L2 norm m_0 (shape (1, d_in)) and initialises m = m_0. During training, m is a fully trainable vector. The directional component is W / m, which is held frozen, with a LoRA update BA added: V' = (W + BA) / ||W + BA||. The final forward pass uses m · V'.

The extra trainable count is small — a 70B model with DoRA adds only the magnitude vectors on top of the LoRA adapters, perhaps a few million additional parameters. The compute overhead from the normalisation is roughly 10-30% more wall-clock per step than vanilla LoRA, which most teams happily pay.

Hyperparameters#

Hyperparameter	Typical value
Rank r	4 - 32 (often lower than LoRA needs)
Alpha α	2 × rank
Target modules	All linear layers
Magnitude learning rate	Same as LoRA LR
Magnitude initialisation	Per-column L2 norm of W

DoRA often reaches LoRA-r=32 quality at r=8 or r=16, which more than recovers the training-step overhead in wall-clock terms.

Trade-offs#

Pro: consistently closes or eliminates the LoRA vs full fine-tune quality gap.
Pro: same memory footprint as LoRA — the magnitude vectors are negligible.
Pro: composes with QLoRA (commonly called QDoRA).
Con: 10-30% slower per step than LoRA due to the column-norm computation.
Con: the magnitude vector is harder to merge cleanly than a pure LoRA delta; some serving stacks do not support multi-DoRA.

When to Use DoRA#

Use DoRA when LoRA at sensible ranks is underperforming on the target task and full fine-tuning is out of reach. The cleanest signal is a LoRA run whose validation loss plateaus above what you need — DoRA usually closes that gap. For straightforward instruction tuning on a strong base, vanilla LoRA is often enough.

References

DoRA: Weight-Decomposed Low-Rank Adaptation · arXiv (Liu et al., 2024)
NVIDIA DoRA reference repository · GitHub
PEFT DoRA documentation · Hugging Face docs

The Insight#

Mechanism#

Hyperparameters#

Hyperparameter	Typical value
Rank r	4 - 32 (often lower than LoRA needs)
Alpha α	2 × rank
Target modules	All linear layers
Magnitude learning rate	Same as LoRA LR
Magnitude initialisation	Per-column L2 norm of W

DoRA often reaches LoRA-r=32 quality at r=8 or r=16, which more than recovers the training-step overhead in wall-clock terms.

Trade-offs#

Pro: consistently closes or eliminates the LoRA vs full fine-tune quality gap.

Pro: same memory footprint as LoRA — the magnitude vectors are negligible.

Pro: composes with QLoRA (commonly called QDoRA).

Con: 10-30% slower per step than LoRA due to the column-norm computation.

Con: the magnitude vector is harder to merge cleanly than a pure LoRA delta; some serving stacks do not support multi-DoRA.

When to Use DoRA#

DoRA (Weight-Decomposed Low-Rank Adaptation)

The Insight#

Mechanism#

Hyperparameters#

Trade-offs#

When to Use DoRA#

References

Browse all entries

Deploy on Yobitel

DoRA (Weight-Decomposed Low-Rank Adaptation)

The Insight#

Mechanism#

Hyperparameters#

Trade-offs#

When to Use DoRA#

References

Browse all entries

Deploy on Yobitel