TL;DR
- State Space Models (SSMs) compute outputs via a recurrence h_t = A · h_{t-1} + B · x_t, y_t = C · h_t — linear in sequence length, no quadratic attention.
- Mamba (Gu & Dao, 2023, arXiv:2312.00752) introduces input-dependent (selective) A, B, C matrices, recovering the content-routing flexibility that earlier SSMs lacked.
- Mamba matches Transformer quality at small-to-medium scale with linear-time inference and constant memory per token — attractive for very long contexts and edge deployment.
- Hybrid Transformer-SSM architectures (Jamba, Zamba, Falcon Mamba) interleave Mamba blocks with attention blocks, capturing both global mixing and linear-time long-context behaviour.
The State-Space Recurrence#
A linear time-invariant state-space model is defined by four matrices A, B, C, D and the recurrence h_t = A · h_{t-1} + B · x_t, y_t = C · h_t + D · x_t. Compared to RNNs the structure is the same; the discipline is that A is parameterised in a specific way (HIPPO matrices) so that h_t represents a compressed memory of the entire input history with provable approximation properties.
Computationally, SSMs can be evaluated either as a recurrence (sequential) or as a convolution (parallel) — Gu et al.'s 2021-2022 work showed that for time-invariant A, B, C the recurrence is equivalent to convolving the input with a precomputed kernel. That made SSMs efficient to train on GPUs without the parallelism limits of RNNs.
Why Pure Time-Invariant SSMs Plateau#
S4 (Gu et al., 2022) showed strong performance on long-range benchmarks but did not match Transformers on language modelling. The reason was diagnosed by Gu & Dao: a time-invariant A applies the same state-update rule to every token, regardless of content. The model can compress history well, but it cannot decide what to compress conditionally on the current input. Attention, by contrast, gives every token explicit control over what it attends to.
Mamba's Selective Mechanism#
Mamba makes the SSM matrices B, C and the discretisation step Δ depend on the input x_t. The recurrence becomes h_t = Ā(x_t) · h_{t-1} + B̄(x_t) · x_t, where Ā and B̄ are produced by small input-dependent functions and Δ controls how aggressively the state is updated at this step.
Selectivity restores content routing: a token whose Δ is large effectively 'resets' the state and stores new information; one whose Δ is small mostly preserves prior state. The model can choose to remember, forget or transform information per token, the way attention chooses what to attend to.
The input-dependent A breaks the convolutional view of SSMs — you can no longer precompute a single kernel. Mamba's contribution includes a hardware-aware parallel scan algorithm that recovers GPU efficiency despite this.
Hardware-Aware Parallel Scan#
A naive selective SSM is a sequential recurrence — bad on GPUs. Mamba implements a parallel scan with careful tiling: chunks of the sequence are processed with associative scan in SRAM, intermediate states are accumulated in HBM, and the whole pipeline avoids materialising the full state tensor.
On Hopper GPUs the Mamba scan kernel achieves throughput comparable to attention at small batch sizes and dominates at very long contexts (>32k tokens). On Blackwell it scales further with the larger SRAM and improved L2.
Inference Properties#
At inference, Mamba is a pure recurrence: each new token costs O(1) compute given the state, and the state size is fixed (independent of context length). This is the structural advantage Transformers lack — Transformer decoding cost scales with the KV cache, which grows linearly with context.
For a million-token context, Mamba's per-token decode cost is roughly the same as at the first token. A Transformer with the same context costs roughly 1000× more per token at the end of the sequence than at the start, before KV-cache optimisations.
Hybrid Architectures#
Pure Mamba models (Mamba 1.4B, Mamba-2.8B) are competitive at small scale but consistently slightly behind Transformers at frontier scale on standard benchmarks. The 2024 consensus has been hybrid: interleave Mamba blocks with attention blocks.
- Jamba (AI21, 2024) — 1 attention block per 7 Mamba blocks plus MoE, 52B total parameters.
- Zamba (Zyphra, 2024) — Mamba blocks with a single shared attention block applied periodically.
- Falcon Mamba 7B (TII, 2024) — pure Mamba with strong open-weights release.
- Codestral Mamba (Mistral, 2024) — Mamba for code generation, exploits the long-context advantage on full repositories.
Standing in 2026#
Mamba and its successors have not displaced the Transformer for frontier general-purpose LLMs. They have established a real niche: extreme long context (1M+ tokens), edge deployment (constant memory per token), and code repositories (where the full project context matters more than peak intelligence).
Mamba-2 (Dao & Gu, 2024) tightened the connection between SSMs and attention by showing structured state-space duality — useful theoretical clarity, though it has not produced a dramatic capability jump yet.