TL;DR
- Cerebras's third-generation wafer-scale engine launched March 2024 — built on TSMC 5 nm, 4 trillion transistors.
- Single chip occupies an entire wafer: 900,000 AI cores, 44 GB on-die SRAM, 21 PB/s memory bandwidth.
- Designed for training and inference where bandwidth and weight-streaming kill GPU architectures.
- Sold as the CS-3 system; multiple CS-3 systems compose into clusters via SwarmX fabric.
Overview#
Cerebras WSE-3 is the largest chip in production. Where conventional GPUs cut wafers into dies, Cerebras keeps the wafer whole, yielding a single chip the size of a dinner plate with 4 trillion transistors, 900,000 AI cores and 44 GB of on-die SRAM. The architectural bet is that for transformer workloads, the dominant cost is not FLOPS but the bandwidth of feeding weights to compute — and a wafer-scale SRAM pool can stream weights at bandwidths orders of magnitude beyond HBM.
The product unit is the CS-3 system. Multiple CS-3s connect via the SwarmX fabric and the MemoryX external weight store, allowing training of trillion-parameter models without per-device parameter sharding.
Specifications#
| Metric | WSE-3 |
|---|---|
| Process | TSMC 5 nm |
| Transistors | 4 trillion |
| AI cores | 900,000 |
| On-die SRAM | 44 GB |
| Memory bandwidth | ~21 PB/s on-die |
| Fabric bandwidth | ~214 Pb/s on-chip |
| Sparse FP16 throughput | 125 PFLOPS |
| System | CS-3 |
WSE-3's defining number is bandwidth, not FLOPS. 21 PB/s on-die is roughly 7,000× a single H100's HBM bandwidth — the metric the architecture optimises.
Why a Whole Wafer#
Transformer training and inference are increasingly memory-bandwidth bound. Each layer requires streaming the full weight tensor through the compute units; HBM bandwidth has scaled slower than FLOPS, leaving GPUs under-fed at large batches and long sequences.
Cerebras's bet is that placing weights in on-die SRAM eliminates the bottleneck. 44 GB is modest by HBM standards but bandwidth is 21 PB/s — roughly four orders of magnitude beyond HBM3 — and the on-chip mesh fabric scales bandwidth with compute rather than across an off-chip boundary.
Weights too large for 44 GB sit in MemoryX, an external store that streams parameters to the WSE on demand. This model — 'weight streaming' — replaces the parameter sharding strategies used on GPU clusters.
When to Pick Cerebras#
- Training and inference workloads dominated by memory-bandwidth bottlenecks.
- Very-long-context inference where attention's quadratic memory traffic punishes GPUs.
- Workloads benefiting from the weight-streaming abstraction over parameter sharding.
- Customers willing to accept a custom software stack and a single-vendor commitment.
- Pick GPU clusters when CUDA ecosystem reach or commodity supply dominate.
Pitfalls#
- Single-vendor stack: Cerebras provides its own compiler, runtime and training framework. Portability is non-trivial.
- Capital cost per system is substantial; the unit of investment is much larger than a GPU node.
- Software ecosystem reach is narrow; popular frameworks (vLLM, TensorRT-LLM, SGLang) do not target Cerebras.
- Power and facilities requirements per CS-3 are non-trivial — installation typically requires bespoke planning.
Software Notes#
Cerebras provides its own software stack, including PyTorch-compatible APIs through a custom XLA-style compiler. Reference recipes target Llama, GPT, BERT and other open-weight families. Inference offerings (Cerebras Inference) provide low-latency LLM serving directly on WSE-3 systems.
References
- Cerebras WSE-3 Product Page · Cerebras
- Cerebras CS-3 System Overview · Cerebras