Cerebras WSE-3 (Wafer-Scale Engine 3)

TL;DR

Cerebras's third-generation wafer-scale engine launched March 2024 — built on TSMC 5 nm, 4 trillion transistors.
Single chip occupies an entire wafer: 900,000 AI cores, 44 GB on-die SRAM, 21 PB/s memory bandwidth.
Designed for training and inference where bandwidth and weight-streaming kill GPU architectures.
Sold as the CS-3 system; multiple CS-3 systems compose into clusters via SwarmX fabric.

Overview#

Cerebras WSE-3 is the largest chip in production. Where conventional GPUs cut wafers into dies, Cerebras keeps the wafer whole, yielding a single chip the size of a dinner plate with 4 trillion transistors, 900,000 AI cores and 44 GB of on-die SRAM. The architectural bet is that for transformer workloads, the dominant cost is not FLOPS but the bandwidth of feeding weights to compute — and a wafer-scale SRAM pool can stream weights at bandwidths orders of magnitude beyond HBM.

The product unit is the CS-3 system. Multiple CS-3s connect via the SwarmX fabric and the MemoryX external weight store, allowing training of trillion-parameter models without per-device parameter sharding.

Specifications#

Metric	WSE-3
Process	TSMC 5 nm
Transistors	4 trillion
AI cores	900,000
On-die SRAM	44 GB
Memory bandwidth	~21 PB/s on-die
Fabric bandwidth	~214 Pb/s on-chip
Sparse FP16 throughput	125 PFLOPS
System	CS-3

WSE-3's defining number is bandwidth, not FLOPS. 21 PB/s on-die is roughly 7,000× a single H100's HBM bandwidth — the metric the architecture optimises.

Why a Whole Wafer#

Transformer training and inference are increasingly memory-bandwidth bound. Each layer requires streaming the full weight tensor through the compute units; HBM bandwidth has scaled slower than FLOPS, leaving GPUs under-fed at large batches and long sequences.

Cerebras's bet is that placing weights in on-die SRAM eliminates the bottleneck. 44 GB is modest by HBM standards but bandwidth is 21 PB/s — roughly four orders of magnitude beyond HBM3 — and the on-chip mesh fabric scales bandwidth with compute rather than across an off-chip boundary.

Weights too large for 44 GB sit in MemoryX, an external store that streams parameters to the WSE on demand. This model — 'weight streaming' — replaces the parameter sharding strategies used on GPU clusters.

When to Pick Cerebras#

Training and inference workloads dominated by memory-bandwidth bottlenecks.
Very-long-context inference where attention's quadratic memory traffic punishes GPUs.
Workloads benefiting from the weight-streaming abstraction over parameter sharding.
Customers willing to accept a custom software stack and a single-vendor commitment.
Pick GPU clusters when CUDA ecosystem reach or commodity supply dominate.

Pitfalls#

Single-vendor stack: Cerebras provides its own compiler, runtime and training framework. Portability is non-trivial.
Capital cost per system is substantial; the unit of investment is much larger than a GPU node.
Software ecosystem reach is narrow; popular frameworks (vLLM, TensorRT-LLM, SGLang) do not target Cerebras.
Power and facilities requirements per CS-3 are non-trivial — installation typically requires bespoke planning.

Software Notes#

Cerebras provides its own software stack, including PyTorch-compatible APIs through a custom XLA-style compiler. Reference recipes target Llama, GPT, BERT and other open-weight families. Inference offerings (Cerebras Inference) provide low-latency LLM serving directly on WSE-3 systems.

References

Cerebras WSE-3 Product Page · Cerebras
Cerebras CS-3 System Overview · Cerebras

Overview#

Specifications#

Metric	WSE-3
Process	TSMC 5 nm
Transistors	4 trillion
AI cores	900,000
On-die SRAM	44 GB
Memory bandwidth	~21 PB/s on-die
Fabric bandwidth	~214 Pb/s on-chip
Sparse FP16 throughput	125 PFLOPS
System	CS-3

WSE-3's defining number is bandwidth, not FLOPS. 21 PB/s on-die is roughly 7,000× a single H100's HBM bandwidth — the metric the architecture optimises.

Why a Whole Wafer#

When to Pick Cerebras#

Training and inference workloads dominated by memory-bandwidth bottlenecks.

Very-long-context inference where attention's quadratic memory traffic punishes GPUs.

Workloads benefiting from the weight-streaming abstraction over parameter sharding.

Customers willing to accept a custom software stack and a single-vendor commitment.

Pick GPU clusters when CUDA ecosystem reach or commodity supply dominate.

Pitfalls#

Single-vendor stack: Cerebras provides its own compiler, runtime and training framework. Portability is non-trivial.

Capital cost per system is substantial; the unit of investment is much larger than a GPU node.

Software ecosystem reach is narrow; popular frameworks (vLLM, TensorRT-LLM, SGLang) do not target Cerebras.

Power and facilities requirements per CS-3 are non-trivial — installation typically requires bespoke planning.

Cerebras WSE-3 (Wafer-Scale Engine 3)

Overview#

Specifications#

Why a Whole Wafer#

When to Pick Cerebras#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

Cerebras WSE-3 (Wafer-Scale Engine 3)

Overview#

Specifications#

Why a Whole Wafer#

When to Pick Cerebras#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel