TL;DR
- Google's fourth-generation TPU, deployed internally from 2020 and made available on Google Cloud in 2022.
- 275 TFLOPS BF16 per chip, 32 GB HBM per chip, organised in 4,096-chip 3D-torus pods linked by optical circuit switches.
- First TPU generation to use OCS — reconfigurable optical fabric that let Google compose pods on demand.
- Trained PaLM, PaLM-2 and much of Google's early generative-AI work; now largely succeeded by v5e/v5p/Trillium.
Overview#
TPU v4 is the generation that made Google's training infrastructure publicly visible. Internal use started in 2020; Google Cloud exposed it as a managed service in 2022. The chip pairs a systolic-array matrix engine with 32 GB of HBM and a 4-chip 'tray' that scales into 4,096-chip pods.
What made v4 distinctive was not the chip but the fabric. Optical circuit switches (OCS) let Google reconfigure the interconnect topology on demand, slicing the pod into sub-pods of various shapes for different jobs. PaLM (540B) was trained on TPU v4 pods; the paper documents the OCS-based topology in detail.
Specifications#
| Metric | TPU v4 (per chip) |
|---|---|
| BF16 | 275 TFLOPS |
| INT8 | 275 TOPS |
| Memory | 32 GB HBM |
| Memory bandwidth | 1.2 TB/s |
| Inter-chip link | ICI, ~50 GB/s per link |
| Pod scale | 4,096 chips |
| TDP | ~200 W per chip |
| Fabric | OCS-reconfigured 3D torus |
Architecture and Pod Design#
Each TPU v4 chip contains a systolic-array Matrix Multiplication Unit, a vector unit, and HBM. Four chips share a tray with all-to-all on-tray connectivity. Trays connect into a 3D torus topology that is reconfigured by optical circuit switches in the surrounding fabric.
The OCS layer is the architectural innovation. Conventional electrical interconnects fix the topology at build time; OCS lets Google physically rewire the torus per-job, optimising for the parallelism pattern the workload needs. This dramatically improved utilisation versus a fixed topology, particularly for irregular jobs.
When TPUs Matter#
- Workloads on Google Cloud where the GCE TPU integration removes operational overhead.
- Frameworks already targeting JAX or TensorFlow with XLA backends.
- Very-large training where pod-scale ICI bandwidth and the OCS fabric outperform InfiniBand.
- Workloads outside Google Cloud cannot directly use TPUs — GPU is the default elsewhere.
Pitfalls#
- TPUs are Google Cloud exclusive — no on-prem or other-cloud option.
- JAX is the most productive framework; PyTorch on TPU is supported but with rough edges.
- XLA compilation can be slow; iterative debugging cycles are different from CUDA.
- Custom kernels require Pallas (TPU's Triton-equivalent) and a different mental model than CUDA.
Software Ecosystem#
JAX with the XLA backend is the production path. TensorFlow continues to work but JAX dominates new work. PyTorch/XLA exists but typically lags JAX in throughput and feature support. Pallas exposes low-level TPU programming for advanced users.
References
- TPU v4 Paper (Jouppi et al., ISCA 2023) · arXiv
- Google Cloud TPU Documentation · Google Cloud