Google TPU v4

TL;DR

Google's fourth-generation TPU, deployed internally from 2020 and made available on Google Cloud in 2022.
275 TFLOPS BF16 per chip, 32 GB HBM per chip, organised in 4,096-chip 3D-torus pods linked by optical circuit switches.
First TPU generation to use OCS — reconfigurable optical fabric that let Google compose pods on demand.
Trained PaLM, PaLM-2 and much of Google's early generative-AI work; now largely succeeded by v5e/v5p/Trillium.

Overview#

TPU v4 is the generation that made Google's training infrastructure publicly visible. Internal use started in 2020; Google Cloud exposed it as a managed service in 2022. The chip pairs a systolic-array matrix engine with 32 GB of HBM and a 4-chip 'tray' that scales into 4,096-chip pods.

What made v4 distinctive was not the chip but the fabric. Optical circuit switches (OCS) let Google reconfigure the interconnect topology on demand, slicing the pod into sub-pods of various shapes for different jobs. PaLM (540B) was trained on TPU v4 pods; the paper documents the OCS-based topology in detail.

Specifications#

Metric	TPU v4 (per chip)
BF16	275 TFLOPS
INT8	275 TOPS
Memory	32 GB HBM
Memory bandwidth	1.2 TB/s
Inter-chip link	ICI, ~50 GB/s per link
Pod scale	4,096 chips
TDP	~200 W per chip
Fabric	OCS-reconfigured 3D torus

Architecture and Pod Design#

Each TPU v4 chip contains a systolic-array Matrix Multiplication Unit, a vector unit, and HBM. Four chips share a tray with all-to-all on-tray connectivity. Trays connect into a 3D torus topology that is reconfigured by optical circuit switches in the surrounding fabric.

The OCS layer is the architectural innovation. Conventional electrical interconnects fix the topology at build time; OCS lets Google physically rewire the torus per-job, optimising for the parallelism pattern the workload needs. This dramatically improved utilisation versus a fixed topology, particularly for irregular jobs.

When TPUs Matter#

Workloads on Google Cloud where the GCE TPU integration removes operational overhead.
Frameworks already targeting JAX or TensorFlow with XLA backends.
Very-large training where pod-scale ICI bandwidth and the OCS fabric outperform InfiniBand.
Workloads outside Google Cloud cannot directly use TPUs — GPU is the default elsewhere.

Pitfalls#

TPUs are Google Cloud exclusive — no on-prem or other-cloud option.
JAX is the most productive framework; PyTorch on TPU is supported but with rough edges.
XLA compilation can be slow; iterative debugging cycles are different from CUDA.
Custom kernels require Pallas (TPU's Triton-equivalent) and a different mental model than CUDA.

Software Ecosystem#

JAX with the XLA backend is the production path. TensorFlow continues to work but JAX dominates new work. PyTorch/XLA exists but typically lags JAX in throughput and feature support. Pallas exposes low-level TPU programming for advanced users.

References

TPU v4 Paper (Jouppi et al., ISCA 2023) · arXiv
Google Cloud TPU Documentation · Google Cloud

Overview#

Metric

TPU v4 (per chip)

BF16

275 TFLOPS

INT8

275 TOPS

Memory

32 GB HBM

Memory bandwidth

1.2 TB/s

Inter-chip link

ICI, ~50 GB/s per link

Pod scale

4,096 chips

TDP

~200 W per chip

Fabric

OCS-reconfigured 3D torus

Architecture and Pod Design#

When TPUs Matter#

Workloads on Google Cloud where the GCE TPU integration removes operational overhead.

Frameworks already targeting JAX or TensorFlow with XLA backends.

Very-large training where pod-scale ICI bandwidth and the OCS fabric outperform InfiniBand.

Workloads outside Google Cloud cannot directly use TPUs — GPU is the default elsewhere.

Pitfalls#

TPUs are Google Cloud exclusive — no on-prem or other-cloud option.

JAX is the most productive framework; PyTorch on TPU is supported but with rough edges.

XLA compilation can be slow; iterative debugging cycles are different from CUDA.

Custom kernels require Pallas (TPU's Triton-equivalent) and a different mental model than CUDA.

Google TPU v4

Overview#

Specifications#

Architecture and Pod Design#

When TPUs Matter#

Pitfalls#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel

Google TPU v4

Overview#

Specifications#

Architecture and Pod Design#

When TPUs Matter#

Pitfalls#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel