TL;DR
- Single-slot 70 W Turing card launched September 2018 — the most-deployed AI accelerator of the 2018-2022 era.
- 16 GB GDDR6 at 320 GB/s with first-generation Tensor Cores supporting FP16 and INT8.
- Standard across AWS g4dn, GCP nvidia-tesla-t4, and Azure NCasT4 — the canonical cloud inference card.
- Superseded by L4 in new deployments; CUDA support continues through current LTS releases.
Overview#
T4 is the GPU that made GPU inference cheap at cloud scale. Launched in September 2018 with the Turing architecture, it packed first-generation Tensor Cores into a 70 W single-slot card priced low enough that hyperscalers could offer GPU inference instances at meaningful volume. AWS g4dn, GCP T4 instances, and many on-prem appliances all standardised on T4.
By 2026 T4 is largely succeeded by L4 (Ada Lovelace) in new deployments. It remains widely available second-hand, broadly supported by current CUDA releases, and still useful for lightweight inference, video transcoding and CV workloads.
Specifications#
| Metric | T4 |
|---|---|
| Architecture | Turing (TU104) |
| Process | TSMC 12 nm FFN |
| FP32 | 8.1 TFLOPS |
| BF16 / FP16 (Tensor) | 65 TFLOPS |
| INT8 (Tensor) | 130 TOPS |
| INT4 (Tensor) | 260 TOPS |
| Memory | 16 GB GDDR6 |
| Memory bandwidth | 320 GB/s |
| TDP | 70 W |
| Form factor | PCIe Gen3 x16, single-slot low-profile |
| NVENC / NVDEC | 1 / 1 |
| NVLink | Not supported |
When T4 Still Makes Sense#
- Existing deployments where TCO is well-amortised and workloads have not outgrown 16 GB.
- Small-model inference (BERT-base, CNNs, traditional CV) where Turing throughput is adequate.
- Video transcoding at modest density (single NVENC/NVDEC).
- Educational and prototyping use cases on cheap second-hand cards.
- Pick L4 for new builds — same form factor, much higher throughput per watt, FP8 support.
Pitfalls#
- First-generation Tensor Cores lack BF16; mixed-precision training paths need FP16 with loss scaling.
- No FP8; modern LLM quantisation paths skip T4.
- PCIe Gen3 limits host bandwidth in modern servers.
- Driver lifecycle: T4 remains supported but newer CUDA features increasingly skip Turing.
Software Notes#
T4 is supported in current CUDA releases (through CUDA 13 at time of writing) and runs most major inference servers — Triton, TensorRT, ONNX Runtime, OpenVINO. vLLM supports T4 with quantised weights but warns about reduced throughput. Most current FP8 / FP4 paths skip T4 entirely.
References
- NVIDIA T4 Datasheet · NVIDIA
- Turing Architecture Whitepaper · NVIDIA