NVIDIA A100 Tensor Core GPU

TL;DR

Ampere-architecture data centre GPU (GA100, TSMC 7N, 54 billion transistors) launched at GTC May 2020 — the silicon behind GPT-3, Megatron-LM, BLOOM, Llama 1/2, and the entire first generation of Stable Diffusion. Still the dominant non-Hopper SKU on hyperscalers and neoclouds in 2026 because its software stack is the deepest on the planet.
Two memory tiers: 40 GB HBM2 at 1.55 TB/s (launch) and 80 GB HBM2e at 2.0 TB/s (Q4 2021 refresh). Two form factors: SXM4 (400 W, NVLink-attached, fills HGX-A100 baseboards and AWS p4d/p4de, GCP a2-ultragpu) and PCIe Gen4 (250-300 W, drop-in for retrofit servers).
Third-generation Tensor Core: 312 TFLOPS TF32, 624 TFLOPS BF16/FP16 (2:4 sparse), 1,248 TOPS INT8 — no FP8 (Hopper-only), no FP4 (Blackwell-only). MIG generation 1 partitions a single card into up to 7 hardware-isolated slices with dedicated SMs, L2 and HBM bandwidth.
NVLink 3.0 at 600 GB/s per GPU and six NVSwitch chips per HGX baseboard give an 8-GPU non-blocking all-to-all fabric — but A100 predates NVLink Switch System, so multi-node training crosses InfiniBand HDR/NDR and looks like discrete 8-GPU islands.
Pricing in 2026: on-demand $1.80-$2.25 / GPU-hr at hyperscalers, $1.10-$1.40 1-year reserved, $0.85-$1.10 3-year, $0.65-$0.90 spot. Cost-per-million-output-tokens for Llama 3.1 8B FP16: roughly $0.75 — competitive with L40S, cheaper than H100 on small-model serving.

Overview#

The A100 is the GPU that catalysed the modern LLM era. Announced at GTC in May 2020 and shipping in volume through 2024, it pairs the Ampere GA100 die (54 billion transistors on TSMC's 7nm 'N7' process, marketed as 7N) with HBM2/HBM2e memory and the third-generation Tensor Core. Every named model on the early frontier — GPT-3, Megatron-LM, BLOOM, the first three generations of Llama, the original Stable Diffusion, the first internal Anthropic and OpenAI training runs — was trained on A100 clusters. By the time Hopper arrived in volume in 2023, an entire industry had been built on Ampere assumptions.

Six years on, A100 has aged into the workhorse middle tier of the AI compute stack. It is no longer the right choice for frontier training (no FP8 Transformer Engine, no NVLink Switch System scale-out beyond 8 GPUs) or for memory-pressured inference (80 GB and 2.0 TB/s have been outclassed by H200's 141 GB and 4.8 TB/s, and B200's 192 GB at 8 TB/s). What it does have is the deepest software stack of any AI accelerator NVIDIA ships, a fully amortised capital base across every hyperscaler and neocloud, and a hardware MIG model that makes it the cleanest multi-tenant inference card under 70B in production today.

This entry is the 2026 reference for teams operating A100 at scale or deciding whether to stay on A100, migrate to H100/H200 or skip a generation to L40S. It covers the GA100 architecture, the full per-SKU spec sheet, the NVLink/NVSwitch topology, sizing tables for the workloads A100 still wins on, current cost ranges in USD, and a migration matrix in both directions. Yobitel NeoCloud offers A100 SXM4 80 GB and 40 GB capacity broadly across UK and EU regions with NCSC OFFICIAL alignment — A100 is one of the price-leading SKUs on the platform for sub-30B inference and 7B-13B QLoRA fine-tunes. This entry helps you decide when A100 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

How it works: the GA100 die and Ampere's third-generation Tensor Core#

Ampere shipped two distinct dies: GA100 for the data centre and GA102 for everything else (RTX 3090, RTX A6000, L40, A10G). A100 is the GA100-only card, with 108 SMs (out of 128 physical), 432 Tensor Cores, 40 MB L2 cache, and 192 KB combined L1/SMEM per SM — a strictly bigger, HBM-attached, ECC-everywhere version of consumer Ampere.

Three architectural changes made A100 the era-defining card. First, TF32 (TensorFloat-32) replaced FP32 in training silently: same 8-bit exponent as FP32, mantissa narrowed to 10 bits like FP16, and the Tensor Core executed it transparently whenever cuBLAS saw an FP32 matmul. Existing codebases got roughly 3x throughput on the same hardware with no source change. BF16 also arrived as a first-class Tensor Core type — the format Google's TPUs had been using for years, and the format Llama, Mistral and almost every open-weight model trained in by 2023.

Second, structural sparsity made its debut. The third-generation Tensor Core can skip half the weights when they are pruned in a fixed 2-of-4 pattern, doubling effective throughput. Most production models never adopted the explicit pruning step — the published sparse FLOPS numbers are achievable, but only after a structured-sparsity fine-tune — so the 312 TFLOPS BF16 dense figure (not the 624 TFLOPS sparse figure) is what most workloads see.

Third, Multi-Instance GPU (MIG) appeared. A single A100 can be partitioned into up to seven hardware-isolated slices, each with its own SM block, L2 carve-out, HBM partition and NVDEC/NVENC pair. Slice sizes are quantised: 1g.10gb (1/7 SMs, 10 GB HBM), 2g.20gb, 3g.40gb (the 'half-card' slice), 7g.80gb (the full card). MIG is what made multi-tenant inference on GPU economically viable at cloud scale: AWS, GCP and most neoclouds price MIG slices independently, and the cheapest practical A100 line-item on most public clouds is a single 1g.10gb slice.

GA100 die: 108 active SMs (128 physical, harvested), 432 third-generation Tensor Cores, 40 MB L2 cache, 192 KB L1/SMEM per SM, 6,912 FP32 CUDA cores total.
Compute capability sm_80 (sm_86 is GA102, not A100) — important when compiling Triton or CUTLASS kernels.
Memory subsystem: 5 HBM2e stacks x 16 GB = 80 GB on the refreshed SKU at 2.0 TB/s; the original 40 GB SKU had 5 x 8 GB HBM2 at 1.55 TB/s.
MIG generation 1: spatial isolation only — no inter-instance bandwidth guarantees beyond the partitioned HBM channels; H100/H200 added per-instance bandwidth quotas and confidential-compute boundaries.
No FP8, no FP4, no Transformer Engine, no Tensor Memory Accelerator (TMA), no Thread Block Clusters. All of those are Hopper or later.

Reference: full specification sheet#

Authoritative per-SKU figures across the four A100 variants you will actually encounter in 2026. Sparse Tensor figures assume 2:4 structured sparsity; dense throughput is half the listed sparse figure. The 40 GB SKUs are still common in older cloud regions and on cost-sensitive neoclouds — verify which SKU your instance type maps to before sizing.

Metric	A100 SXM4 80 GB	A100 PCIe 80 GB	A100 SXM4 40 GB	A100 PCIe 40 GB
Architecture	Ampere GA100	Ampere GA100	Ampere GA100	Ampere GA100
Process	TSMC 7N	TSMC 7N	TSMC 7N	TSMC 7N
Transistors	54 billion	54 billion	54 billion	54 billion
Active SMs	108	108	108	108
Tensor cores	432	432	432	432
Compute capability	sm_80	sm_80	sm_80	sm_80
FP64	9.7 TFLOPS	9.7 TFLOPS	9.7 TFLOPS	9.7 TFLOPS
FP64 (Tensor)	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS
FP32	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS
TF32 (Tensor, sparse)	312 TFLOPS	312 TFLOPS	312 TFLOPS	312 TFLOPS
BF16 / FP16 (Tensor, sparse)	624 TFLOPS	624 TFLOPS	624 TFLOPS	624 TFLOPS
BF16 / FP16 (Tensor, dense)	312 TFLOPS	312 TFLOPS	312 TFLOPS	312 TFLOPS
INT8 (Tensor, sparse)	1,248 TOPS	1,248 TOPS	1,248 TOPS	1,248 TOPS
FP8	Not supported	Not supported	Not supported	Not supported
Memory	80 GB HBM2e	80 GB HBM2e	40 GB HBM2	40 GB HBM2
Memory bandwidth	2.0 TB/s	1.94 TB/s	1.55 TB/s	1.55 TB/s
L2 cache	40 MB	40 MB	40 MB	40 MB
NVLink	600 GB/s (NVLink 3.0, 12 ports)	600 GB/s (bridge, optional)	600 GB/s (NVLink 3.0)	600 GB/s (bridge)
PCIe	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)
TDP	400 W	300 W (250 W variants exist)	400 W	250 W
MIG instances	Up to 7	Up to 7	Up to 7	Up to 7
Form factor	SXM4 mezzanine	FHFL dual-slot PCIe	SXM4 mezzanine	FHFL dual-slot PCIe
Minimum driver	R450	R450	R450	R450
Recommended driver (2026)	R535+ (R570 stable)	R535+	R535+	R535+
Minimum CUDA	11.0	11.0	11.0	11.0
Maximum CUDA (sm_80 path)	13.x	13.x	13.x	13.x

FP8 is not supported on Ampere. Any production path that has standardised on FP8 weights or activations (TensorRT-LLM FP8 engines, vLLM `--quantization fp8`, Transformer Engine FP8 training) needs Hopper or newer. INT8 PTQ and AWQ/GPTQ INT4 inference remain available paths if A100 is the only option.

Interconnect: NVLink 3.0 and the HGX-A100 baseboard#

NVLink 3.0 provides 600 GB/s per GPU — 12 ports at 50 GB/s each — and is exposed on every SXM4 module. On the HGX-A100 baseboard, 8 GPUs are wired through 6 second-generation NVSwitch ASICs into a fully non-blocking all-to-all fabric: any GPU can DMA into any other GPU's HBM at the full 600 GB/s bidirectional bandwidth with no fabric contention.

The critical limit is what comes next. A100 predates the NVLink Switch System (introduced with H100). Beyond the 8-GPU baseboard, scale-out runs over InfiniBand HDR (200 Gb/s, the original A100 cluster fabric) or NDR (400 Gb/s, common on 2022+ A100 clusters). All-reduce latency at the cluster boundary jumps by an order of magnitude — typically 5-10x — and the cluster looks like a collection of discrete 8-GPU islands rather than a coherent shared-memory fabric. This is the single biggest architectural reason frontier training moved off A100 to Hopper from 2023 onward.

PCIe A100 cards can be paired with the 600 GB/s NVLink bridge for 2-card or 4-card configurations, but unlike SXM4 there is no NVSwitch path — bridges scale to 2 cards directly, and 4-card configurations rely on chained bridges with reduced effective bandwidth. For multi-GPU training on PCIe A100, sizing assumes effective NVLink bandwidth closer to 200-400 GB/s.

Per-GPU NVLink 3.0: 600 GB/s bidirectional (12 ports x 50 GB/s).
Per-baseboard NVSwitch bisection: 4.8 TB/s aggregate on 8 GPUs.
NVLink-domain ceiling: 8 GPUs (one HGX-A100 baseboard). No multi-baseboard NVLink fabric.
Cluster scale-out: InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s); RoCE v2 increasingly common on neoclouds.
Cross-baseboard collective latency: 5-10x intra-baseboard; size pipeline-parallel rather than tensor-parallel for cross-node splits.

Sizing and capacity planning#

Sizing tables for the workloads A100 still wins or holds its own on in 2026. All figures assume A100 SXM4 80 GB, BF16 weights (no FP8 path on Ampere), vLLM 0.6+ with paged KV cache and prefix caching, and a healthy NVLink-local placement. Throughput is given in output tokens per second per replica at moderate concurrency (16-32 sessions); verify against your own traffic shape before committing capacity.

Single A100 80 GB ceiling for BF16 inference: weights + KV cache + activations + cuBLAS scratch must fit under ~76 GB; above that, OOMs even with paged KV.
70B BF16 on a single A100 80 GB is infeasible (140 GB weights alone); use INT4 quantisation or 2-4 card TP.
MIG slice mapping: 1g.10gb fits 7B INT4, 3g.40gb fits 7B BF16 or 13B INT4, 7g.80gb (full card) fits 34B BF16 or 70B INT4.
Training rule of thumb: 1 trillion tokens x 13B parameters at BF16 takes roughly 90-130 A100-days on 64-GPU HGX-A100 clusters with Megatron-LM.
AllReduce overhead at TP=8 inside one HGX-A100: ~12-18 % of step time for 70B BF16 (vs ~6-9 % on H100); cross-baseboard splits push this past 35 %.
Spot/preemptible A100 capacity is broadly available at 50-60 % below on-demand with 5-10 % daily eviction — suitable for fine-tunes, not for inference SLAs.

Model	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	VRAM headroom
Llama 3.1 8B	BF16	8K	1x A100 80 GB	1 / 1	3,200-4,200	55 GB free
Llama 3.1 8B	AWQ INT4	32K	1x A100 80 GB	1 / 1	4,500-5,500	60 GB free
Mistral 7B / Qwen 7B	BF16	8K	1x A100 80 GB	1 / 1	3,400-4,400	55 GB free
Codestral 22B / Yi 34B	BF16	8K	1x A100 80 GB	1 / 1	1,100-1,500	10 GB free
Codestral 22B	AWQ INT4	32K	1x A100 80 GB	1 / 1	1,800-2,400	40 GB free
Llama 3 70B	BF16	4K	2x A100 80 GB	2 / 1	550-750	8 GB free per rank
Llama 3 70B	AWQ INT4	8K	1x A100 80 GB	1 / 1	350-500	20 GB free
Llama 3 70B	BF16	8K	4x A100 80 GB	4 / 1	750-950	30 GB free per rank
Mixtral 8x7B (MoE 47B)	BF16	32K	2x A100 80 GB	2 / 1	1,200-1,600	10 GB free per rank
SDXL 1.0 (1024x1024)	BF16	n/a	1x A100 80 GB	1 / 1	1.0-1.3 images/s	60 GB free
Whisper Large v3 batch	BF16	30s clip	1x A100 (MIG 3g.40gb slice)	1 / 1	40-50 RTF	n/a

Cost and TCO#

A100 pricing has compressed continuously since H100 supply caught up in 2024. In 2026 the public ranges below are typical across hyperscalers and Tier-1/Tier-2 neoclouds. The cost case for A100 is now made on three axes: (1) cost-per-token for sub-30B inference, where the A100 80 GB beats both H100 (overkill) and L40S (less HBM bandwidth) on $/token-second; (2) fine-tune of 7B-13B models, where 80 GB HBM2e fits QLoRA comfortably and the per-GPU-hour cost is half H100's; (3) ecosystem maturity — every kernel, every quantisation scheme, every published recipe targets A100 first.

Cost-per-million-output-tokens on Llama 3.1 8B BF16 at $1.50/GPU-hr and 3,800 TPS sustained: roughly $0.11 per million tokens before margin.
Cost-per-million-output-tokens on Llama 3 70B BF16 (2x A100) at $1.50/GPU-hr and 650 TPS sustained: roughly $1.28 per million — competitive with H100 70B FP8 ($0.50) only at high utilisation discounts.
Switching from BF16 to AWQ INT4 yields +1.4-1.6x throughput on most sub-30B models with <2 % quality regression; the only realistic precision lever on Ampere.
3-year reservation cuts effective $/GPU-hr by 45-55 % versus on-demand; only commit when steady-state utilisation exceeds 65 % across the term.
MIG slices are the dominant cost lever for small-model inference fleets: serving 7B INT4 on 1g.10gb slices is typically 3-5x cheaper per request than full-card serving.
Egress and inter-region data movement still frequently exceed 8-12 % of total A100 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Spot
Hyperscaler (AWS/GCP/Azure)	A100 SXM4 80 GB	$2.00-$2.25	$1.35-$1.70	$1.10-$1.40	$0.70-$0.90
Hyperscaler	A100 SXM4 40 GB	$1.70-$2.00	$1.20-$1.50	$0.95-$1.20	$0.55-$0.75
Hyperscaler	A100 PCIe 80 GB	$1.80-$2.10	$1.25-$1.55	$1.00-$1.25	$0.60-$0.80
Tier-1 neocloud	A100 SXM4 80 GB	$1.50-$1.90	$1.10-$1.40	$0.90-$1.15	$0.55-$0.75
Tier-2 neocloud	A100 SXM4 80 GB	$1.10-$1.50	$0.90-$1.20	$0.75-$1.00	$0.40-$0.60
Spot/preemptible (hyperscaler)	A100 SXM4 80 GB	$0.65-$0.90	n/a	n/a	8-15 %/day eviction
MIG 1g.10gb slice (1/7 card)	A100 80 GB	$0.30-$0.45	$0.22-$0.32	$0.18-$0.26	n/a
Yobitel NeoCloud (UK + EU)	A100 SXM4 80 GB	$1.40-$1.80	$1.05-$1.35	$0.85-$1.10	n/a
Yobitel Omniscient Compute	A100 SXM4 80 GB multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	n/a

Cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel Omniscient Compute: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.a100.sxm4.80gb`. This is what enables cross-provider arbitrage and cost attribution at workspace granularity.

Software ecosystem#

A100 has the deepest software stack of any AI accelerator after H100. CUDA 11.0 through 13.x supports sm_80 as a first-class target; every cuDNN release back to 8.0, every NCCL release back to 2.7, every Triton release, and the full Hugging Face training stack (transformers, peft, accelerate, trl) treat A100 as the reference platform. vLLM, SGLang, TGI and Triton Inference Server all serve on A100 with FP16/BF16 weights and INT8/INT4 quantisation paths (AWQ, GPTQ, bitsandbytes); the only thing they cannot do on A100 is FP8.

Training framework support is equally complete: Megatron-LM, DeepSpeed, FSDP (PyTorch native), FSDP-2, Megatron-Core, NeMo and Axolotl all have A100-tuned recipes, and most published academic results from 2020-2024 are reproducible on A100 without modification. The CUDA 13 driver (R570) supports A100 to the same depth as Hopper apart from sm_90-specific kernels.

The operational pattern that matters: Hopper-tuned kernels (Flash Attention 3, CUTLASS 3.x Hopper-only paths, Triton's Hopper backend, the TMA-using CUTLASS kernels in vLLM 0.6+) silently fall back to sm_80 paths on A100 — they continue to run, but at a fraction of Hopper throughput. When benchmarking, verify the kernel path with `cuobjdump --dump-elf-symbols` to make sure you are not measuring an unintended fallback.

Migration and alternatives#

When A100 is the right choice and when it isn't. The table maps the practical migration paths in both directions — A100 is increasingly the 'from' card rather than the 'to' card, but it remains the 'to' card for teams stepping down from over-spec H100 fleets to right-size inference economics.

Heuristic 1: if your inference fleet runs 7B-34B models at moderate context (< 16K) and FP8 is not in the roadmap, A100 80 GB at $/token usually beats H100 — verify on InferenceBench.
Heuristic 2: if you are training above 13B parameters with > 64 GPUs, the lack of NVLink Switch System scale-out makes A100 measurably slower than H100 per FLOP-second despite the lower per-GPU cost.
Heuristic 3: never migrate A100 -> L40S without first benchmarking the long-context tail latency — L40S GDDR6 bandwidth (864 GB/s) is less than half A100's HBM2e (2.0 TB/s) and KV-cache-heavy decodes regress meaningfully.

From / to	When it pays	Migration effort	Key incompatibility
V100 -> A100	Need BF16, TF32, MIG, or 80 GB HBM	Low (CUDA upgrade)	Compute capability sm_70 -> sm_80; kernel recompile only
A100 -> H100	Need FP8 or NVLink Switch System scale-out	Low (drop-in CUDA upgrade)	FP8 calibration; sm_90 kernels not on Ampere
A100 -> H200	KV cache memory-bound on long context	Low (same software stack as H100)	Same as A100 -> H100
A100 -> L40S	7B-34B inference, NVLink not needed, $/token priority	Low (GDDR6 not HBM — verify latency tail)	No NVLink; no MIG on L40S
A100 PCIe -> A100 SXM4	NVLink-bound multi-GPU training	Medium (chassis change)	Cooling envelope; baseboard required
A100 80 GB -> A100 40 GB	Cost-sensitive MIG inference fleets	Trivial (same software)	Sizing tables shift — 40 GB limits at 7B BF16
A100 -> MI300X	Need 192 GB HBM3 per GPU at A100 software lead	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm path lags
A100 -> Inferentia 2	Inference-only, AWS-resident, simple model coverage	High (Neuron compiler)	Limited model coverage; recompile required

Pitfalls and operational notes#

FP8 silently unsupported — Hopper-tuned vLLM/TensorRT-LLM commands that pass `--quantization fp8` will error out on A100. Standardise on BF16 or AWQ INT4 in your A100 deployment manifests.
Two memory tiers in the wild — 40 GB HBM2 (1.55 TB/s) versus 80 GB HBM2e (2.0 TB/s). Cloud instance names rarely disambiguate; verify with `nvidia-smi --query-gpu=name,memory.total --format=csv` before sizing.
PCIe Gen4 is half the bandwidth of Gen5 — in mixed clusters with H100 hosts, A100 nodes often become dataloader-bound on training. Pin large parquet shards to local NVMe and use pinned-memory dataloaders.
MIG slice boundaries are static — switching MIG profile or disabling MIG requires draining the GPU and a `nvidia-smi mig -cgi` reconfiguration; rolling MIG changes break long-running workloads.
Secondary-market A100s are a real risk — many ex-mining or ex-crypto cards have degraded HBM. Burn-in for at least 72 hours with DCGM ECC monitoring before production placement.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops on A100 SXM4 below ~580 GB/s indicate a single NVLink port down; reseat the mezzanine before considering RMA.
Hopper-tuned kernels fall back silently to sm_80 — verify with `cuobjdump` that benchmarks are exercising the intended path; published H100 numbers do not scale linearly down to A100.
Confidential Compute is not available on A100 (Hopper-and-later only). Sovereign deployments requiring attested-boot GPU isolation must target H100/H200.
Driver R570 (CUDA 13.x) is the recommended 2026 baseline; older R450/R470 builds are missing critical NCCL and DCGM fixes.

Where this fits in the Yobitel stack#

A100 remains a first-class target in the Yobitel stack through 2026. Yobibyte — our AI-native platform — schedules inference replicas and fine-tune jobs on A100 pools whenever the workload fits inside A100's BF16/INT4 envelope and the cost case beats H100; the platform's placement layer is aware of NVLink topology, MIG profiles and HBM2 versus HBM2e variants and tags every replica with the SKU it landed on.

Omniscient Compute — our cross-cloud capacity broker — indexes A100 80 GB and 40 GB SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud and arbitrages workloads to the cheapest region that meets the workspace's residency posture. A100 is frequently the price-leading SKU on the broker for sub-30B inference because supply is broad and depreciation is well past breakeven for most operators.

InferenceBench — our public, reproducible benchmarking harness — publishes A100 throughput, latency and cost-per-token numbers for every major open-weight model under 70B on vLLM, TensorRT (no LLM-FP8 paths on A100), SGLang and TGI. The A100 sizing tables in this entry are anchored on InferenceBench runs; production numbers your team will see in steady state are typically within 10 % of the published figures.

References

NVIDIA A100 Datasheet · NVIDIA
NVIDIA Ampere Architecture Whitepaper · NVIDIA
Multi-Instance GPU User Guide · NVIDIA
NCCL on NVLink 3.0 — performance guide · NVIDIA
vLLM on Ampere — quantisation paths · vLLM
FinOps Foundation FOCUS billing specification · FinOps Foundation

TL;DR

Ampere-architecture data centre GPU (GA100, TSMC 7N, 54 billion transistors) launched at GTC May 2020 — the silicon behind GPT-3, Megatron-LM, BLOOM, Llama 1/2, and the entire first generation of Stable Diffusion. Still the dominant non-Hopper SKU on hyperscalers and neoclouds in 2026 because its software stack is the deepest on the planet.
Two memory tiers: 40 GB HBM2 at 1.55 TB/s (launch) and 80 GB HBM2e at 2.0 TB/s (Q4 2021 refresh). Two form factors: SXM4 (400 W, NVLink-attached, fills HGX-A100 baseboards and AWS p4d/p4de, GCP a2-ultragpu) and PCIe Gen4 (250-300 W, drop-in for retrofit servers).
Third-generation Tensor Core: 312 TFLOPS TF32, 624 TFLOPS BF16/FP16 (2:4 sparse), 1,248 TOPS INT8 — no FP8 (Hopper-only), no FP4 (Blackwell-only). MIG generation 1 partitions a single card into up to 7 hardware-isolated slices with dedicated SMs, L2 and HBM bandwidth.
NVLink 3.0 at 600 GB/s per GPU and six NVSwitch chips per HGX baseboard give an 8-GPU non-blocking all-to-all fabric — but A100 predates NVLink Switch System, so multi-node training crosses InfiniBand HDR/NDR and looks like discrete 8-GPU islands.
Pricing in 2026: on-demand $1.80-$2.25 / GPU-hr at hyperscalers, $1.10-$1.40 1-year reserved, $0.85-$1.10 3-year, $0.65-$0.90 spot. Cost-per-million-output-tokens for Llama 3.1 8B FP16: roughly $0.75 — competitive with L40S, cheaper than H100 on small-model serving.

Overview#

How it works: the GA100 die and Ampere's third-generation Tensor Core#

GA100 die: 108 active SMs (128 physical, harvested), 432 third-generation Tensor Cores, 40 MB L2 cache, 192 KB L1/SMEM per SM, 6,912 FP32 CUDA cores total.
Compute capability sm_80 (sm_86 is GA102, not A100) — important when compiling Triton or CUTLASS kernels.
Memory subsystem: 5 HBM2e stacks x 16 GB = 80 GB on the refreshed SKU at 2.0 TB/s; the original 40 GB SKU had 5 x 8 GB HBM2 at 1.55 TB/s.
MIG generation 1: spatial isolation only — no inter-instance bandwidth guarantees beyond the partitioned HBM channels; H100/H200 added per-instance bandwidth quotas and confidential-compute boundaries.
No FP8, no FP4, no Transformer Engine, no Tensor Memory Accelerator (TMA), no Thread Block Clusters. All of those are Hopper or later.

Reference: full specification sheet#

Metric	A100 SXM4 80 GB	A100 PCIe 80 GB	A100 SXM4 40 GB	A100 PCIe 40 GB
Architecture	Ampere GA100	Ampere GA100	Ampere GA100	Ampere GA100
Process	TSMC 7N	TSMC 7N	TSMC 7N	TSMC 7N
Transistors	54 billion	54 billion	54 billion	54 billion
Active SMs	108	108	108	108
Tensor cores	432	432	432	432
Compute capability	sm_80	sm_80	sm_80	sm_80
FP64	9.7 TFLOPS	9.7 TFLOPS	9.7 TFLOPS	9.7 TFLOPS
FP64 (Tensor)	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS
FP32	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS	19.5 TFLOPS
TF32 (Tensor, sparse)	312 TFLOPS	312 TFLOPS	312 TFLOPS	312 TFLOPS
BF16 / FP16 (Tensor, sparse)	624 TFLOPS	624 TFLOPS	624 TFLOPS	624 TFLOPS
BF16 / FP16 (Tensor, dense)	312 TFLOPS	312 TFLOPS	312 TFLOPS	312 TFLOPS
INT8 (Tensor, sparse)	1,248 TOPS	1,248 TOPS	1,248 TOPS	1,248 TOPS
FP8	Not supported	Not supported	Not supported	Not supported
Memory	80 GB HBM2e	80 GB HBM2e	40 GB HBM2	40 GB HBM2
Memory bandwidth	2.0 TB/s	1.94 TB/s	1.55 TB/s	1.55 TB/s
L2 cache	40 MB	40 MB	40 MB	40 MB
NVLink	600 GB/s (NVLink 3.0, 12 ports)	600 GB/s (bridge, optional)	600 GB/s (NVLink 3.0)	600 GB/s (bridge)
PCIe	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)
TDP	400 W	300 W (250 W variants exist)	400 W	250 W
MIG instances	Up to 7	Up to 7	Up to 7	Up to 7
Form factor	SXM4 mezzanine	FHFL dual-slot PCIe	SXM4 mezzanine	FHFL dual-slot PCIe
Minimum driver	R450	R450	R450	R450
Recommended driver (2026)	R535+ (R570 stable)	R535+	R535+	R535+
Minimum CUDA	11.0	11.0	11.0	11.0
Maximum CUDA (sm_80 path)	13.x	13.x	13.x	13.x

Interconnect: NVLink 3.0 and the HGX-A100 baseboard#

Per-GPU NVLink 3.0: 600 GB/s bidirectional (12 ports x 50 GB/s).
Per-baseboard NVSwitch bisection: 4.8 TB/s aggregate on 8 GPUs.
NVLink-domain ceiling: 8 GPUs (one HGX-A100 baseboard). No multi-baseboard NVLink fabric.
Cluster scale-out: InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s); RoCE v2 increasingly common on neoclouds.
Cross-baseboard collective latency: 5-10x intra-baseboard; size pipeline-parallel rather than tensor-parallel for cross-node splits.

Sizing and capacity planning#

Single A100 80 GB ceiling for BF16 inference: weights + KV cache + activations + cuBLAS scratch must fit under ~76 GB; above that, OOMs even with paged KV.
70B BF16 on a single A100 80 GB is infeasible (140 GB weights alone); use INT4 quantisation or 2-4 card TP.
MIG slice mapping: 1g.10gb fits 7B INT4, 3g.40gb fits 7B BF16 or 13B INT4, 7g.80gb (full card) fits 34B BF16 or 70B INT4.
Training rule of thumb: 1 trillion tokens x 13B parameters at BF16 takes roughly 90-130 A100-days on 64-GPU HGX-A100 clusters with Megatron-LM.
AllReduce overhead at TP=8 inside one HGX-A100: ~12-18 % of step time for 70B BF16 (vs ~6-9 % on H100); cross-baseboard splits push this past 35 %.
Spot/preemptible A100 capacity is broadly available at 50-60 % below on-demand with 5-10 % daily eviction — suitable for fine-tunes, not for inference SLAs.

Model	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	VRAM headroom
Llama 3.1 8B	BF16	8K	1x A100 80 GB	1 / 1	3,200-4,200	55 GB free
Llama 3.1 8B	AWQ INT4	32K	1x A100 80 GB	1 / 1	4,500-5,500	60 GB free
Mistral 7B / Qwen 7B	BF16	8K	1x A100 80 GB	1 / 1	3,400-4,400	55 GB free
Codestral 22B / Yi 34B	BF16	8K	1x A100 80 GB	1 / 1	1,100-1,500	10 GB free
Codestral 22B	AWQ INT4	32K	1x A100 80 GB	1 / 1	1,800-2,400	40 GB free
Llama 3 70B	BF16	4K	2x A100 80 GB	2 / 1	550-750	8 GB free per rank
Llama 3 70B	AWQ INT4	8K	1x A100 80 GB	1 / 1	350-500	20 GB free
Llama 3 70B	BF16	8K	4x A100 80 GB	4 / 1	750-950	30 GB free per rank
Mixtral 8x7B (MoE 47B)	BF16	32K	2x A100 80 GB	2 / 1	1,200-1,600	10 GB free per rank
SDXL 1.0 (1024x1024)	BF16	n/a	1x A100 80 GB	1 / 1	1.0-1.3 images/s	60 GB free
Whisper Large v3 batch	BF16	30s clip	1x A100 (MIG 3g.40gb slice)	1 / 1	40-50 RTF	n/a

Cost and TCO#

Cost-per-million-output-tokens on Llama 3.1 8B BF16 at $1.50/GPU-hr and 3,800 TPS sustained: roughly $0.11 per million tokens before margin.
Cost-per-million-output-tokens on Llama 3 70B BF16 (2x A100) at $1.50/GPU-hr and 650 TPS sustained: roughly $1.28 per million — competitive with H100 70B FP8 ($0.50) only at high utilisation discounts.
Switching from BF16 to AWQ INT4 yields +1.4-1.6x throughput on most sub-30B models with <2 % quality regression; the only realistic precision lever on Ampere.
3-year reservation cuts effective $/GPU-hr by 45-55 % versus on-demand; only commit when steady-state utilisation exceeds 65 % across the term.
MIG slices are the dominant cost lever for small-model inference fleets: serving 7B INT4 on 1g.10gb slices is typically 3-5x cheaper per request than full-card serving.
Egress and inter-region data movement still frequently exceed 8-12 % of total A100 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Spot
Hyperscaler (AWS/GCP/Azure)	A100 SXM4 80 GB	$2.00-$2.25	$1.35-$1.70	$1.10-$1.40	$0.70-$0.90
Hyperscaler	A100 SXM4 40 GB	$1.70-$2.00	$1.20-$1.50	$0.95-$1.20	$0.55-$0.75
Hyperscaler	A100 PCIe 80 GB	$1.80-$2.10	$1.25-$1.55	$1.00-$1.25	$0.60-$0.80
Tier-1 neocloud	A100 SXM4 80 GB	$1.50-$1.90	$1.10-$1.40	$0.90-$1.15	$0.55-$0.75
Tier-2 neocloud	A100 SXM4 80 GB	$1.10-$1.50	$0.90-$1.20	$0.75-$1.00	$0.40-$0.60
Spot/preemptible (hyperscaler)	A100 SXM4 80 GB	$0.65-$0.90	n/a	n/a	8-15 %/day eviction
MIG 1g.10gb slice (1/7 card)	A100 80 GB	$0.30-$0.45	$0.22-$0.32	$0.18-$0.26	n/a
Yobitel NeoCloud (UK + EU)	A100 SXM4 80 GB	$1.40-$1.80	$1.05-$1.35	$0.85-$1.10	n/a
Yobitel Omniscient Compute	A100 SXM4 80 GB multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	n/a

Software ecosystem#

Migration and alternatives#

Heuristic 1: if your inference fleet runs 7B-34B models at moderate context (< 16K) and FP8 is not in the roadmap, A100 80 GB at $/token usually beats H100 — verify on InferenceBench.
Heuristic 2: if you are training above 13B parameters with > 64 GPUs, the lack of NVLink Switch System scale-out makes A100 measurably slower than H100 per FLOP-second despite the lower per-GPU cost.
Heuristic 3: never migrate A100 -> L40S without first benchmarking the long-context tail latency — L40S GDDR6 bandwidth (864 GB/s) is less than half A100's HBM2e (2.0 TB/s) and KV-cache-heavy decodes regress meaningfully.

From / to	When it pays	Migration effort	Key incompatibility
V100 -> A100	Need BF16, TF32, MIG, or 80 GB HBM	Low (CUDA upgrade)	Compute capability sm_70 -> sm_80; kernel recompile only
A100 -> H100	Need FP8 or NVLink Switch System scale-out	Low (drop-in CUDA upgrade)	FP8 calibration; sm_90 kernels not on Ampere
A100 -> H200	KV cache memory-bound on long context	Low (same software stack as H100)	Same as A100 -> H100
A100 -> L40S	7B-34B inference, NVLink not needed, $/token priority	Low (GDDR6 not HBM — verify latency tail)	No NVLink; no MIG on L40S
A100 PCIe -> A100 SXM4	NVLink-bound multi-GPU training	Medium (chassis change)	Cooling envelope; baseboard required
A100 80 GB -> A100 40 GB	Cost-sensitive MIG inference fleets	Trivial (same software)	Sizing tables shift — 40 GB limits at 7B BF16
A100 -> MI300X	Need 192 GB HBM3 per GPU at A100 software lead	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm path lags
A100 -> Inferentia 2	Inference-only, AWS-resident, simple model coverage	High (Neuron compiler)	Limited model coverage; recompile required

Pitfalls and operational notes#

FP8 silently unsupported — Hopper-tuned vLLM/TensorRT-LLM commands that pass `--quantization fp8` will error out on A100. Standardise on BF16 or AWQ INT4 in your A100 deployment manifests.
Two memory tiers in the wild — 40 GB HBM2 (1.55 TB/s) versus 80 GB HBM2e (2.0 TB/s). Cloud instance names rarely disambiguate; verify with `nvidia-smi --query-gpu=name,memory.total --format=csv` before sizing.
PCIe Gen4 is half the bandwidth of Gen5 — in mixed clusters with H100 hosts, A100 nodes often become dataloader-bound on training. Pin large parquet shards to local NVMe and use pinned-memory dataloaders.
MIG slice boundaries are static — switching MIG profile or disabling MIG requires draining the GPU and a `nvidia-smi mig -cgi` reconfiguration; rolling MIG changes break long-running workloads.
Secondary-market A100s are a real risk — many ex-mining or ex-crypto cards have degraded HBM. Burn-in for at least 72 hours with DCGM ECC monitoring before production placement.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops on A100 SXM4 below ~580 GB/s indicate a single NVLink port down; reseat the mezzanine before considering RMA.
Hopper-tuned kernels fall back silently to sm_80 — verify with `cuobjdump` that benchmarks are exercising the intended path; published H100 numbers do not scale linearly down to A100.
Confidential Compute is not available on A100 (Hopper-and-later only). Sovereign deployments requiring attested-boot GPU isolation must target H100/H200.
Driver R570 (CUDA 13.x) is the recommended 2026 baseline; older R450/R470 builds are missing critical NCCL and DCGM fixes.

Where this fits in the Yobitel stack#

References

NVIDIA A100 Datasheet · NVIDIA
NVIDIA Ampere Architecture Whitepaper · NVIDIA
Multi-Instance GPU User Guide · NVIDIA
NCCL on NVLink 3.0 — performance guide · NVIDIA
vLLM on Ampere — quantisation paths · vLLM
FinOps Foundation FOCUS billing specification · FinOps Foundation

NVIDIA A100 Tensor Core GPU

Overview#

How it works: the GA100 die and Ampere's third-generation Tensor Core#

Reference: full specification sheet#

Interconnect: NVLink 3.0 and the HGX-A100 baseboard#

Sizing and capacity planning#

Cost and TCO#

Software ecosystem#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

NVIDIA A100 Tensor Core GPU

Overview#

How it works: the GA100 die and Ampere's third-generation Tensor Core#

Reference: full specification sheet#

Interconnect: NVLink 3.0 and the HGX-A100 baseboard#

Sizing and capacity planning#

Cost and TCO#

Software ecosystem#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel