TL;DR
- Ampere-architecture data centre GPU (GA100, TSMC 7N, 54 billion transistors) launched at GTC May 2020 — the silicon behind GPT-3, Megatron-LM, BLOOM, Llama 1/2, and the entire first generation of Stable Diffusion. Still the dominant non-Hopper SKU on hyperscalers and neoclouds in 2026 because its software stack is the deepest on the planet.
- Two memory tiers: 40 GB HBM2 at 1.55 TB/s (launch) and 80 GB HBM2e at 2.0 TB/s (Q4 2021 refresh). Two form factors: SXM4 (400 W, NVLink-attached, fills HGX-A100 baseboards and AWS p4d/p4de, GCP a2-ultragpu) and PCIe Gen4 (250-300 W, drop-in for retrofit servers).
- Third-generation Tensor Core: 312 TFLOPS TF32, 624 TFLOPS BF16/FP16 (2:4 sparse), 1,248 TOPS INT8 — no FP8 (Hopper-only), no FP4 (Blackwell-only). MIG generation 1 partitions a single card into up to 7 hardware-isolated slices with dedicated SMs, L2 and HBM bandwidth.
- NVLink 3.0 at 600 GB/s per GPU and six NVSwitch chips per HGX baseboard give an 8-GPU non-blocking all-to-all fabric — but A100 predates NVLink Switch System, so multi-node training crosses InfiniBand HDR/NDR and looks like discrete 8-GPU islands.
- Pricing in 2026: on-demand $1.80-$2.25 / GPU-hr at hyperscalers, $1.10-$1.40 1-year reserved, $0.85-$1.10 3-year, $0.65-$0.90 spot. Cost-per-million-output-tokens for Llama 3.1 8B FP16: roughly $0.75 — competitive with L40S, cheaper than H100 on small-model serving.
Overview#
The A100 is the GPU that catalysed the modern LLM era. Announced at GTC in May 2020 and shipping in volume through 2024, it pairs the Ampere GA100 die (54 billion transistors on TSMC's 7nm 'N7' process, marketed as 7N) with HBM2/HBM2e memory and the third-generation Tensor Core. Every named model on the early frontier — GPT-3, Megatron-LM, BLOOM, the first three generations of Llama, the original Stable Diffusion, the first internal Anthropic and OpenAI training runs — was trained on A100 clusters. By the time Hopper arrived in volume in 2023, an entire industry had been built on Ampere assumptions.
Six years on, A100 has aged into the workhorse middle tier of the AI compute stack. It is no longer the right choice for frontier training (no FP8 Transformer Engine, no NVLink Switch System scale-out beyond 8 GPUs) or for memory-pressured inference (80 GB and 2.0 TB/s have been outclassed by H200's 141 GB and 4.8 TB/s, and B200's 192 GB at 8 TB/s). What it does have is the deepest software stack of any AI accelerator NVIDIA ships, a fully amortised capital base across every hyperscaler and neocloud, and a hardware MIG model that makes it the cleanest multi-tenant inference card under 70B in production today.
This entry is the 2026 reference for teams operating A100 at scale or deciding whether to stay on A100, migrate to H100/H200 or skip a generation to L40S. It covers the GA100 architecture, the full per-SKU spec sheet, the NVLink/NVSwitch topology, sizing tables for the workloads A100 still wins on, current cost ranges in USD, and a migration matrix in both directions. Yobitel NeoCloud offers A100 SXM4 80 GB and 40 GB capacity broadly across UK and EU regions with NCSC OFFICIAL alignment — A100 is one of the price-leading SKUs on the platform for sub-30B inference and 7B-13B QLoRA fine-tunes. This entry helps you decide when A100 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.
How it works: the GA100 die and Ampere's third-generation Tensor Core#
Ampere shipped two distinct dies: GA100 for the data centre and GA102 for everything else (RTX 3090, RTX A6000, L40, A10G). A100 is the GA100-only card, with 108 SMs (out of 128 physical), 432 Tensor Cores, 40 MB L2 cache, and 192 KB combined L1/SMEM per SM — a strictly bigger, HBM-attached, ECC-everywhere version of consumer Ampere.
Three architectural changes made A100 the era-defining card. First, TF32 (TensorFloat-32) replaced FP32 in training silently: same 8-bit exponent as FP32, mantissa narrowed to 10 bits like FP16, and the Tensor Core executed it transparently whenever cuBLAS saw an FP32 matmul. Existing codebases got roughly 3x throughput on the same hardware with no source change. BF16 also arrived as a first-class Tensor Core type — the format Google's TPUs had been using for years, and the format Llama, Mistral and almost every open-weight model trained in by 2023.
Second, structural sparsity made its debut. The third-generation Tensor Core can skip half the weights when they are pruned in a fixed 2-of-4 pattern, doubling effective throughput. Most production models never adopted the explicit pruning step — the published sparse FLOPS numbers are achievable, but only after a structured-sparsity fine-tune — so the 312 TFLOPS BF16 dense figure (not the 624 TFLOPS sparse figure) is what most workloads see.
Third, Multi-Instance GPU (MIG) appeared. A single A100 can be partitioned into up to seven hardware-isolated slices, each with its own SM block, L2 carve-out, HBM partition and NVDEC/NVENC pair. Slice sizes are quantised: 1g.10gb (1/7 SMs, 10 GB HBM), 2g.20gb, 3g.40gb (the 'half-card' slice), 7g.80gb (the full card). MIG is what made multi-tenant inference on GPU economically viable at cloud scale: AWS, GCP and most neoclouds price MIG slices independently, and the cheapest practical A100 line-item on most public clouds is a single 1g.10gb slice.
- GA100 die: 108 active SMs (128 physical, harvested), 432 third-generation Tensor Cores, 40 MB L2 cache, 192 KB L1/SMEM per SM, 6,912 FP32 CUDA cores total.
- Compute capability sm_80 (sm_86 is GA102, not A100) — important when compiling Triton or CUTLASS kernels.
- Memory subsystem: 5 HBM2e stacks x 16 GB = 80 GB on the refreshed SKU at 2.0 TB/s; the original 40 GB SKU had 5 x 8 GB HBM2 at 1.55 TB/s.
- MIG generation 1: spatial isolation only — no inter-instance bandwidth guarantees beyond the partitioned HBM channels; H100/H200 added per-instance bandwidth quotas and confidential-compute boundaries.
- No FP8, no FP4, no Transformer Engine, no Tensor Memory Accelerator (TMA), no Thread Block Clusters. All of those are Hopper or later.
Reference: full specification sheet#
Authoritative per-SKU figures across the four A100 variants you will actually encounter in 2026. Sparse Tensor figures assume 2:4 structured sparsity; dense throughput is half the listed sparse figure. The 40 GB SKUs are still common in older cloud regions and on cost-sensitive neoclouds — verify which SKU your instance type maps to before sizing.
| Metric | A100 SXM4 80 GB | A100 PCIe 80 GB | A100 SXM4 40 GB | A100 PCIe 40 GB |
|---|---|---|---|---|
| Architecture | Ampere GA100 | Ampere GA100 | Ampere GA100 | Ampere GA100 |
| Process | TSMC 7N | TSMC 7N | TSMC 7N | TSMC 7N |
| Transistors | 54 billion | 54 billion | 54 billion | 54 billion |
| Active SMs | 108 | 108 | 108 | 108 |
| Tensor cores | 432 | 432 | 432 | 432 |
| Compute capability | sm_80 | sm_80 | sm_80 | sm_80 |
| FP64 | 9.7 TFLOPS | 9.7 TFLOPS | 9.7 TFLOPS | 9.7 TFLOPS |
| FP64 (Tensor) | 19.5 TFLOPS | 19.5 TFLOPS | 19.5 TFLOPS | 19.5 TFLOPS |
| FP32 | 19.5 TFLOPS | 19.5 TFLOPS | 19.5 TFLOPS | 19.5 TFLOPS |
| TF32 (Tensor, sparse) | 312 TFLOPS | 312 TFLOPS | 312 TFLOPS | 312 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 624 TFLOPS | 624 TFLOPS | 624 TFLOPS | 624 TFLOPS |
| BF16 / FP16 (Tensor, dense) | 312 TFLOPS | 312 TFLOPS | 312 TFLOPS | 312 TFLOPS |
| INT8 (Tensor, sparse) | 1,248 TOPS | 1,248 TOPS | 1,248 TOPS | 1,248 TOPS |
| FP8 | Not supported | Not supported | Not supported | Not supported |
| Memory | 80 GB HBM2e | 80 GB HBM2e | 40 GB HBM2 | 40 GB HBM2 |
| Memory bandwidth | 2.0 TB/s | 1.94 TB/s | 1.55 TB/s | 1.55 TB/s |
| L2 cache | 40 MB | 40 MB | 40 MB | 40 MB |
| NVLink | 600 GB/s (NVLink 3.0, 12 ports) | 600 GB/s (bridge, optional) | 600 GB/s (NVLink 3.0) | 600 GB/s (bridge) |
| PCIe | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) |
| TDP | 400 W | 300 W (250 W variants exist) | 400 W | 250 W |
| MIG instances | Up to 7 | Up to 7 | Up to 7 | Up to 7 |
| Form factor | SXM4 mezzanine | FHFL dual-slot PCIe | SXM4 mezzanine | FHFL dual-slot PCIe |
| Minimum driver | R450 | R450 | R450 | R450 |
| Recommended driver (2026) | R535+ (R570 stable) | R535+ | R535+ | R535+ |
| Minimum CUDA | 11.0 | 11.0 | 11.0 | 11.0 |
| Maximum CUDA (sm_80 path) | 13.x | 13.x | 13.x | 13.x |
FP8 is not supported on Ampere. Any production path that has standardised on FP8 weights or activations (TensorRT-LLM FP8 engines, vLLM `--quantization fp8`, Transformer Engine FP8 training) needs Hopper or newer. INT8 PTQ and AWQ/GPTQ INT4 inference remain available paths if A100 is the only option.
Interconnect: NVLink 3.0 and the HGX-A100 baseboard#
NVLink 3.0 provides 600 GB/s per GPU — 12 ports at 50 GB/s each — and is exposed on every SXM4 module. On the HGX-A100 baseboard, 8 GPUs are wired through 6 second-generation NVSwitch ASICs into a fully non-blocking all-to-all fabric: any GPU can DMA into any other GPU's HBM at the full 600 GB/s bidirectional bandwidth with no fabric contention.
The critical limit is what comes next. A100 predates the NVLink Switch System (introduced with H100). Beyond the 8-GPU baseboard, scale-out runs over InfiniBand HDR (200 Gb/s, the original A100 cluster fabric) or NDR (400 Gb/s, common on 2022+ A100 clusters). All-reduce latency at the cluster boundary jumps by an order of magnitude — typically 5-10x — and the cluster looks like a collection of discrete 8-GPU islands rather than a coherent shared-memory fabric. This is the single biggest architectural reason frontier training moved off A100 to Hopper from 2023 onward.
PCIe A100 cards can be paired with the 600 GB/s NVLink bridge for 2-card or 4-card configurations, but unlike SXM4 there is no NVSwitch path — bridges scale to 2 cards directly, and 4-card configurations rely on chained bridges with reduced effective bandwidth. For multi-GPU training on PCIe A100, sizing assumes effective NVLink bandwidth closer to 200-400 GB/s.
- Per-GPU NVLink 3.0: 600 GB/s bidirectional (12 ports x 50 GB/s).
- Per-baseboard NVSwitch bisection: 4.8 TB/s aggregate on 8 GPUs.
- NVLink-domain ceiling: 8 GPUs (one HGX-A100 baseboard). No multi-baseboard NVLink fabric.
- Cluster scale-out: InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s); RoCE v2 increasingly common on neoclouds.
- Cross-baseboard collective latency: 5-10x intra-baseboard; size pipeline-parallel rather than tensor-parallel for cross-node splits.
Sizing and capacity planning#
Sizing tables for the workloads A100 still wins or holds its own on in 2026. All figures assume A100 SXM4 80 GB, BF16 weights (no FP8 path on Ampere), vLLM 0.6+ with paged KV cache and prefix caching, and a healthy NVLink-local placement. Throughput is given in output tokens per second per replica at moderate concurrency (16-32 sessions); verify against your own traffic shape before committing capacity.
- Single A100 80 GB ceiling for BF16 inference: weights + KV cache + activations + cuBLAS scratch must fit under ~76 GB; above that, OOMs even with paged KV.
- 70B BF16 on a single A100 80 GB is infeasible (140 GB weights alone); use INT4 quantisation or 2-4 card TP.
- MIG slice mapping: 1g.10gb fits 7B INT4, 3g.40gb fits 7B BF16 or 13B INT4, 7g.80gb (full card) fits 34B BF16 or 70B INT4.
- Training rule of thumb: 1 trillion tokens x 13B parameters at BF16 takes roughly 90-130 A100-days on 64-GPU HGX-A100 clusters with Megatron-LM.
- AllReduce overhead at TP=8 inside one HGX-A100: ~12-18 % of step time for 70B BF16 (vs ~6-9 % on H100); cross-baseboard splits push this past 35 %.
- Spot/preemptible A100 capacity is broadly available at 50-60 % below on-demand with 5-10 % daily eviction — suitable for fine-tunes, not for inference SLAs.
| Model | Precision | Context | GPUs per replica | TP / PP | Approx output TPS | VRAM headroom |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | BF16 | 8K | 1x A100 80 GB | 1 / 1 | 3,200-4,200 | 55 GB free |
| Llama 3.1 8B | AWQ INT4 | 32K | 1x A100 80 GB | 1 / 1 | 4,500-5,500 | 60 GB free |
| Mistral 7B / Qwen 7B | BF16 | 8K | 1x A100 80 GB | 1 / 1 | 3,400-4,400 | 55 GB free |
| Codestral 22B / Yi 34B | BF16 | 8K | 1x A100 80 GB | 1 / 1 | 1,100-1,500 | 10 GB free |
| Codestral 22B | AWQ INT4 | 32K | 1x A100 80 GB | 1 / 1 | 1,800-2,400 | 40 GB free |
| Llama 3 70B | BF16 | 4K | 2x A100 80 GB | 2 / 1 | 550-750 | 8 GB free per rank |
| Llama 3 70B | AWQ INT4 | 8K | 1x A100 80 GB | 1 / 1 | 350-500 | 20 GB free |
| Llama 3 70B | BF16 | 8K | 4x A100 80 GB | 4 / 1 | 750-950 | 30 GB free per rank |
| Mixtral 8x7B (MoE 47B) | BF16 | 32K | 2x A100 80 GB | 2 / 1 | 1,200-1,600 | 10 GB free per rank |
| SDXL 1.0 (1024x1024) | BF16 | n/a | 1x A100 80 GB | 1 / 1 | 1.0-1.3 images/s | 60 GB free |
| Whisper Large v3 batch | BF16 | 30s clip | 1x A100 (MIG 3g.40gb slice) | 1 / 1 | 40-50 RTF | n/a |
Cost and TCO#
A100 pricing has compressed continuously since H100 supply caught up in 2024. In 2026 the public ranges below are typical across hyperscalers and Tier-1/Tier-2 neoclouds. The cost case for A100 is now made on three axes: (1) cost-per-token for sub-30B inference, where the A100 80 GB beats both H100 (overkill) and L40S (less HBM bandwidth) on $/token-second; (2) fine-tune of 7B-13B models, where 80 GB HBM2e fits QLoRA comfortably and the per-GPU-hour cost is half H100's; (3) ecosystem maturity — every kernel, every quantisation scheme, every published recipe targets A100 first.
- Cost-per-million-output-tokens on Llama 3.1 8B BF16 at $1.50/GPU-hr and 3,800 TPS sustained: roughly $0.11 per million tokens before margin.
- Cost-per-million-output-tokens on Llama 3 70B BF16 (2x A100) at $1.50/GPU-hr and 650 TPS sustained: roughly $1.28 per million — competitive with H100 70B FP8 ($0.50) only at high utilisation discounts.
- Switching from BF16 to AWQ INT4 yields +1.4-1.6x throughput on most sub-30B models with <2 % quality regression; the only realistic precision lever on Ampere.
- 3-year reservation cuts effective $/GPU-hr by 45-55 % versus on-demand; only commit when steady-state utilisation exceeds 65 % across the term.
- MIG slices are the dominant cost lever for small-model inference fleets: serving 7B INT4 on 1g.10gb slices is typically 3-5x cheaper per request than full-card serving.
- Egress and inter-region data movement still frequently exceed 8-12 % of total A100 bill at hyperscalers — collocate model artefacts with compute.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Spot |
|---|---|---|---|---|---|
| Hyperscaler (AWS/GCP/Azure) | A100 SXM4 80 GB | $2.00-$2.25 | $1.35-$1.70 | $1.10-$1.40 | $0.70-$0.90 |
| Hyperscaler | A100 SXM4 40 GB | $1.70-$2.00 | $1.20-$1.50 | $0.95-$1.20 | $0.55-$0.75 |
| Hyperscaler | A100 PCIe 80 GB | $1.80-$2.10 | $1.25-$1.55 | $1.00-$1.25 | $0.60-$0.80 |
| Tier-1 neocloud | A100 SXM4 80 GB | $1.50-$1.90 | $1.10-$1.40 | $0.90-$1.15 | $0.55-$0.75 |
| Tier-2 neocloud | A100 SXM4 80 GB | $1.10-$1.50 | $0.90-$1.20 | $0.75-$1.00 | $0.40-$0.60 |
| Spot/preemptible (hyperscaler) | A100 SXM4 80 GB | $0.65-$0.90 | n/a | n/a | 8-15 %/day eviction |
| MIG 1g.10gb slice (1/7 card) | A100 80 GB | $0.30-$0.45 | $0.22-$0.32 | $0.18-$0.26 | n/a |
| Yobitel NeoCloud (UK + EU) | A100 SXM4 80 GB | $1.40-$1.80 | $1.05-$1.35 | $0.85-$1.10 | n/a |
| Yobitel Omniscient Compute | A100 SXM4 80 GB multi-cloud | Market-clearing | Commit-discounted | Commit-discounted | n/a |
Cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel Omniscient Compute: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.a100.sxm4.80gb`. This is what enables cross-provider arbitrage and cost attribution at workspace granularity.
Software ecosystem#
A100 has the deepest software stack of any AI accelerator after H100. CUDA 11.0 through 13.x supports sm_80 as a first-class target; every cuDNN release back to 8.0, every NCCL release back to 2.7, every Triton release, and the full Hugging Face training stack (transformers, peft, accelerate, trl) treat A100 as the reference platform. vLLM, SGLang, TGI and Triton Inference Server all serve on A100 with FP16/BF16 weights and INT8/INT4 quantisation paths (AWQ, GPTQ, bitsandbytes); the only thing they cannot do on A100 is FP8.
Training framework support is equally complete: Megatron-LM, DeepSpeed, FSDP (PyTorch native), FSDP-2, Megatron-Core, NeMo and Axolotl all have A100-tuned recipes, and most published academic results from 2020-2024 are reproducible on A100 without modification. The CUDA 13 driver (R570) supports A100 to the same depth as Hopper apart from sm_90-specific kernels.
The operational pattern that matters: Hopper-tuned kernels (Flash Attention 3, CUTLASS 3.x Hopper-only paths, Triton's Hopper backend, the TMA-using CUTLASS kernels in vLLM 0.6+) silently fall back to sm_80 paths on A100 — they continue to run, but at a fraction of Hopper throughput. When benchmarking, verify the kernel path with `cuobjdump --dump-elf-symbols` to make sure you are not measuring an unintended fallback.
Migration and alternatives#
When A100 is the right choice and when it isn't. The table maps the practical migration paths in both directions — A100 is increasingly the 'from' card rather than the 'to' card, but it remains the 'to' card for teams stepping down from over-spec H100 fleets to right-size inference economics.
- Heuristic 1: if your inference fleet runs 7B-34B models at moderate context (< 16K) and FP8 is not in the roadmap, A100 80 GB at $/token usually beats H100 — verify on InferenceBench.
- Heuristic 2: if you are training above 13B parameters with > 64 GPUs, the lack of NVLink Switch System scale-out makes A100 measurably slower than H100 per FLOP-second despite the lower per-GPU cost.
- Heuristic 3: never migrate A100 -> L40S without first benchmarking the long-context tail latency — L40S GDDR6 bandwidth (864 GB/s) is less than half A100's HBM2e (2.0 TB/s) and KV-cache-heavy decodes regress meaningfully.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| V100 -> A100 | Need BF16, TF32, MIG, or 80 GB HBM | Low (CUDA upgrade) | Compute capability sm_70 -> sm_80; kernel recompile only |
| A100 -> H100 | Need FP8 or NVLink Switch System scale-out | Low (drop-in CUDA upgrade) | FP8 calibration; sm_90 kernels not on Ampere |
| A100 -> H200 | KV cache memory-bound on long context | Low (same software stack as H100) | Same as A100 -> H100 |
| A100 -> L40S | 7B-34B inference, NVLink not needed, $/token priority | Low (GDDR6 not HBM — verify latency tail) | No NVLink; no MIG on L40S |
| A100 PCIe -> A100 SXM4 | NVLink-bound multi-GPU training | Medium (chassis change) | Cooling envelope; baseboard required |
| A100 80 GB -> A100 40 GB | Cost-sensitive MIG inference fleets | Trivial (same software) | Sizing tables shift — 40 GB limits at 7B BF16 |
| A100 -> MI300X | Need 192 GB HBM3 per GPU at A100 software lead | High (CUDA -> ROCm rewrite) | CUDA kernels not portable; vLLM ROCm path lags |
| A100 -> Inferentia 2 | Inference-only, AWS-resident, simple model coverage | High (Neuron compiler) | Limited model coverage; recompile required |
Pitfalls and operational notes#
- FP8 silently unsupported — Hopper-tuned vLLM/TensorRT-LLM commands that pass `--quantization fp8` will error out on A100. Standardise on BF16 or AWQ INT4 in your A100 deployment manifests.
- Two memory tiers in the wild — 40 GB HBM2 (1.55 TB/s) versus 80 GB HBM2e (2.0 TB/s). Cloud instance names rarely disambiguate; verify with `nvidia-smi --query-gpu=name,memory.total --format=csv` before sizing.
- PCIe Gen4 is half the bandwidth of Gen5 — in mixed clusters with H100 hosts, A100 nodes often become dataloader-bound on training. Pin large parquet shards to local NVMe and use pinned-memory dataloaders.
- MIG slice boundaries are static — switching MIG profile or disabling MIG requires draining the GPU and a `nvidia-smi mig -cgi` reconfiguration; rolling MIG changes break long-running workloads.
- Secondary-market A100s are a real risk — many ex-mining or ex-crypto cards have degraded HBM. Burn-in for at least 72 hours with DCGM ECC monitoring before production placement.
- DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops on A100 SXM4 below ~580 GB/s indicate a single NVLink port down; reseat the mezzanine before considering RMA.
- Hopper-tuned kernels fall back silently to sm_80 — verify with `cuobjdump` that benchmarks are exercising the intended path; published H100 numbers do not scale linearly down to A100.
- Confidential Compute is not available on A100 (Hopper-and-later only). Sovereign deployments requiring attested-boot GPU isolation must target H100/H200.
- Driver R570 (CUDA 13.x) is the recommended 2026 baseline; older R450/R470 builds are missing critical NCCL and DCGM fixes.
Where this fits in the Yobitel stack#
A100 remains a first-class target in the Yobitel stack through 2026. Yobibyte — our AI-native platform — schedules inference replicas and fine-tune jobs on A100 pools whenever the workload fits inside A100's BF16/INT4 envelope and the cost case beats H100; the platform's placement layer is aware of NVLink topology, MIG profiles and HBM2 versus HBM2e variants and tags every replica with the SKU it landed on.
Omniscient Compute — our cross-cloud capacity broker — indexes A100 80 GB and 40 GB SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud and arbitrages workloads to the cheapest region that meets the workspace's residency posture. A100 is frequently the price-leading SKU on the broker for sub-30B inference because supply is broad and depreciation is well past breakeven for most operators.
InferenceBench — our public, reproducible benchmarking harness — publishes A100 throughput, latency and cost-per-token numbers for every major open-weight model under 70B on vLLM, TensorRT (no LLM-FP8 paths on A100), SGLang and TGI. The A100 sizing tables in this entry are anchored on InferenceBench runs; production numbers your team will see in steady state are typically within 10 % of the published figures.
References
- NVIDIA A100 Datasheet · NVIDIA
- NVIDIA Ampere Architecture Whitepaper · NVIDIA
- Multi-Instance GPU User Guide · NVIDIA
- NCCL on NVLink 3.0 — performance guide · NVIDIA
- vLLM on Ampere — quantisation paths · vLLM
- FinOps Foundation FOCUS billing specification · FinOps Foundation