TL;DR
- Single-slot, low-profile, passive Ada Lovelace card (AD104) at 72 W TDP — the most power-efficient mainstream data-centre GPU NVIDIA ships in 2026.
- 24 GB GDDR6 at 300 GB/s, 121 TFLOPS BF16 (242 sparse), 242 TFLOPS FP8 (485 sparse), 485 TOPS INT8 — fourth-generation Tensor Core with E4M3/E5M2 FP8 support.
- Sweet spot: 7B-class LLM serving (Qwen2.5-7B, Llama 3.1-8B, Mistral-7B), text-embedding-3 / BGE-M3 embeddings, Whisper Large v3 transcription, SDXL inference, OCR, and high-density AV1 transcode.
- Cost: $0.40-0.55/GPU/hr on-demand at hyperscalers (GCP G2, AWS G6), $0.30-0.45/GPU/hr on Tier-1/2 neoclouds — typically 4-6x cheaper per hour than H100 SXM5 and 2-3x cheaper per million output tokens on 7B chat workloads.
- Default routing target for Yobibyte's 7B-class chat and embedding endpoints on Yobitel NeoCloud UK + EU L4 capacity; InferenceBench publishes L4-vs-H100-vs-L40S cost-per-token tables so the routing decision is auditable.
Overview#
L4 is NVIDIA's Ada Lovelace successor to T4 and the densest mainstream inference card the company ships in 2026. Single-slot, low-profile, passive, 72 W TDP — the form factor that lets you fit 8 L4s in a 1U server or 16 in a 2U without exotic cooling, and the part you reach for whenever the binding constraint is rack power or replica count per dollar rather than peak throughput per GPU.
Despite the modest envelope, L4 carries the full Ada Lovelace feature set: fourth-generation Tensor Cores with FP8 (E4M3 / E5M2 — the same formats as Hopper, the same Transformer Engine code path), Ada media engines with AV1 encode and decode, and third-generation RT cores. The combination makes it genuinely versatile: 7B-class LLM serving, dense embedding generation, Whisper Large v3 transcription, OCR (TrOCR, PaddleOCR), SDXL inference at moderate latency, and very high-density AV1 video transcoding sit on the same SKU.
This entry is the reference for teams choosing inference SKUs in 2026: full spec sheet, the InferenceBench-anchored sizing tables we use to fleet-plan 7B-class chat, the cost-per-million-output-tokens maths against H100 / L40S / T4, the workloads where L4 wins and where it loses to H100, and the migration paths. Yobitel NeoCloud offers L4 capacity in UK and EU regions with NCSC OFFICIAL alignment; Yobibyte's router treats L4 as the default placement for 7B chat and embedding endpoints because the cost-per-token maths favour it across the typical request distribution. This entry helps you decide when L4 is the right pick vs L40S, A10G, T4 or H100 — and what it costs on Yobitel NeoCloud per million tokens served.
Specifications#
Authoritative figures. All Tensor numbers are quoted both dense and with 2:4 structured sparsity (sparse = 2x dense). Production workloads rarely sustain sparsity; plan with dense figures.
| Metric | L4 |
|---|---|
| Architecture | Ada Lovelace (AD104) |
| Process | TSMC 4N |
| Transistors | 35.8 billion |
| SMs / CUDA cores | 58 / 7,424 |
| Tensor cores (gen 4) | 232 |
| RT cores (gen 3) | 58 |
| Compute capability | sm_89 |
| FP32 | 30.3 TFLOPS |
| TF32 (Tensor, dense / sparse) | 60 / 120 TFLOPS |
| BF16 / FP16 (Tensor, dense / sparse) | 121 / 242 TFLOPS |
| FP8 (Tensor, dense / sparse) | 242 / 485 TFLOPS |
| INT8 (Tensor, dense / sparse) | 242 / 485 TOPS |
| Memory | 24 GB GDDR6 |
| Memory bandwidth | 300 GB/s |
| L2 cache | 48 MB |
| PCIe | Gen4 x16 (32 GB/s) |
| TDP | 72 W (passive, single-slot) |
| Form factor | Low-profile single-slot, 168 mm length |
| NVENC / NVDEC | 2 NVENC / 4 NVDEC (AV1 encode + decode) |
| NVLink | Not supported (PCIe-only) |
| MIG | Not supported |
| Confidential Compute | Not supported |
| Minimum driver | R525 (R550+ recommended) |
| Minimum CUDA | 12.0 (12.4+ for full TE) |
| Launched | March 2023 (GTC) |
L4 memory bandwidth (300 GB/s) is half of A10G/A10 (600 GB/s) and one-eleventh of H100 SXM5 (3.35 TB/s). For bandwidth-bound LLM decode, single-replica TPS scales with bandwidth — L4 single-replica throughput on Llama 3 8B is 30-40 % lower than A10G at the same batch size. The L4 win is density (8-16 cards per server) and dollars-per-token, not raw single-replica speed.
Architecture: what Ada brings to inference#
Ada Lovelace's fourth-generation Tensor Core adds FP8 — the same E4M3 forward / E5M2 backward formats Hopper introduced, the same Transformer Engine library, the same cuBLAS / cuDNN integration. For inference this is the headline feature: enabling FP8 weights and FP8 KV cache roughly doubles throughput vs FP16 and halves the memory pressure on the 24 GB framebuffer, letting L4 host a 7B-class model with a meaningful KV-cache budget.
The Ada media engines support AV1 encode and decode in addition to H.264/H.265, at roughly 40 % higher NVENC throughput-per-watt than the Ampere generation. For video-AI pipelines (ingesting livestreams, running detection or transcription, re-encoding to AV1 for delivery) this is what makes L4 a single-SKU solution where Ampere required separate transcode and compute cards.
What Ada does NOT bring: NVLink (L4 is PCIe-only), MIG (no hardware multi-tenancy), Confidential Compute, HBM (GDDR6 is fine for inference but caps bandwidth at 300 GB/s vs HBM3's 3+ TB/s). Multi-GPU L4 configurations exist but rely entirely on PCIe Gen4 x16 — fine for data-parallel replicas, not for tensor-parallel inference of large models.
- Fourth-gen Tensor Core: FP8 (E4M3 + E5M2), BF16, FP16, TF32, INT8 — Transformer Engine compatible.
- Ada media engines: 2 NVENC + 4 NVDEC with full AV1 encode/decode at up to 4K60 HDR per pipeline.
- Third-gen RT cores: relevant for graphics-adjacent AI (NeRF inference, real-time ray-traced auxiliary visualisation), not for typical LLM/CV workloads.
- No MIG: a single inference replica owns the whole card. Multi-tenancy is per-process / per-container, not silicon-isolated.
- No NVLink: rules out tensor-parallel inference of models that do not fit on a single L4 (effectively, anything above ~13B in FP8).
Form factor, power and thermal#
The L4 design brief was 'maximum density in standard server chassis'. The result is a single-slot, low-profile PCIe card at 72 W TDP — the same envelope as a high-end NIC.
- Form factor: PCIe Gen4 x16, single-slot, low-profile (LP), 168 mm length, passive cooling. Fits standard 1U / 2U server chassis without modification.
- TDP: 72 W (passive). Cards rely on chassis airflow; server vendors specify minimum airflow CFM in their L4-certified configurations.
- Density: 8x L4 in a 1U server is a common reference design (Supermicro SYS-211GT-HNTR, Dell PowerEdge XR8000, HPE ProLiant DL380a) — 576 W of GPU TDP plus chassis. 16x L4 in 2U is achievable.
- Power efficiency: roughly 3.4 TFLOPS BF16 dense per watt vs ~1.4 TFLOPS BF16 dense per watt on H100 SXM5 — the highest inference-FLOPS-per-watt of any NVIDIA data-centre SKU through 2026.
- Operating temperature: 0-50 C inlet for L4-certified servers; thermal throttle threshold is around 85 C die. Less margin than H100 (83 C throttle on a 700 W board), but the absolute power dissipation makes thermal headroom non-binding in well-designed chassis.
An 8x L4 1U at ~600 W of GPU TDP draws less power than a single H100 SXM5 and serves more concurrent 7B-class chat sessions. For dense inference fleets, L4 1U sleds are the most power- and rack-efficient configuration NVIDIA enables in 2026.
Software ecosystem#
L4 inherits the full CUDA / Ada software stack. Driver requirements are R525 minimum, R550+ recommended for FP8 paths via Transformer Engine. CUDA 12.0+ supports the silicon; CUDA 12.4+ is required for the full TE recipe set used by vLLM / TensorRT-LLM / SGLang.
- Inference servers: vLLM 0.5+ (FP8 weight + KV cache paths run on L4 with the same flags as Hopper at proportionally lower throughput), TensorRT-LLM (FP8 engines compile for sm_89; the same workflow as H100 minus FP4), SGLang (same vLLM-style FP8 path), Triton Inference Server (TensorRT-LLM backend), TGI, Hugging Face TGI.
- Embedding / dense vector: sentence-transformers, text-embeddings-inference (TEI), Cohere reranker.
- Speech: whisper.cpp, faster-whisper (CTranslate2 backend), WhisperX, NVIDIA Riva.
- Vision: TensorRT, ONNX Runtime, OpenCV with CUDA backend, NVIDIA DeepStream for video pipelines exploiting AV1.
- Cloud-native: NVIDIA GPU Operator on Kubernetes treats L4 as a routine `nvidia.com/gpu.product=NVIDIA-L4` resource. KServe, Knative-Serving, KEDA + Prometheus all work transparently.
- Yobibyte exposes L4 capacity as the default pool for 7B chat and embedding workspaces; customers describe a model, Yobibyte places the replica on an L4 (or L40S / H100 for larger contexts) per the published routing matrix.
Sizing: workload-to-config mapping#
Sizing tables we use to scope L4 footprints on Yobitel NeoCloud. All figures assume FP8 via Transformer Engine, vLLM 0.6 with paged KV cache and prefix caching, and a healthy single-GPU placement. Throughput is output tokens per second per replica at the listed concurrency. Anchor your plan on InferenceBench numbers and validate before locking in.
- Rule of thumb: 7B-class FP8 chat at 4K avg output, 500 RPS sustained needs roughly 3-4 L4 replicas. The same workload on H100 SXM5 needs 1 replica but at 4-6x the hourly rate.
- Memory ceiling on 24 GB GDDR6: weights + KV cache + cuBLAS scratch < 22 GB. Llama 3.1 8B FP8 weights ~9 GB, leaves 13 GB for KV cache + activations; tune `gpu_memory_utilization=0.90`.
- L4 PCIe Gen4 x16 link is 32 GB/s — fine for inference (host-to-GPU bandwidth is not the bottleneck) but limits multi-card data-parallel scaling beyond ~8 cards behind a single CPU.
- Spot/preemptible L4 capacity on GCP G2 is viable for batch inference and embeddings (5-10 % eviction/day); not for chat SLAs.
| Workload | Model | Precision | Concurrency | Throughput / replica | Notes |
|---|---|---|---|---|---|
| Chat (7B-class) | Llama 3.1 8B Instruct | FP8 | 16-24 | 1,500-2,100 output TPS | Sweet spot; FP8 KV cache enables 8K-32K context. |
| Chat (7B-class) | Mistral 7B Instruct v0.3 | FP8 | 16-24 | 1,650-2,300 output TPS | Faster than Llama 8B due to grouped-query attn. |
| Chat (7B-class) | Qwen2.5 7B Instruct | FP8 | 16-24 | 1,400-1,900 output TPS | 128K context: KV pressure binds at concurrency > 8. |
| Chat (13B-class, FP8) | Llama 2 13B Chat | FP8 | 8-12 | 650-900 output TPS | Fits 8K context comfortably; 32K is tight. |
| Dense embedding | BGE-M3 / text-embedding-3 equivalent | FP16 | 256-512 | ~4,500 embeddings/sec | Batched short sequences; bandwidth-bound. |
| Reranker | BGE-reranker-large | FP16 | 128-256 | ~3,200 pairs/sec | Same shape as embeddings. |
| Whisper (transcription) | Whisper Large v3 (CT2 INT8) | INT8 | 8-16 streams | 10-12x realtime | faster-whisper backend; bandwidth-bound. |
| OCR | PaddleOCR / TrOCR | FP16 | 32-64 | ~120 pages/sec | Multi-stage pipeline; NVENC/NVDEC unused. |
| SDXL inference | SDXL Base 1.0 | FP16 | 1 (latency) / 4 (throughput) | 0.25-0.40 images/sec | Compute-bound; H100 PCIe is 4x faster. |
| AV1 transcode | 1080p60 H.264 -> AV1 | n/a | 8 streams / NVENC | 16-20 streams per card | Media engines, not Tensor Cores. |
Cost and TCO#
L4 pricing in 2026 is in the $0.40-0.55/GPU-hr band on-demand at hyperscalers, and $0.30-0.45/GPU-hr on Tier-1/2 neoclouds. The cost-per-million-output-tokens advantage over H100 SXM5 on 7B chat is roughly 2-3x — the reason Yobibyte routes the 7B request stream to L4 pools by default.
- Cost-per-million-output-tokens on Llama 3.1 8B FP8, 1x L4 at $0.42/GPU-hr and 1,800 TPS sustained: roughly $0.065 per million tokens before margin — competitive with the cheapest hosted-API rates.
- Same workload on 1x H100 SXM5 at $2.00/GPU-hr and 6,500 TPS sustained: roughly $0.086 per million tokens. H100 wins on latency, L4 wins on cost.
- Embeddings on L4 (BGE-M3 FP16, 4,500/sec): ~$0.025 per million embeddings before margin.
- Reserved 3-year on Yobitel NeoCloud L4 cuts effective $/GPU-hr roughly 40-50 % vs on-demand — commit only if utilisation > 65 %.
- FinOps Foundation FOCUS billing: L4 lands as `ServiceName=AcceleratorCompute`, `SkuId=gpu.l4.pcie` on Yobitel-issued invoices — same schema as H100, so per-workload attribution is uniform across the fleet.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Notes |
|---|---|---|---|---|---|
| Hyperscaler (GCP G2) | L4 | $0.50-0.55 | $0.35-0.40 | $0.25-0.32 | Largest L4 footprint; widely available. |
| Hyperscaler (AWS G6) | L4 | $0.45-0.55 | $0.30-0.38 | $0.22-0.30 | g6.xlarge - g6.48xlarge. |
| Tier-1 neocloud | L4 | $0.35-0.45 | $0.28-0.36 | $0.22-0.30 | Often surplus capacity at favourable rates. |
| Tier-2 neocloud | L4 | $0.30-0.40 | $0.24-0.32 | $0.20-0.26 | Best raw rate; verify chassis density. |
| Yobitel NeoCloud (UK + EU) | L4 (Ada Inference Pool) | $0.40-0.50 | $0.30-0.38 | $0.22-0.30 | NCSC OFFICIAL-aligned; FOCUS-conformant billing. |
| Yobibyte (managed endpoint) | L4-backed 7B endpoint | Per-million-token | Spend-cap | Spend-cap | Customer sees /v1/chat/completions; placement on L4 hidden. |
Migration and alternatives#
L4 sits in a competitive band against L40S (more VRAM, more compute, higher TDP), A10G (more bandwidth, fewer FLOPS, no FP8), T4 (the predecessor, no FP8, lower throughput), and H100 PCIe (much higher throughput, much higher cost). Picking correctly is mostly about identifying the binding constraint — VRAM, bandwidth, FLOPS, density, or dollars-per-token.
- Two heuristics: pick L40S when a single replica needs to serve 13B+ models or much higher per-replica throughput; pick L4 in every other 7B-class case where density and dollars dominate.
- Yobibyte's router picks L4 vs L40S vs H100 per-request based on the workspace's model and SLA; customers see a single endpoint, the router maps it to the cheapest SKU that meets the latency budget.
- Omniscient Compute treats L4 capacity as commoditised — neocloud arbitrage is straightforward, FOCUS-normalised pricing surfaces the best rate per workspace region.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| T4 -> L4 | Need FP8, AV1, or higher throughput per watt | Low (drop-in PCIe upgrade) | Driver R525+; CUDA 12+ |
| L4 -> L40S | Need 48 GB VRAM or 4x compute for 13B-34B | Low (chassis power redesign; 350 W vs 72 W) | Power and cooling envelope |
| L4 -> A10G | Bandwidth-bound decode dominates over density | Low (A10G is also PCIe) | A10G has no FP8 — Transformer Engine paths fall back |
| L4 -> H100 PCIe | Need 70B-class serving or latency-critical 7B SLAs | Medium (much higher TDP) | Power, cooling, $/hr step-change |
| L4 -> AMD Radeon Pro V710 | All-AMD strategy | High (ROCm rewrite) | CUDA kernels not portable |
| L4 -> Inferentia 2 | AWS-resident, inference-only, fine for model coverage | High (Neuron compiler) | Limited model + framework support |
| L40S -> L4 | Density / dollars more important than per-replica throughput | Low | Drop VRAM ceiling from 48 GB to 24 GB |
Pitfalls / operational notes#
L4 issues we see most often in production, ranked by frequency.
- Bandwidth surprise: teams sizing L4 from FLOPS alone over-promise single-replica decode TPS. 300 GB/s GDDR6 vs 600 GB/s on A10G means L4 is bandwidth-bound for LLM decode — verify TPS on InferenceBench before committing.
- VRAM headroom: 24 GB is generous for 7B but tight for 13B in FP8 with long context. Llama 2 13B Chat FP8 + 32K context + concurrency 12 will OOM; size 8K context or drop concurrency.
- PCIe-only: data-parallel multi-card scaling past 8 cards behind a single CPU starves the PCIe root complex. Plan one CPU per 8 L4s for high-throughput inference.
- No NVLink: rules out tensor-parallel inference for 34B+ models. Use L40S or H100 PCIe for tensor-parallel multi-card configurations.
- Low-profile constraint: not every 1U/2U chassis accepts low-profile cards. Specify L4-certified server SKUs (Supermicro SYS-211GT-HNTR, Dell XR8000, HPE DL380a) when procuring; do not assume a generic chassis fits.
- No MIG: a single inference replica owns the whole card. Multi-tenant inference relies on per-process scheduling, not silicon-level isolation — use H100 with MIG if hardware isolation is required.
- Driver / CUDA version drift: FP8 paths in vLLM require CUDA 12.4+. CUDA 12.0 silently falls back to FP16, halving throughput. Pin the container CUDA version explicitly.
- AV1 encoder licensing: NVENC AV1 encode is unrestricted on L4 in 2026 (the previous concurrent-stream cap was lifted in R535+); confirm driver version before sizing video pipelines.
Where this fits in the Yobitel stack#
L4 is the default inference SKU for Yobibyte's 7B-class workspaces on Yobitel NeoCloud. When a customer creates a Yobibyte workspace and points it at Llama 3.1 8B, Qwen2.5 7B, Mistral 7B, BGE-M3 embeddings or Whisper Large v3, the placement layer routes the replica to a NeoCloud L4 node by default. The router falls through to L40S or H100 PCIe only when the model size or context length pushes past the L4 envelope, or when the workspace SLA demands lower P99 latency than L4 can hit at the requested concurrency.
Yobitel NeoCloud's L4 footprint sits in the UK and EU sovereign regions with NCSC OFFICIAL-aligned host hardening, FOCUS-conformant per-GPU-hour billing and standard FinOps Foundation cost attribution — the same observability and compliance surface as the H100 fleet, billed at a fraction of the rate. Yobibyte exposes L4-backed endpoints behind an OpenAI-compatible API so customers consume them with no code changes vs an H100-backed equivalent.
InferenceBench publishes L4-vs-L40S-vs-H100-vs-T4 throughput, latency and cost-per-million-output-tokens numbers for every covered 7B-13B open-weight model, with the exact vLLM / TensorRT-LLM flags used to produce each number. The sizing tables in this entry are anchored on InferenceBench L4 runs; the production numbers your team will see in steady state are typically within 10 % of the published figures. Omniscient Compute treats L4 as a commoditised capacity tier — neocloud arbitrage is straightforward and Yobibyte inherits the cheapest qualifying placement per workspace region.
References
- NVIDIA L4 Tensor Core GPU Datasheet · NVIDIA
- Ada Lovelace Architecture Whitepaper · NVIDIA
- NVIDIA L4 OEM Server Reference Designs · NVIDIA
- vLLM FP8 quantisation on Ada Lovelace · vLLM
- GCP G2 instance family (L4) · Google Cloud
- AWS G6 instance family (L4) · AWS