Unit Economics for AI Workloads

TL;DR

Unit economics expresses cost in terms of business-meaningful units — cost per inference, per million tokens, per training step, per user-session — rather than per resource.
For AI workloads, the dominant unit metrics are cost per million input tokens, cost per million output tokens, and cost per training step or per token-of-training-data.
Tracked over time, unit economics tells you whether scale is improving margins or eroding them, and whether optimisation work is actually moving the metric customers pay for.
Achievable cost-per-token improves with model quantisation, batching, KV-cache reuse, multi-instance GPU partitioning and choice of accelerator generation.

Why Resource-Level Cost Is Not Enough#

Knowing that you spent $50,000 on H100 GPU-hours last month tells you nothing about whether your AI product is sustainable. What matters is how many inferences that $50,000 produced, and whether the price you charge per inference covers it with margin to spare.

Unit economics is the bridge between infrastructure cost and product P&L. It is the metric that lets a product team optimise the right thing — sometimes that is the GPU bill, sometimes it is throughput per GPU, sometimes it is the prompt or model architecture itself.

Common Units#

Unit	Workload	What it tells you
Cost per million input tokens	LLM inference	Comparable to public model pricing — directly competitive.
Cost per million output tokens	LLM inference	Output is typically the dominant cost — autoregressive decode.
Cost per request	API-style inference	Aligns with the unit customers are charged for.
Cost per training step	Training	Tracks training efficiency irrespective of total run length.
Cost per token-of-training-data	Pre-training and continued pre-training	Industry-comparable measure of training cost.
Cost per embedding	RAG ingest	Tracks vector-DB economics at corpus scale.
Cost per active user	Multi-tenant SaaS	Ties infrastructure cost to product revenue.

Building the Metric#

Unit economics is a quotient — total cost divided by total units. The numerator is straightforward once FOCUS-normalised billing is in place; the denominator requires the application to emit telemetry that counts the units consumed.

Numerator — fully-loaded infrastructure cost for the workload, including amortised commitments, idle capacity, and a fair share of shared platform costs.
Denominator — application-emitted counter for the unit (input tokens, output tokens, requests served).
Window — daily, weekly or monthly; daily gives early signal but is noisy, monthly is smoother but lags.
Cohort — segment by customer tier, geography, model variant, or deployment configuration to see where economics differ.

Emit token and request counters from the inference server itself, not from the application. Server-side counters are harder to lose and easier to audit against the model provider's own metering.

Levers That Move Cost-Per-Token#

For LLM inference workloads, the levers that meaningfully improve cost per million tokens are well understood. Each comes with trade-offs in latency, quality, or operational complexity.

Quantisation — moving from FP16 to FP8 or INT8 typically doubles throughput per GPU with minimal quality impact for many models.
Continuous batching — modern inference servers (vLLM, TensorRT-LLM, SGLang) batch concurrent requests at the token level, dramatically improving GPU utilisation.
KV-cache reuse — sharing prefix cache across requests with common system prompts.
Speculative decoding — using a small draft model to propose tokens for verification by the large model.
Multi-Instance GPU (MIG) — partitioning a single H100 or B200 into multiple isolated GPU slices for smaller models.
Newer-generation accelerators — B200 typically delivers 2-3x the throughput of H100 on transformer inference, more than offsetting the higher unit price for sustained workloads.

Reporting and Targets#

Treat unit economics as a first-class engineering metric, on the same dashboards as latency and error rate. Set explicit targets — for example, 'cost per million output tokens for production chat must remain below $X' — and gate releases that materially worsen the metric.

Forecast the metric alongside usage. A flat cost-per-token at growing usage means the workload is scaling linearly; an improving cost-per-token at growing usage means optimisation is winning; a worsening cost-per-token signals something has regressed.

Yobitel and Unit Economics#

Yobibyte exposes per-deployment token counters and cost attribution by default. Customers deploying models through Yobibyte can read cost-per-million-input-tokens and cost-per-million-output-tokens directly from the platform, without having to instrument the inference server separately. For comparative analysis against hyperscaler-hosted models, the same metric methodology applies — FOCUS EffectiveCost over server-side token counts.

References

FinOps Foundation — Unit Economics · FinOps Foundation
NVIDIA TensorRT-LLM · NVIDIA
vLLM project · vLLM

Why Resource-Level Cost Is Not Enough#

Common Units#

Unit	Workload	What it tells you
Cost per million input tokens	LLM inference	Comparable to public model pricing — directly competitive.
Cost per million output tokens	LLM inference	Output is typically the dominant cost — autoregressive decode.
Cost per request	API-style inference	Aligns with the unit customers are charged for.
Cost per training step	Training	Tracks training efficiency irrespective of total run length.
Cost per token-of-training-data	Pre-training and continued pre-training	Industry-comparable measure of training cost.
Cost per embedding	RAG ingest	Tracks vector-DB economics at corpus scale.
Cost per active user	Multi-tenant SaaS	Ties infrastructure cost to product revenue.

Building the Metric#

Numerator — fully-loaded infrastructure cost for the workload, including amortised commitments, idle capacity, and a fair share of shared platform costs.

Denominator — application-emitted counter for the unit (input tokens, output tokens, requests served).

Window — daily, weekly or monthly; daily gives early signal but is noisy, monthly is smoother but lags.

Cohort — segment by customer tier, geography, model variant, or deployment configuration to see where economics differ.

Emit token and request counters from the inference server itself, not from the application. Server-side counters are harder to lose and easier to audit against the model provider's own metering.

Levers That Move Cost-Per-Token#

For LLM inference workloads, the levers that meaningfully improve cost per million tokens are well understood. Each comes with trade-offs in latency, quality, or operational complexity.

Quantisation — moving from FP16 to FP8 or INT8 typically doubles throughput per GPU with minimal quality impact for many models.

Continuous batching — modern inference servers (vLLM, TensorRT-LLM, SGLang) batch concurrent requests at the token level, dramatically improving GPU utilisation.

KV-cache reuse — sharing prefix cache across requests with common system prompts.

Speculative decoding — using a small draft model to propose tokens for verification by the large model.

Multi-Instance GPU (MIG) — partitioning a single H100 or B200 into multiple isolated GPU slices for smaller models.

Newer-generation accelerators — B200 typically delivers 2-3x the throughput of H100 on transformer inference, more than offsetting the higher unit price for sustained workloads.

Reporting and Targets#

Yobitel and Unit Economics#

Unit Economics for AI Workloads

Why Resource-Level Cost Is Not Enough#

Common Units#

Building the Metric#

Levers That Move Cost-Per-Token#

Reporting and Targets#

Yobitel and Unit Economics#

References

Browse all entries

Deploy on Yobitel

Unit Economics for AI Workloads

Why Resource-Level Cost Is Not Enough#

Common Units#

Building the Metric#

Levers That Move Cost-Per-Token#

Reporting and Targets#

Yobitel and Unit Economics#

References

Browse all entries

Deploy on Yobitel