TL;DR
- Unit economics expresses cost in terms of business-meaningful units — cost per inference, per million tokens, per training step, per user-session — rather than per resource.
- For AI workloads, the dominant unit metrics are cost per million input tokens, cost per million output tokens, and cost per training step or per token-of-training-data.
- Tracked over time, unit economics tells you whether scale is improving margins or eroding them, and whether optimisation work is actually moving the metric customers pay for.
- Achievable cost-per-token improves with model quantisation, batching, KV-cache reuse, multi-instance GPU partitioning and choice of accelerator generation.
Why Resource-Level Cost Is Not Enough#
Knowing that you spent $50,000 on H100 GPU-hours last month tells you nothing about whether your AI product is sustainable. What matters is how many inferences that $50,000 produced, and whether the price you charge per inference covers it with margin to spare.
Unit economics is the bridge between infrastructure cost and product P&L. It is the metric that lets a product team optimise the right thing — sometimes that is the GPU bill, sometimes it is throughput per GPU, sometimes it is the prompt or model architecture itself.
Common Units#
| Unit | Workload | What it tells you |
|---|---|---|
| Cost per million input tokens | LLM inference | Comparable to public model pricing — directly competitive. |
| Cost per million output tokens | LLM inference | Output is typically the dominant cost — autoregressive decode. |
| Cost per request | API-style inference | Aligns with the unit customers are charged for. |
| Cost per training step | Training | Tracks training efficiency irrespective of total run length. |
| Cost per token-of-training-data | Pre-training and continued pre-training | Industry-comparable measure of training cost. |
| Cost per embedding | RAG ingest | Tracks vector-DB economics at corpus scale. |
| Cost per active user | Multi-tenant SaaS | Ties infrastructure cost to product revenue. |
Building the Metric#
Unit economics is a quotient — total cost divided by total units. The numerator is straightforward once FOCUS-normalised billing is in place; the denominator requires the application to emit telemetry that counts the units consumed.
- Numerator — fully-loaded infrastructure cost for the workload, including amortised commitments, idle capacity, and a fair share of shared platform costs.
- Denominator — application-emitted counter for the unit (input tokens, output tokens, requests served).
- Window — daily, weekly or monthly; daily gives early signal but is noisy, monthly is smoother but lags.
- Cohort — segment by customer tier, geography, model variant, or deployment configuration to see where economics differ.
Emit token and request counters from the inference server itself, not from the application. Server-side counters are harder to lose and easier to audit against the model provider's own metering.
Levers That Move Cost-Per-Token#
For LLM inference workloads, the levers that meaningfully improve cost per million tokens are well understood. Each comes with trade-offs in latency, quality, or operational complexity.
- Quantisation — moving from FP16 to FP8 or INT8 typically doubles throughput per GPU with minimal quality impact for many models.
- Continuous batching — modern inference servers (vLLM, TensorRT-LLM, SGLang) batch concurrent requests at the token level, dramatically improving GPU utilisation.
- KV-cache reuse — sharing prefix cache across requests with common system prompts.
- Speculative decoding — using a small draft model to propose tokens for verification by the large model.
- Multi-Instance GPU (MIG) — partitioning a single H100 or B200 into multiple isolated GPU slices for smaller models.
- Newer-generation accelerators — B200 typically delivers 2-3x the throughput of H100 on transformer inference, more than offsetting the higher unit price for sustained workloads.
Reporting and Targets#
Treat unit economics as a first-class engineering metric, on the same dashboards as latency and error rate. Set explicit targets — for example, 'cost per million output tokens for production chat must remain below $X' — and gate releases that materially worsen the metric.
Forecast the metric alongside usage. A flat cost-per-token at growing usage means the workload is scaling linearly; an improving cost-per-token at growing usage means optimisation is winning; a worsening cost-per-token signals something has regressed.
Yobitel and Unit Economics#
Yobibyte exposes per-deployment token counters and cost attribution by default. Customers deploying models through Yobibyte can read cost-per-million-input-tokens and cost-per-million-output-tokens directly from the platform, without having to instrument the inference server separately. For comparative analysis against hyperscaler-hosted models, the same metric methodology applies — FOCUS EffectiveCost over server-side token counts.
References
- FinOps Foundation — Unit Economics · FinOps Foundation
- NVIDIA TensorRT-LLM · NVIDIA
- vLLM project · vLLM