Professional Services · Inference Engineering
Production inference engineered to the numbers that matter
Cost-per-token, p99 latency, GPU utilisation, throughput. Serving stack chosen against your workload trace, not a vendor benchmark. Quantisation validated on your eval set before traffic ever hits it.
Representative engagement
LiveLlama-3 70B · 24 hr workload trace
Cost / 1M tok
$0.81
p99 TTFT
140 ms
GPU util.
74%
-75%
-71%
+95%
Same model, same hardware, same traffic. Continuous batching + FP8 quantisation + paged attention.
The dashboard that decides it
Engineered to the metrics that move unit economics
Vendor benchmarks measure throughput on idealised prompts. Production wins or fails on what your dashboard reads during a Wednesday lunchtime spike.
Cost-per-token
$ per 1M tok
The unit economics. Quantisation, batching, and serving-stack choice move this by 3–10x. Engineered against your model + traffic mix, not a vendor benchmark.
p99 latency
TTFT + TBT
Time-to-first-token plus time-between-tokens at the 99th percentile, not the median. The number that decides whether a chat UI feels alive.
Throughput
tok/sec per GPU
How much work an H100 / H200 / B200 actually does for you. We engineer against measured workload traces, not synthetic prompts.
GPU utilisation
% on the hot path
Most production fleets sit at 30–50% utilisation. Idle GPUs are a cost line with no revenue. We build the autoscaling + batching to fix that.
Operational failure modes
Where production inference quietly degrades
Every production engagement we take on hits some subset of these. The cost line climbs, p99 widens, the on-call gets paged at midnight. Knowing they exist is most of the win.
Batching is not just throughput tuning
What bad looks like
Static max-batch=32
What we design for
Continuous batching with adaptive policy
Static batching trades latency for throughput linearly. Continuous batching (vLLM / SGLang style) decouples them so p99 stops creeping when concurrency climbs. We tune the policy against your real traffic, not the example in the docs.
Quantisation has cliffs, not curves
What bad looks like
INT4 dropped accuracy 6 points
What we design for
FP8 / INT8 validated per model
Quantisation looks linear in headline numbers and is anything but in practice. Some model families tolerate INT4; some lose double-digit accuracy. We validate against your eval set before anything ships.
KV-cache eats your memory budget
What bad looks like
OOM at 6K concurrent
What we design for
PagedAttention + KV-cache budgeting
Long contexts and high concurrency both consume KV-cache. Without paged attention and explicit budgeting, the cluster oversubscribes memory and starts dropping requests under load.
Multi-tenant noisy neighbour
What bad looks like
p99 doubles when tenant B traffic spikes
What we design for
Tenant-aware admission + isolation
Sharing a serving cluster across tenants is cheap until one tenant's traffic bursts. We design the admission control + per-tenant rate limits so your golden customer's p99 stays stable while the rest fight for the leftovers.
Reference serving stacks
We deploy the stack that fits the workload
No single stack wins on every dimension. We pick the one that fits your traffic, your model family, your tenancy model, and your team. Comparison reads in seconds.
vLLM
DefaultPagedAttention + continuous batching. The volume-throughput leader for most LLMs.
Fit · OSS LLMs · cost-per-token pressure · multi-model serving
SGLang
Structured-generation primitives + RadixAttention. Strong for agent / RAG workloads.
Fit · Agent + RAG workloads · structured outputs · tool-use
TensorRT-LLM
NVIDIA's tightly-tuned compiler-backed engine. Top per-GPU throughput when you can build engines.
Fit · Single-model · max throughput · stable model spec
Triton Inference Server
Multi-framework, multi-model server. Pairs with TensorRT-LLM as the runtime backend.
Fit · Multi-model fleets · mixed frameworks · enterprise ops
Ray Serve
Python-native distributed serving. Best when inference is part of a larger Ray cluster.
Fit · Ray-native shops · composite pipelines · Python-first
Hugging Face TGI
Open-source server (Apache-2.0, currently in maintenance mode). Familiar to most teams, easy on-ramp.
Fit · Pilots · teams starting out · default for the HF model zoo
| Serving stack | ThroughputTokens/sec per GPU at target latency | Multi-modelServing many models from one cluster | QuantisationFP8 / INT8 / INT4 support depth | Multi-GPUTensor + pipeline parallelism | Ops surfaceMetrics, autoscaling, runbook-friendliness |
|---|---|---|---|---|---|
vLLMDefaultPagedAttention + continuous batching. The volume-throughput leader for most LLMs. Fit · OSS LLMs · cost-per-token pressure · multi-model serving | |||||
SGLangStructured-generation primitives + RadixAttention. Strong for agent / RAG workloads. Fit · Agent + RAG workloads · structured outputs · tool-use | |||||
TensorRT-LLMNVIDIA's tightly-tuned compiler-backed engine. Top per-GPU throughput when you can build engines. Fit · Single-model · max throughput · stable model spec | |||||
Triton Inference ServerMulti-framework, multi-model server. Pairs with TensorRT-LLM as the runtime backend. Fit · Multi-model fleets · mixed frameworks · enterprise ops | |||||
Ray ServePython-native distributed serving. Best when inference is part of a larger Ray cluster. Fit · Ray-native shops · composite pipelines · Python-first | |||||
Hugging Face TGIOpen-source server (Apache-2.0, currently in maintenance mode). Familiar to most teams, easy on-ramp. Fit · Pilots · teams starting out · default for the HF model zoo |
We deploy whichever stack fits your workload. Most production builds settle on vLLM as the default and add SGLang or TensorRT-LLM for specific models.
Your handover pack
What lands when we leave the room
Every engagement closes with version-controlled artefacts your team can act on the day after we leave. Not a slide deck. Not a “we'll send the runbook next week.”
If day-two stays with your team, these are the manual. If day-two stays with ours, the same artefacts back the SLA.
Workload trace + benchmark report
We capture your real traffic, replay it against the candidate stacks, and write a benchmark report that compares them at your latency budget. The decision record everyone can argue with on facts, not vendor claims.
Serving-stack decision record
Which stack we deploy, and why. With the alternatives we ruled out and the conditions under which we'd revisit the call.
Quantisation profile + accuracy delta
Per-model accuracy delta on your evaluation set, signed off before anything reaches production traffic.
Autoscaling config + alert pack
The scaling policy as code, the SLO targets, and the alert rules. Tuned to your traffic shape, not a one-size template.
Cost-per-token improvement report
Before / after on the same workload. Where the savings came from. What we can't reduce further without changing the model.
Runbook for traffic spikes
What the on-call does when traffic 10x's, when a tenant burns its rate limit, when a model regression slips in. Written down, exercised on a game day.
How we engage
Pick the shape that fits your team
From end-to-end delivery to time-boxed advisory. The scope call confirms which fits; the statement of work names the deliverables.
Yobitel-led
We own the serving stack end-to-end
Workload capture, stack selection, deployment, tuning, observability, and optional 24/7 day-2. Best when you want production inference delivered against a fixed milestone.
Collaborative
We engineer with your team
Paired work on the trickier surfaces: continuous batching tuning, quantisation validation, multi-tenant admission, KV-cache budgeting. Your team executes; we sign off on the design and join the cutover.
Advisory
Time-boxed review
Fixed-window engagement to review your serving design + benchmarks. We spot risk, suggest a focused set of changes, deliver a written report.
Related
Network fabrics for AI clusters
The east-west fabric design that decides whether your inference cluster can actually reach the throughput numbers above.
Related
Platform layer for AI GPU clouds
Total-estate platform delivery across bare metal, VMs, and containers. The layer your inference cluster runs on.
Tell us what your inference cluster is doing today.
A short questionnaire covers workload, performance targets, and engagement model. Our inference practice lead replies inside one working day with a fitted serving-stack recommendation and a workload-trace plan you can take to your CFO.
Same engineering bench that designs the fabric and the platform layer below your inference cluster. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, FedRAMP, MeitY, and beyond). Optional 24/7 day-2 handover available.