TL;DR
- Yobibyte is Yobitel's fully-managed AI-native platform service — a Yobitel-operated workspace surface where customers deploy models, fine-tune adapters, and consume an OpenAI-compatible inference API without running any of the underlying infrastructure themselves.
- Sovereignty is a first-class control: every workspace pins to a Yobitel-operated region — UK (NCSC OFFICIAL), EU (Data Boundary), or US (FedRAMP-equivalent) — and admission refuses to spill workloads across those boundaries regardless of capacity or price.
- Accelerator coverage spans NVIDIA B300/B200/H200/H100/A100/L40S/L4/A10G/T4 and AMD MI300X; Yobibyte selects the right SKU on the customer's behalf using the marketplace recipe attached to each model.
- Customer-facing surfaces: workspaces with OIDC SSO and SCIM, an OpenAI-compatible API on the standard `openai` SDK, declarative `Inference`, `FineTune`, and `AIApplication` resources, a marketplace covering both models and Yobitel AI Applications (MediQuery and the rest of the vertical suite), FOCUS 1.1 billing export, customer-owned KMS, customer-owned object storage.
- Yobitel operates the runtime end-to-end; the customer never installs, upgrades, or operates the inference engines, schedulers, or clusters that sit underneath.
Overview#
Most teams that want to use large language models do not want to become an MLOps shop. They want to deploy a model, get an OpenAI-compatible URL, point an application at it, and have something predictable handle GPU capacity, identity, autoscaling, observability, and billing. That problem space is what Yobibyte is built for — a managed consumption surface that turns 'I need a Llama 3.1 70B endpoint in the UK with a $5,500/month spend cap' into a deployed, observable, billed endpoint without anyone on the customer side touching a Kubernetes cluster.
Yobibyte is Yobitel's fully-managed AI-native platform service. Customers consume it as workspaces. Inside a workspace they declare `Inference` resources (deployed model endpoints), `FineTune` resources (managed adapter-training jobs), and `AIApplication` resources (first-party Yobitel applications such as MediQuery deployed into the workspace and configured against the customer's own data). They browse the Yobibyte marketplace for both vetted models and Yobitel AI Applications, federate identity from their existing OIDC provider, set per-resource and per-workspace spend caps, and pull a FOCUS 1.1 billing export into their FinOps tooling. The inference surface is OpenAI-compatible, so the standard `openai` Python and TypeScript SDKs work unchanged against a Yobibyte endpoint — only the base URL changes.
Compared with AWS Bedrock and Google Vertex AI, the operational shape is similar (you consume models as a service, you do not run clusters), but Yobibyte's differentiator is multi-region sovereignty pinning across Yobitel-operated regions. A workspace is bound at creation to UK (NCSC OFFICIAL alignment), EU (EU Data Boundary), or US (FedRAMP-equivalent), and admission refuses to place a workload outside that boundary regardless of capacity or price. Bedrock and Vertex AI are anchored to their parent cloud's region map; Yobibyte runs across Yobitel NeoCloud plus Yobitel-managed presence on partner cloud regions, with the same workspace API across all of them.
Yobitel Communications — a UK-headquartered AI infrastructure company and NVIDIA Inception partner — runs the platform end-to-end. Yobitel operates the runtime, the GPU fleet, the schedulers, the inference engines, and the upgrade cadence. The customer owns the data: models, datasets, fine-tuned adapter weights, and prompt/response logs live in customer-controlled object storage encrypted with customer-managed KMS keys. The customer never installs, upgrades, or operates Yobibyte itself.
Quick start#
The customer experience for getting an OpenAI-compatible endpoint live on Yobibyte has five steps and no Kubernetes. Sign in to the Yobibyte console with your corporate identity provider — OIDC federation is configured once per workspace, and the console will accept the SSO bounce on first login. Create a workspace and choose a sovereignty region at creation time: UK for NCSC OFFICIAL alignment, EU for the EU Data Boundary, or US for FedRAMP-equivalent controls. The region is bound to the workspace and cannot be silently changed afterwards.
Open the marketplace, find the model you want — say Llama 3.1 70B Instruct — and click Deploy. The deploy dialog asks for three things: replicas (minimum and maximum), a spend cap in your currency, and an optional autoscaling profile. Yobibyte fills in the serving defaults from the marketplace recipe attached to the model; you do not pick the inference engine, the tensor-parallel size, the quantisation, or the GPU SKU — those come from the recipe and you can override them later if you want.
Once the endpoint reaches Ready (typically 60–180 seconds for an in-region model, longer for the first deploy of a cold model), copy the endpoint URL and mint an API key in the workspace's Keys tab. The endpoint speaks the OpenAI Chat Completions and Embeddings API, so the standard `openai` SDK works against it unchanged — only `base_url` and `api_key` differ.
# Install the standard OpenAI SDK
pip install openai
# Point it at your Yobibyte workspace endpoint
export OPENAI_API_KEY="ybt_live_..." # from the workspace Keys tab
export OPENAI_BASE_URL="https://acme.yobitel.app/v1"
python - <<'PY'
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Hello from Yobibyte"}],
)
print(resp.choices[0].message.content)
PYThe same `base_url` swap works with the TypeScript `openai` SDK, LangChain, LlamaIndex, and Vercel AI SDK — any client that targets the OpenAI Chat Completions shape works against a Yobibyte endpoint with no code change beyond the URL and key.
Concepts#
Yobibyte exposes a small set of concepts that map to how customers think about AI workloads. The mental model is workspace at the top, with `Inference`, `FineTune`, and `AIApplication` resources and marketplace items inside it, and identity, sovereignty, and spend governing what is allowed to happen inside the workspace boundary.
- Workspace — the tenant boundary, the billing root, the OIDC anchor, and the sovereignty region binding. Every resource belongs to exactly one workspace; quotas, spend caps, audit logs, and the FOCUS billing export are scoped to it.
- Inference — a deployed model endpoint. The customer supplies model name, region, replica range, spend cap, and an autoscaling profile; Yobibyte selects the inference engine, GPU SKU, and serving flags from the marketplace recipe. The endpoint speaks OpenAI-compatible HTTP.
- FineTune — a managed adapter-training job. The customer supplies a base model, a dataset URI (s3://, azure://, or gs://), a fine-tune method (QLoRA / LoRA / full SFT), a hyperparameter preset or a small set of high-level knobs, and an output bucket with a KMS key. Yobibyte runs the job and writes the adapter back to the customer's bucket.
- AI Application — a Yobitel-published first-party application (MediQuery for clinical workflows, and the rest of the vertical AI Applications suite) deployed into a workspace. The customer configures the application — data sources, knowledge bases, user roles, branding — rather than the underlying models. The application composes `Inference` and `FineTune` resources internally; Yobibyte runs them on the customer's behalf, and the customer's interface is the application's configuration surface, not the model surface. Partner-published applications appear in the marketplace alongside Yobitel's own.
- Marketplace — a curated catalogue of two things: models and AI Applications. Model entries carry sovereignty, licence, and hardware-compatibility metadata, ranked using InferenceBench data, and ship with the serving defaults Yobibyte uses when the model is deployed. AI Application entries carry sovereignty, licence, industry/vertical, and integration metadata, with first-party Yobitel apps featured alongside partner-published apps. Customer-private models and customer-private application packages live alongside the curated catalogue but are scoped to the workspace they were uploaded to.
- Sovereignty Region — the compliance pin chosen at workspace creation. UK regions align to NCSC Cloud Security Principles and the OFFICIAL classification; EU regions sit inside the EU Data Boundary; US regions align to FedRAMP-equivalent controls. Admission refuses to place workloads outside the workspace's region.
- Spend Cap — a hard budget set per Inference and per Workspace. Yobibyte tracks accrued cost against the FOCUS-shaped cost stream and enforces at hourly granularity; when the cap is reached, replicas pause (they are not deleted) and a P2 alert fires on the configured channel.
- Identity — workspaces federate from any OIDC-compliant IdP (Okta, Microsoft Entra ID, Auth0, Keycloak, Google Workspace tested). SCIM 2.0 provisions users and groups. Permissions follow owner / maintainer / viewer plus a fine-grained resource-level set (`inference:deploy`, `finetune:read`, `workspace:billing:read`).
Inference resource reference#
Customers who prefer a declarative workflow can submit `Inference` resources directly via the Yobibyte API or commit them to Git. The shape below is the full customer-facing surface — everything Yobibyte needs to deploy and operate an endpoint on the customer's behalf. Engine selection, GPU SKU selection, and serving flags are NOT customer fields; they come from the marketplace recipe attached to the model name. Customers who need to override a recipe value can do so via the `overrides` block, but the defaults are correct for the vast majority of workloads.
| Field | Type | Description |
|---|---|---|
| spec.model | string | Marketplace model name. Yobibyte resolves the serving recipe (engine, SKU, flags) from this. |
| spec.region | string | Sovereignty region; must equal the workspace's bound region (e.g. `uk-london`, `eu-frankfurt`, `us-ashburn`). |
| spec.replicas.min | integer | Minimum warm replicas. Set to 1 for any latency-sensitive workload; 0 enables scale-to-zero at the cost of a cold start. |
| spec.replicas.max | integer | Upper bound for autoscaling. Quotas still apply; see Limits and quotas. |
| spec.autoscaling.scaleMetric | enum | What Yobibyte scales on: `tokensPerSecond`, `requestsPerSecond`, or `concurrency`. |
| spec.autoscaling.target | number | Target value of the scale metric per replica. Crossing the target triggers a scale-out. |
| spec.spendCap.amount | number | Hard budget in the workspace's currency. When reached, replicas pause and a P2 alert fires. |
| spec.spendCap.window | enum | `hourly`, `daily`, or `monthly`. The cap resets at the end of each window. |
| spec.accessControl.apiKeys | string[] | Logical API-key names; the actual key material is minted in the workspace Keys tab and rotated independently. |
| spec.accessControl.allowedOriginCidrs | string[] | Optional client-IP allow-list, evaluated at the gateway. |
| spec.observability.otelEndpoint | string | Customer-owned OTel collector endpoint; Yobibyte ships traces to it in addition to the hosted view. |
| spec.observability.prometheusScrape | boolean | Exposes the standard Yobibyte metric set on a workspace-scoped scrape endpoint for the customer's own Prometheus. |
apiVersion: yobibyte.yobitel.com/v1
kind: Inference
metadata:
name: support-bot
workspace: acme-uk
spec:
model: llama-3.1-70b-instruct # marketplace model name
region: uk-london # must match the workspace region
replicas:
min: 1
max: 6
autoscaling:
scaleMetric: tokensPerSecond # tokensPerSecond | requestsPerSecond | concurrency
target: 1800
spendCap:
amount: 5500
currency: USD
window: monthly
accessControl:
apiKeys: [ "support-bot-prod" ]
allowedOriginCidrs: [ "10.0.0.0/8" ]
observability:
otelEndpoint: "https://otel.acme.com:4317"
prometheusScrape: trueFineTune resource reference#
`FineTune` is the managed adapter-training surface. The customer points at a dataset and a base model, picks a method and a hyperparameter preset, and supplies an output bucket and KMS key — Yobibyte runs the job, ships metrics, and writes the resulting adapter back to the customer's bucket. The customer never picks the training framework or schedules nodes manually.
| Field | Type | Description |
|---|---|---|
| spec.baseModel | string | Marketplace model name to fine-tune from. |
| spec.region | string | Sovereignty region; must match the workspace. |
| spec.dataset.uri | string | `s3://`, `azure://`, or `gs://` URI to the training dataset; access is via a customer-supplied role/identity. |
| spec.dataset.format | enum | `chat`, `completion`, `instruction`, or `raw`. |
| spec.method | enum | `qlora` (memory-efficient, 4-bit base + 16-bit adapters), `lora` (full-precision adapters), or `full` (full SFT). |
| spec.hyperparameters.preset | enum | `fast`, `balanced`, or `quality` — sets sensible defaults; override individual knobs below if you need to. |
| spec.hyperparameters.epochs | integer | Number of passes over the dataset. |
| spec.hyperparameters.learningRatePreset | enum | `conservative`, `standard`, or `aggressive`. |
| spec.hyperparameters.rank | integer | LoRA/QLoRA rank. Higher rank = more capacity, larger adapter. |
| spec.output.bucket | string | Customer-owned bucket the adapter is written to. |
| spec.output.kmsKey | string | Customer-managed KMS key the adapter is encrypted with at rest. |
| spec.spendCap.amount | number | Hard budget for the job. When reached, the job is paused and surfaced for review; no silent overrun. |
apiVersion: yobibyte.yobitel.com/v1
kind: FineTune
metadata:
name: support-llama-v3
workspace: acme-uk
spec:
baseModel: llama-3.1-70b-instruct
region: uk-london
dataset:
uri: s3://acme-ml/support-tickets-2025-q4.jsonl
format: chat
method: qlora # qlora | lora | full
hyperparameters:
preset: balanced # fast | balanced | quality
epochs: 3
learningRatePreset: standard
rank: 64
output:
bucket: s3://acme-ml/adapters/support-v3
kmsKey: arn:aws:kms:eu-west-2:111122223333:key/abcd-...
spendCap:
amount: 1500
currency: USDThe `balanced` preset is the right starting point for 90% of fine-tunes; use `fast` for iteration on a held-out evaluation set and `quality` when the adapter is going to production and a few extra GPU-hours are cheaper than another iteration cycle.
Marketplace#
The marketplace is where customers find what they can deploy into a workspace, and it covers two distinct catalogues that share a common shopfront: open-weight models, and Yobitel AI Applications. The same filters (sovereignty, licence, region, popularity, last-updated) apply to both; the result detail page differs in shape according to what is being deployed.
On the model side, every entry carries sovereignty metadata (which regions can host it), licence metadata (commercial use, redistribution, attribution), hardware-compatibility metadata (which accelerator families it fits), and an InferenceBench-sourced ranking on price per million tokens and tokens per second at representative input/output shapes. Filters narrow by sovereignty (only models that can be hosted in the workspace region), licence (only models cleared for commercial use), family (Llama, Qwen, Mistral, Phi, Gemma, etc.), and modality (text, embeddings, vision, multimodal, speech). The result list shows the InferenceBench rank, the per-million-token price band at default replicas, and the smallest GPU configuration the model fits on.
On the AI Application side, every entry carries sovereignty and residency metadata, licence and commercial terms, industry/vertical tags (clinical, financial-services, manufacturing, retail, telco, public-sector, etc.), integration metadata (which data sources, identity systems, and downstream tools the application natively connects to), and a deployment footprint (which Inferences and FineTunes the application provisions on deploy, and the spend-cap implications). Yobitel-first-party applications such as MediQuery are featured alongside partner-published applications; the deploy dialog asks the customer to configure the data sources, knowledge bases, RBAC, and branding rather than to pick models or serving flags.
Customer-private models and customer-private application packages live in the marketplace alongside the curated catalogues but are scoped to the workspace they were uploaded to. Bring-your-own model uploads go directly to a customer-owned object bucket; Yobibyte registers the entry and stamps the recipe metadata it needs to serve the model, but the weights stay in the customer's storage. Customer-built applications follow the same pattern: package + manifest land in the workspace, Yobibyte registers them, and the runtime is operated on the customer's behalf.
Customers do not interact with the internal recipe shape on either side. For models, the marketplace supplies sensible serving defaults that work for the listed model on the listed accelerators; customers can override individual fields (max context length, replica concurrency, sampling defaults) via the `overrides` block on the `Inference` resource, but engine selection and core serving flags are Yobibyte's responsibility. For applications, customers configure the application's surface (data, identity, RBAC, branding) and the Inferences and FineTunes the application provisions are operated by Yobibyte.
Sizing and capacity planning#
The table below summarises baseline sizing for the workload shapes most teams hit first. Treat figures as starting points; actual throughput varies with prompt length, output length, batch size, and quantisation. Validate against InferenceBench numbers for the exact model/GPU combination.
| Workload | Recommended config | Throughput baseline | Assumptions |
|---|---|---|---|
| Llama 3.1 8B chat, low latency | 1× L40S, FP16 | ~3,200 tok/s/replica | Batch ≤ 16, 2K input / 512 output, P50 TTFT < 250 ms. |
| Llama 3.1 8B chat, high throughput | 1× H100 SXM5, FP8 | ~12,500 tok/s/replica | Batch 64, prefix-cache-aware routing on. |
| Llama 3.1 70B chat | 4× H100 SXM5 TP=4, FP8 | ~3,800 tok/s/replica | NVLink 4.0 mandatory, max_model_len 8192. |
| Llama 3.1 70B chat, ultra low latency | 8× H100 SXM5 TP=8, FP8 | ~6,200 tok/s, P50 TTFT < 180 ms | NVLink 4.0, in-flight batching, FP8 KV cache. |
| Llama 3.1 405B chat | 8× H200 TP=8, FP8 | ~1,400 tok/s/replica | NVLink 4.0 plus 141 GB HBM3e per card. |
| Whisper Large-v3 | 1× L4 | Realtime × 32 streams/card | 16 kHz mono, 30 s segments. |
| SDXL image gen | 1× L40S, FP16 | ~3.5 images/s at 1024² 30 steps | Single replica; multi-GPU not worth it under 1024². |
| Embeddings (BGE-Large) | 1× A10G | ~8,000 embeddings/s/replica | Sequence length 512, batch 64. |
| 70B QLoRA fine-tune | 8× H100 SXM5 single node | ~12,000 tok/s training | Rank 64, microbatch 4, grad accum 8. |
| 70B full SFT | 32× H100 SXM5 across 4 nodes | ~6,400 tok/s training | InfiniBand NDR, ZeRO-3, FP8. |
Sizing: workspace footprint by team size#
Workspaces map one-to-one with business units or product teams; sovereignty regions map to a Yobitel-operated region and isolation tier. The recommended ratios below come from production deployments at 10/50/200-engineer scale.
| Team size | Workspaces | Inference replicas (steady) | Fine-tune concurrency | Recommended GPU pool |
|---|---|---|---|---|
| 10 engineers | 1–2 | 5–10 | 1–2 | 4× H100 SXM5 + 2× L40S shared pool |
| 50 engineers | 4–6 | 30–60 | 4–6 | 16× H100 SXM5 + 8× L40S + 4× L4 |
| 200 engineers | 12–20 | 120–250 | 12–20 concurrent | 64× H100 SXM5 + 16× H200 + dedicated MIG pool |
| Org-wide platform | 50+ | 500+ | 30+ concurrent | Multi-region, 256+ H100/H200 + reserved capacity + spot tier |
Limits and quotas#
Default workspace quotas exist to protect shared infrastructure during onboarding; almost every limit is raisable on request. Hard ceilings exist where the underlying primitive imposes one (e.g. Kubernetes ETCD object size limits).
| Resource | Default | Enterprise ceiling | How to raise |
|---|---|---|---|
| Workspaces per org | 5 | 200 | Self-service in console. |
| Clusters per workspace | 3 | 50 | Self-service; subject to region availability. |
| GPUs per workspace | 16 | 4,096 | Support request plus reserved capacity commit. |
| Concurrent Inference resources | 20 | 500 | Self-service up to 100; ticket beyond. |
| Concurrent FineTune jobs | 5 | 100 | Support request; gated on quota and budget. |
| Replicas per Inference | 10 | 200 | Self-service. |
| Request rate per endpoint | 1,000 RPS | 100,000 RPS | Per-key rate-limit overrides via API. |
| Max model size (single replica) | 1.4 TB weights | 8 TB | Bounded by HBM × TP × node count. |
| Marketplace private models per workspace | 100 | 10,000 | Self-service. |
| Spend cap precision | USD 1 | USD 1 | Hard floor. |
| OIDC IdP connections per workspace | 5 | 20 | Support request. |
| FineTune job queue depth | 50 | 1,000 | Self-service up to 200. |
| Custom domains per Inference | 3 | 50 | Self-service. |
| Audit log retention | 90 days | 7 years | Enterprise tier; immutable S3 destination. |
Observability#
Yobibyte emits three telemetry streams for every customer workload. GPU metrics come from the DCGM exporter (an NVIDIA open standard) and cover SM occupancy, HBM usage, power draw, and NVLink throughput. Inference engine metrics come from the managed inference runtime — Yobibyte normalises them under a stable `yobibyte_*` metric namespace so customers do not have to chase engine-specific metric names across upgrades. OpenTelemetry traces cover the full request lifecycle from the workspace gateway through to the response.
All three streams are scrape-compatible with customer Prometheus and customer OTel collectors. Customers can scrape directly from the workspace, ship to their own OTel collector via the `observability.otelEndpoint` field on the `Inference` resource, or view the hosted Grafana with curated dashboards. The metric names are stable across engine upgrades — the inference runtime can change underneath without changing the customer-facing metric surface.
The PromQL snippet below is the alert most production deployments add first — it catches the 'replica is up, traffic is flowing, but tokens/sec has collapsed' failure mode that simple liveness probes miss.
groups:
- name: yobibyte-inference
interval: 30s
rules:
- alert: InferenceTokensPerSecondDegraded
expr: |
avg by (workspace, inference) (
yobibyte_inference_tokens_per_second_replica
) < 1000
and
sum by (workspace, inference) (
rate(yobibyte_inference_requests_total[5m])
) > 0.5
for: 10m
labels: { severity: page }
annotations:
summary: "{{ $labels.inference }} tokens/sec/replica below 1k"
runbook: https://docs.yobitel.com/runbooks/inference-throughput
- alert: GpuHbmExhaustion
expr: yobibyte_gpu_hbm_used_ratio > 0.97
for: 5m
labels: { severity: page }Three SLIs cover roughly 90% of regressions before users notice them: tokens/sec per replica (throughput health), P99 inter-token latency (responsiveness), and KV-cache headroom (capacity health). Page on all three at the workspace level and treat the per-Inference signals as drilldown.
Cost and FinOps#
Billing is per-resource and per-second for GPUs, per-GB-month for storage, and per-million-tokens for marketplace-hosted models. Every line item carries the FOCUS 1.1 columns (BilledCost, EffectiveCost, ListCost, ChargePeriod*, ServiceCategory, SubAccountId, Tags), so the export drops directly into a Cloudability/Apptio/Vantage pipeline or a customer-built BigQuery or Snowflake lakehouse.
Reserved and committed-use pricing is available for steady workloads; spot capacity is available at material discount for fault-tolerant fine-tunes and batch scoring. The table below is indicative for UK regions in mid-2026 — always re-validate against the live Omniscient Compute price feed before forecasting.
| SKU / mode | On-demand $/GPU/hr | 1-yr reserved $/GPU/hr | 3-yr reserved $/GPU/hr | Spot floor |
|---|---|---|---|---|
| H100 SXM5 80GB | $3.25 | $2.45 | $1.95 | $1.20 |
| H200 141GB | $4.25 | $3.20 | $2.55 | $1.75 |
| B200 192GB | $6.00 | $4.50 | $3.60 | n/a |
| A100 80GB | $2.25 | $1.70 | $1.40 | $0.80 |
| L40S 48GB | $1.20 | $0.90 | $0.70 | $0.45 |
| L4 24GB | $0.50 | $0.40 | $0.30 | $0.22 |
| MI300X 192GB | $4.00 | $3.00 | $2.45 | n/a |
| Object storage | $0.022/GB-mo | — | — | — |
| Egress to internet | $0.075/GB | — | — | — |
| Egress between Yobibyte regions | $0.00/GB | — | — | — |
Spend caps are enforced at hourly granularity; when an Inference or workspace cap is exceeded, replicas pause (they are not deleted) and a P2 alert fires. Caps are a safety net, not a substitute for forecasting.
Security and compliance#
Identity federates from any OIDC-compliant IdP; tested integrations include Okta, Microsoft Entra ID, Auth0, Keycloak, and Google Workspace. SCIM 2.0 is supported for user and group provisioning. RBAC follows the standard owner/maintainer/viewer triple plus a finer-grained resource-level permission set (e.g. `inference:deploy`, `finetune:read`, `workspace:billing:read`).
Tenant isolation is a customer choice exposed per `Inference`. `shared` runs workloads on multi-tenant nodes with cgroup and MIG isolation — the right default for most workloads. `dedicated` pins the workload to dedicated nodes and is billed per node-hour rather than per pod-hour; choose this when a regulatory profile requires single-tenant compute or when a workload's noisy-neighbour sensitivity justifies the price. `confidential` runs the workload on NVIDIA H100/H200 confidential-compute mode with TEE attestation, so encryption keys never leave the GPU; choose this for the most sensitive data classes.
Data residency is enforced at admission via policy-based admission. A workload labelled `compliance=ncsc-official` is rejected if it targets a non-UK region; a workload labelled `compliance=eu-data-boundary` is rejected if it targets outside the EU; a workload labelled `compliance=fedramp-equiv` is rejected if it targets outside the US partner regions. Yobibyte will not silently spill workloads across compliance boundaries regardless of capacity or price.
- NCSC Cloud Security Principles — controls mapped per principle; OFFICIAL-tier UK regions audited annually.
- G-Cloud framework — listed under Cloud Software (Lot 2) and Cloud Support (Lot 3).
- SOC 2 Type II — annual third-party audit covering security, availability, confidentiality.
- ISO 27001:2022 and ISO 27017/27018 — current certificates available under NDA.
- GDPR / UK DPA 2018 — DPA, sub-processor list, and EU SCCs available; data residency enforced at admission.
- FedRAMP-equivalent — Moderate-baseline-aligned controls available via partner regions in the US.
- HIPAA — BAA available for healthcare workloads; encryption, logging, and access-control controls audited.
- CSA STAR Level 2 — published self-assessment plus third-party attestation.
Migration and alternatives#
Yobibyte is one option among several. Use the comparison table to decide when each fits.
| Concern | Yobibyte | DIY on cloud Kubernetes | AWS Bedrock / Vertex AI | Self-hosted Kubeflow |
|---|---|---|---|---|
| Operational ownership | Yobitel runs the platform | You run the platform | Cloud runs the platform | You run the platform |
| Cold start from request to live endpoint | Minutes | Hours to days | Minutes | Days |
| GPU SKU coverage | 30+ SKUs across Yobitel-operated regions | Whatever the cloud sells | Cloud's own SKUs only | Whatever you provision |
| Sovereignty enforcement | Pinned per workspace, enforced at admission | DIY policy | Region-only, no admission gate | DIY |
| FOCUS-aligned billing export | Built in | DIY | Proprietary export | DIY |
| Identity federation | OIDC + SCIM built in | DIY | Cloud IAM only | DIY |
| Marketplace plus benchmark integration | InferenceBench-sourced | DIY | Vendor curated | DIY |
| Pricing model | Per-second + reserved + spot, FOCUS-shaped | Cloud bill + your ops cost | Per-token, opaque | Cloud bill + your ops cost |
Today, most teams that want an in-house inference platform own all of this themselves — GPU node pools, drivers, the inference engine, autoscaling, OIDC federation, KMS integration, the billing pipeline, the observability stack — and burn two to three weeks of platform engineering per production endpoint. Yobibyte consolidates that into a single consumption contract with a workspace, a sovereignty region, and an OpenAI-compatible URL.
Troubleshooting#
The errors below cover the failure modes seen most often during onboarding and the first weeks of production. The full runbook library is at docs.yobitel.com/runbooks.
| Error | Cause | Fix |
|---|---|---|
| WorkspaceProvisioningFailed: OIDC discovery 401 | Wrong issuer URL or audience in the IdP application; Yobibyte cannot read the IdP's `.well-known/openid-configuration`. | In the workspace Identity tab, re-enter the issuer URL and audience; confirm the IdP's discovery document resolves over the public internet. |
| InferenceColdStartTimeout | First request hit a scale-to-zero endpoint while the model was still loading from object storage (typical for 70B+ models on a cold replica). | Raise `replicas.min` to 1 on the Inference resource, or pre-warm via the console's 'Warm replicas' control before traffic ramps. |
| QuotaExceeded: workspace gpus 16/16 in use | Workspace has reached its default GPU quota. | Either lower `replicas.max` on a less-critical Inference, or raise the workspace GPU quota via the console (self-service up to enterprise ceiling, then ticket). |
| RegionCapacityUnavailable: uk-london-1 | Requested SKU has no capacity in the workspace's pinned region right now. | Either accept a transient queue (Yobibyte retries automatically), pre-purchase reserved capacity via the console's Capacity tab, or talk to your account team about a sibling region inside the same compliance boundary. |
| AdmissionDenied: complianceMismatch | Inference labelled with a compliance tag that does not match the workspace region — for example a UK workspace asked to host a workload labelled `compliance=us-fedramp`. | Either move the Inference to a workspace bound to the correct sovereignty region, or remove the compliance label if the constraint no longer applies. |
| BillingExportEmpty | FOCUS export bucket policy does not allow the Yobibyte writer role to write objects. | Apply the bucket policy snippet shown in the workspace's Billing tab and wait one export window; exports retry hourly. |
| OidcLoginRedirectMismatch | Redirect URI in the IdP application does not include the workspace's callback URL. | Add `https://<workspace>.yobitel.app/auth/callback` to the IdP's allowed redirects and retry sign-in. |
| KmsDecryptDenied | Customer KMS key policy is missing the Yobibyte data-plane role. | Add the role ARN shown in the workspace's Setup tab to the KMS key policy; FineTune jobs and Inference deployments resume automatically on the next reconciliation tick. |
| SpendCapExceeded: inference paused | Accrued cost has reached the configured spend cap; replicas are paused (not deleted). | Either raise the spend cap on the Inference resource or wait for the next budget window to begin; paused replicas resume automatically. |
| MarketplaceModelUnavailable: under license review | Selected model is temporarily withheld from the marketplace pending a license or sovereignty review. | Pick a substitute from the marketplace's 'Similar models' list, or contact your account team for an ETA on the review. |
Where Yobibyte fits in the Yobitel stack#
Yobitel has three principal platforms and one applications suite, and they sit in a clear stack. Omniscient Compute is the layer below Yobibyte — the vendor-neutral search and orchestration index that knows every SKU, every region, every price, every compliance tag. Yobibyte calls Omniscient at admission and reconciliation time so a workload's placement is always grounded in current capacity and pricing.
InferenceBench is the benchmark and economics layer that sits alongside Yobibyte rather than below it. The marketplace pulls model/GPU/provider scoring directly from InferenceBench, so the recipe attached to `meta-llama/Llama-3.1-70B-Instruct` is the same public ranking anyone can verify at inferencebench.io. Yobitel AI Applications — MediQuery and the rest of the vertical suite — sit on top of Yobibyte; they are first-party applications that use Workspace, Inference, and FineTune the same way any customer would.
Practically, a customer can adopt the stack at any layer. A FinOps team can use Omniscient alone to broker capacity. A platform team can adopt Yobibyte and let it consume Omniscient internally. An application team can consume an AI Application without ever seeing the layers beneath. The boundaries are deliberate, the APIs are stable, and the contracts are documented at each layer.
References
- Yobibyte product page · Yobitel
- Omniscient Compute · Yobitel
- InferenceBench · Yobitel
- AWS Bedrock · AWS
- Google Vertex AI · Google Cloud
- FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation
- NCSC Cloud Security Principles · NCSC