Yobibyte

TL;DR

Yobibyte is Yobitel's fully-managed AI-native platform service — a Yobitel-operated workspace surface where customers deploy models, fine-tune adapters, and consume an OpenAI-compatible inference API without running any of the underlying infrastructure themselves.
Sovereignty is a first-class control: every workspace pins to a Yobitel-operated region — UK (NCSC OFFICIAL), EU (Data Boundary), or US (FedRAMP-equivalent) — and admission refuses to spill workloads across those boundaries regardless of capacity or price.
Accelerator coverage spans NVIDIA B300/B200/H200/H100/A100/L40S/L4/A10G/T4 and AMD MI300X; Yobibyte selects the right SKU on the customer's behalf using the marketplace recipe attached to each model.
Customer-facing surfaces: workspaces with OIDC SSO and SCIM, an OpenAI-compatible API on the standard `openai` SDK, declarative `Inference`, `FineTune`, and `AIApplication` resources, a marketplace covering both models and Yobitel AI Applications (MediQuery and the rest of the vertical suite), FOCUS 1.1 billing export, customer-owned KMS, customer-owned object storage.
Yobitel operates the runtime end-to-end; the customer never installs, upgrades, or operates the inference engines, schedulers, or clusters that sit underneath.

Overview#

Most teams that want to use large language models do not want to become an MLOps shop. They want to deploy a model, get an OpenAI-compatible URL, point an application at it, and have something predictable handle GPU capacity, identity, autoscaling, observability, and billing. That problem space is what Yobibyte is built for — a managed consumption surface that turns 'I need a Llama 3.1 70B endpoint in the UK with a $5,500/month spend cap' into a deployed, observable, billed endpoint without anyone on the customer side touching a Kubernetes cluster.

Yobibyte is Yobitel's fully-managed AI-native platform service. Customers consume it as workspaces. Inside a workspace they declare `Inference` resources (deployed model endpoints), `FineTune` resources (managed adapter-training jobs), and `AIApplication` resources (first-party Yobitel applications such as MediQuery deployed into the workspace and configured against the customer's own data). They browse the Yobibyte marketplace for both vetted models and Yobitel AI Applications, federate identity from their existing OIDC provider, set per-resource and per-workspace spend caps, and pull a FOCUS 1.1 billing export into their FinOps tooling. The inference surface is OpenAI-compatible, so the standard `openai` Python and TypeScript SDKs work unchanged against a Yobibyte endpoint — only the base URL changes.

Compared with AWS Bedrock and Google Vertex AI, the operational shape is similar (you consume models as a service, you do not run clusters), but Yobibyte's differentiator is multi-region sovereignty pinning across Yobitel-operated regions. A workspace is bound at creation to UK (NCSC OFFICIAL alignment), EU (EU Data Boundary), or US (FedRAMP-equivalent), and admission refuses to place a workload outside that boundary regardless of capacity or price. Bedrock and Vertex AI are anchored to their parent cloud's region map; Yobibyte runs across Yobitel NeoCloud plus Yobitel-managed presence on partner cloud regions, with the same workspace API across all of them.

Yobitel Communications — a UK-headquartered AI infrastructure company and NVIDIA Inception partner — runs the platform end-to-end. Yobitel operates the runtime, the GPU fleet, the schedulers, the inference engines, and the upgrade cadence. The customer owns the data: models, datasets, fine-tuned adapter weights, and prompt/response logs live in customer-controlled object storage encrypted with customer-managed KMS keys. The customer never installs, upgrades, or operates Yobibyte itself.

Quick start#

The customer experience for getting an OpenAI-compatible endpoint live on Yobibyte has five steps and no Kubernetes. Sign in to the Yobibyte console with your corporate identity provider — OIDC federation is configured once per workspace, and the console will accept the SSO bounce on first login. Create a workspace and choose a sovereignty region at creation time: UK for NCSC OFFICIAL alignment, EU for the EU Data Boundary, or US for FedRAMP-equivalent controls. The region is bound to the workspace and cannot be silently changed afterwards.

Open the marketplace, find the model you want — say Llama 3.1 70B Instruct — and click Deploy. The deploy dialog asks for three things: replicas (minimum and maximum), a spend cap in your currency, and an optional autoscaling profile. Yobibyte fills in the serving defaults from the marketplace recipe attached to the model; you do not pick the inference engine, the tensor-parallel size, the quantisation, or the GPU SKU — those come from the recipe and you can override them later if you want.

Once the endpoint reaches Ready (typically 60–180 seconds for an in-region model, longer for the first deploy of a cold model), copy the endpoint URL and mint an API key in the workspace's Keys tab. The endpoint speaks the OpenAI Chat Completions and Embeddings API, so the standard `openai` SDK works against it unchanged — only `base_url` and `api_key` differ.

bash

# Install the standard OpenAI SDK
pip install openai

# Point it at your Yobibyte workspace endpoint
export OPENAI_API_KEY="ybt_live_..."           # from the workspace Keys tab
export OPENAI_BASE_URL="https://acme.yobitel.app/v1"

python - <<'PY'
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Hello from Yobibyte"}],
)
print(resp.choices[0].message.content)
PY

The same `base_url` swap works with the TypeScript `openai` SDK, LangChain, LlamaIndex, and Vercel AI SDK — any client that targets the OpenAI Chat Completions shape works against a Yobibyte endpoint with no code change beyond the URL and key.

Concepts#

Yobibyte exposes a small set of concepts that map to how customers think about AI workloads. The mental model is workspace at the top, with `Inference`, `FineTune`, and `AIApplication` resources and marketplace items inside it, and identity, sovereignty, and spend governing what is allowed to happen inside the workspace boundary.

Workspace — the tenant boundary, the billing root, the OIDC anchor, and the sovereignty region binding. Every resource belongs to exactly one workspace; quotas, spend caps, audit logs, and the FOCUS billing export are scoped to it.
Inference — a deployed model endpoint. The customer supplies model name, region, replica range, spend cap, and an autoscaling profile; Yobibyte selects the inference engine, GPU SKU, and serving flags from the marketplace recipe. The endpoint speaks OpenAI-compatible HTTP.
FineTune — a managed adapter-training job. The customer supplies a base model, a dataset URI (s3://, azure://, or gs://), a fine-tune method (QLoRA / LoRA / full SFT), a hyperparameter preset or a small set of high-level knobs, and an output bucket with a KMS key. Yobibyte runs the job and writes the adapter back to the customer's bucket.
AI Application — a Yobitel-published first-party application (MediQuery for clinical workflows, and the rest of the vertical AI Applications suite) deployed into a workspace. The customer configures the application — data sources, knowledge bases, user roles, branding — rather than the underlying models. The application composes `Inference` and `FineTune` resources internally; Yobibyte runs them on the customer's behalf, and the customer's interface is the application's configuration surface, not the model surface. Partner-published applications appear in the marketplace alongside Yobitel's own.
Marketplace — a curated catalogue of two things: models and AI Applications. Model entries carry sovereignty, licence, and hardware-compatibility metadata, ranked using InferenceBench data, and ship with the serving defaults Yobibyte uses when the model is deployed. AI Application entries carry sovereignty, licence, industry/vertical, and integration metadata, with first-party Yobitel apps featured alongside partner-published apps. Customer-private models and customer-private application packages live alongside the curated catalogue but are scoped to the workspace they were uploaded to.
Sovereignty Region — the compliance pin chosen at workspace creation. UK regions align to NCSC Cloud Security Principles and the OFFICIAL classification; EU regions sit inside the EU Data Boundary; US regions align to FedRAMP-equivalent controls. Admission refuses to place workloads outside the workspace's region.
Spend Cap — a hard budget set per Inference and per Workspace. Yobibyte tracks accrued cost against the FOCUS-shaped cost stream and enforces at hourly granularity; when the cap is reached, replicas pause (they are not deleted) and a P2 alert fires on the configured channel.
Identity — workspaces federate from any OIDC-compliant IdP (Okta, Microsoft Entra ID, Auth0, Keycloak, Google Workspace tested). SCIM 2.0 provisions users and groups. Permissions follow owner / maintainer / viewer plus a fine-grained resource-level set (`inference:deploy`, `finetune:read`, `workspace:billing:read`).

Inference resource reference#

Customers who prefer a declarative workflow can submit `Inference` resources directly via the Yobibyte API or commit them to Git. The shape below is the full customer-facing surface — everything Yobibyte needs to deploy and operate an endpoint on the customer's behalf. Engine selection, GPU SKU selection, and serving flags are NOT customer fields; they come from the marketplace recipe attached to the model name. Customers who need to override a recipe value can do so via the `overrides` block, but the defaults are correct for the vast majority of workloads.

Field	Type	Description
spec.model	string	Marketplace model name. Yobibyte resolves the serving recipe (engine, SKU, flags) from this.
spec.region	string	Sovereignty region; must equal the workspace's bound region (e.g. `uk-london`, `eu-frankfurt`, `us-ashburn`).
spec.replicas.min	integer	Minimum warm replicas. Set to 1 for any latency-sensitive workload; 0 enables scale-to-zero at the cost of a cold start.
spec.replicas.max	integer	Upper bound for autoscaling. Quotas still apply; see Limits and quotas.
spec.autoscaling.scaleMetric	enum	What Yobibyte scales on: `tokensPerSecond`, `requestsPerSecond`, or `concurrency`.
spec.autoscaling.target	number	Target value of the scale metric per replica. Crossing the target triggers a scale-out.
spec.spendCap.amount	number	Hard budget in the workspace's currency. When reached, replicas pause and a P2 alert fires.
spec.spendCap.window	enum	`hourly`, `daily`, or `monthly`. The cap resets at the end of each window.
spec.accessControl.apiKeys	string[]	Logical API-key names; the actual key material is minted in the workspace Keys tab and rotated independently.
spec.accessControl.allowedOriginCidrs	string[]	Optional client-IP allow-list, evaluated at the gateway.
spec.observability.otelEndpoint	string	Customer-owned OTel collector endpoint; Yobibyte ships traces to it in addition to the hosted view.
spec.observability.prometheusScrape	boolean	Exposes the standard Yobibyte metric set on a workspace-scoped scrape endpoint for the customer's own Prometheus.

yaml

apiVersion: yobibyte.yobitel.com/v1
kind: Inference
metadata:
  name: support-bot
  workspace: acme-uk
spec:
  model: llama-3.1-70b-instruct        # marketplace model name
  region: uk-london                     # must match the workspace region
  replicas:
    min: 1
    max: 6
  autoscaling:
    scaleMetric: tokensPerSecond        # tokensPerSecond | requestsPerSecond | concurrency
    target: 1800
  spendCap:
    amount: 5500
    currency: USD
    window: monthly
  accessControl:
    apiKeys: [ "support-bot-prod" ]
    allowedOriginCidrs: [ "10.0.0.0/8" ]
  observability:
    otelEndpoint: "https://otel.acme.com:4317"
    prometheusScrape: true

FineTune resource reference#

`FineTune` is the managed adapter-training surface. The customer points at a dataset and a base model, picks a method and a hyperparameter preset, and supplies an output bucket and KMS key — Yobibyte runs the job, ships metrics, and writes the resulting adapter back to the customer's bucket. The customer never picks the training framework or schedules nodes manually.

Field	Type	Description
spec.baseModel	string	Marketplace model name to fine-tune from.
spec.region	string	Sovereignty region; must match the workspace.
spec.dataset.uri	string	`s3://`, `azure://`, or `gs://` URI to the training dataset; access is via a customer-supplied role/identity.
spec.dataset.format	enum	`chat`, `completion`, `instruction`, or `raw`.
spec.method	enum	`qlora` (memory-efficient, 4-bit base + 16-bit adapters), `lora` (full-precision adapters), or `full` (full SFT).
spec.hyperparameters.preset	enum	`fast`, `balanced`, or `quality` — sets sensible defaults; override individual knobs below if you need to.
spec.hyperparameters.epochs	integer	Number of passes over the dataset.
spec.hyperparameters.learningRatePreset	enum	`conservative`, `standard`, or `aggressive`.
spec.hyperparameters.rank	integer	LoRA/QLoRA rank. Higher rank = more capacity, larger adapter.
spec.output.bucket	string	Customer-owned bucket the adapter is written to.
spec.output.kmsKey	string	Customer-managed KMS key the adapter is encrypted with at rest.
spec.spendCap.amount	number	Hard budget for the job. When reached, the job is paused and surfaced for review; no silent overrun.

yaml

apiVersion: yobibyte.yobitel.com/v1
kind: FineTune
metadata:
  name: support-llama-v3
  workspace: acme-uk
spec:
  baseModel: llama-3.1-70b-instruct
  region: uk-london
  dataset:
    uri: s3://acme-ml/support-tickets-2025-q4.jsonl
    format: chat
  method: qlora                          # qlora | lora | full
  hyperparameters:
    preset: balanced                     # fast | balanced | quality
    epochs: 3
    learningRatePreset: standard
    rank: 64
  output:
    bucket: s3://acme-ml/adapters/support-v3
    kmsKey: arn:aws:kms:eu-west-2:111122223333:key/abcd-...
  spendCap:
    amount: 1500
    currency: USD

The `balanced` preset is the right starting point for 90% of fine-tunes; use `fast` for iteration on a held-out evaluation set and `quality` when the adapter is going to production and a few extra GPU-hours are cheaper than another iteration cycle.

Marketplace#

The marketplace is where customers find what they can deploy into a workspace, and it covers two distinct catalogues that share a common shopfront: open-weight models, and Yobitel AI Applications. The same filters (sovereignty, licence, region, popularity, last-updated) apply to both; the result detail page differs in shape according to what is being deployed.

On the model side, every entry carries sovereignty metadata (which regions can host it), licence metadata (commercial use, redistribution, attribution), hardware-compatibility metadata (which accelerator families it fits), and an InferenceBench-sourced ranking on price per million tokens and tokens per second at representative input/output shapes. Filters narrow by sovereignty (only models that can be hosted in the workspace region), licence (only models cleared for commercial use), family (Llama, Qwen, Mistral, Phi, Gemma, etc.), and modality (text, embeddings, vision, multimodal, speech). The result list shows the InferenceBench rank, the per-million-token price band at default replicas, and the smallest GPU configuration the model fits on.

On the AI Application side, every entry carries sovereignty and residency metadata, licence and commercial terms, industry/vertical tags (clinical, financial-services, manufacturing, retail, telco, public-sector, etc.), integration metadata (which data sources, identity systems, and downstream tools the application natively connects to), and a deployment footprint (which Inferences and FineTunes the application provisions on deploy, and the spend-cap implications). Yobitel-first-party applications such as MediQuery are featured alongside partner-published applications; the deploy dialog asks the customer to configure the data sources, knowledge bases, RBAC, and branding rather than to pick models or serving flags.

Customer-private models and customer-private application packages live in the marketplace alongside the curated catalogues but are scoped to the workspace they were uploaded to. Bring-your-own model uploads go directly to a customer-owned object bucket; Yobibyte registers the entry and stamps the recipe metadata it needs to serve the model, but the weights stay in the customer's storage. Customer-built applications follow the same pattern: package + manifest land in the workspace, Yobibyte registers them, and the runtime is operated on the customer's behalf.

Customers do not interact with the internal recipe shape on either side. For models, the marketplace supplies sensible serving defaults that work for the listed model on the listed accelerators; customers can override individual fields (max context length, replica concurrency, sampling defaults) via the `overrides` block on the `Inference` resource, but engine selection and core serving flags are Yobibyte's responsibility. For applications, customers configure the application's surface (data, identity, RBAC, branding) and the Inferences and FineTunes the application provisions are operated by Yobibyte.

Sizing and capacity planning#

The table below summarises baseline sizing for the workload shapes most teams hit first. Treat figures as starting points; actual throughput varies with prompt length, output length, batch size, and quantisation. Validate against InferenceBench numbers for the exact model/GPU combination.

Workload	Recommended config	Throughput baseline	Assumptions
Llama 3.1 8B chat, low latency	1× L40S, FP16	~3,200 tok/s/replica	Batch ≤ 16, 2K input / 512 output, P50 TTFT < 250 ms.
Llama 3.1 8B chat, high throughput	1× H100 SXM5, FP8	~12,500 tok/s/replica	Batch 64, prefix-cache-aware routing on.
Llama 3.1 70B chat	4× H100 SXM5 TP=4, FP8	~3,800 tok/s/replica	NVLink 4.0 mandatory, max_model_len 8192.
Llama 3.1 70B chat, ultra low latency	8× H100 SXM5 TP=8, FP8	~6,200 tok/s, P50 TTFT < 180 ms	NVLink 4.0, in-flight batching, FP8 KV cache.
Llama 3.1 405B chat	8× H200 TP=8, FP8	~1,400 tok/s/replica	NVLink 4.0 plus 141 GB HBM3e per card.
Whisper Large-v3	1× L4	Realtime × 32 streams/card	16 kHz mono, 30 s segments.
SDXL image gen	1× L40S, FP16	~3.5 images/s at 1024² 30 steps	Single replica; multi-GPU not worth it under 1024².
Embeddings (BGE-Large)	1× A10G	~8,000 embeddings/s/replica	Sequence length 512, batch 64.
70B QLoRA fine-tune	8× H100 SXM5 single node	~12,000 tok/s training	Rank 64, microbatch 4, grad accum 8.
70B full SFT	32× H100 SXM5 across 4 nodes	~6,400 tok/s training	InfiniBand NDR, ZeRO-3, FP8.

Sizing: workspace footprint by team size#

Workspaces map one-to-one with business units or product teams; sovereignty regions map to a Yobitel-operated region and isolation tier. The recommended ratios below come from production deployments at 10/50/200-engineer scale.

Team size	Workspaces	Inference replicas (steady)	Fine-tune concurrency	Recommended GPU pool
10 engineers	1–2	5–10	1–2	4× H100 SXM5 + 2× L40S shared pool
50 engineers	4–6	30–60	4–6	16× H100 SXM5 + 8× L40S + 4× L4
200 engineers	12–20	120–250	12–20 concurrent	64× H100 SXM5 + 16× H200 + dedicated MIG pool
Org-wide platform	50+	500+	30+ concurrent	Multi-region, 256+ H100/H200 + reserved capacity + spot tier

Limits and quotas#

Default workspace quotas exist to protect shared infrastructure during onboarding; almost every limit is raisable on request. Hard ceilings exist where the underlying primitive imposes one (e.g. Kubernetes ETCD object size limits).

Resource	Default	Enterprise ceiling	How to raise
Workspaces per org	5	200	Self-service in console.
Clusters per workspace	3	50	Self-service; subject to region availability.
GPUs per workspace	16	4,096	Support request plus reserved capacity commit.
Concurrent Inference resources	20	500	Self-service up to 100; ticket beyond.
Concurrent FineTune jobs	5	100	Support request; gated on quota and budget.
Replicas per Inference	10	200	Self-service.
Request rate per endpoint	1,000 RPS	100,000 RPS	Per-key rate-limit overrides via API.
Max model size (single replica)	1.4 TB weights	8 TB	Bounded by HBM × TP × node count.
Marketplace private models per workspace	100	10,000	Self-service.
Spend cap precision	USD 1	USD 1	Hard floor.
OIDC IdP connections per workspace	5	20	Support request.
FineTune job queue depth	50	1,000	Self-service up to 200.
Custom domains per Inference	3	50	Self-service.
Audit log retention	90 days	7 years	Enterprise tier; immutable S3 destination.

Observability#

Yobibyte emits three telemetry streams for every customer workload. GPU metrics come from the DCGM exporter (an NVIDIA open standard) and cover SM occupancy, HBM usage, power draw, and NVLink throughput. Inference engine metrics come from the managed inference runtime — Yobibyte normalises them under a stable `yobibyte_*` metric namespace so customers do not have to chase engine-specific metric names across upgrades. OpenTelemetry traces cover the full request lifecycle from the workspace gateway through to the response.

All three streams are scrape-compatible with customer Prometheus and customer OTel collectors. Customers can scrape directly from the workspace, ship to their own OTel collector via the `observability.otelEndpoint` field on the `Inference` resource, or view the hosted Grafana with curated dashboards. The metric names are stable across engine upgrades — the inference runtime can change underneath without changing the customer-facing metric surface.

The PromQL snippet below is the alert most production deployments add first — it catches the 'replica is up, traffic is flowing, but tokens/sec has collapsed' failure mode that simple liveness probes miss.

yaml

groups:
- name: yobibyte-inference
  interval: 30s
  rules:
  - alert: InferenceTokensPerSecondDegraded
    expr: |
      avg by (workspace, inference) (
        yobibyte_inference_tokens_per_second_replica
      ) < 1000
      and
      sum by (workspace, inference) (
        rate(yobibyte_inference_requests_total[5m])
      ) > 0.5
    for: 10m
    labels: { severity: page }
    annotations:
      summary: "{{ $labels.inference }} tokens/sec/replica below 1k"
      runbook: https://docs.yobitel.com/runbooks/inference-throughput

  - alert: GpuHbmExhaustion
    expr: yobibyte_gpu_hbm_used_ratio > 0.97
    for: 5m
    labels: { severity: page }

Three SLIs cover roughly 90% of regressions before users notice them: tokens/sec per replica (throughput health), P99 inter-token latency (responsiveness), and KV-cache headroom (capacity health). Page on all three at the workspace level and treat the per-Inference signals as drilldown.

Cost and FinOps#

Billing is per-resource and per-second for GPUs, per-GB-month for storage, and per-million-tokens for marketplace-hosted models. Every line item carries the FOCUS 1.1 columns (BilledCost, EffectiveCost, ListCost, ChargePeriod*, ServiceCategory, SubAccountId, Tags), so the export drops directly into a Cloudability/Apptio/Vantage pipeline or a customer-built BigQuery or Snowflake lakehouse.

Reserved and committed-use pricing is available for steady workloads; spot capacity is available at material discount for fault-tolerant fine-tunes and batch scoring. The table below is indicative for UK regions in mid-2026 — always re-validate against the live Omniscient Compute price feed before forecasting.

SKU / mode	On-demand $/GPU/hr	1-yr reserved $/GPU/hr	3-yr reserved $/GPU/hr	Spot floor
H100 SXM5 80GB	$3.25	$2.45	$1.95	$1.20
H200 141GB	$4.25	$3.20	$2.55	$1.75
B200 192GB	$6.00	$4.50	$3.60	n/a
A100 80GB	$2.25	$1.70	$1.40	$0.80
L40S 48GB	$1.20	$0.90	$0.70	$0.45
L4 24GB	$0.50	$0.40	$0.30	$0.22
MI300X 192GB	$4.00	$3.00	$2.45	n/a
Object storage	$0.022/GB-mo	—	—	—
Egress to internet	$0.075/GB	—	—	—
Egress between Yobibyte regions	$0.00/GB	—	—	—

Spend caps are enforced at hourly granularity; when an Inference or workspace cap is exceeded, replicas pause (they are not deleted) and a P2 alert fires. Caps are a safety net, not a substitute for forecasting.

Security and compliance#

Identity federates from any OIDC-compliant IdP; tested integrations include Okta, Microsoft Entra ID, Auth0, Keycloak, and Google Workspace. SCIM 2.0 is supported for user and group provisioning. RBAC follows the standard owner/maintainer/viewer triple plus a finer-grained resource-level permission set (e.g. `inference:deploy`, `finetune:read`, `workspace:billing:read`).

Tenant isolation is a customer choice exposed per `Inference`. `shared` runs workloads on multi-tenant nodes with cgroup and MIG isolation — the right default for most workloads. `dedicated` pins the workload to dedicated nodes and is billed per node-hour rather than per pod-hour; choose this when a regulatory profile requires single-tenant compute or when a workload's noisy-neighbour sensitivity justifies the price. `confidential` runs the workload on NVIDIA H100/H200 confidential-compute mode with TEE attestation, so encryption keys never leave the GPU; choose this for the most sensitive data classes.

Data residency is enforced at admission via policy-based admission. A workload labelled `compliance=ncsc-official` is rejected if it targets a non-UK region; a workload labelled `compliance=eu-data-boundary` is rejected if it targets outside the EU; a workload labelled `compliance=fedramp-equiv` is rejected if it targets outside the US partner regions. Yobibyte will not silently spill workloads across compliance boundaries regardless of capacity or price.

NCSC Cloud Security Principles — controls mapped per principle; OFFICIAL-tier UK regions audited annually.
G-Cloud framework — listed under Cloud Software (Lot 2) and Cloud Support (Lot 3).
SOC 2 Type II — annual third-party audit covering security, availability, confidentiality.
ISO 27001:2022 and ISO 27017/27018 — current certificates available under NDA.
GDPR / UK DPA 2018 — DPA, sub-processor list, and EU SCCs available; data residency enforced at admission.
FedRAMP-equivalent — Moderate-baseline-aligned controls available via partner regions in the US.
HIPAA — BAA available for healthcare workloads; encryption, logging, and access-control controls audited.
CSA STAR Level 2 — published self-assessment plus third-party attestation.

Migration and alternatives#

Yobibyte is one option among several. Use the comparison table to decide when each fits.

Concern	Yobibyte	DIY on cloud Kubernetes	AWS Bedrock / Vertex AI	Self-hosted Kubeflow
Operational ownership	Yobitel runs the platform	You run the platform	Cloud runs the platform	You run the platform
Cold start from request to live endpoint	Minutes	Hours to days	Minutes	Days
GPU SKU coverage	30+ SKUs across Yobitel-operated regions	Whatever the cloud sells	Cloud's own SKUs only	Whatever you provision
Sovereignty enforcement	Pinned per workspace, enforced at admission	DIY policy	Region-only, no admission gate	DIY
FOCUS-aligned billing export	Built in	DIY	Proprietary export	DIY
Identity federation	OIDC + SCIM built in	DIY	Cloud IAM only	DIY
Marketplace plus benchmark integration	InferenceBench-sourced	DIY	Vendor curated	DIY
Pricing model	Per-second + reserved + spot, FOCUS-shaped	Cloud bill + your ops cost	Per-token, opaque	Cloud bill + your ops cost

Today, most teams that want an in-house inference platform own all of this themselves — GPU node pools, drivers, the inference engine, autoscaling, OIDC federation, KMS integration, the billing pipeline, the observability stack — and burn two to three weeks of platform engineering per production endpoint. Yobibyte consolidates that into a single consumption contract with a workspace, a sovereignty region, and an OpenAI-compatible URL.

Troubleshooting#

The errors below cover the failure modes seen most often during onboarding and the first weeks of production. The full runbook library is at docs.yobitel.com/runbooks.

Error	Cause	Fix
WorkspaceProvisioningFailed: OIDC discovery 401	Wrong issuer URL or audience in the IdP application; Yobibyte cannot read the IdP's `.well-known/openid-configuration`.	In the workspace Identity tab, re-enter the issuer URL and audience; confirm the IdP's discovery document resolves over the public internet.
InferenceColdStartTimeout	First request hit a scale-to-zero endpoint while the model was still loading from object storage (typical for 70B+ models on a cold replica).	Raise `replicas.min` to 1 on the Inference resource, or pre-warm via the console's 'Warm replicas' control before traffic ramps.
QuotaExceeded: workspace gpus 16/16 in use	Workspace has reached its default GPU quota.	Either lower `replicas.max` on a less-critical Inference, or raise the workspace GPU quota via the console (self-service up to enterprise ceiling, then ticket).
RegionCapacityUnavailable: uk-london-1	Requested SKU has no capacity in the workspace's pinned region right now.	Either accept a transient queue (Yobibyte retries automatically), pre-purchase reserved capacity via the console's Capacity tab, or talk to your account team about a sibling region inside the same compliance boundary.
AdmissionDenied: complianceMismatch	Inference labelled with a compliance tag that does not match the workspace region — for example a UK workspace asked to host a workload labelled `compliance=us-fedramp`.	Either move the Inference to a workspace bound to the correct sovereignty region, or remove the compliance label if the constraint no longer applies.
BillingExportEmpty	FOCUS export bucket policy does not allow the Yobibyte writer role to write objects.	Apply the bucket policy snippet shown in the workspace's Billing tab and wait one export window; exports retry hourly.
OidcLoginRedirectMismatch	Redirect URI in the IdP application does not include the workspace's callback URL.	Add `https://<workspace>.yobitel.app/auth/callback` to the IdP's allowed redirects and retry sign-in.
KmsDecryptDenied	Customer KMS key policy is missing the Yobibyte data-plane role.	Add the role ARN shown in the workspace's Setup tab to the KMS key policy; FineTune jobs and Inference deployments resume automatically on the next reconciliation tick.
SpendCapExceeded: inference paused	Accrued cost has reached the configured spend cap; replicas are paused (not deleted).	Either raise the spend cap on the Inference resource or wait for the next budget window to begin; paused replicas resume automatically.
MarketplaceModelUnavailable: under license review	Selected model is temporarily withheld from the marketplace pending a license or sovereignty review.	Pick a substitute from the marketplace's 'Similar models' list, or contact your account team for an ETA on the review.

Where Yobibyte fits in the Yobitel stack#

Yobitel has three principal platforms and one applications suite, and they sit in a clear stack. Omniscient Compute is the layer below Yobibyte — the vendor-neutral search and orchestration index that knows every SKU, every region, every price, every compliance tag. Yobibyte calls Omniscient at admission and reconciliation time so a workload's placement is always grounded in current capacity and pricing.

InferenceBench is the benchmark and economics layer that sits alongside Yobibyte rather than below it. The marketplace pulls model/GPU/provider scoring directly from InferenceBench, so the recipe attached to `meta-llama/Llama-3.1-70B-Instruct` is the same public ranking anyone can verify at inferencebench.io. Yobitel AI Applications — MediQuery and the rest of the vertical suite — sit on top of Yobibyte; they are first-party applications that use Workspace, Inference, and FineTune the same way any customer would.

Practically, a customer can adopt the stack at any layer. A FinOps team can use Omniscient alone to broker capacity. A platform team can adopt Yobibyte and let it consume Omniscient internally. An application team can consume an AI Application without ever seeing the layers beneath. The boundaries are deliberate, the APIs are stable, and the contracts are documented at each layer.

References

Yobibyte product page · Yobitel
Omniscient Compute · Yobitel
InferenceBench · Yobitel
AWS Bedrock · AWS
Google Vertex AI · Google Cloud
FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation
NCSC Cloud Security Principles · NCSC

TL;DR

Yobibyte is Yobitel's fully-managed AI-native platform service — a Yobitel-operated workspace surface where customers deploy models, fine-tune adapters, and consume an OpenAI-compatible inference API without running any of the underlying infrastructure themselves.
Sovereignty is a first-class control: every workspace pins to a Yobitel-operated region — UK (NCSC OFFICIAL), EU (Data Boundary), or US (FedRAMP-equivalent) — and admission refuses to spill workloads across those boundaries regardless of capacity or price.
Accelerator coverage spans NVIDIA B300/B200/H200/H100/A100/L40S/L4/A10G/T4 and AMD MI300X; Yobibyte selects the right SKU on the customer's behalf using the marketplace recipe attached to each model.
Customer-facing surfaces: workspaces with OIDC SSO and SCIM, an OpenAI-compatible API on the standard `openai` SDK, declarative `Inference`, `FineTune`, and `AIApplication` resources, a marketplace covering both models and Yobitel AI Applications (MediQuery and the rest of the vertical suite), FOCUS 1.1 billing export, customer-owned KMS, customer-owned object storage.
Yobitel operates the runtime end-to-end; the customer never installs, upgrades, or operates the inference engines, schedulers, or clusters that sit underneath.

Overview#

Quick start#

bash

# Install the standard OpenAI SDK
pip install openai

# Point it at your Yobibyte workspace endpoint
export OPENAI_API_KEY="ybt_live_..."           # from the workspace Keys tab
export OPENAI_BASE_URL="https://acme.yobitel.app/v1"

python - <<'PY'
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Hello from Yobibyte"}],
)
print(resp.choices[0].message.content)
PY

Concepts#

Workspace — the tenant boundary, the billing root, the OIDC anchor, and the sovereignty region binding. Every resource belongs to exactly one workspace; quotas, spend caps, audit logs, and the FOCUS billing export are scoped to it.
Inference — a deployed model endpoint. The customer supplies model name, region, replica range, spend cap, and an autoscaling profile; Yobibyte selects the inference engine, GPU SKU, and serving flags from the marketplace recipe. The endpoint speaks OpenAI-compatible HTTP.
FineTune — a managed adapter-training job. The customer supplies a base model, a dataset URI (s3://, azure://, or gs://), a fine-tune method (QLoRA / LoRA / full SFT), a hyperparameter preset or a small set of high-level knobs, and an output bucket with a KMS key. Yobibyte runs the job and writes the adapter back to the customer's bucket.
AI Application — a Yobitel-published first-party application (MediQuery for clinical workflows, and the rest of the vertical AI Applications suite) deployed into a workspace. The customer configures the application — data sources, knowledge bases, user roles, branding — rather than the underlying models. The application composes `Inference` and `FineTune` resources internally; Yobibyte runs them on the customer's behalf, and the customer's interface is the application's configuration surface, not the model surface. Partner-published applications appear in the marketplace alongside Yobitel's own.
Marketplace — a curated catalogue of two things: models and AI Applications. Model entries carry sovereignty, licence, and hardware-compatibility metadata, ranked using InferenceBench data, and ship with the serving defaults Yobibyte uses when the model is deployed. AI Application entries carry sovereignty, licence, industry/vertical, and integration metadata, with first-party Yobitel apps featured alongside partner-published apps. Customer-private models and customer-private application packages live alongside the curated catalogue but are scoped to the workspace they were uploaded to.
Sovereignty Region — the compliance pin chosen at workspace creation. UK regions align to NCSC Cloud Security Principles and the OFFICIAL classification; EU regions sit inside the EU Data Boundary; US regions align to FedRAMP-equivalent controls. Admission refuses to place workloads outside the workspace's region.
Spend Cap — a hard budget set per Inference and per Workspace. Yobibyte tracks accrued cost against the FOCUS-shaped cost stream and enforces at hourly granularity; when the cap is reached, replicas pause (they are not deleted) and a P2 alert fires on the configured channel.
Identity — workspaces federate from any OIDC-compliant IdP (Okta, Microsoft Entra ID, Auth0, Keycloak, Google Workspace tested). SCIM 2.0 provisions users and groups. Permissions follow owner / maintainer / viewer plus a fine-grained resource-level set (`inference:deploy`, `finetune:read`, `workspace:billing:read`).

Inference resource reference#

Field	Type	Description
spec.model	string	Marketplace model name. Yobibyte resolves the serving recipe (engine, SKU, flags) from this.
spec.region	string	Sovereignty region; must equal the workspace's bound region (e.g. `uk-london`, `eu-frankfurt`, `us-ashburn`).
spec.replicas.min	integer	Minimum warm replicas. Set to 1 for any latency-sensitive workload; 0 enables scale-to-zero at the cost of a cold start.
spec.replicas.max	integer	Upper bound for autoscaling. Quotas still apply; see Limits and quotas.
spec.autoscaling.scaleMetric	enum	What Yobibyte scales on: `tokensPerSecond`, `requestsPerSecond`, or `concurrency`.
spec.autoscaling.target	number	Target value of the scale metric per replica. Crossing the target triggers a scale-out.
spec.spendCap.amount	number	Hard budget in the workspace's currency. When reached, replicas pause and a P2 alert fires.
spec.spendCap.window	enum	`hourly`, `daily`, or `monthly`. The cap resets at the end of each window.
spec.accessControl.apiKeys	string[]	Logical API-key names; the actual key material is minted in the workspace Keys tab and rotated independently.
spec.accessControl.allowedOriginCidrs	string[]	Optional client-IP allow-list, evaluated at the gateway.
spec.observability.otelEndpoint	string	Customer-owned OTel collector endpoint; Yobibyte ships traces to it in addition to the hosted view.
spec.observability.prometheusScrape	boolean	Exposes the standard Yobibyte metric set on a workspace-scoped scrape endpoint for the customer's own Prometheus.

yaml

apiVersion: yobibyte.yobitel.com/v1
kind: Inference
metadata:
  name: support-bot
  workspace: acme-uk
spec:
  model: llama-3.1-70b-instruct        # marketplace model name
  region: uk-london                     # must match the workspace region
  replicas:
    min: 1
    max: 6
  autoscaling:
    scaleMetric: tokensPerSecond        # tokensPerSecond | requestsPerSecond | concurrency
    target: 1800
  spendCap:
    amount: 5500
    currency: USD
    window: monthly
  accessControl:
    apiKeys: [ "support-bot-prod" ]
    allowedOriginCidrs: [ "10.0.0.0/8" ]
  observability:
    otelEndpoint: "https://otel.acme.com:4317"
    prometheusScrape: true

FineTune resource reference#

Field	Type	Description
spec.baseModel	string	Marketplace model name to fine-tune from.
spec.region	string	Sovereignty region; must match the workspace.
spec.dataset.uri	string	`s3://`, `azure://`, or `gs://` URI to the training dataset; access is via a customer-supplied role/identity.
spec.dataset.format	enum	`chat`, `completion`, `instruction`, or `raw`.
spec.method	enum	`qlora` (memory-efficient, 4-bit base + 16-bit adapters), `lora` (full-precision adapters), or `full` (full SFT).
spec.hyperparameters.preset	enum	`fast`, `balanced`, or `quality` — sets sensible defaults; override individual knobs below if you need to.
spec.hyperparameters.epochs	integer	Number of passes over the dataset.
spec.hyperparameters.learningRatePreset	enum	`conservative`, `standard`, or `aggressive`.
spec.hyperparameters.rank	integer	LoRA/QLoRA rank. Higher rank = more capacity, larger adapter.
spec.output.bucket	string	Customer-owned bucket the adapter is written to.
spec.output.kmsKey	string	Customer-managed KMS key the adapter is encrypted with at rest.
spec.spendCap.amount	number	Hard budget for the job. When reached, the job is paused and surfaced for review; no silent overrun.

yaml

apiVersion: yobibyte.yobitel.com/v1
kind: FineTune
metadata:
  name: support-llama-v3
  workspace: acme-uk
spec:
  baseModel: llama-3.1-70b-instruct
  region: uk-london
  dataset:
    uri: s3://acme-ml/support-tickets-2025-q4.jsonl
    format: chat
  method: qlora                          # qlora | lora | full
  hyperparameters:
    preset: balanced                     # fast | balanced | quality
    epochs: 3
    learningRatePreset: standard
    rank: 64
  output:
    bucket: s3://acme-ml/adapters/support-v3
    kmsKey: arn:aws:kms:eu-west-2:111122223333:key/abcd-...
  spendCap:
    amount: 1500
    currency: USD

Marketplace#

Sizing and capacity planning#

Workload	Recommended config	Throughput baseline	Assumptions
Llama 3.1 8B chat, low latency	1× L40S, FP16	~3,200 tok/s/replica	Batch ≤ 16, 2K input / 512 output, P50 TTFT < 250 ms.
Llama 3.1 8B chat, high throughput	1× H100 SXM5, FP8	~12,500 tok/s/replica	Batch 64, prefix-cache-aware routing on.
Llama 3.1 70B chat	4× H100 SXM5 TP=4, FP8	~3,800 tok/s/replica	NVLink 4.0 mandatory, max_model_len 8192.
Llama 3.1 70B chat, ultra low latency	8× H100 SXM5 TP=8, FP8	~6,200 tok/s, P50 TTFT < 180 ms	NVLink 4.0, in-flight batching, FP8 KV cache.
Llama 3.1 405B chat	8× H200 TP=8, FP8	~1,400 tok/s/replica	NVLink 4.0 plus 141 GB HBM3e per card.
Whisper Large-v3	1× L4	Realtime × 32 streams/card	16 kHz mono, 30 s segments.
SDXL image gen	1× L40S, FP16	~3.5 images/s at 1024² 30 steps	Single replica; multi-GPU not worth it under 1024².
Embeddings (BGE-Large)	1× A10G	~8,000 embeddings/s/replica	Sequence length 512, batch 64.
70B QLoRA fine-tune	8× H100 SXM5 single node	~12,000 tok/s training	Rank 64, microbatch 4, grad accum 8.
70B full SFT	32× H100 SXM5 across 4 nodes	~6,400 tok/s training	InfiniBand NDR, ZeRO-3, FP8.

Sizing: workspace footprint by team size#

Team size	Workspaces	Inference replicas (steady)	Fine-tune concurrency	Recommended GPU pool
10 engineers	1–2	5–10	1–2	4× H100 SXM5 + 2× L40S shared pool
50 engineers	4–6	30–60	4–6	16× H100 SXM5 + 8× L40S + 4× L4
200 engineers	12–20	120–250	12–20 concurrent	64× H100 SXM5 + 16× H200 + dedicated MIG pool
Org-wide platform	50+	500+	30+ concurrent	Multi-region, 256+ H100/H200 + reserved capacity + spot tier

Limits and quotas#

Resource	Default	Enterprise ceiling	How to raise
Workspaces per org	5	200	Self-service in console.
Clusters per workspace	3	50	Self-service; subject to region availability.
GPUs per workspace	16	4,096	Support request plus reserved capacity commit.
Concurrent Inference resources	20	500	Self-service up to 100; ticket beyond.
Concurrent FineTune jobs	5	100	Support request; gated on quota and budget.
Replicas per Inference	10	200	Self-service.
Request rate per endpoint	1,000 RPS	100,000 RPS	Per-key rate-limit overrides via API.
Max model size (single replica)	1.4 TB weights	8 TB	Bounded by HBM × TP × node count.
Marketplace private models per workspace	100	10,000	Self-service.
Spend cap precision	USD 1	USD 1	Hard floor.
OIDC IdP connections per workspace	5	20	Support request.
FineTune job queue depth	50	1,000	Self-service up to 200.
Custom domains per Inference	3	50	Self-service.
Audit log retention	90 days	7 years	Enterprise tier; immutable S3 destination.

Observability#

yaml

groups:
- name: yobibyte-inference
  interval: 30s
  rules:
  - alert: InferenceTokensPerSecondDegraded
    expr: |
      avg by (workspace, inference) (
        yobibyte_inference_tokens_per_second_replica
      ) < 1000
      and
      sum by (workspace, inference) (
        rate(yobibyte_inference_requests_total[5m])
      ) > 0.5
    for: 10m
    labels: { severity: page }
    annotations:
      summary: "{{ $labels.inference }} tokens/sec/replica below 1k"
      runbook: https://docs.yobitel.com/runbooks/inference-throughput

  - alert: GpuHbmExhaustion
    expr: yobibyte_gpu_hbm_used_ratio > 0.97
    for: 5m
    labels: { severity: page }

Cost and FinOps#

SKU / mode	On-demand $/GPU/hr	1-yr reserved $/GPU/hr	3-yr reserved $/GPU/hr	Spot floor
H100 SXM5 80GB	$3.25	$2.45	$1.95	$1.20
H200 141GB	$4.25	$3.20	$2.55	$1.75
B200 192GB	$6.00	$4.50	$3.60	n/a
A100 80GB	$2.25	$1.70	$1.40	$0.80
L40S 48GB	$1.20	$0.90	$0.70	$0.45
L4 24GB	$0.50	$0.40	$0.30	$0.22
MI300X 192GB	$4.00	$3.00	$2.45	n/a
Object storage	$0.022/GB-mo	—	—	—
Egress to internet	$0.075/GB	—	—	—
Egress between Yobibyte regions	$0.00/GB	—	—	—

Security and compliance#

NCSC Cloud Security Principles — controls mapped per principle; OFFICIAL-tier UK regions audited annually.
G-Cloud framework — listed under Cloud Software (Lot 2) and Cloud Support (Lot 3).
SOC 2 Type II — annual third-party audit covering security, availability, confidentiality.
ISO 27001:2022 and ISO 27017/27018 — current certificates available under NDA.
GDPR / UK DPA 2018 — DPA, sub-processor list, and EU SCCs available; data residency enforced at admission.
FedRAMP-equivalent — Moderate-baseline-aligned controls available via partner regions in the US.
HIPAA — BAA available for healthcare workloads; encryption, logging, and access-control controls audited.
CSA STAR Level 2 — published self-assessment plus third-party attestation.

Migration and alternatives#

Yobibyte is one option among several. Use the comparison table to decide when each fits.

Concern	Yobibyte	DIY on cloud Kubernetes	AWS Bedrock / Vertex AI	Self-hosted Kubeflow
Operational ownership	Yobitel runs the platform	You run the platform	Cloud runs the platform	You run the platform
Cold start from request to live endpoint	Minutes	Hours to days	Minutes	Days
GPU SKU coverage	30+ SKUs across Yobitel-operated regions	Whatever the cloud sells	Cloud's own SKUs only	Whatever you provision
Sovereignty enforcement	Pinned per workspace, enforced at admission	DIY policy	Region-only, no admission gate	DIY
FOCUS-aligned billing export	Built in	DIY	Proprietary export	DIY
Identity federation	OIDC + SCIM built in	DIY	Cloud IAM only	DIY
Marketplace plus benchmark integration	InferenceBench-sourced	DIY	Vendor curated	DIY
Pricing model	Per-second + reserved + spot, FOCUS-shaped	Cloud bill + your ops cost	Per-token, opaque	Cloud bill + your ops cost

Troubleshooting#

The errors below cover the failure modes seen most often during onboarding and the first weeks of production. The full runbook library is at docs.yobitel.com/runbooks.

Error	Cause	Fix
WorkspaceProvisioningFailed: OIDC discovery 401	Wrong issuer URL or audience in the IdP application; Yobibyte cannot read the IdP's `.well-known/openid-configuration`.	In the workspace Identity tab, re-enter the issuer URL and audience; confirm the IdP's discovery document resolves over the public internet.
InferenceColdStartTimeout	First request hit a scale-to-zero endpoint while the model was still loading from object storage (typical for 70B+ models on a cold replica).	Raise `replicas.min` to 1 on the Inference resource, or pre-warm via the console's 'Warm replicas' control before traffic ramps.
QuotaExceeded: workspace gpus 16/16 in use	Workspace has reached its default GPU quota.	Either lower `replicas.max` on a less-critical Inference, or raise the workspace GPU quota via the console (self-service up to enterprise ceiling, then ticket).
RegionCapacityUnavailable: uk-london-1	Requested SKU has no capacity in the workspace's pinned region right now.	Either accept a transient queue (Yobibyte retries automatically), pre-purchase reserved capacity via the console's Capacity tab, or talk to your account team about a sibling region inside the same compliance boundary.
AdmissionDenied: complianceMismatch	Inference labelled with a compliance tag that does not match the workspace region — for example a UK workspace asked to host a workload labelled `compliance=us-fedramp`.	Either move the Inference to a workspace bound to the correct sovereignty region, or remove the compliance label if the constraint no longer applies.
BillingExportEmpty	FOCUS export bucket policy does not allow the Yobibyte writer role to write objects.	Apply the bucket policy snippet shown in the workspace's Billing tab and wait one export window; exports retry hourly.
OidcLoginRedirectMismatch	Redirect URI in the IdP application does not include the workspace's callback URL.	Add `https://<workspace>.yobitel.app/auth/callback` to the IdP's allowed redirects and retry sign-in.
KmsDecryptDenied	Customer KMS key policy is missing the Yobibyte data-plane role.	Add the role ARN shown in the workspace's Setup tab to the KMS key policy; FineTune jobs and Inference deployments resume automatically on the next reconciliation tick.
SpendCapExceeded: inference paused	Accrued cost has reached the configured spend cap; replicas are paused (not deleted).	Either raise the spend cap on the Inference resource or wait for the next budget window to begin; paused replicas resume automatically.
MarketplaceModelUnavailable: under license review	Selected model is temporarily withheld from the marketplace pending a license or sovereignty review.	Pick a substitute from the marketplace's 'Similar models' list, or contact your account team for an ETA on the review.

Where Yobibyte fits in the Yobitel stack#

References

Yobibyte product page · Yobitel
Omniscient Compute · Yobitel
InferenceBench · Yobitel
AWS Bedrock · AWS
Google Vertex AI · Google Cloud
FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation
NCSC Cloud Security Principles · NCSC

Yobibyte

Overview#

Quick start#

Concepts#

Inference resource reference#

FineTune resource reference#

Marketplace#

Sizing and capacity planning#

Sizing: workspace footprint by team size#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where Yobibyte fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Yobibyte

Overview#

Quick start#

Concepts#

Inference resource reference#

FineTune resource reference#

Marketplace#

Sizing and capacity planning#

Sizing: workspace footprint by team size#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where Yobibyte fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel