Professional Services · Infrastructure Architecture

The reference architecture your AI cluster is actually built from

From compute and power envelope through fabric topology, storage tier, platform layer, MLOps, and operations posture. Vendor-neutral. Sovereignty-aware. Signed off before procurement, traceable for years after.

See a sample ADR pack

NVIDIA HGX H100 · H200 · B200 · GB200InfiniBand NDR / XDR · Lustre · GPFS · WEKA · MaaS · Foreman

Representative decision record

Signed

ADR-011 · Storage tier

WEKA hot tier in front of Lustre warm

ADR-009 · Compute SKU

HGX H200 over HGX H100 for Q3 wave

ADR-007 · Fabric topology

Status: Active · Rev 3

Anchor

Decision: 800G InfiniBand XDR over 400G NDR for the east-west fabric.

Context

Cluster scales from 2k GPU at first cut to a targeted 8k by year two.

Trade-off

14% capex now in exchange for ~40% capacity headroom and no fabric refresh.

Consequence

Switch SKU locks NVIDIA Quantum-X800; storage egress sized to match.

Every architecture call lands as an ADR. Reviewable, dateable, revisitable. The whole pack ships with the reference architecture at sign-off.

The surface area

What an architecture engagement actually covers

Every layer in the cluster makes the others harder to design. We span the lot in one engagement so the decisions reinforce instead of fight each other.

Compute selection

GPU SKU, host topology, accelerator mix. NVIDIA HGX H100 / H200 / B200 / GB200 against your training and inference profile. Right-sized, not over-sized.

Power and cooling envelope

Rack-level kW budget, PUE target, liquid vs air, the conversation with the data centre. Sized for first cut and the eighteen-month growth path.

Fabric topology

East-west (InfiniBand NDR / XDR, RoCE) and north-south. Rails, oversubscription, blast radius. Designed against intended job sizes, not picked from a catalogue.

Storage tier

Hot, warm, cold. Parallel file systems (Lustre, GPFS, WEKA) feeding the hot path. Checkpoint sizing, dataset staging, MaaS / object durability for the rest.

Platform layer

Bare metal lifecycle (Foreman), Kubernetes, multi-tenancy, scheduler choice, Crossplane-style declarative provisioning. The substrate your teams actually consume.

MLOps surface

Pipeline orchestration, model registry, feature store, eval harness. The day-two product the ML org runs on, designed to compose with the platform underneath.

Operations posture

Sovereignty perimeter (UK NCSC, G-Cloud, OFFICIAL, plus EU / FedRAMP / MeitY where they apply), DR class per workload, observability and on-call shape from day one.

Architecture risks we close

The decisions that get made by accident when no one is watching

These are the patterns we see when we walk into an architecture review. None of them are technically catastrophic on day one. All of them compound into a Y2 forklift.

Procurement before architecture

What bad looks like

4,096 GPUs ordered, then the fabric is sized to fit

What we design for

Fabric and storage topology drive the BoM, in that order

The biggest single mistake we see. Compute spend is the headline so it gets signed first, then the fabric and storage get squeezed into what the budget has left. We invert the sequence: define the workload envelope, pick the fabric and storage that serve it, then size compute against the bandwidth those tiers can actually feed.

Year-two surprise

What bad looks like

No capacity model. Hit the ceiling, ad-hoc forklift in Q6.

What we design for

Scaling envelope documented from day one

Most clusters get re-architected within eighteen months because nobody modelled the path from first cut to steady state. We write the capacity envelope alongside the reference architecture: what scales linearly, what needs a forklift, what triggers a re-design, with the trigger conditions named.

RFP-driven vendor lock

What bad looks like

Single-vendor stack, no substitution path

What we design for

Vendor-neutral BoM with named substitution rules

A reference architecture that names exactly one vendor per layer is a procurement instruction, not an architecture. We design with substitution rules: which decisions are vendor-load-bearing and which are interchangeable, what changes if you swap accelerator family or fabric vendor, what stays the same.

DR posture left to operations

What bad looks like

No RTO or RPO targets. Discovered during the first incident.

What we design for

RTO and RPO defined per workload class

Training, inference, RAG indexes, and feature stores all have different recovery profiles. Treating them as one class lands a DR plan that is either too expensive for low-tier workloads or too thin for the tier-zero ones. We write the workload classification and pin recovery targets to it before the operations team inherits a problem.

Sovereignty as an afterthought

What bad looks like

Cluster designed, then compliance review rewrites it

What we design for

Sovereignty perimeter named in section one of the reference architecture

UK NCSC, G-Cloud, OFFICIAL, GDPR, FedRAMP, MeitY. Every perimeter constrains data residency, key custody, vendor nationality, and operations access. We start there. The reference architecture lands compliant by construction, not retrofitted to a checklist.

Decisions without traceability

What bad looks like

Why is this Lustre, not WEKA? Nobody on the team remembers.

What we design for

Every load-bearing call lands as a dated, reviewable ADR

Two years in, the team that designed the cluster has moved on. Without architecture decision records, every revisit becomes archaeology. We deliver the ADR pack alongside the architecture: context, decision, trade-offs, consequence, and the conditions under which we'd revisit the call.

Architecture decision records

The format every load-bearing call lands in

Three sample ADRs below. Real engagements ship dozens. The pack lives in version control with the reference architecture and the team that joins in Y3 can read what the team that signed in Y1 actually decided.

ADR-007

Fabric topology

Signed

800G InfiniBand XDR over 400G NDR

Context

Cluster scales from 2k GPU at first cut to a targeted 8k by year two. Mixed training + RAG-heavy inference share the fabric.

Decision

Standardise on 800G InfiniBand XDR for the east-west fabric. Rail-optimised topology with 1:1 oversubscription on the spine.

Trade-off

Roughly 14% capex now in exchange for ~40% capacity headroom and no fabric refresh inside the three-year envelope.

Revisit when

Revisit if (a) target scale changes by more than 2x, or (b) a RoCEv2 alternative meets the acceptance test's latency and lossless thresholds when paired with PFC/ECN and DCQCN-style congestion control tuning.

ADR-009

Compute SKU

Signed

HGX H200 over HGX H100 for the Q3 procurement wave

Context

Workload mix is inference-dominant with long-context RAG. KV-cache pressure is the dominant constraint, not raw FP16 throughput.

Decision

Standardise the Q3 wave on HGX H200. Existing H100 inventory stays in service for training-class workloads.

Trade-off

Per-GPU capex premium of around 18% over H100. Justified by ~1.4x memory bandwidth (4.8 vs 3.35 TB/s) and ~1.8x HBM capacity (141 GB vs 80 GB) for the dominant workload.

Revisit when

Revisit before the Q1 wave once B200 lead times and rack-level power delivery are signed off with the data centre.

ADR-011

Storage tier

Signed

WEKA hot tier in front of Lustre warm

Context

Training jobs need sustained reads at fabric line rate. Checkpoint writes pulse hard at end-of-epoch. Existing Lustre footprint serves the warm tier well.

Decision

WEKA fronts the hot path; Lustre retains the warm tier. Dataset staging moves automatically on access pattern, not on a schedule.

Trade-off

Two file systems to operate instead of one. Justified by 3x sustained hot-read throughput at the trained job and a clean evacuation path for cold data.

Revisit when

Revisit if the hot working set drops below 8% of total storage or if a single-vendor solution lands competitive on the same benchmark.

Bill of materials snapshot

The categories your procurement team takes to market

The shape of a real BoM, not specific quantities. The deliverable names exact vendor options against your workload, with substitution rules so the architecture survives an RFP cycle intact.

Compute

NVIDIA HGX H100 / H200 / B200 / GB200
Head-node and management hosts
Optional CPU-only inference tier

Fabric

InfiniBand NDR or XDR (east-west)
Switch + cabling BoM
Out-of-band management network

Storage

Hot parallel FS (WEKA / VAST)
Warm parallel FS (Lustre / GPFS)
Object tier for datasets and checkpoints

Platform

Bare-metal lifecycle (Foreman / MAAS)
Kubernetes distribution + GPU operator
Multi-tenancy and scheduler choice

MLOps

Pipeline orchestrator
Model registry
Feature store and eval harness

Observability

Metrics, logs, traces stack
GPU + fabric telemetry
Cost and utilisation attribution

DR + sovereignty

RTO / RPO classification per workload
Key management and custody model
Sovereignty perimeter (NCSC / G-Cloud / OFFICIAL / GDPR / FedRAMP / MeitY)

Your handover pack

What lands when we leave the room

Every engagement closes with version-controlled artefacts your team can act on the day after we leave. Not a slide deck. Not a “we'll send the runbook next week.”

These are the documents your procurement, platform, and operations teams will keep referencing for years. They have to be readable on their own.

Reference architecture document

The canonical write-up. Workload envelope, layer-by-layer design, the sovereignty perimeter, the operations posture. The artefact every later decision is checked against.

Architecture decision records (ADR pack)

Every load-bearing call, dated and reviewable. Context, decision, trade-off, consequence, and the conditions under which we'd revisit the call.

Vendor-neutral bill of materials

Categorised by layer, with named substitution rules. Procurement can take it to multiple vendors without losing the architectural intent.

Capacity model and scaling envelope

First cut, steady state, the trigger conditions that prompt a re-design. The number the CFO and the head of platform read off the same page.

Vendor evaluation matrix

Weighted scoring against the criteria that matter for your cluster, not a generic analyst grid. Used to compare RFP responses on facts.

Runbook seeds

Skeleton runbooks for the operations team: bring-up sequence, day-two scenarios, the on-call shape. The starting point, not the finished article.

How we engage

Pick the shape that fits your team

From end-to-end architecture ownership to a time-boxed review of a draft. The scope call confirms which fits; the statement of work names the deliverables.

Yobitel-led

We own the architecture end-to-end

Workload discovery, reference architecture, ADR pack, BoM, capacity model, vendor evaluation, sign-off. Best when you want the architecture delivered against a fixed milestone with a single owner.

Collaborative

We pair with your architecture team

Joint design sessions on the trickier surfaces: fabric topology, storage tier, sovereignty constraints. Your team executes the BoM and the procurement; we co-author the ADRs and join the sign-off.

Advisory

Time-boxed review

Fixed-window engagement to review your draft architecture or an incumbent vendor's proposal. We spot the load-bearing risks, write a focused set of recommendations, deliver a signed report.

Network fabrics for AI clusters

Where the fabric topology ADR turns into a built, tested, and tuned east-west fabric. The build practice that sits downstream of architecture.

Platform layer for AI GPU clouds

The total-estate platform delivery that the architecture's platform-layer ADR feeds into. Bare metal, VMs, containers, the lot.

Tell us what your cluster needs to do, and at what scale.

A short questionnaire covers cluster intent, target scale, hard constraints, and engagement shape. Our architecture practice lead replies inside one working day with a fitted scope, an indicative timeline to first ADR, and a few sample reference architectures from comparable engagements.

Prefer email? Contact us

Same engineering bench that builds the fabric and the platform layer the architecture lands on. Engagements scoped to any sovereignty perimeter (NCSC, G-Cloud, OFFICIAL, GDPR, FedRAMP, MeitY, and beyond). Vendor-neutral. Substitution rules named. Procurement-ready.

The reference architecture your AI cluster is actually built from

NVIDIA HGX H100 · H200 · B200 · GB200InfiniBand NDR / XDR · Lustre · GPFS · WEKA · MaaS · Foreman

What lands when we leave the room

Every engagement closes with version-controlled artefacts your team can act on the day after we leave. Not a slide deck. Not a “we'll send the runbook next week.”

These are the documents your procurement, platform, and operations teams will keep referencing for years. They have to be readable on their own.

Tell us what your cluster needs to do, and at what scale.