Professional Services · Infrastructure Architecture
The reference architecture your AI cluster is actually built from
From compute and power envelope through fabric topology, storage tier, platform layer, MLOps, and operations posture. Vendor-neutral. Sovereignty-aware. Signed off before procurement, traceable for years after.
Representative decision record
SignedADR-011 · Storage tier
WEKA hot tier in front of Lustre warm
ADR-009 · Compute SKU
HGX H200 over HGX H100 for Q3 wave
ADR-007 · Fabric topology
Status: Active · Rev 3
Decision: 800G InfiniBand XDR over 400G NDR for the east-west fabric.
Context
Cluster scales from 2k GPU at first cut to a targeted 8k by year two.
Trade-off
14% capex now in exchange for ~40% capacity headroom and no fabric refresh.
Consequence
Switch SKU locks NVIDIA Quantum-X800; storage egress sized to match.
Every architecture call lands as an ADR. Reviewable, dateable, revisitable. The whole pack ships with the reference architecture at sign-off.
The surface area
What an architecture engagement actually covers
Every layer in the cluster makes the others harder to design. We span the lot in one engagement so the decisions reinforce instead of fight each other.
Compute selection
GPU SKU, host topology, accelerator mix. NVIDIA HGX H100 / H200 / B200 / GB200 against your training and inference profile. Right-sized, not over-sized.
Power and cooling envelope
Rack-level kW budget, PUE target, liquid vs air, the conversation with the data centre. Sized for first cut and the eighteen-month growth path.
Fabric topology
East-west (InfiniBand NDR / XDR, RoCE) and north-south. Rails, oversubscription, blast radius. Designed against intended job sizes, not picked from a catalogue.
Storage tier
Hot, warm, cold. Parallel file systems (Lustre, GPFS, WEKA) feeding the hot path. Checkpoint sizing, dataset staging, MaaS / object durability for the rest.
Platform layer
Bare metal lifecycle (Foreman), Kubernetes, multi-tenancy, scheduler choice, Crossplane-style declarative provisioning. The substrate your teams actually consume.
MLOps surface
Pipeline orchestration, model registry, feature store, eval harness. The day-two product the ML org runs on, designed to compose with the platform underneath.
Operations posture
Sovereignty perimeter (UK NCSC, G-Cloud, OFFICIAL, plus EU / FedRAMP / MeitY where they apply), DR class per workload, observability and on-call shape from day one.
Architecture risks we close
The decisions that get made by accident when no one is watching
These are the patterns we see when we walk into an architecture review. None of them are technically catastrophic on day one. All of them compound into a Y2 forklift.
Procurement before architecture
What bad looks like
4,096 GPUs ordered, then the fabric is sized to fit
What we design for
Fabric and storage topology drive the BoM, in that order
The biggest single mistake we see. Compute spend is the headline so it gets signed first, then the fabric and storage get squeezed into what the budget has left. We invert the sequence: define the workload envelope, pick the fabric and storage that serve it, then size compute against the bandwidth those tiers can actually feed.
Year-two surprise
What bad looks like
No capacity model. Hit the ceiling, ad-hoc forklift in Q6.
What we design for
Scaling envelope documented from day one
Most clusters get re-architected within eighteen months because nobody modelled the path from first cut to steady state. We write the capacity envelope alongside the reference architecture: what scales linearly, what needs a forklift, what triggers a re-design, with the trigger conditions named.
RFP-driven vendor lock
What bad looks like
Single-vendor stack, no substitution path
What we design for
Vendor-neutral BoM with named substitution rules
A reference architecture that names exactly one vendor per layer is a procurement instruction, not an architecture. We design with substitution rules: which decisions are vendor-load-bearing and which are interchangeable, what changes if you swap accelerator family or fabric vendor, what stays the same.
DR posture left to operations
What bad looks like
No RTO or RPO targets. Discovered during the first incident.
What we design for
RTO and RPO defined per workload class
Training, inference, RAG indexes, and feature stores all have different recovery profiles. Treating them as one class lands a DR plan that is either too expensive for low-tier workloads or too thin for the tier-zero ones. We write the workload classification and pin recovery targets to it before the operations team inherits a problem.
Sovereignty as an afterthought
What bad looks like
Cluster designed, then compliance review rewrites it
What we design for
Sovereignty perimeter named in section one of the reference architecture
UK NCSC, G-Cloud, OFFICIAL, GDPR, FedRAMP, MeitY. Every perimeter constrains data residency, key custody, vendor nationality, and operations access. We start there. The reference architecture lands compliant by construction, not retrofitted to a checklist.
Decisions without traceability
What bad looks like
Why is this Lustre, not WEKA? Nobody on the team remembers.
What we design for
Every load-bearing call lands as a dated, reviewable ADR
Two years in, the team that designed the cluster has moved on. Without architecture decision records, every revisit becomes archaeology. We deliver the ADR pack alongside the architecture: context, decision, trade-offs, consequence, and the conditions under which we'd revisit the call.
Architecture decision records
The format every load-bearing call lands in
Three sample ADRs below. Real engagements ship dozens. The pack lives in version control with the reference architecture and the team that joins in Y3 can read what the team that signed in Y1 actually decided.
ADR-007
Fabric topology
800G InfiniBand XDR over 400G NDR
Context
Cluster scales from 2k GPU at first cut to a targeted 8k by year two. Mixed training + RAG-heavy inference share the fabric.
Decision
Standardise on 800G InfiniBand XDR for the east-west fabric. Rail-optimised topology with 1:1 oversubscription on the spine.
Trade-off
Roughly 14% capex now in exchange for ~40% capacity headroom and no fabric refresh inside the three-year envelope.
Revisit when
Revisit if (a) target scale changes by more than 2x, or (b) a RoCEv2 alternative meets the acceptance test's latency and lossless thresholds when paired with PFC/ECN and DCQCN-style congestion control tuning.
ADR-009
Compute SKU
HGX H200 over HGX H100 for the Q3 procurement wave
Context
Workload mix is inference-dominant with long-context RAG. KV-cache pressure is the dominant constraint, not raw FP16 throughput.
Decision
Standardise the Q3 wave on HGX H200. Existing H100 inventory stays in service for training-class workloads.
Trade-off
Per-GPU capex premium of around 18% over H100. Justified by ~1.4x memory bandwidth (4.8 vs 3.35 TB/s) and ~1.8x HBM capacity (141 GB vs 80 GB) for the dominant workload.
Revisit when
Revisit before the Q1 wave once B200 lead times and rack-level power delivery are signed off with the data centre.
ADR-011
Storage tier
WEKA hot tier in front of Lustre warm
Context
Training jobs need sustained reads at fabric line rate. Checkpoint writes pulse hard at end-of-epoch. Existing Lustre footprint serves the warm tier well.
Decision
WEKA fronts the hot path; Lustre retains the warm tier. Dataset staging moves automatically on access pattern, not on a schedule.
Trade-off
Two file systems to operate instead of one. Justified by 3x sustained hot-read throughput at the trained job and a clean evacuation path for cold data.
Revisit when
Revisit if the hot working set drops below 8% of total storage or if a single-vendor solution lands competitive on the same benchmark.
Bill of materials snapshot
The categories your procurement team takes to market
The shape of a real BoM, not specific quantities. The deliverable names exact vendor options against your workload, with substitution rules so the architecture survives an RFP cycle intact.
Compute
- NVIDIA HGX H100 / H200 / B200 / GB200
- Head-node and management hosts
- Optional CPU-only inference tier
Fabric
- InfiniBand NDR or XDR (east-west)
- Switch + cabling BoM
- Out-of-band management network
Storage
- Hot parallel FS (WEKA / VAST)
- Warm parallel FS (Lustre / GPFS)
- Object tier for datasets and checkpoints
Platform
- Bare-metal lifecycle (Foreman / MAAS)
- Kubernetes distribution + GPU operator
- Multi-tenancy and scheduler choice
MLOps
- Pipeline orchestrator
- Model registry
- Feature store and eval harness
Observability
- Metrics, logs, traces stack
- GPU + fabric telemetry
- Cost and utilisation attribution
DR + sovereignty
- RTO / RPO classification per workload
- Key management and custody model
- Sovereignty perimeter (NCSC / G-Cloud / OFFICIAL / GDPR / FedRAMP / MeitY)
Your handover pack
What lands when we leave the room
Every engagement closes with version-controlled artefacts your team can act on the day after we leave. Not a slide deck. Not a “we'll send the runbook next week.”
These are the documents your procurement, platform, and operations teams will keep referencing for years. They have to be readable on their own.
Reference architecture document
The canonical write-up. Workload envelope, layer-by-layer design, the sovereignty perimeter, the operations posture. The artefact every later decision is checked against.
Architecture decision records (ADR pack)
Every load-bearing call, dated and reviewable. Context, decision, trade-off, consequence, and the conditions under which we'd revisit the call.
Vendor-neutral bill of materials
Categorised by layer, with named substitution rules. Procurement can take it to multiple vendors without losing the architectural intent.
Capacity model and scaling envelope
First cut, steady state, the trigger conditions that prompt a re-design. The number the CFO and the head of platform read off the same page.
Vendor evaluation matrix
Weighted scoring against the criteria that matter for your cluster, not a generic analyst grid. Used to compare RFP responses on facts.
Runbook seeds
Skeleton runbooks for the operations team: bring-up sequence, day-two scenarios, the on-call shape. The starting point, not the finished article.
How we engage
Pick the shape that fits your team
From end-to-end architecture ownership to a time-boxed review of a draft. The scope call confirms which fits; the statement of work names the deliverables.
Yobitel-led
We own the architecture end-to-end
Workload discovery, reference architecture, ADR pack, BoM, capacity model, vendor evaluation, sign-off. Best when you want the architecture delivered against a fixed milestone with a single owner.
Collaborative
We pair with your architecture team
Joint design sessions on the trickier surfaces: fabric topology, storage tier, sovereignty constraints. Your team executes the BoM and the procurement; we co-author the ADRs and join the sign-off.
Advisory
Time-boxed review
Fixed-window engagement to review your draft architecture or an incumbent vendor's proposal. We spot the load-bearing risks, write a focused set of recommendations, deliver a signed report.
Related
Network fabrics for AI clusters
Where the fabric topology ADR turns into a built, tested, and tuned east-west fabric. The build practice that sits downstream of architecture.
Related
Platform layer for AI GPU clouds
The total-estate platform delivery that the architecture's platform-layer ADR feeds into. Bare metal, VMs, containers, the lot.
Tell us what your cluster needs to do, and at what scale.
A short questionnaire covers cluster intent, target scale, hard constraints, and engagement shape. Our architecture practice lead replies inside one working day with a fitted scope, an indicative timeline to first ADR, and a few sample reference architectures from comparable engagements.
Same engineering bench that builds the fabric and the platform layer the architecture lands on. Engagements scoped to any sovereignty perimeter (NCSC, G-Cloud, OFFICIAL, GDPR, FedRAMP, MeitY, and beyond). Vendor-neutral. Substitution rules named. Procurement-ready.