TL;DR
- Multi-Instance GPU (MIG) carves a single A100, H100, H200 or B200 into up to seven hardware-isolated instances, each with its own SMs, L2 cache slice, and memory partition.
- Each MIG instance appears to Kubernetes as a distinct schedulable resource (e.g. `nvidia.com/mig-1g.10gb`) so multi-tenant workloads share a GPU without interfering with each other's memory bandwidth or latency.
- MIG is configured via the NVIDIA GPU Operator's MIG Manager, with either a `single` strategy (uniform partitions per GPU) or `mixed` (heterogeneous sizes), reconciled from a node label.
- Best fit for inference and notebook workloads where deterministic isolation matters more than peak throughput; not used for tensor-parallel training, which needs full GPUs and NVLink.
What MIG Provides#
MIG is a hardware feature introduced on the A100 (Ampere) and extended through Hopper (H100, H200) and Blackwell (B200). It splits the GPU at the SM, L2 cache, memory controller, and PCIe BAR level — partitions are not time-sliced, they are physically separate. A noisy neighbour in one MIG instance cannot starve another of memory bandwidth.
On an H100 80 GB, the supported profiles range from `7g.80gb` (full GPU, equivalent to disabling MIG) down to `1g.10gb` (seven instances of one compute slice and 10 GB each). On H200 141 GB the equivalent finest split is `1g.18gb`, and on B200 192 GB the profile catalogue widens further.
Single vs Mixed Strategy#
The Kubernetes device plugin supports two MIG strategies, set via the GPU Operator:
- single — every GPU on the node is partitioned identically (e.g. all seven slices `1g.10gb`). The device plugin advertises `nvidia.com/gpu` and each "GPU" is one MIG instance. Workloads written for full GPUs run unchanged.
- mixed — different profiles can coexist on the same node (e.g. one 3g.40gb + four 1g.10gb). The device plugin advertises resources by profile name (`nvidia.com/mig-3g.40gb`, `nvidia.com/mig-1g.10gb`) and pods must request the specific profile.
Start with `single` and a uniform profile across the cluster — it is easier to schedule and reason about. Move to `mixed` only when you have a clear workload mix (e.g. one big notebook + several small inference replicas per GPU).
Pod Resource Requests#
# Mixed strategy — pod requests a specific MIG profile
apiVersion: v1
kind: Pod
metadata:
name: inference-worker
spec:
containers:
- name: worker
image: yobitel/vllm-inference:0.6.4
resources:
limits:
nvidia.com/mig-1g.10gb: 1
nodeSelector:
nvidia.com/mig.config: all-1g.10gbWhere MIG Helps and Where It Doesn't#
MIG is the right answer when many small workloads share a GPU and need predictable latency — typical examples are Jupyter notebooks for data scientists, small-model inference endpoints, CI test runners, and per-tenant inference replicas. A 1g.10gb slice of an H100 is roughly enough to serve a 7B model in FP8 at modest concurrency.
MIG is the wrong answer for training. Tensor and pipeline parallelism rely on NVLink between full GPUs; MIG instances cannot participate in NCCL collectives across slices. It is also the wrong answer when you genuinely need the full memory bandwidth — large-batch inference of frontier models will always prefer a whole GPU.
| Workload | MIG? | Reason |
|---|---|---|
| Notebook fleet | Yes | Many idle users, isolation matters |
| Small model inference | Yes | Latency-bounded, 7B fits in 1g.10gb |
| Frontier model serving | No | Needs full HBM + NVLink |
| Distributed training | No | NCCL collectives need whole GPUs |
| Batch fine-tuning | Sometimes | Single-GPU LoRA fits, FSDP doesn't |
Repartitioning and Lifecycle#
Changing a node's MIG profile requires reloading the driver, which means draining the node. The GPU Operator's MIG Manager handles this when the `nvidia.com/mig.config` label changes — it cordons, drains, repartitions, and uncordons. The whole operation typically takes 60-90 seconds, but any in-flight workloads on the node are terminated.
Telemetry and Quotas#
DCGM Exporter reports metrics per MIG instance, not just per GPU, which is essential for FinOps chargeback. Each MIG slice's GPU-time and memory utilisation can be billed to a namespace or tenant. Combined with Kubernetes ResourceQuota objects on the `nvidia.com/mig-*` resources, MIG gives a hard, auditable substrate for multi-tenant GPU sharing — the foundation of Yobitel's sovereign multi-tenant clusters.
References
- NVIDIA MIG User Guide · NVIDIA Docs
- MIG Support in Kubernetes · NVIDIA Cloud Native Docs
- k8s-device-plugin · GitHub