TL;DR
- MPI is a standardised message-passing API (currently MPI-4.1, ratified 2023) implemented by Open MPI, MPICH, MVAPICH, and Intel MPI.
- Used in AI clusters principally as a job launcher (`mpirun`, `srun`) and process manager, with NCCL handling the data-plane collectives.
- Defines point-to-point operations (Send/Recv), collectives (AllReduce, Bcast, etc.), and one-sided RMA — many of which NCCL provides GPU-aware equivalents for.
- Plays a central role in MPI+NCCL hybrid setups: MPI manages ranks, hostfiles, and rendezvous; NCCL does the GPU-fabric heavy lifting.
Overview#
The Message Passing Interface is the standardised API that the HPC community has used to write distributed applications since 1994. It defines point-to-point send/receive, blocking and non-blocking collectives, one-sided remote memory access, and process management primitives. The current standard is MPI-4.1 (2023).
In AI infrastructure, MPI is rarely the data-plane any more — NCCL handles GPU-to-GPU collectives with better topology awareness. But MPI remains the dominant job launcher and process manager: `mpirun` and SLURM's `srun` (which speaks the PMI-2/PMIx protocol that MPI uses) are how almost every multi-node training job is started.
Implementations#
| Implementation | Maintainer | Notes |
|---|---|---|
| Open MPI | Community | Most common in academic and neocloud environments |
| MPICH | Argonne National Lab | Reference implementation |
| MVAPICH | Ohio State | InfiniBand-optimised derivative of MPICH |
| Intel MPI | Intel | Commercial MPICH derivative, used in Intel HPC stacks |
| HPC-X / NVIDIA HPC SDK | NVIDIA | Open MPI + UCX + HCOLL bundle |
Role in AI Training#
A typical multi-node PyTorch job uses `mpirun` (or SLURM's `srun`) to start one process per GPU on each node. MPI handles rendezvous — every process learns the addresses of every other process — and then the application initialises a `torch.distributed` process group with the NCCL backend.
From that point, NCCL takes over the data plane: every gradient AllReduce, every parameter broadcast, every all-gather goes through NCCL, not MPI. MPI's role narrows to occasional control-plane operations like barriers between training phases or rank-0 broadcasts of metadata.
Operational Notes#
- PMIx vs PMI-2: modern launchers prefer PMIx for richer rendezvous and async event support; older clusters may still use PMI-2.
- MPI and NCCL must agree on rank numbering — pass through `OMPI_COMM_WORLD_RANK` or equivalents into the framework.
- UCX is the default transport layer for Open MPI on modern InfiniBand/RoCE clusters; verify with `ompi_info | grep ucx`.
- HCOLL (NVIDIA's hierarchical collective library) can replace MPI's own collectives for CPU-side reductions; rarely used in AI but valuable in hybrid CPU+GPU jobs.
References
- MPI Forum — Standards · MPI Forum
- Open MPI Documentation · Open MPI
- MPICH Documentation · Argonne National Lab