MPI (Message Passing Interface)

TL;DR

MPI is a standardised message-passing API (currently MPI-4.1, ratified 2023) implemented by Open MPI, MPICH, MVAPICH, and Intel MPI.
Used in AI clusters principally as a job launcher (`mpirun`, `srun`) and process manager, with NCCL handling the data-plane collectives.
Defines point-to-point operations (Send/Recv), collectives (AllReduce, Bcast, etc.), and one-sided RMA — many of which NCCL provides GPU-aware equivalents for.
Plays a central role in MPI+NCCL hybrid setups: MPI manages ranks, hostfiles, and rendezvous; NCCL does the GPU-fabric heavy lifting.

Overview#

The Message Passing Interface is the standardised API that the HPC community has used to write distributed applications since 1994. It defines point-to-point send/receive, blocking and non-blocking collectives, one-sided remote memory access, and process management primitives. The current standard is MPI-4.1 (2023).

In AI infrastructure, MPI is rarely the data-plane any more — NCCL handles GPU-to-GPU collectives with better topology awareness. But MPI remains the dominant job launcher and process manager: `mpirun` and SLURM's `srun` (which speaks the PMI-2/PMIx protocol that MPI uses) are how almost every multi-node training job is started.

Implementations#

Implementation	Maintainer	Notes
Open MPI	Community	Most common in academic and neocloud environments
MPICH	Argonne National Lab	Reference implementation
MVAPICH	Ohio State	InfiniBand-optimised derivative of MPICH
Intel MPI	Intel	Commercial MPICH derivative, used in Intel HPC stacks
HPC-X / NVIDIA HPC SDK	NVIDIA	Open MPI + UCX + HCOLL bundle

Role in AI Training#

A typical multi-node PyTorch job uses `mpirun` (or SLURM's `srun`) to start one process per GPU on each node. MPI handles rendezvous — every process learns the addresses of every other process — and then the application initialises a `torch.distributed` process group with the NCCL backend.

From that point, NCCL takes over the data plane: every gradient AllReduce, every parameter broadcast, every all-gather goes through NCCL, not MPI. MPI's role narrows to occasional control-plane operations like barriers between training phases or rank-0 broadcasts of metadata.

Operational Notes#

PMIx vs PMI-2: modern launchers prefer PMIx for richer rendezvous and async event support; older clusters may still use PMI-2.
MPI and NCCL must agree on rank numbering — pass through `OMPI_COMM_WORLD_RANK` or equivalents into the framework.
UCX is the default transport layer for Open MPI on modern InfiniBand/RoCE clusters; verify with `ompi_info | grep ucx`.
HCOLL (NVIDIA's hierarchical collective library) can replace MPI's own collectives for CPU-side reductions; rarely used in AI but valuable in hybrid CPU+GPU jobs.

References

MPI Forum — Standards · MPI Forum
Open MPI Documentation · Open MPI
MPICH Documentation · Argonne National Lab

Overview#

Implementations#

Implementation	Maintainer	Notes
Open MPI	Community	Most common in academic and neocloud environments
MPICH	Argonne National Lab	Reference implementation
MVAPICH	Ohio State	InfiniBand-optimised derivative of MPICH
Intel MPI	Intel	Commercial MPICH derivative, used in Intel HPC stacks
HPC-X / NVIDIA HPC SDK	NVIDIA	Open MPI + UCX + HCOLL bundle

Role in AI Training#

Operational Notes#

PMIx vs PMI-2: modern launchers prefer PMIx for richer rendezvous and async event support; older clusters may still use PMI-2.

MPI and NCCL must agree on rank numbering — pass through `OMPI_COMM_WORLD_RANK` or equivalents into the framework.

UCX is the default transport layer for Open MPI on modern InfiniBand/RoCE clusters; verify with `ompi_info | grep ucx`.

HCOLL (NVIDIA's hierarchical collective library) can replace MPI's own collectives for CPU-side reductions; rarely used in AI but valuable in hybrid CPU+GPU jobs.

MPI (Message Passing Interface)

Overview#

Implementations#

Role in AI Training#

Operational Notes#

References

Browse all entries

Deploy on Yobitel

MPI (Message Passing Interface)

Overview#

Implementations#

Role in AI Training#

Operational Notes#

References

Browse all entries

Deploy on Yobitel