TL;DR
- UCX is an open-source communication framework that abstracts over InfiniBand, RoCE, TCP, shared memory, CUDA IPC, and other transports.
- Sits below MPI implementations (Open MPI, MPICH), libfabric peers, and parts of NCCL's GPU IPC paths; provides RMA, atomics, tag-matching, and active messages.
- Maintained by the UCF (Unified Communication Framework) consortium — NVIDIA, IBM, Mellanox legacy, Argonne, Oak Ridge.
- Tuning UCX is part of the standard hygiene for high-performance MPI jobs on InfiniBand and RoCE clusters.
Overview#
UCX provides a single API that higher-level libraries — MPI implementations, OpenSHMEM, frameworks doing point-to-point communication — can call without caring whether the underlying transport is InfiniBand RDMA, RoCEv2, TCP, shared memory, or CUDA IPC. It internally chooses the best transport per pair of endpoints.
The framework sits between collective libraries (NCCL, MPI collective implementations) and the underlying driver/hardware. Where NCCL specifically targets GPU-to-GPU collectives, UCX is more general — point-to-point and one-sided primitives with transport agility — and is the default below most MPI stacks in HPC and AI deployments.
Architecture#
UCX is organised in three layers. UCS (Unified Communication Services) provides logging, configuration, and OS primitives. UCT (Unified Communication Transports) wraps each underlying transport — `rc_mlx5` for InfiniBand reliable connection, `dc_mlx5` for dynamic connection, `rdmacm` for RoCE, `tcp`, `shm`, `cuda_ipc`, `cuda_copy`, and others. UCP (Unified Communication Protocols) is the high-level API that applications call; it composes UCT transports automatically.
Operational Notes#
- Use `UCX_NET_DEVICES` to pin specific HCAs (e.g. `mlx5_0:1`) per job — wrong defaults silently regress throughput.
- `UCX_TLS` chooses the transport list — common production values are `rc_mlx5,sm,self,cuda_copy,cuda_ipc` for InfiniBand + intra-node CUDA paths.
- Logging via `UCX_LOG_LEVEL=info` surfaces transport selection and registration cache hits; invaluable when debugging slow startup.
- Memory registration is expensive — long-running training jobs benefit from `UCX_MEMTYPE_CACHE=y` to amortise it.
References
- UCX Documentation · UCF Consortium
- UCX GitHub Repository · UCF Consortium
- UCX: An Open Source Framework for HPC Network APIs · Hot Interconnects 2015