TL;DR
- Open-source distributed training framework from HPC-AI Tech (Bian et al., 2021); aims to be a vendor-neutral alternative to Megatron-LM + DeepSpeed.
- Provides DP, 1D / 2D / 2.5D / 3D tensor parallelism, pipeline parallelism, ZeRO-style sharding, and CPU/NVMe offload behind one config surface.
- Notably accessible — its examples reproduce Stable Diffusion, ChatGPT-style RLHF, and Llama fine-tuning with small infrastructure footprints.
Overview#
Colossal-AI started at the National University of Singapore's HPC-AI lab (Prof. Yang You's group), spun out as HPC-AI Tech, and grew into an open-source training framework that competes with Megatron-LM and DeepSpeed for the same workloads. Its differentiator is breadth: rather than picking one parallelism dimension, it implements many and lets the user mix and match.
It is particularly notable for exotic tensor-parallelism variants — 2D, 2.5D, and 3D tensor parallelism (after Optimus, Tesseract, and similar papers) — that trade communication patterns differently than Megatron's 1D TP. In practice most users still pick 1D TP plus PP plus ZeRO, the same recipe as everyone else.
What Colossal-AI Provides#
- Parallelism: 1D / 2D / 2.5D / 3D tensor parallelism, pipeline parallelism, sequence parallelism, ZeRO sharding.
- Heterogeneous memory: Gemini auto-offload between GPU, CPU, and NVMe — similar to ZeRO-Infinity.
- PEFT recipes: LoRA, prefix tuning for LLM fine-tuning.
- Application examples: Open-Sora (video diffusion), ColossalChat (RLHF reference), Llama / GPT-2 / OPT pretraining.
- Inference adapter: Colossal-Inference for serving models trained in the framework.
Mechanism#
Colossal-AI's central abstraction is the ParallelContext: a single object that declares how DP, TP, PP, and (optionally) sequence parallelism map onto the available devices. Models are then wrapped with the framework's distributed engine, which inserts the appropriate communication primitives at module boundaries.
Gemini, the heterogeneous-memory manager, treats GPU memory, CPU RAM, and NVMe as a tiered cache. Tensors are paged between tiers based on access frequency, with prefetching for known forward/backward schedules.
When to Use#
Use Colossal-AI when you want one framework that covers pretrain, fine-tune, RLHF, and inference for a small team's worth of GPUs — its examples are well-tuned for 8-64 GPU setups and reproduce well-known recipes (Stable Diffusion training, Llama LoRA, ChatGPT-style RLHF) end to end. Its 2D/3D TP variants are research-interesting but rarely the best choice in production; Megatron-LM and NeMo dominate above ~256 GPUs.
Pitfalls#
- Community size is smaller than Megatron-LM / DeepSpeed / FSDP — debugging exotic configurations may require reading source.
- Some advanced features (Gemini, 2.5D TP) have throughput trade-offs that are not always documented.
- Checkpoint formats differ from Megatron and HuggingFace; conversion utilities exist but add steps.
Software#
- github.com/hpcaitech/ColossalAI — main repository.
- Open-Sora — open-source text-to-video model trained on Colossal-AI.
- ColossalChat — RLHF reference implementation.
- HPC-AI Tech maintains commercial offerings on top of the framework.
References
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training · arXiv (Bian et al., 2021)
- Colossal-AI on GitHub · GitHub (HPC-AI Tech)
- Open-Sora project · GitHub