TL;DR
- At frontier scale, a poorly designed data loader leaves GPUs idle for 20-50 % of each step. A good one keeps them fed at >95 % utilisation.
- Three production patterns: indexed binary formats (Megatron MMapIndexedDataset), streaming over object storage (WebDataset, Mosaic StreamingDataset), and on-the-fly tokenisation (NeMo Curator pipelines).
- Beyond throughput: deterministic shuffling, exact resumption, dataset blending, and per-rank sharding without overlap.
Overview#
Frontier model training reads tens to hundreds of terabytes of tokenised data over a multi-week run. Per-GPU throughput targets are in the hundreds of MB/s; cluster-wide throughput targets are tens of GB/s. If the loader cannot keep up, GPUs sit idle on `cuda.synchronize` waiting for the next batch — wasted hardware.
The discipline of distributed data loading covers three problems: storage format and access pattern, sharding across ranks, and bookkeeping (shuffling, resumption, blending).
Storage Formats#
Pretraining data is almost always pre-tokenised and stored as binary blobs with a separate index. Megatron-LM's MMapIndexedDataset is the reference format: documents are concatenated into one .bin file, document offsets into a .idx file, and the dataset is `mmap`ed for random-access reads. The OS page cache absorbs most of the I/O cost.
Alternatives include WebDataset (tar shards, sequential read, ideal for object storage), MosaicML StreamingDataset (purpose-built for cloud-streamed pretraining), and Hugging Face Datasets (Arrow-backed, friendlier for fine-tuning workloads).
- Megatron MMapIndexedDataset — random access, low overhead, the LLM pretraining default.
- WebDataset — sequential tar streams, plays nicely with shuffle buffers and object storage.
- Mosaic StreamingDataset — designed for streaming over cloud object stores with deterministic resumption.
- Hugging Face Datasets (Arrow) — ergonomic for fine-tuning, less optimised for trillion-token pretrains.
Dataset Blending and Resumption#
Pretraining datasets are mixtures: 60 % web text, 15 % code, 10 % scientific, 10 % conversational, 5 % math (typical proportions). The loader samples from each constituent in proportion at each step. Megatron and NeMo express this as weighted sampling with per-source independent indexing.
Resumption — restarting training from a checkpoint and reading the same sample sequence — requires the loader to expose state. Modern loaders save (epoch, sample_index, shuffle_seed) at every checkpoint and resume from exactly that point, so a recovered run sees the same data it would have seen without the interruption.
If your loader cannot resume exactly, your training is not reproducible. Resumed runs will see a different data ordering, which interacts badly with learning-rate warmup, curriculum schedules, and any analysis that assumes consistent sample ordering. Test resumption before you need it.
Performance Characteristics#
- Target per-GPU throughput: 100-500 MB/s for pretraining, 10-50 MB/s for fine-tuning.
- Cluster-wide aggregate: tens to hundreds of GB/s — dictates choice of parallel filesystem.
- OS page cache hit rate >90 % is normal for repeated-epoch training; cold reads from object store are 10-100× slower.
- CPU cost: pre-tokenised binary formats need ~0.1 CPU core per GPU; on-the-fly tokenisation 1-4 cores per GPU.
Pitfalls#
- Object-store-backed datasets without prefetching show step-time jitter — pre-stage to NVMe.
- On-the-fly tokenisation is convenient but can become the bottleneck; pre-tokenise once for serious pretrains.
- Loader workers (`num_workers` in PyTorch DataLoader) too high spawns thrashing; too low leaves the GPU waiting. Tune.
- Pinned memory + non-blocking H2D copies matter at scale — frameworks expose them but the user must enable them.
- Curriculum and dataset blending must be deterministic across resumes, or the curriculum schedule silently shifts.
Software#
- Megatron-LM MMapIndexedDataset — reference format for LLM pretrain.
- WebDataset (github.com/webdataset/webdataset).
- MosaicML StreamingDataset (github.com/mosaicml/streaming).
- Hugging Face Datasets (huggingface.co/docs/datasets) — Arrow-backed, IterableDataset for streaming.
- NeMo Curator — distributed dataset preparation (dedup, filter, tokenise) at trillion-token scale.
- PyTorch DataLoader with `num_workers`, `pin_memory`, `prefetch_factor`.
References
- WebDataset on GitHub · GitHub
- MosaicML StreamingDataset · GitHub (Databricks / MosaicML)
- NeMo Curator documentation · NVIDIA
- Megatron-LM data preparation · GitHub (NVIDIA)