Distributed Data Loaders

TL;DR

At frontier scale, a poorly designed data loader leaves GPUs idle for 20-50 % of each step. A good one keeps them fed at >95 % utilisation.
Three production patterns: indexed binary formats (Megatron MMapIndexedDataset), streaming over object storage (WebDataset, Mosaic StreamingDataset), and on-the-fly tokenisation (NeMo Curator pipelines).
Beyond throughput: deterministic shuffling, exact resumption, dataset blending, and per-rank sharding without overlap.

Overview#

Frontier model training reads tens to hundreds of terabytes of tokenised data over a multi-week run. Per-GPU throughput targets are in the hundreds of MB/s; cluster-wide throughput targets are tens of GB/s. If the loader cannot keep up, GPUs sit idle on `cuda.synchronize` waiting for the next batch — wasted hardware.

The discipline of distributed data loading covers three problems: storage format and access pattern, sharding across ranks, and bookkeeping (shuffling, resumption, blending).

Storage Formats#

Pretraining data is almost always pre-tokenised and stored as binary blobs with a separate index. Megatron-LM's MMapIndexedDataset is the reference format: documents are concatenated into one .bin file, document offsets into a .idx file, and the dataset is `mmap`ed for random-access reads. The OS page cache absorbs most of the I/O cost.

Alternatives include WebDataset (tar shards, sequential read, ideal for object storage), MosaicML StreamingDataset (purpose-built for cloud-streamed pretraining), and Hugging Face Datasets (Arrow-backed, friendlier for fine-tuning workloads).

Megatron MMapIndexedDataset — random access, low overhead, the LLM pretraining default.
WebDataset — sequential tar streams, plays nicely with shuffle buffers and object storage.
Mosaic StreamingDataset — designed for streaming over cloud object stores with deterministic resumption.
Hugging Face Datasets (Arrow) — ergonomic for fine-tuning, less optimised for trillion-token pretrains.

Sharding#

Every DP rank must see disjoint data within an epoch — overlap means duplicate gradients and biased training. The standard approach is to shard the dataset by index: rank r out of N reads samples r, r+N, r+2N, …. With sharded distributed parallelism (FSDP, ZeRO), the DP group is the relevant shard dimension; TP and PP ranks share data.

Shuffling at frontier scale is non-trivial. A naive global shuffle requires a permutation over trillions of indices in memory. Practical implementations use bucketed shuffles — within-bucket random with deterministic bucket order — or two-pass shuffles seeded by the dataset version.

Dataset Blending and Resumption#

Pretraining datasets are mixtures: 60 % web text, 15 % code, 10 % scientific, 10 % conversational, 5 % math (typical proportions). The loader samples from each constituent in proportion at each step. Megatron and NeMo express this as weighted sampling with per-source independent indexing.

Resumption — restarting training from a checkpoint and reading the same sample sequence — requires the loader to expose state. Modern loaders save (epoch, sample_index, shuffle_seed) at every checkpoint and resume from exactly that point, so a recovered run sees the same data it would have seen without the interruption.

If your loader cannot resume exactly, your training is not reproducible. Resumed runs will see a different data ordering, which interacts badly with learning-rate warmup, curriculum schedules, and any analysis that assumes consistent sample ordering. Test resumption before you need it.

Performance Characteristics#

Target per-GPU throughput: 100-500 MB/s for pretraining, 10-50 MB/s for fine-tuning.
Cluster-wide aggregate: tens to hundreds of GB/s — dictates choice of parallel filesystem.
OS page cache hit rate >90 % is normal for repeated-epoch training; cold reads from object store are 10-100× slower.
CPU cost: pre-tokenised binary formats need ~0.1 CPU core per GPU; on-the-fly tokenisation 1-4 cores per GPU.

Pitfalls#

Object-store-backed datasets without prefetching show step-time jitter — pre-stage to NVMe.
On-the-fly tokenisation is convenient but can become the bottleneck; pre-tokenise once for serious pretrains.
Loader workers (`num_workers` in PyTorch DataLoader) too high spawns thrashing; too low leaves the GPU waiting. Tune.
Pinned memory + non-blocking H2D copies matter at scale — frameworks expose them but the user must enable them.
Curriculum and dataset blending must be deterministic across resumes, or the curriculum schedule silently shifts.

Software#

Megatron-LM MMapIndexedDataset — reference format for LLM pretrain.
WebDataset (github.com/webdataset/webdataset).
MosaicML StreamingDataset (github.com/mosaicml/streaming).
Hugging Face Datasets (huggingface.co/docs/datasets) — Arrow-backed, IterableDataset for streaming.
NeMo Curator — distributed dataset preparation (dedup, filter, tokenise) at trillion-token scale.
PyTorch DataLoader with `num_workers`, `pin_memory`, `prefetch_factor`.

References

WebDataset on GitHub · GitHub
MosaicML StreamingDataset · GitHub (Databricks / MosaicML)
NeMo Curator documentation · NVIDIA
Megatron-LM data preparation · GitHub (NVIDIA)

TL;DR

At frontier scale, a poorly designed data loader leaves GPUs idle for 20-50 % of each step. A good one keeps them fed at >95 % utilisation.
Three production patterns: indexed binary formats (Megatron MMapIndexedDataset), streaming over object storage (WebDataset, Mosaic StreamingDataset), and on-the-fly tokenisation (NeMo Curator pipelines).
Beyond throughput: deterministic shuffling, exact resumption, dataset blending, and per-rank sharding without overlap.

Overview#

The discipline of distributed data loading covers three problems: storage format and access pattern, sharding across ranks, and bookkeeping (shuffling, resumption, blending).

Storage Formats#

Megatron MMapIndexedDataset — random access, low overhead, the LLM pretraining default.
WebDataset — sequential tar streams, plays nicely with shuffle buffers and object storage.
Mosaic StreamingDataset — designed for streaming over cloud object stores with deterministic resumption.
Hugging Face Datasets (Arrow) — ergonomic for fine-tuning, less optimised for trillion-token pretrains.

Target per-GPU throughput: 100-500 MB/s for pretraining, 10-50 MB/s for fine-tuning.
Cluster-wide aggregate: tens to hundreds of GB/s — dictates choice of parallel filesystem.
OS page cache hit rate >90 % is normal for repeated-epoch training; cold reads from object store are 10-100× slower.
CPU cost: pre-tokenised binary formats need ~0.1 CPU core per GPU; on-the-fly tokenisation 1-4 cores per GPU.

Pitfalls#

Object-store-backed datasets without prefetching show step-time jitter — pre-stage to NVMe.
On-the-fly tokenisation is convenient but can become the bottleneck; pre-tokenise once for serious pretrains.
Loader workers (`num_workers` in PyTorch DataLoader) too high spawns thrashing; too low leaves the GPU waiting. Tune.
Pinned memory + non-blocking H2D copies matter at scale — frameworks expose them but the user must enable them.
Curriculum and dataset blending must be deterministic across resumes, or the curriculum schedule silently shifts.

Software#

Megatron-LM MMapIndexedDataset — reference format for LLM pretrain.
WebDataset (github.com/webdataset/webdataset).
MosaicML StreamingDataset (github.com/mosaicml/streaming).
Hugging Face Datasets (huggingface.co/docs/datasets) — Arrow-backed, IterableDataset for streaming.
NeMo Curator — distributed dataset preparation (dedup, filter, tokenise) at trillion-token scale.
PyTorch DataLoader with `num_workers`, `pin_memory`, `prefetch_factor`.

References

WebDataset on GitHub · GitHub
MosaicML StreamingDataset · GitHub (Databricks / MosaicML)
NeMo Curator documentation · NVIDIA
Megatron-LM data preparation · GitHub (NVIDIA)

Distributed Data Loaders

Overview#

Storage Formats#

Sharding#

Dataset Blending and Resumption#

Performance Characteristics#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel

Distributed Data Loaders

Overview#

Storage Formats#

Sharding#

Dataset Blending and Resumption#

Performance Characteristics#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel