TL;DR
- Dataset curation — not algorithm choice — is the single biggest determinant of fine-tune quality at the open-model scale.
- A standard pipeline includes language filtering, deduplication (exact, near, and semantic), benchmark decontamination, quality scoring, format normalisation, and balancing across task types.
- Modern open recipes (Tulu 3, Llama 3, OLMo 2, Hermes 3) publish their curation pipelines in detail; the playbook is no longer proprietary.
- The most common failure mode is silent benchmark contamination — training examples that overlap with eval sets, inflating reported scores and masking real-world degradation.
Why Curation Dominates#
Every public fine-tune ablation since 2023 has reinforced the same finding: clean, diverse, well-balanced data beats more data of lower quality. LIMA showed 1,000 examples could produce a competitive model. Phi showed careful synthetic curation could match much larger training corpora. Llama 3's training report attributes most of its post-training gains to data pipeline improvements rather than algorithmic changes.
The reason is simple: the model fits whatever you give it. Noise, duplicates, and contamination are not noise to the model — they are signal, and the model dutifully memorises them. Curating the dataset is the most important thing the practitioner does.
A Canonical Pipeline#
A modern fine-tuning curation pipeline applies the following stages in roughly this order. Skip any of them at your peril.
- Language filtering — drop examples not in the target language(s); LangID or fasttext models do this cheaply.
- Format validation — enforce schema (e.g. valid JSON, present role tags, non-empty response).
- Length filters — drop overly short responses (often refusals or errors) and overly long ones that destabilise packing.
- Exact deduplication — hash-based removal of byte-identical examples.
- Near-duplicate filtering — MinHash or SimHash to remove paraphrases.
- Semantic deduplication — embedding-based clustering catches restated content that hashes miss.
- Benchmark decontamination — n-gram or embedding search against every eval set you plan to report on.
- Quality scoring — judge-model rating; drop the bottom tail.
- Topic / task balancing — re-weight or sub-sample to match a target distribution.
- Final spot check — human review on a random sample to catch silent failures.
Benchmark decontamination must happen against the exact eval sets you will report. Decontaminating against MMLU but not GSM8K, then reporting GSM8K, is a common and embarrassing mistake.
Deduplication in Detail#
Deduplication is the single highest-leverage step. Public instruction datasets routinely contain 20-40% near-duplicate content; training on duplicated examples is equivalent to upweighting them, which biases the model toward whatever pattern is over-represented.
| Method | Catches | Cost |
|---|---|---|
| SHA256 hash | Byte-exact | Trivial |
| MinHash LSH | Near-duplicates (token-level) | Cheap |
| SimHash | Near-duplicates (faster) | Cheap |
| Embedding clustering | Semantic duplicates | Moderate |
| Pairwise LLM judge | Subtle paraphrases | Expensive |
Decontamination#
Modern benchmarks leak. MMLU, HumanEval, GSM8K, and MATH question text appears in many web crawls, in tutorial sites, in Stack Overflow answers, and in synthetic data generated by models that themselves memorised the benchmark. A fine-tuning dataset compiled without explicit decontamination will almost always have some contamination.
The standard approach is n-gram overlap (a sliding 8-13-gram window against the eval set, drop matches), supplemented by embedding-based similarity search for paraphrased versions. Llama 3, Tulu 3, and OLMo 2 publish their decontamination protocols and ship the resulting clean datasets.
Balancing and Mixing#
Once the data is clean, the next question is composition. A fine-tune dataset is typically a mixture of several source datasets, each weighted to achieve a target task-type distribution. The mixture matters more than any single source.
- Maintain coverage across all target task types; under-representation produces dead zones in the model's behaviour.
- Re-weight high-quality sources up and low-quality sources down — but cap weights to avoid memorisation.
- Mix general data with domain-specific data when narrow fine-tuning, otherwise the model loses general capability.
- For chat models, mix single-turn and multi-turn examples; single-turn-only fine-tunes degrade conversational ability.
- For reasoning models, mix verbose chain-of-thought with concise responses so the model can produce both on demand.
Tooling#
- datatrove (HuggingFace) — production-scale text processing pipelines.
- dolma toolkit (AI2) — used to build OLMo's training and post-training data.
- text-dedup — MinHash and SimHash deduplication at scale.
- lm-evaluation-harness — also provides benchmark decontamination hooks.
- fasttext — language ID and quality classifiers, used widely since CommonCrawl-LM.
- judge models — Llama-3-70B-Instruct, Qwen2.5-72B, or Prometheus 2 for open quality scoring.
When to Invest in Curation#
Always. The marginal hour spent improving the dataset returns more quality than the marginal hour spent tuning hyperparameters, in every fine-tuning project the author has seen. Treat curation as 60-80% of the project budget and the rest of the pipeline gets easier.
References
- Tulu 3: Pushing Frontiers in Open Language Model Post-Training · arXiv (Lambert et al., 2024)
- datatrove — Hugging Face data processing library · GitHub
- Deduplicating Training Data Makes Language Models Better · arXiv (Lee et al., 2021)