TL;DR
- Foundation models are large neural networks pretrained on broad data with self-supervised objectives, then adapted to downstream tasks. The term was coined by Stanford's CRFM in 2021.
- For language, foundation models are almost universally decoder-only Transformers trained with next-token prediction on trillions of tokens of web, code and curated data.
- Frontier families in 2026: GPT-5/4.1 (OpenAI), Claude 4.x (Anthropic), Gemini 2.x (Google DeepMind), Llama 3.x/4 (Meta), Qwen 3 (Alibaba), DeepSeek-V3/R1 (DeepSeek), Mistral Large (Mistral).
- Capability now sits on a stack: foundation pretraining → supervised fine-tuning → RLHF/DPO → tool use and reasoning training → deployment via vLLM/TensorRT-LLM/SGLang.
What 'Foundation Model' Means#
Stanford's Center for Research on Foundation Models defined the term in 2021: a model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The defining properties are scale, breadth of training data, and adaptability. The economic implication is that a small number of foundation models are trained at huge expense, and a much larger number of downstream applications adapt them via fine-tuning, prompting or retrieval.
In 2026, 'foundation model' colloquially refers to large language models (LLMs) and their multimodal extensions. The same architectural idea underlies image generation (Stable Diffusion, FLUX), speech (Whisper, Voicebox), code (Codestral, DeepSeek-Coder) and biology (AlphaFold 3, ESM-3).
The Pretraining Recipe#
Modern LLM pretraining follows a remarkably stable recipe:
- Architecture — decoder-only Transformer with RoPE, SwiGLU, RMSNorm, GQA, often MoE.
- Tokeniser — byte-level BPE with 100k-256k vocabulary.
- Data — trillions of tokens of curated web text (Common Crawl, FineWeb), code (GitHub), books, papers and synthetic data.
- Objective — next-token prediction (cross-entropy) with optional infilling.
- Optimiser — AdamW with cosine learning-rate schedule and warm-up.
- Hardware — thousands to tens of thousands of H100 / B200 GPUs with InfiniBand interconnect.
- Duration — weeks to months of wall-clock training.
Scaling Laws#
Hoffmann et al.'s 2022 Chinchilla paper established the compute-optimal scaling rule: for a fixed compute budget, model parameters and training tokens should grow in roughly equal proportion. The empirical recipe became ~20 tokens per parameter — a 70B model wants ~1.4T tokens.
By 2024 frontier training had moved well past Chinchilla-optimal because inference cost matters more than training cost at deployment scale. Llama 3 8B was trained on 15T tokens (≈1800 tokens/parameter), heavily overtrained relative to Chinchilla to produce a more capable small model. DeepSeek-V3 with MoE has its own scaling regime.
Post-Training Pipeline#
A pretrained 'base' model is a sophisticated next-token predictor but not a useful assistant. Post-training adds:
- Supervised fine-tuning (SFT) on high-quality instruction-following data.
- Preference optimisation — RLHF (PPO), DPO, or GRPO — against human or AI preference data.
- Constitutional AI / RLAIF — AI-generated critique and rewriting against a written policy.
- Tool-use training — function-calling, code execution, web search and agent-environment loops.
- Reasoning training — long-chain-of-thought distillation, OpenAI o-series-style RL on verifiable tasks.
The 2026 Landscape#
| Family | Latest flagship | Weights | Notable trait |
|---|---|---|---|
| OpenAI | GPT-5 / o5 | Closed | Strongest general reasoning |
| Anthropic | Claude 4.7 Opus | Closed | Long context, strong agentic use |
| Google DeepMind | Gemini 2.5 Ultra | Closed | 1M+ token context, multimodal |
| Meta | Llama 4 400B / 4 70B | Open | Largest open frontier model |
| DeepSeek | DeepSeek-V3.5 / R2 | Open | MoE efficiency, strong reasoning |
| Alibaba | Qwen 3 / Qwen 3-MoE | Open | Strong multilingual |
| Mistral | Mistral Large 2 / Codestral 2 | Mixed | European, code-strong |
Open vs Closed Weights#
The 2023-2026 era has produced a remarkable open-weights ecosystem. Llama, Qwen, DeepSeek, Mistral and Gemma have released models within a few months of comparable closed flagships. For most enterprise applications outside the absolute frontier, open weights — fine-tuned and served on dedicated infrastructure — are now the cost-efficient choice.
Closed-weights APIs retain the absolute capability lead at any given moment, generally by 6-12 months. They also bundle reliability, safety filtering and rapid iteration, which are non-trivial to replicate at enterprise scale.
When choosing a foundation model, weigh capability ceiling, total cost of ownership (training + inference), data sovereignty, latency requirements and tool ecosystem. The right answer is rarely 'the most capable model'.