LLM Foundation Models

TL;DR

Foundation models are large neural networks pretrained on broad data with self-supervised objectives, then adapted to downstream tasks. The term was coined by Stanford's CRFM in 2021.
For language, foundation models are almost universally decoder-only Transformers trained with next-token prediction on trillions of tokens of web, code and curated data.
Frontier families in 2026: GPT-5/4.1 (OpenAI), Claude 4.x (Anthropic), Gemini 2.x (Google DeepMind), Llama 3.x/4 (Meta), Qwen 3 (Alibaba), DeepSeek-V3/R1 (DeepSeek), Mistral Large (Mistral).
Capability now sits on a stack: foundation pretraining → supervised fine-tuning → RLHF/DPO → tool use and reasoning training → deployment via vLLM/TensorRT-LLM/SGLang.

What 'Foundation Model' Means#

Stanford's Center for Research on Foundation Models defined the term in 2021: a model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The defining properties are scale, breadth of training data, and adaptability. The economic implication is that a small number of foundation models are trained at huge expense, and a much larger number of downstream applications adapt them via fine-tuning, prompting or retrieval.

In 2026, 'foundation model' colloquially refers to large language models (LLMs) and their multimodal extensions. The same architectural idea underlies image generation (Stable Diffusion, FLUX), speech (Whisper, Voicebox), code (Codestral, DeepSeek-Coder) and biology (AlphaFold 3, ESM-3).

The Pretraining Recipe#

Modern LLM pretraining follows a remarkably stable recipe:

Architecture — decoder-only Transformer with RoPE, SwiGLU, RMSNorm, GQA, often MoE.
Tokeniser — byte-level BPE with 100k-256k vocabulary.
Data — trillions of tokens of curated web text (Common Crawl, FineWeb), code (GitHub), books, papers and synthetic data.
Objective — next-token prediction (cross-entropy) with optional infilling.
Optimiser — AdamW with cosine learning-rate schedule and warm-up.
Hardware — thousands to tens of thousands of H100 / B200 GPUs with InfiniBand interconnect.
Duration — weeks to months of wall-clock training.

Scaling Laws#

Hoffmann et al.'s 2022 Chinchilla paper established the compute-optimal scaling rule: for a fixed compute budget, model parameters and training tokens should grow in roughly equal proportion. The empirical recipe became ~20 tokens per parameter — a 70B model wants ~1.4T tokens.

By 2024 frontier training had moved well past Chinchilla-optimal because inference cost matters more than training cost at deployment scale. Llama 3 8B was trained on 15T tokens (≈1800 tokens/parameter), heavily overtrained relative to Chinchilla to produce a more capable small model. DeepSeek-V3 with MoE has its own scaling regime.

Post-Training Pipeline#

A pretrained 'base' model is a sophisticated next-token predictor but not a useful assistant. Post-training adds:

Supervised fine-tuning (SFT) on high-quality instruction-following data.
Preference optimisation — RLHF (PPO), DPO, or GRPO — against human or AI preference data.
Constitutional AI / RLAIF — AI-generated critique and rewriting against a written policy.
Tool-use training — function-calling, code execution, web search and agent-environment loops.
Reasoning training — long-chain-of-thought distillation, OpenAI o-series-style RL on verifiable tasks.

The 2026 Landscape#

Family	Latest flagship	Weights	Notable trait
OpenAI	GPT-5 / o5	Closed	Strongest general reasoning
Anthropic	Claude 4.7 Opus	Closed	Long context, strong agentic use
Google DeepMind	Gemini 2.5 Ultra	Closed	1M+ token context, multimodal
Meta	Llama 4 400B / 4 70B	Open	Largest open frontier model
DeepSeek	DeepSeek-V3.5 / R2	Open	MoE efficiency, strong reasoning
Alibaba	Qwen 3 / Qwen 3-MoE	Open	Strong multilingual
Mistral	Mistral Large 2 / Codestral 2	Mixed	European, code-strong

Open vs Closed Weights#

The 2023-2026 era has produced a remarkable open-weights ecosystem. Llama, Qwen, DeepSeek, Mistral and Gemma have released models within a few months of comparable closed flagships. For most enterprise applications outside the absolute frontier, open weights — fine-tuned and served on dedicated infrastructure — are now the cost-efficient choice.

Closed-weights APIs retain the absolute capability lead at any given moment, generally by 6-12 months. They also bundle reliability, safety filtering and rapid iteration, which are non-trivial to replicate at enterprise scale.

When choosing a foundation model, weigh capability ceiling, total cost of ownership (training + inference), data sovereignty, latency requirements and tool ecosystem. The right answer is rarely 'the most capable model'.

References

On the Opportunities and Risks of Foundation Models (Bommasani et al., 2021) · arXiv / Stanford CRFM
Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) · arXiv
Llama 3 Technical Report · arXiv
DeepSeek-V3 Technical Report · arXiv

What 'Foundation Model' Means#

The Pretraining Recipe#

Modern LLM pretraining follows a remarkably stable recipe:

Architecture — decoder-only Transformer with RoPE, SwiGLU, RMSNorm, GQA, often MoE.

Tokeniser — byte-level BPE with 100k-256k vocabulary.

Data — trillions of tokens of curated web text (Common Crawl, FineWeb), code (GitHub), books, papers and synthetic data.

Objective — next-token prediction (cross-entropy) with optional infilling.

Optimiser — AdamW with cosine learning-rate schedule and warm-up.

Hardware — thousands to tens of thousands of H100 / B200 GPUs with InfiniBand interconnect.

Duration — weeks to months of wall-clock training.

Scaling Laws#

Post-Training Pipeline#

A pretrained 'base' model is a sophisticated next-token predictor but not a useful assistant. Post-training adds:

Supervised fine-tuning (SFT) on high-quality instruction-following data.

Preference optimisation — RLHF (PPO), DPO, or GRPO — against human or AI preference data.

Constitutional AI / RLAIF — AI-generated critique and rewriting against a written policy.

Tool-use training — function-calling, code execution, web search and agent-environment loops.

Reasoning training — long-chain-of-thought distillation, OpenAI o-series-style RL on verifiable tasks.

The 2026 Landscape#

Family	Latest flagship	Weights	Notable trait
OpenAI	GPT-5 / o5	Closed	Strongest general reasoning
Anthropic	Claude 4.7 Opus	Closed	Long context, strong agentic use
Google DeepMind	Gemini 2.5 Ultra	Closed	1M+ token context, multimodal
Meta	Llama 4 400B / 4 70B	Open	Largest open frontier model
DeepSeek	DeepSeek-V3.5 / R2	Open	MoE efficiency, strong reasoning
Alibaba	Qwen 3 / Qwen 3-MoE	Open	Strong multilingual
Mistral	Mistral Large 2 / Codestral 2	Mixed	European, code-strong

Open vs Closed Weights#

LLM Foundation Models

What 'Foundation Model' Means#

The Pretraining Recipe#

Scaling Laws#

Post-Training Pipeline#

The 2026 Landscape#

Open vs Closed Weights#

References

Browse all entries

Deploy on Yobitel

LLM Foundation Models

What 'Foundation Model' Means#

The Pretraining Recipe#

Scaling Laws#

Post-Training Pipeline#

The 2026 Landscape#

Open vs Closed Weights#

References

Browse all entries

Deploy on Yobitel