TL;DR
- Vision-Language Models map images and text into a shared representation space, enabling tasks from image-text retrieval to visual question answering and grounded reasoning.
- CLIP (Radford et al., 2021) established the contrastive image-text pretraining recipe and launched the field's modern era.
- Generative VLMs (LLaVA, BLIP-2, GPT-4V, Claude 3 Vision, Gemini, Qwen-VL, Llama 3.2-Vision) combine a vision encoder with an LLM via a small connector module.
- The dominant 2024-2026 architecture: ViT or SigLIP vision encoder → MLP connector → decoder-only LLM, trained on image-text pairs and instruction data.
CLIP and Contrastive Pretraining#
Before CLIP, vision models were trained on labelled image datasets (ImageNet) and language models on text — separate worlds. CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021) trained a vision encoder and a text encoder jointly on 400M image-text pairs scraped from the web, using a contrastive InfoNCE loss: in a batch of N pairs, the correct (image, caption) pair should score higher than any of the N-1 mismatched pairs.
The result: image embeddings and caption embeddings lived in the same vector space, enabling zero-shot image classification (compare an image to caption candidates like 'a photo of a cat'), text-to-image retrieval, and downstream task transfer with minimal fine-tuning.
Beyond CLIP: SigLIP and EVA-CLIP#
SigLIP (Zhai et al., 2023) replaced CLIP's softmax-based contrastive loss with a sigmoid pairwise loss, eliminating the dependence on batch size and enabling training on much smaller batches. SigLIP-So400M became the de-facto vision encoder for open generative VLMs in 2024.
EVA-CLIP scaled the recipe further and is used as the vision tower in several Chinese open VLMs. DINOv2 is a self-supervised alternative (no text) with strong visual features for dense prediction tasks.
Generative VLM Architecture#
The dominant generative VLM recipe has three components:
Training is typically in two stages: connector-only pretraining on image-caption data (LLM frozen), then end-to-end instruction tuning on visual question-answering data (LLM unfrozen).
- Vision encoder (ViT, SigLIP, EVA-CLIP) — produces a sequence of visual tokens from an image.
- Connector — a small MLP or Q-Former that maps visual tokens into the LLM's embedding space.
- LLM decoder — a pretrained decoder-only language model that consumes visual tokens followed by text and generates output autoregressively.
Representative Models#
| Model | Vision encoder | LLM backbone | Notes |
|---|---|---|---|
| LLaVA-1.5 / 1.6 | CLIP ViT-L/14 | Vicuna / Mistral | Open, popular research baseline |
| BLIP-2 | EVA-CLIP | FlanT5 / OPT | Q-Former connector |
| GPT-4V / GPT-4o | Proprietary | GPT-4 / 4o | Closed, strong general VQA |
| Claude 3 / 4 Vision | Proprietary | Claude | Closed, strong document VQA |
| Gemini 1.5 / 2 | Native multimodal | Gemini | Closed, native long-context video |
| Qwen-VL / Qwen2-VL | ViT (Qwen-specific) | Qwen | Open, strong OCR and grounding |
| Llama 3.2-Vision | Custom | Llama 3.1 | Open Meta release |
Native Multimodal vs Encoder-Connector#
Gemini was the first frontier model trained natively multimodal — images, text and audio handled by a single Transformer with multimodal tokenisation from the start. GPT-4o followed with a unified omni-modal model. The advantage: tighter integration, lower latency, better cross-modal reasoning.
Most open VLMs still use the encoder-connector recipe because it leverages existing strong vision and language models without retraining either from scratch. The capability gap is closing as connector designs (multi-resolution, dynamic patching, tiling) improve.
When VLMs 'see' images, they really see a few hundred to a few thousand visual tokens. High-resolution input or fine-grained text in images requires tiling strategies (Qwen2-VL, GPT-4o-style multi-crop) or dedicated OCR.
Common Tasks and Benchmarks#
MMMU (Multimodal Massive Multitask Understanding) is the closest VLM equivalent of MMLU; top models in 2026 score in the high 70s, expert humans in the high 80s.
- Visual Question Answering — VQAv2, MMBench, MMMU, MathVista.
- Document understanding — DocVQA, ChartQA, InfographicVQA.
- Image captioning — COCO Captions.
- Grounding and referring — RefCOCO, Visual Grounding benchmarks.
- Video understanding — Video-MME, MVBench, Long Video Bench.
- Scientific reasoning — ScienceQA, AI2D, MMMU.