Vision-Language Models (VLMs)

TL;DR

Vision-Language Models map images and text into a shared representation space, enabling tasks from image-text retrieval to visual question answering and grounded reasoning.
CLIP (Radford et al., 2021) established the contrastive image-text pretraining recipe and launched the field's modern era.
Generative VLMs (LLaVA, BLIP-2, GPT-4V, Claude 3 Vision, Gemini, Qwen-VL, Llama 3.2-Vision) combine a vision encoder with an LLM via a small connector module.
The dominant 2024-2026 architecture: ViT or SigLIP vision encoder → MLP connector → decoder-only LLM, trained on image-text pairs and instruction data.

CLIP and Contrastive Pretraining#

Before CLIP, vision models were trained on labelled image datasets (ImageNet) and language models on text — separate worlds. CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021) trained a vision encoder and a text encoder jointly on 400M image-text pairs scraped from the web, using a contrastive InfoNCE loss: in a batch of N pairs, the correct (image, caption) pair should score higher than any of the N-1 mismatched pairs.

The result: image embeddings and caption embeddings lived in the same vector space, enabling zero-shot image classification (compare an image to caption candidates like 'a photo of a cat'), text-to-image retrieval, and downstream task transfer with minimal fine-tuning.

Beyond CLIP: SigLIP and EVA-CLIP#

SigLIP (Zhai et al., 2023) replaced CLIP's softmax-based contrastive loss with a sigmoid pairwise loss, eliminating the dependence on batch size and enabling training on much smaller batches. SigLIP-So400M became the de-facto vision encoder for open generative VLMs in 2024.

EVA-CLIP scaled the recipe further and is used as the vision tower in several Chinese open VLMs. DINOv2 is a self-supervised alternative (no text) with strong visual features for dense prediction tasks.

Generative VLM Architecture#

The dominant generative VLM recipe has three components:

Training is typically in two stages: connector-only pretraining on image-caption data (LLM frozen), then end-to-end instruction tuning on visual question-answering data (LLM unfrozen).

Vision encoder (ViT, SigLIP, EVA-CLIP) — produces a sequence of visual tokens from an image.
Connector — a small MLP or Q-Former that maps visual tokens into the LLM's embedding space.
LLM decoder — a pretrained decoder-only language model that consumes visual tokens followed by text and generates output autoregressively.

Representative Models#

Model	Vision encoder	LLM backbone	Notes
LLaVA-1.5 / 1.6	CLIP ViT-L/14	Vicuna / Mistral	Open, popular research baseline
BLIP-2	EVA-CLIP	FlanT5 / OPT	Q-Former connector
GPT-4V / GPT-4o	Proprietary	GPT-4 / 4o	Closed, strong general VQA
Claude 3 / 4 Vision	Proprietary	Claude	Closed, strong document VQA
Gemini 1.5 / 2	Native multimodal	Gemini	Closed, native long-context video
Qwen-VL / Qwen2-VL	ViT (Qwen-specific)	Qwen	Open, strong OCR and grounding
Llama 3.2-Vision	Custom	Llama 3.1	Open Meta release

Native Multimodal vs Encoder-Connector#

Gemini was the first frontier model trained natively multimodal — images, text and audio handled by a single Transformer with multimodal tokenisation from the start. GPT-4o followed with a unified omni-modal model. The advantage: tighter integration, lower latency, better cross-modal reasoning.

Most open VLMs still use the encoder-connector recipe because it leverages existing strong vision and language models without retraining either from scratch. The capability gap is closing as connector designs (multi-resolution, dynamic patching, tiling) improve.

When VLMs 'see' images, they really see a few hundred to a few thousand visual tokens. High-resolution input or fine-grained text in images requires tiling strategies (Qwen2-VL, GPT-4o-style multi-crop) or dedicated OCR.

Common Tasks and Benchmarks#

MMMU (Multimodal Massive Multitask Understanding) is the closest VLM equivalent of MMLU; top models in 2026 score in the high 70s, expert humans in the high 80s.

Visual Question Answering — VQAv2, MMBench, MMMU, MathVista.
Document understanding — DocVQA, ChartQA, InfographicVQA.
Image captioning — COCO Captions.
Grounding and referring — RefCOCO, Visual Grounding benchmarks.
Video understanding — Video-MME, MVBench, Long Video Bench.
Scientific reasoning — ScienceQA, AI2D, MMMU.

References

TL;DR

Vision-Language Models map images and text into a shared representation space, enabling tasks from image-text retrieval to visual question answering and grounded reasoning.
CLIP (Radford et al., 2021) established the contrastive image-text pretraining recipe and launched the field's modern era.
Generative VLMs (LLaVA, BLIP-2, GPT-4V, Claude 3 Vision, Gemini, Qwen-VL, Llama 3.2-Vision) combine a vision encoder with an LLM via a small connector module.
The dominant 2024-2026 architecture: ViT or SigLIP vision encoder → MLP connector → decoder-only LLM, trained on image-text pairs and instruction data.

Vision encoder (ViT, SigLIP, EVA-CLIP) — produces a sequence of visual tokens from an image.
Connector — a small MLP or Q-Former that maps visual tokens into the LLM's embedding space.
LLM decoder — a pretrained decoder-only language model that consumes visual tokens followed by text and generates output autoregressively.

Representative Models#

Model	Vision encoder	LLM backbone	Notes
LLaVA-1.5 / 1.6	CLIP ViT-L/14	Vicuna / Mistral	Open, popular research baseline
BLIP-2	EVA-CLIP	FlanT5 / OPT	Q-Former connector
GPT-4V / GPT-4o	Proprietary	GPT-4 / 4o	Closed, strong general VQA
Claude 3 / 4 Vision	Proprietary	Claude	Closed, strong document VQA
Gemini 1.5 / 2	Native multimodal	Gemini	Closed, native long-context video
Qwen-VL / Qwen2-VL	ViT (Qwen-specific)	Qwen	Open, strong OCR and grounding
Llama 3.2-Vision	Custom	Llama 3.1	Open Meta release

Native Multimodal vs Encoder-Connector#

Common Tasks and Benchmarks#

MMMU (Multimodal Massive Multitask Understanding) is the closest VLM equivalent of MMLU; top models in 2026 score in the high 70s, expert humans in the high 80s.

Visual Question Answering — VQAv2, MMBench, MMMU, MathVista.
Document understanding — DocVQA, ChartQA, InfographicVQA.
Image captioning — COCO Captions.
Grounding and referring — RefCOCO, Visual Grounding benchmarks.
Video understanding — Video-MME, MVBench, Long Video Bench.
Scientific reasoning — ScienceQA, AI2D, MMMU.

Vision-Language Models (VLMs)

CLIP and Contrastive Pretraining#

Beyond CLIP: SigLIP and EVA-CLIP#

Generative VLM Architecture#

Representative Models#

Native Multimodal vs Encoder-Connector#

Common Tasks and Benchmarks#

References

Browse all entries

Deploy on Yobitel

Vision-Language Models (VLMs)

CLIP and Contrastive Pretraining#

Beyond CLIP: SigLIP and EVA-CLIP#

Generative VLM Architecture#

Representative Models#

Native Multimodal vs Encoder-Connector#

Common Tasks and Benchmarks#

References

Browse all entries

Deploy on Yobitel