SigLIP

TL;DR

Introduced by Zhai et al. at Google in 'Sigmoid Loss for Language Image Pre-Training' (arXiv:2303.15343, March 2023).
Replaces CLIP's softmax-based InfoNCE loss with a per-pair sigmoid loss, removing the need for full-batch negative comparison and the associated all-gather across devices.
Simpler scaling — performance improves smoothly with batch size and model size, and small-batch training becomes competitive with large-batch CLIP for the first time.
SigLIP and SigLIP 2 (2024-2025) have become standard vision-language encoders in modern multimodal LLMs (PaliGemma, Gemma 3 vision tower, many open MLLMs).

What Changed from CLIP#

CLIP's contrastive loss is softmax over the in-batch similarity matrix — for every image, all other texts in the batch are negatives, and vice versa. That requires gathering all embeddings from all distributed-training devices to compute the loss, and the loss quality scales with the batch size because more negatives sharpen the contrast.

SigLIP replaces softmax with a sigmoid loss applied independently to every (image, text) pair. Each pair becomes a binary classification: should these match? The loss decomposes per pair, so the all-gather step disappears, and small-batch training works just as well as large-batch in terms of optimisation dynamics. The architecture — two encoders, shared projection — is otherwise unchanged.

Loss in One Line#

Compared to CLIP's symmetric InfoNCE, this is a per-pair logistic loss with a learned linear shift on similarity. No softmax, no batch-wide normalisation, no full-batch negative dependence.

text

For each image-text pair (i, j):
  z_ij = t * (image_i · text_j) + b
  loss_ij = log(1 + exp(-y_ij * z_ij))

where y_ij = +1 if i == j (positive pair), -1 otherwise
(t, b) are learned temperature and bias scalars.

Variants#

SigLIP — original release, several ViT backbones (B/16, L/16, So400m/14, So400m/16).
SigLIP 2 — 2024-2025 successor with improved training data and recipe; current default for many multimodal stacks.
PaliGemma — Google's open vision-language model that pairs a SigLIP image encoder with a Gemma text decoder.
Gemma 3 vision — uses a SigLIP-style encoder as its vision tower.

Why It Spread#

Three practical wins moved the open ecosystem onto SigLIP: simpler distributed training, strong small-batch performance (lower barrier to fine-tuning), and consistently competitive or better zero-shot benchmark numbers versus same-scale CLIP and EVA-CLIP. Hugging Face's `SiglipModel` made integration trivial.

Multimodal LLM authors picked SigLIP as the vision tower because it scaled cleanly to the higher image resolutions modern MLLMs need (typically 384 or 448 pixels), without the throughput penalty that full-batch contrastive training imposes at high resolution.

For new vision-language work in 2026, SigLIP 2 So400m is a strong default encoder. Pair with a Gemma, Qwen, or Llama text decoder for a working multimodal stack.

Practical Use#

Note that SigLIP's logits are interpreted with sigmoid, not softmax. Each label gets an independent probability of matching the image — useful for multi-label classification, threshold-based gating, and zero-shot tagging tasks where 'none of the above' is a valid answer.

python

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

image = Image.open("photo.jpg")
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding="max_length")
with torch.no_grad():
    outputs = model(**inputs)

logits = torch.sigmoid(outputs.logits_per_image)
print(dict(zip(texts, logits[0].tolist())))

Compared to CLIP and DINOv2#

vs CLIP — SigLIP is the modern default for vision-language tasks. Same dual-encoder pattern, simpler loss, better scaling, current public checkpoints.
vs DINOv2 — different problems. SigLIP for language-conditioned tasks, DINOv2 for vision-only dense features.
vs EVA-CLIP — comparable performance on zero-shot benchmarks; SigLIP wins on training simplicity and ecosystem support, EVA-CLIP wins where larger encoder backbones (up to 18B) are needed.

References

Sigmoid Loss for Language Image Pre-Training (Zhai et al., 2023) · arXiv
SigLIP on Hugging Face · Hugging Face
PaliGemma technical report · arXiv

What Changed from CLIP#

Loss in One Line#

Compared to CLIP's symmetric InfoNCE, this is a per-pair logistic loss with a learned linear shift on similarity. No softmax, no batch-wide normalisation, no full-batch negative dependence.

text

For each image-text pair (i, j):
  z_ij = t * (image_i · text_j) + b
  loss_ij = log(1 + exp(-y_ij * z_ij))

where y_ij = +1 if i == j (positive pair), -1 otherwise
(t, b) are learned temperature and bias scalars.

Variants#

SigLIP — original release, several ViT backbones (B/16, L/16, So400m/14, So400m/16).

SigLIP 2 — 2024-2025 successor with improved training data and recipe; current default for many multimodal stacks.

PaliGemma — Google's open vision-language model that pairs a SigLIP image encoder with a Gemma text decoder.

Gemma 3 vision — uses a SigLIP-style encoder as its vision tower.

Why It Spread#

For new vision-language work in 2026, SigLIP 2 So400m is a strong default encoder. Pair with a Gemma, Qwen, or Llama text decoder for a working multimodal stack.

Practical Use#

python

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

image = Image.open("photo.jpg")
texts = ["a photo of a dog", "a photo of a cat"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding="max_length")
with torch.no_grad():
    outputs = model(**inputs)

logits = torch.sigmoid(outputs.logits_per_image)
print(dict(zip(texts, logits[0].tolist())))

Compared to CLIP and DINOv2#

vs CLIP — SigLIP is the modern default for vision-language tasks. Same dual-encoder pattern, simpler loss, better scaling, current public checkpoints.

vs DINOv2 — different problems. SigLIP for language-conditioned tasks, DINOv2 for vision-only dense features.

vs EVA-CLIP — comparable performance on zero-shot benchmarks; SigLIP wins on training simplicity and ecosystem support, EVA-CLIP wins where larger encoder backbones (up to 18B) are needed.

SigLIP

What Changed from CLIP#

Loss in One Line#

Variants#

Why It Spread#

Practical Use#

Compared to CLIP and DINOv2#

References

Browse all entries

Deploy on Yobitel

SigLIP

What Changed from CLIP#

Loss in One Line#

Variants#

Why It Spread#

Practical Use#

Compared to CLIP and DINOv2#

References

Browse all entries

Deploy on Yobitel