Mask R-CNN

TL;DR

Introduced by He, Gkioxari, Dollár, and Girshick at Facebook AI Research in 'Mask R-CNN' (arXiv:1703.06870, March 2017).
Extends Faster R-CNN by adding a parallel branch that predicts a binary mask for each Region of Interest, producing instance segmentation alongside detection.
Introduced RoIAlign — a bilinear-interpolation replacement for RoIPool that fixes the quantisation misalignment that previously hurt mask quality.
Remains widely used in 2026 for high-accuracy instance segmentation and as the architectural ancestor of Cascade R-CNN, HTC, and most subsequent two-stage segmentation work.

Lineage#

Mask R-CNN sits at the end of a clear lineage: R-CNN (Girshick et al., 2014) introduced region proposals plus a CNN classifier; Fast R-CNN (Girshick, 2015) shared the CNN computation across proposals; Faster R-CNN (Ren et al., 2015) replaced selective search with a learned Region Proposal Network. Mask R-CNN added the mask head and the RoIAlign operation that lets that head produce pixel-precise masks.

The 2017 paper won the Marr Prize at ICCV. Almost a decade later, Mask R-CNN remains the reference two-stage detector and the most common baseline against which new instance segmentation methods are reported.

Architecture#

Backbone — typically ResNet-50/101 with Feature Pyramid Network (FPN).
Region Proposal Network — sliding-window anchor-based proposal generator on each FPN level.
RoIAlign — extracts a fixed-size feature map for each proposal using bilinear interpolation, avoiding the integer quantisation of RoIPool.
Detection head — per-RoI classification and bounding-box regression.
Mask head — small FCN per RoI emitting one binary mask per class (K × m × m output).

RoIAlign — Why It Mattered#

RoIPool, used in Fast/Faster R-CNN, quantised RoI coordinates to integer grid positions twice — once when binning the RoI onto the feature map, and once when pooling within each bin. Each quantisation introduced sub-pixel misalignment that was harmless for box-level classification but ruinous for pixel-precise masks.

RoIAlign removes both quantisations. Coordinates stay continuous; bilinear interpolation samples the feature map at fractional positions. The improvement on COCO mask AP was several points and is the single biggest reason Mask R-CNN worked so well.

RoIAlign is now standard in every two-stage detector. It is one of the few cases where a single op-level fix moved the state of the art measurably.

Loss#

The training loss is the sum of three terms: classification cross-entropy, bounding-box L1, and per-pixel binary cross-entropy over the predicted mask. Mask loss is computed only on the ground-truth class — the network does not have to choose 'which mask to predict' because each class has its own mask channel.

Compared to SAM 2 and Modern Segmentation#

Mask R-CNN and SAM 2 solve different problems. Mask R-CNN is a closed-vocabulary instance segmentation model — it predicts masks for a fixed set of trained classes. SAM 2 is an open-vocabulary, promptable segmentation model — it predicts a mask for whatever object the user points at, with no class label.

Where the trained classes are known and stable (industrial QC, medical imaging with fixed pathology classes, retail product detection), Mask R-CNN remains competitive and often easier to deploy — no prompt UI, deterministic outputs, well-understood failure modes. Where the object vocabulary is open or shifting, SAM 2 is the better fit.

Deployment Reality#

Mask R-CNN ships in Detectron2, MMDetection, and TorchVision with mature ResNet-50/101-FPN reference configurations. On an L4 at 800×800 input, FP16 TensorRT throughput is in the tens of images per second per stream — fine for offline batch segmentation, marginal for high-frame-rate video. For high-throughput streaming, modern single-stage segmenters (YOLOv11-seg, RT-DETR-seg variants) are preferred.