Annotation Practice · Supervised Labelling
The labels your training set stands on
Classification, named-entity spans, bounding boxes, segmentation masks, intent and slot tagging, relation extraction. Double-blind labelling. Krippendorff's α tracked per task. Per-item lineage so the dataset card you ship answers the regulator question.
Representative item
ReviewedClinical-note NER + chest X-ray bbox
Reviewed by Dr. Patel at Royal Free Hospital on 12 Mar 2026.
Started on amoxicillin 25mg, review with Dr. Okafor in two weeks.
Bounding box · 3 regions
512 × 512 px source
IAA (Krippendorff α)
0.91
Labellers / item
3
Items in batch
12k
Double-blind labelling. Disagreements adjudicated by a clinical reviewer. Per-item lineage.
The shape of the work
The task shapes that supervised learning trains against
Supervised labelling is not one task. Span tagging, box drawing, mask painting, and relation linking are distinct crafts with distinct failure modes. We treat them that way.
Classification
Single-label or multi-label. The simplest task on paper, the easiest to mis-design in practice. Label set and edge cases get nailed down in calibration before any volume ships.
single-label · multi-label · hierarchical
Named-entity recognition
Span-level tagging across people, places, organisations, dosages, dates, custom domain entities. Nested spans handled. Span boundary disagreement adjudicated, not averaged.
spans · nested · custom entities
Bounding boxes
Tight axis-aligned or rotated boxes for object detection. IoU thresholds set against your downstream model's tolerance, not a generic default.
axis-aligned · rotated · IoU-tuned
Polygon and segmentation masks
Pixel-precise polygons for instance and semantic segmentation. Used where bounding boxes lose too much information. Slower per item, calibration-heavy, worth it when the model needs it.
instance · semantic · panoptic
Intent and slot tagging
Turn utterances into the intent plus structured slots a conversational system can act on. Calibration covers ambiguous intent boundaries upfront so the dataset reflects one rulebook.
intent · slots · conversational
Relation extraction and span linking
Typed relations between entities, coreference chains, span-to-knowledge-base linking. The signal that pushes a model from extractive to compositional.
relations · coreference · entity linking
Real annotation samples
What the labeller actually draws on the canvas
Three real photographs with the bounding boxes, polygons, and keypoints a labeller would produce. Class labels and confidence scores are the same shape your trainer downstream consumes. Click each card for the COCO-format JSON.
person · 0.97person · 0.94bicycle · 0.88skateboard · 0.91person (group) · 0.74Multi-object scene · bounding boxes + class labels + confidence scores
Show COCO-format annotation
{
"image_id": 87038,
"annotations": [
{ "category": "person", "bbox": [342, 200, 80, 173], "score": 0.97 },
{ "category": "person", "bbox": [246, 228, 77, 158], "score": 0.94 },
{ "category": "bicycle", "bbox": [234, 266, 93, 127], "score": 0.88 },
{ "category": "skateboard", "bbox": [307, 339, 70, 41], "score": 0.91 }
]
}
bowl · 0.89person · 0.95apron · mask · 0.86Instance segmentation · polygon outline traces the apron silhouette
Show COCO-format annotation
{
"image_id": 397133,
"annotations": [
{
"category": "person",
"bbox": [352, 119, 141, 257],
"segmentation": [[387, 154, ... 18 pts ...]],
"score": 0.95
},
{ "category": "bowl", "bbox": [141, 277, 77, 56], "score": 0.89 }
]
}
bowl of fruit · 0.92refrigerator · 0.96oven · 0.88Keypoint + bbox · point annotations on small objects with class labels
Show COCO-format annotation
{
"image_id": 37777,
"annotations": [
{ "category": "refrigerator", "bbox": [296, 51, 56, 161], "score": 0.96 },
{ "category": "oven", "bbox": [88, 83, 77, 92], "score": 0.88 },
{ "category": "bowl", "bbox": [146, 138, 79, 41], "score": 0.92,
"keypoints": [186, 154, 1, 180, 161, 1, 194, 161, 1, 173, 169, 1] }
]
}Photos sampled from COCO 2017 validation set (Creative Commons Attribution 4.0). Bounding boxes, polygons, and keypoints hand-placed to illustrate the labelling shapes Yobitel ships.
Where supervised datasets quietly fail
The failure modes that surface at eval time
Every supervised labelling engagement we audit hits some subset of these. The training loss looks fine. The held-out split looks fine. Production traffic exposes the cracks.
Span boundaries are where supervised NER quietly breaks
What bad looks like
Whichever boundary the first labeller picks wins
What we design for
Boundary-aware adjudication with a written tie-break rule
Two labellers can agree the entity exists and still disagree by a token on where it starts or ends. Without a written boundary rule and an adjudication step, the resulting dataset trains the model to be inconsistent in exactly that way.
Auto-suggested labels become unchecked truth
What bad looks like
Pre-fill accepted unless the labeller notices
What we design for
Human review on every auto-suggested label until calibration holds
Model-assisted labelling speeds throughput by 2 to 3x. It also smuggles model bias into the gold set if every pre-fill is accepted by default. We keep human review mandatory on every auto-suggestion until measured calibration says we can ease off.
Single-pass labelling cannot tell signal from vibe
What bad looks like
One labeller per item, no inter-annotator agreement reported
What we design for
Double-blind labelling with Krippendorff's α tracked per task
If you cannot measure agreement between labellers you cannot defend the dataset to your auditor, your trainer, or yourself. Every project ships with IAA per task, per labeller, per batch, and disagreement-driven adjudication on the items that need it.
Class imbalance hides until eval time
What bad looks like
Random sampling of source data, ship the labels
What we design for
Stratified sampling plus targeted rare-class harvesting
Most real corpora are 90% boring and 10% interesting. Random sampling leaves you with a dataset that scores well on a held-out random split and falls over on the long-tail classes that actually matter. We sample with intent and audit the class distribution before training touches it.
Tooling by modality
The right tool for the task shape
No single platform wins on every modality. We pick on the work, your residency constraints, and whether your team will operate the platform after we leave.
Text
NER, classification, span linking, relation extraction
Image
Bounding boxes, polygon masks, classification
Multimodal
Document layout, screenshot understanding, image plus text
Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same. The hosting posture changes.
Your handover pack
What ships with the labels
A folder of labels on its own is a liability. The same labels plus their guideline version, IAA history, confusion matrix, adjudication record, and lineage are an asset your training programme and your auditor can both work with.
Every batch ships with these artefacts. A one-shot project receives them once. A rolling cadence refreshes them per batch.
Versioned annotation guideline
Examples-rich, edge-case-explicit, with a written tie-break rule for every ambiguous label. Versioned so the first batch and the last batch are labelled against the same rulebook.
Calibration set
Sealed gold items used to ramp labellers, retest after every guideline change, and quality-check every batch you receive. Refreshed when drift fires.
IAA dashboard
Krippendorff's α and Cohen's κ tracked per task, per labeller, per batch. The signal that says ship the batch or rework it. Trended over the life of the engagement.
Per-label confusion matrix
Where the labellers confuse class A for class B, and where the eventual model is going to do the same. Drives both adjudicator focus and guideline refinement.
Per-item lineage
Who labelled it, when, against which guideline version, and which adjudicator signed off. Labeller identity hashed. The audit trail your model card and your regulator can both cite.
Dataset card and split
Train, validation, and held-out test split. Dataset card describes composition, class distribution, biases known and mitigated, intended use, and the disagreement profile.
How we engage
Pick the shape that fits your team
From end-to-end programme delivery to time-boxed audit. The scope call confirms which fits. The statement of work names the deliverables.
Yobitel-led
We own the labelling programme end-to-end
Guideline authoring, calibration set, labeller pool, adjudication, IAA tracking, QA, lineage, redaction. You receive shipped batches against a fixed quality bar. Best when the dataset is on the critical path of a training or eval milestone.
Collaborative
You bring the labellers, we run the craft
You provide an in-house or contracted labelling team. We own guidelines, calibration, IAA tracking, adjudication, and the QA loop. Best when you already operate labellers and want the methodology to lift.
Advisory
Time-boxed audit of your existing programme
Fixed-window review of your current supervised labelling work. We sample the data, re-label a control set, run IAA against your existing labels, and write a remediation plan. Best when last year's dataset is not holding up at eval time.
Parent practice
Annotation + dataset practice
The full annotation surface. RLHF preference data, instruction-tune curation, eval set authoring, safety datasets, multimodal annotation, synthetic data, and the domain SME pool.
Related
Model training + fine-tuning
The training-run engineering that consumes the supervised dataset you commission here. SFT, DPO, RLHF, evaluation. Same engineering bench across both.
Tell us what the labels are for.
A short questionnaire covers workload, quality bar, and engagement shape. Our supervised labelling lead replies inside one working day with a calibration plan and a candidate tooling stack fitted to your modality, your residency, and your timeline.
Calibrated labeller pool with domain SME adjudication when the work needs it. Per-item lineage shipped with every batch. Engagements scoped to any sovereignty perimeter (NCSC, G-Cloud, OFFICIAL, GDPR, HIPAA, MeitY). Same practice that authors the eval sets your training engagements train against.