Annotation Practice · Supervised Labelling

The labels your training set stands on

Classification, named-entity spans, bounding boxes, segmentation masks, intent and slot tagging, relation extraction. Double-blind labelling. Krippendorff's α tracked per task. Per-item lineage so the dataset card you ship answers the regulator question.

See the tooling we drive

Label Studio · Prodigy · V7 · Encord · CVAT · Argilla · doccano · RoboflowDouble-blind labelling with disagreement-driven adjudicationISO/IEC 27001:2022 + SOC 2 Type II aligned data handling

Representative item

Reviewed

Clinical-note NER + chest X-ray bbox

PERSONORGDOSEDATE

Reviewed by Dr. Patel at Royal Free Hospital on 12 Mar 2026.

Started on amoxicillin 25mg, review with Dr. Okafor in two weeks.

Bounding box · 3 regions

512 × 512 px source

IAA (Krippendorff α)

0.91

Labellers / item

Items in batch

12k

Double-blind labelling. Disagreements adjudicated by a clinical reviewer. Per-item lineage.

The shape of the work

The task shapes that supervised learning trains against

Supervised labelling is not one task. Span tagging, box drawing, mask painting, and relation linking are distinct crafts with distinct failure modes. We treat them that way.

Classification

Single-label or multi-label. The simplest task on paper, the easiest to mis-design in practice. Label set and edge cases get nailed down in calibration before any volume ships.

single-label · multi-label · hierarchical

Named-entity recognition

Span-level tagging across people, places, organisations, dosages, dates, custom domain entities. Nested spans handled. Span boundary disagreement adjudicated, not averaged.

spans · nested · custom entities

Bounding boxes

Tight axis-aligned or rotated boxes for object detection. IoU thresholds set against your downstream model's tolerance, not a generic default.

axis-aligned · rotated · IoU-tuned

Polygon and segmentation masks

Pixel-precise polygons for instance and semantic segmentation. Used where bounding boxes lose too much information. Slower per item, calibration-heavy, worth it when the model needs it.

instance · semantic · panoptic

Intent and slot tagging

Turn utterances into the intent plus structured slots a conversational system can act on. Calibration covers ambiguous intent boundaries upfront so the dataset reflects one rulebook.

intent · slots · conversational

Relation extraction and span linking

Typed relations between entities, coreference chains, span-to-knowledge-base linking. The signal that pushes a model from extractive to compositional.

relations · coreference · entity linking

Real annotation samples

What the labeller actually draws on the canvas

Three real photographs with the bounding boxes, polygons, and keypoints a labeller would produce. Class labels and confidence scores are the same shape your trainer downstream consumes. Click each card for the COCO-format JSON.

Skatepark scene with two riders, one mid-air on a skateboard and one on a BMX bike, graffiti walls behind.

person · 0.97person · 0.94bicycle · 0.88skateboard · 0.91person (group) · 0.74

Bounding boxes640×480

Multi-object scene · bounding boxes + class labels + confidence scores

Show COCO-format annotation

{
  "image_id": 87038,
  "annotations": [
    { "category": "person",     "bbox": [342, 200, 80, 173], "score": 0.97 },
    { "category": "person",     "bbox": [246, 228, 77, 158], "score": 0.94 },
    { "category": "bicycle",    "bbox": [234, 266, 93, 127], "score": 0.88 },
    { "category": "skateboard", "bbox": [307, 339, 70, 41],  "score": 0.91 }
  ]
}

Victorian-style kitchen interior with a person in an apron from behind, copper pots hanging overhead, prep table in the foreground.

bowl · 0.89person · 0.95apron · mask · 0.86

Polygon segmentation640×427

Instance segmentation · polygon outline traces the apron silhouette

Show COCO-format annotation

{
  "image_id": 397133,
  "annotations": [
    {
      "category": "person",
      "bbox": [352, 119, 141, 257],
      "segmentation": [[387, 154, ... 18 pts ...]],
      "score": 0.95
    },
    { "category": "bowl", "bbox": [141, 277, 77, 56], "score": 0.89 }
  ]
}

Modern kitchen with wooden cabinets, a stove and refrigerator at the back, fruit bowl on the foreground table.

bowl of fruit · 0.92refrigerator · 0.96oven · 0.88

Keypoints + bbox352×230

Keypoint + bbox · point annotations on small objects with class labels

Show COCO-format annotation

{
  "image_id": 37777,
  "annotations": [
    { "category": "refrigerator", "bbox": [296, 51, 56, 161], "score": 0.96 },
    { "category": "oven",         "bbox": [88, 83, 77, 92],   "score": 0.88 },
    { "category": "bowl",         "bbox": [146, 138, 79, 41], "score": 0.92,
      "keypoints": [186, 154, 1,  180, 161, 1,  194, 161, 1,  173, 169, 1] }
  ]
}

Photos sampled from COCO 2017 validation set (Creative Commons Attribution 4.0). Bounding boxes, polygons, and keypoints hand-placed to illustrate the labelling shapes Yobitel ships.

Where supervised datasets quietly fail

The failure modes that surface at eval time

Every supervised labelling engagement we audit hits some subset of these. The training loss looks fine. The held-out split looks fine. Production traffic exposes the cracks.

Span boundaries are where supervised NER quietly breaks

What bad looks like

Whichever boundary the first labeller picks wins

What we design for

Boundary-aware adjudication with a written tie-break rule

Two labellers can agree the entity exists and still disagree by a token on where it starts or ends. Without a written boundary rule and an adjudication step, the resulting dataset trains the model to be inconsistent in exactly that way.

Auto-suggested labels become unchecked truth

What bad looks like

Pre-fill accepted unless the labeller notices

What we design for

Human review on every auto-suggested label until calibration holds

Model-assisted labelling speeds throughput by 2 to 3x. It also smuggles model bias into the gold set if every pre-fill is accepted by default. We keep human review mandatory on every auto-suggestion until measured calibration says we can ease off.

Single-pass labelling cannot tell signal from vibe

What bad looks like

One labeller per item, no inter-annotator agreement reported

What we design for

Double-blind labelling with Krippendorff's α tracked per task

If you cannot measure agreement between labellers you cannot defend the dataset to your auditor, your trainer, or yourself. Every project ships with IAA per task, per labeller, per batch, and disagreement-driven adjudication on the items that need it.

Class imbalance hides until eval time

What bad looks like

Random sampling of source data, ship the labels

What we design for

Stratified sampling plus targeted rare-class harvesting

Most real corpora are 90% boring and 10% interesting. Random sampling leaves you with a dataset that scores well on a held-out random split and falls over on the long-tail classes that actually matter. We sample with intent and audit the class distribution before training touches it.

Tooling by modality

The right tool for the task shape

No single platform wins on every modality. We pick on the work, your residency constraints, and whether your team will operate the platform after we leave.

Text

NER, classification, span linking, relation extraction

Label StudioProdigydoccanoArgilla

Image

Bounding boxes, polygon masks, classification

CVATLabel StudioV7EncordRoboflowSuperAnnotateSupervisely

Multimodal

Document layout, screenshot understanding, image plus text

Label StudioEncord

Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same. The hosting posture changes.

Your handover pack

What ships with the labels

A folder of labels on its own is a liability. The same labels plus their guideline version, IAA history, confusion matrix, adjudication record, and lineage are an asset your training programme and your auditor can both work with.

Every batch ships with these artefacts. A one-shot project receives them once. A rolling cadence refreshes them per batch.

Versioned annotation guideline

Examples-rich, edge-case-explicit, with a written tie-break rule for every ambiguous label. Versioned so the first batch and the last batch are labelled against the same rulebook.

Calibration set

Sealed gold items used to ramp labellers, retest after every guideline change, and quality-check every batch you receive. Refreshed when drift fires.

IAA dashboard

Krippendorff's α and Cohen's κ tracked per task, per labeller, per batch. The signal that says ship the batch or rework it. Trended over the life of the engagement.

Per-label confusion matrix

Where the labellers confuse class A for class B, and where the eventual model is going to do the same. Drives both adjudicator focus and guideline refinement.

Per-item lineage

Who labelled it, when, against which guideline version, and which adjudicator signed off. Labeller identity hashed. The audit trail your model card and your regulator can both cite.

Dataset card and split

Train, validation, and held-out test split. Dataset card describes composition, class distribution, biases known and mitigated, intended use, and the disagreement profile.

How we engage

Pick the shape that fits your team

From end-to-end programme delivery to time-boxed audit. The scope call confirms which fits. The statement of work names the deliverables.

Yobitel-led

We own the labelling programme end-to-end

Guideline authoring, calibration set, labeller pool, adjudication, IAA tracking, QA, lineage, redaction. You receive shipped batches against a fixed quality bar. Best when the dataset is on the critical path of a training or eval milestone.

Collaborative

You bring the labellers, we run the craft

You provide an in-house or contracted labelling team. We own guidelines, calibration, IAA tracking, adjudication, and the QA loop. Best when you already operate labellers and want the methodology to lift.

Advisory

Time-boxed audit of your existing programme

Fixed-window review of your current supervised labelling work. We sample the data, re-label a control set, run IAA against your existing labels, and write a remediation plan. Best when last year's dataset is not holding up at eval time.

Parent practice

Annotation + dataset practice

The full annotation surface. RLHF preference data, instruction-tune curation, eval set authoring, safety datasets, multimodal annotation, synthetic data, and the domain SME pool.

Model training + fine-tuning

The training-run engineering that consumes the supervised dataset you commission here. SFT, DPO, RLHF, evaluation. Same engineering bench across both.

Tell us what the labels are for.

A short questionnaire covers workload, quality bar, and engagement shape. Our supervised labelling lead replies inside one working day with a calibration plan and a candidate tooling stack fitted to your modality, your residency, and your timeline.

Prefer email? Contact us

Calibrated labeller pool with domain SME adjudication when the work needs it. Per-item lineage shipped with every batch. Engagements scoped to any sovereignty perimeter (NCSC, G-Cloud, OFFICIAL, GDPR, HIPAA, MeitY). Same practice that authors the eval sets your training engagements train against.

The labels your training set stands on

Label Studio · Prodigy · V7 · Encord · CVAT · Argilla · doccano · RoboflowDouble-blind labelling with disagreement-driven adjudicationISO/IEC 27001:2022 + SOC 2 Type II aligned data handling

{ "image_id": 87038, "annotations": [ { "category": "person", "bbox": [342, 200, 80, 173], "score": 0.97 }, { "category": "person", "bbox": [246, 228, 77, 158], "score": 0.94 }, { "category": "bicycle", "bbox": [234, 266, 93, 127], "score": 0.88 }, { "category": "skateboard", "bbox": [307, 339, 70, 41], "score": 0.91 } ] }

{ "image_id": 397133, "annotations": [ { "category": "person", "bbox": [352, 119, 141, 257], "segmentation": [[387, 154, ... 18 pts ...]], "score": 0.95 }, { "category": "bowl", "bbox": [141, 277, 77, 56], "score": 0.89 } ] }

{ "image_id": 37777, "annotations": [ { "category": "refrigerator", "bbox": [296, 51, 56, 161], "score": 0.96 }, { "category": "oven", "bbox": [88, 83, 77, 92], "score": 0.88 }, { "category": "bowl", "bbox": [146, 138, 79, 41], "score": 0.92, "keypoints": [186, 154, 1, 180, 161, 1, 194, 161, 1, 173, 169, 1] } ] }

What ships with the labels

Every batch ships with these artefacts. A one-shot project receives them once. A rolling cadence refreshes them per batch.

Tell us what the labels are for.