Professional Services · Data Annotation + RLHF Prep

Datasets your training run can actually trust

Supervised labels, RLHF preference pairs, instruction-tune corpora, golden eval sets, safety datasets. Double-blind labelling. Inter-annotator agreement on every task. Per-item lineage so the model card you ship can answer the regulator question.

See the tooling we drive

Label Studio · Prodigy · V7 · Encord · Argilla · CVAT · doccanoVetted SME adjudicator pool (clinical · legal · finance)NCSC + GDPR-aligned data handling

Representative project

On track

RLHF preference set · 14k pairs · medical Q&A

Pairwise comparisons. Clinician adjudicators. PII-redacted source corpus.

Calibrate

Guidelines + golden set + labeller induction

Done

Label

Double-blind pass, 3 labellers per item

In progress

Adjudicate

Disagreements routed to senior reviewer

In progress

QA spike

Hidden gold checks + holdout audit

Queued

Ship

Train split + eval split + lineage record

Queued

Krippendorff's α

0.83

QA reject rate

4.2%

Throughput

640 / day

Per-item lineage. Labeller identity hashed. PII redaction report ships with the dataset.

The shape of the work

From raw corpus to a dataset your trainer can use

Annotation is not one thing. Pre-training filtering, supervised fine-tune corpora, RLHF preference data, and safety evals are different crafts with different failure modes. We run all of them, distinctly.

Supervised labelling

Classification, named-entity spans, bounding boxes, segmentation masks, intent + slot tagging. The foundation layer that supervised fine-tunes and classical eval pipelines depend on.

NER · classification · spans · boxes · masks

View practice

Preference data for RLHF + DPO

Pairwise comparisons and ranked preferences over model outputs. The signal your reward model or DPO loss actually trains against. Double-blind to keep order bias out of the data.

Pairs · rankings · win-rate sets · reward modelling

View practice

Instruction-tuning curation

Prompt authoring + ideal-response writing + edit-and-improve passes on generated responses. The artefact your supervised fine-tune (SFT) stage consumes before any RLHF kicks in.

SFT corpus · prompt design · response editing

View practice

Eval set authoring

Golden sets, rubrics, scoring guides, regression suites. Built so a model swap or a fine-tune iteration can be judged against a fixed bar instead of vibes.

Golden sets · rubrics · regression suites

View practice

Safety + red-teaming datasets

Jailbreak attempts, refusal calibration sets, harm taxonomies, adversarial prompts. The data your safety post-training and your release-gate evals both pull from.

Jailbreaks · refusals · harm taxonomies · red-team

View practice

Domain-grounded review

Clinician, lawyer, or financial-analyst review on domain-sensitive content. Pulled from a vetted adjudicator pool with credentials checked, not anonymous gig labour.

Clinical · legal · finance · domain SME pool

View practice

Multimodal annotation

Image bounding boxes and segmentation, video keyframes and event timelines, audio speaker diarisation and ASR correction. One operating model across modalities, with the right tool for each.

Image · video · audio · ASR · diarisation

View practice

Synthetic data generation

LLM-augmented prompt and response generation, diffusion-generated images, plus the human-in-the-loop quality gates that decide what reaches your training set. Used to bootstrap thin corpora and cover long-tail slices.

LLM augmentation · diffusion gen · QA gates

View practice

Where datasets quietly fail

The pitfalls that show up at eval time

Every annotation engagement we audit hits some subset of these. The model trains fine, evals look reasonable, and then production exposes the cracks. Knowing they exist is most of the win.

Labeller agreement is the floor of dataset quality

What bad looks like

Single labeller per item, no IAA reported

What we design for

Double-blind, Krippendorff's α + Cohen's κ tracked per task

If you can't measure agreement between labellers, you can't tell a signal from a vibe. Every project we run reports inter-annotator agreement per task type, with disagreement-driven adjudication on the items that need it.

Order bias quietly corrupts preference data

What bad looks like

Labellers see model A always on the left

What we design for

Randomised order, blind labelling, periodic gold checks

Preference labelling for RLHF is brittle to presentation. Fixed-order pairs let labellers anchor on position rather than content. We randomise, blind the source model, and slip hidden gold items in to catch drift.

No provenance means no auditability

What bad looks like

Spreadsheet of labels, no link back to source

What we design for

Per-item lineage, labeller identity hashed, change history

If your regulator (or your CTO) asks who labelled an item, when, and against which guideline version, the dataset has to answer. We ship lineage as a first-class artefact, not a retrofit.

Guideline drift breaks long-running projects

What bad looks like

Guidelines updated mid-run, old labels not rebaselined

What we design for

Versioned guidelines, calibration retests on every change

Annotation guidelines evolve as edge cases surface. Without versioning + recalibration, the first 30% of the dataset is labelled against a different rulebook than the last 30%. We version, recalibrate, and report the delta.

Tooling-agnostic by design

We drive the tool that fits the work

The right tool depends on modality, scale, security posture, and whether your team will operate the platform after we leave. We pick on those grounds, not on which vendor sponsored last quarter's webinar.

Label Studio

Open-source default. NLP + vision + audio.

Prodigy

Active-learning + script-driven workflows.

Vision-heavy + medical imaging.

Encord

Video + complex vision pipelines.

Argilla

LLM-feedback + preference data tooling.

CVAT

Computer-vision annotation at scale.

doccano

Lightweight text annotation.

Custom UI

Bespoke labelling app when the tools above don't fit.

Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.

Your handover pack

What ships with the dataset

A dataset on its own is a liability. A dataset plus its provenance, IAA history, adjudication record, and redaction report is an asset your training programme and your auditor can both work with.

Every batch ships with these artefacts. If you commission a one-shot project they arrive once. If you commission a steady cadence they refresh per batch.

Annotation guideline document

Versioned, examples-rich, edge-case-explicit. Written so a new labeller can ramp in a day and produce work that holds against your IAA bar.

Calibration + golden set

The sealed item set we use to ramp labellers, retest after guideline changes, and quality-check every batch you receive.

IAA + quality dashboard

Inter-annotator agreement per task, per labeller, per batch. The signal that tells you whether to ship the batch or rework it.

Adjudication record

Every disagreement, who adjudicated it, and the decision rationale. The audit trail your downstream model card can cite.

Lineage + redaction report

Per-item provenance. Labeller identity hashed. PII redaction report for any regulated source corpus. Signed off before the dataset leaves us.

Train / eval split + dataset card

Pre-split train, validation, and held-out eval sets. Dataset card describes composition, biases known and mitigated, and intended use.

How we engage

Pick the shape that fits your team

From end-to-end programme delivery to time-boxed audit. The scope call confirms which fits; the statement of work names the deliverables.

Yobitel-led

We own the annotation programme end-to-end

Guidelines, calibration set, labeller pool, adjudication, QA, lineage, redaction. You receive shipped batches against a fixed quality bar. Best when annotation is on the critical path of a training or eval milestone.

Collaborative

You bring the labellers, we run the craft

You provide an in-house or contracted labelling team. We own guidelines, calibration, IAA tracking, adjudication, and the QA loop. Best when you already operate labellers and want the methodology to lift.

Advisory

Time-boxed review of your existing process

Fixed-window audit of your current annotation programme. We sample the data, run IAA against a re-labelled control, write a remediation plan. Best when last year's dataset isn't holding up.

Model training + fine-tuning

The training-run engineering that consumes the dataset you commissioned here. SFT, DPO, RLHF, evaluation. Same engineering bench across both.

ML pipelines + continuous evaluation

The pipeline that re-runs your eval set on every model bump and your re-labelling loop when drift fires. The annotation work becomes a continuous signal, not a one-shot.

Tell us what the dataset is for.

A short questionnaire covers scope, quality bar, and engagement shape. Our annotation practice lead replies inside one working day with a calibration plan and a candidate tooling stack fitted to your data sensitivity and timeline.

Prefer email? Contact us

Vetted SME adjudicator pool across clinical, legal, and financial domains. Per-item lineage shipped with every dataset. Same practice that powers the evaluation suites our training engagements train against. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).

Professional Services · Data Annotation + RLHF Prep

Datasets your training run can actually trust

See the tooling we drive

Label Studio · Prodigy · V7 · Encord · Argilla · CVAT · doccanoVetted SME adjudicator pool (clinical · legal · finance)NCSC + GDPR-aligned data handling

Representative project

On track

RLHF preference set · 14k pairs · medical Q&A

Pairwise comparisons. Clinician adjudicators. PII-redacted source corpus.

Calibrate

Guidelines + golden set + labeller induction

Done

Label

Double-blind pass, 3 labellers per item

In progress

Adjudicate

Disagreements routed to senior reviewer

In progress

QA spike

Hidden gold checks + holdout audit

Queued

Ship

Train split + eval split + lineage record

Queued

Krippendorff's α

0.83

QA reject rate

4.2%

Throughput

640 / day

Per-item lineage. Labeller identity hashed. PII redaction report ships with the dataset.

The shape of the work

From raw corpus to a dataset your trainer can use

Supervised labelling

Classification, named-entity spans, bounding boxes, segmentation masks, intent + slot tagging. The foundation layer that supervised fine-tunes and classical eval pipelines depend on.

NER · classification · spans · boxes · masks

View practice

Preference data for RLHF + DPO

Pairwise comparisons and ranked preferences over model outputs. The signal your reward model or DPO loss actually trains against. Double-blind to keep order bias out of the data.

Pairs · rankings · win-rate sets · reward modelling

View practice

Instruction-tuning curation

Prompt authoring + ideal-response writing + edit-and-improve passes on generated responses. The artefact your supervised fine-tune (SFT) stage consumes before any RLHF kicks in.

SFT corpus · prompt design · response editing

View practice

Eval set authoring

Golden sets, rubrics, scoring guides, regression suites. Built so a model swap or a fine-tune iteration can be judged against a fixed bar instead of vibes.

Golden sets · rubrics · regression suites

View practice

Safety + red-teaming datasets

Jailbreak attempts, refusal calibration sets, harm taxonomies, adversarial prompts. The data your safety post-training and your release-gate evals both pull from.

Jailbreaks · refusals · harm taxonomies · red-team

View practice

Domain-grounded review

Clinician, lawyer, or financial-analyst review on domain-sensitive content. Pulled from a vetted adjudicator pool with credentials checked, not anonymous gig labour.

Clinical · legal · finance · domain SME pool

View practice

Multimodal annotation

Image bounding boxes and segmentation, video keyframes and event timelines, audio speaker diarisation and ASR correction. One operating model across modalities, with the right tool for each.

Image · video · audio · ASR · diarisation

View practice

Synthetic data generation

LLM augmentation · diffusion gen · QA gates

View practice

Where datasets quietly fail

The pitfalls that show up at eval time

Every annotation engagement we audit hits some subset of these. The model trains fine, evals look reasonable, and then production exposes the cracks. Knowing they exist is most of the win.

Labeller agreement is the floor of dataset quality

What bad looks like

Single labeller per item, no IAA reported

What we design for

Double-blind, Krippendorff's α + Cohen's κ tracked per task

Order bias quietly corrupts preference data

What bad looks like

Labellers see model A always on the left

What we design for

Randomised order, blind labelling, periodic gold checks

No provenance means no auditability

What bad looks like

Spreadsheet of labels, no link back to source

What we design for

Per-item lineage, labeller identity hashed, change history

If your regulator (or your CTO) asks who labelled an item, when, and against which guideline version, the dataset has to answer. We ship lineage as a first-class artefact, not a retrofit.

Guideline drift breaks long-running projects

What bad looks like

Guidelines updated mid-run, old labels not rebaselined

What we design for

Versioned guidelines, calibration retests on every change

Tooling-agnostic by design

We drive the tool that fits the work

Label Studio

Open-source default. NLP + vision + audio.

Prodigy

Active-learning + script-driven workflows.

Vision-heavy + medical imaging.

Encord

Video + complex vision pipelines.

Argilla

LLM-feedback + preference data tooling.

CVAT

Computer-vision annotation at scale.

doccano

Lightweight text annotation.

Custom UI

Bespoke labelling app when the tools above don't fit.

Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.

Your handover pack

What ships with the dataset

A dataset on its own is a liability. A dataset plus its provenance, IAA history, adjudication record, and redaction report is an asset your training programme and your auditor can both work with.

Every batch ships with these artefacts. If you commission a one-shot project they arrive once. If you commission a steady cadence they refresh per batch.

Annotation guideline document

Versioned, examples-rich, edge-case-explicit. Written so a new labeller can ramp in a day and produce work that holds against your IAA bar.

Calibration + golden set

The sealed item set we use to ramp labellers, retest after guideline changes, and quality-check every batch you receive.

IAA + quality dashboard

Inter-annotator agreement per task, per labeller, per batch. The signal that tells you whether to ship the batch or rework it.

Adjudication record

Every disagreement, who adjudicated it, and the decision rationale. The audit trail your downstream model card can cite.

Lineage + redaction report

Per-item provenance. Labeller identity hashed. PII redaction report for any regulated source corpus. Signed off before the dataset leaves us.

Train / eval split + dataset card

Pre-split train, validation, and held-out eval sets. Dataset card describes composition, biases known and mitigated, and intended use.

How we engage

Pick the shape that fits your team

From end-to-end programme delivery to time-boxed audit. The scope call confirms which fits; the statement of work names the deliverables.

Yobitel-led

We own the annotation programme end-to-end

Collaborative

You bring the labellers, we run the craft

Advisory

Time-boxed review of your existing process

Fixed-window audit of your current annotation programme. We sample the data, run IAA against a re-labelled control, write a remediation plan. Best when last year's dataset isn't holding up.

Model training + fine-tuning

The training-run engineering that consumes the dataset you commissioned here. SFT, DPO, RLHF, evaluation. Same engineering bench across both.

ML pipelines + continuous evaluation

The pipeline that re-runs your eval set on every model bump and your re-labelling loop when drift fires. The annotation work becomes a continuous signal, not a one-shot.

Tell us what the dataset is for.

Prefer email? Contact us