Professional Services · Data Annotation + RLHF Prep
Datasets your training run can actually trust
Supervised labels, RLHF preference pairs, instruction-tune corpora, golden eval sets, safety datasets. Double-blind labelling. Inter-annotator agreement on every task. Per-item lineage so the model card you ship can answer the regulator question.
Representative project
On trackRLHF preference set · 14k pairs · medical Q&A
Pairwise comparisons. Clinician adjudicators. PII-redacted source corpus.
Calibrate
Guidelines + golden set + labeller induction
Label
Double-blind pass, 3 labellers per item
Adjudicate
Disagreements routed to senior reviewer
QA spike
Hidden gold checks + holdout audit
Ship
Train split + eval split + lineage record
Krippendorff's α
0.83
QA reject rate
4.2%
Throughput
640 / day
Per-item lineage. Labeller identity hashed. PII redaction report ships with the dataset.
The shape of the work
From raw corpus to a dataset your trainer can use
Annotation is not one thing. Pre-training filtering, supervised fine-tune corpora, RLHF preference data, and safety evals are different crafts with different failure modes. We run all of them, distinctly.
Where datasets quietly fail
The pitfalls that show up at eval time
Every annotation engagement we audit hits some subset of these. The model trains fine, evals look reasonable, and then production exposes the cracks. Knowing they exist is most of the win.
Labeller agreement is the floor of dataset quality
What bad looks like
Single labeller per item, no IAA reported
What we design for
Double-blind, Krippendorff's α + Cohen's κ tracked per task
If you can't measure agreement between labellers, you can't tell a signal from a vibe. Every project we run reports inter-annotator agreement per task type, with disagreement-driven adjudication on the items that need it.
Order bias quietly corrupts preference data
What bad looks like
Labellers see model A always on the left
What we design for
Randomised order, blind labelling, periodic gold checks
Preference labelling for RLHF is brittle to presentation. Fixed-order pairs let labellers anchor on position rather than content. We randomise, blind the source model, and slip hidden gold items in to catch drift.
No provenance means no auditability
What bad looks like
Spreadsheet of labels, no link back to source
What we design for
Per-item lineage, labeller identity hashed, change history
If your regulator (or your CTO) asks who labelled an item, when, and against which guideline version, the dataset has to answer. We ship lineage as a first-class artefact, not a retrofit.
Guideline drift breaks long-running projects
What bad looks like
Guidelines updated mid-run, old labels not rebaselined
What we design for
Versioned guidelines, calibration retests on every change
Annotation guidelines evolve as edge cases surface. Without versioning + recalibration, the first 30% of the dataset is labelled against a different rulebook than the last 30%. We version, recalibrate, and report the delta.
Tooling-agnostic by design
We drive the tool that fits the work
The right tool depends on modality, scale, security posture, and whether your team will operate the platform after we leave. We pick on those grounds, not on which vendor sponsored last quarter's webinar.
Label Studio
Open-source default. NLP + vision + audio.
Prodigy
Active-learning + script-driven workflows.
V7
Vision-heavy + medical imaging.
Encord
Video + complex vision pipelines.
Argilla
LLM-feedback + preference data tooling.
CVAT
Computer-vision annotation at scale.
doccano
Lightweight text annotation.
Custom UI
Bespoke labelling app when the tools above don't fit.
Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.
Your handover pack
What ships with the dataset
A dataset on its own is a liability. A dataset plus its provenance, IAA history, adjudication record, and redaction report is an asset your training programme and your auditor can both work with.
Every batch ships with these artefacts. If you commission a one-shot project they arrive once. If you commission a steady cadence they refresh per batch.
Annotation guideline document
Versioned, examples-rich, edge-case-explicit. Written so a new labeller can ramp in a day and produce work that holds against your IAA bar.
Calibration + golden set
The sealed item set we use to ramp labellers, retest after guideline changes, and quality-check every batch you receive.
IAA + quality dashboard
Inter-annotator agreement per task, per labeller, per batch. The signal that tells you whether to ship the batch or rework it.
Adjudication record
Every disagreement, who adjudicated it, and the decision rationale. The audit trail your downstream model card can cite.
Lineage + redaction report
Per-item provenance. Labeller identity hashed. PII redaction report for any regulated source corpus. Signed off before the dataset leaves us.
Train / eval split + dataset card
Pre-split train, validation, and held-out eval sets. Dataset card describes composition, biases known and mitigated, and intended use.
How we engage
Pick the shape that fits your team
From end-to-end programme delivery to time-boxed audit. The scope call confirms which fits; the statement of work names the deliverables.
Yobitel-led
We own the annotation programme end-to-end
Guidelines, calibration set, labeller pool, adjudication, QA, lineage, redaction. You receive shipped batches against a fixed quality bar. Best when annotation is on the critical path of a training or eval milestone.
Collaborative
You bring the labellers, we run the craft
You provide an in-house or contracted labelling team. We own guidelines, calibration, IAA tracking, adjudication, and the QA loop. Best when you already operate labellers and want the methodology to lift.
Advisory
Time-boxed review of your existing process
Fixed-window audit of your current annotation programme. We sample the data, run IAA against a re-labelled control, write a remediation plan. Best when last year's dataset isn't holding up.
Related
Model training + fine-tuning
The training-run engineering that consumes the dataset you commissioned here. SFT, DPO, RLHF, evaluation. Same engineering bench across both.
Related
ML pipelines + continuous evaluation
The pipeline that re-runs your eval set on every model bump and your re-labelling loop when drift fires. The annotation work becomes a continuous signal, not a one-shot.
Tell us what the dataset is for.
A short questionnaire covers scope, quality bar, and engagement shape. Our annotation practice lead replies inside one working day with a calibration plan and a candidate tooling stack fitted to your data sensitivity and timeline.
Vetted SME adjudicator pool across clinical, legal, and financial domains. Per-item lineage shipped with every dataset. Same practice that powers the evaluation suites our training engagements train against. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).