Annotation Practice · Instruction-Tuning Curation
The SFT corpus your fine-tune actually deserves
Prompt authoring, ideal-response writing, and edit-and-improve passes on model drafts. Shipped in the chat template your trainer expects, with an edit-history audit attached. The artefact your supervised fine-tune consumes before any RLHF kicks in.
Representative edit pass
AcceptedSFT corpus · customer-support persona · ShareGPT
Prompt
Write a customer email apologising for a delayed shipment.
Model draft
verbose · stiffWe deeply regret to inform you that your recent order has unfortunately experienced an unexpected delay in the shipment process.
Edited
tight · directYour order was delayed. Here's what happened and what we're doing about it.
+12 words tightened·tone: corporate → direct
Edits / week
1,420
Accept rate
73%
Format
ShareGPT
Every edit captures the original draft, the curator, the rationale, and the time spent. Train / val split shipped in target schema with full edit history attached.
What we curate
Six distinct passes on the same corpus
An SFT corpus is not one kind of writing. Prompt construction, gold-response authoring, edit-from-draft, multi-turn dialogue, refusal voice, and persona grounding are each a separate craft with its own failure modes. We run all of them.
Prompt authoring
Curators write prompts to a versioned taxonomy. Coverage of task types, intents, difficulty bands, and edge cases is tracked so the corpus does not over-index on the easy slice.
Taxonomy · intent map · difficulty bands · edge slices
Ideal-response writing
Senior curators write the gold response for the highest-value prompts. Used where the model has no good draft to start from, or where the answer is the artefact you want to teach.
Gold responses · reference answers · style anchors
Model-assisted (post-editing)
A base model generates a draft. A curator tightens phrasing, fixes facts, strips verbosity, and reshapes tone. Industry-standard model-assisted annotation pattern, also called post-editing. The most efficient signal-per-hour for general-assistant corpora.
Draft + improve · diff tracked · rationale captured
Multi-turn dialogue authoring
Whole conversations curated turn-by-turn. State carried across turns, clarification questions written where the prompt is ambiguous, follow-up prompts shaped to stress the model.
Multi-turn · clarifying questions · state-tracking
Refusal authoring
Helpful refusals to genuinely unsafe asks, soft redirects on borderline ones, full answers where the model was wrongly refusing. Calibrated against your safety policy, not a generic refusal voice.
Helpful refusals · soft redirects · policy-aligned
Persona-grounded responses
Responses written in the voice you ship to customers. Brand register, reading age, allowed claims, escalation phrases. The persona becomes a constraint the model learns, not a system-prompt hack.
Brand voice · register · allowed-claims · escalations
Real curation samples
What a curator actually strikes and rewrites
Three real edit shapes across customer-support, code review, and patient-facing clinical copy. The strikethrough is what the model produced; the green spans are the curator's rewrite. Rationale captured per row.
Prompt
Write a customer email apologising for a delayed shipment. Tone: direct, no throat-clearing.
Model draft
auto-generatedWe sincerely and deeply regret to inform you that your order has unfortunately been was delayed. We understand the impact this may have on you and Here's what happened and we want to be fully transparent about the situationand what we're doing about it.
Curator's edited response
AcceptedYour order was delayed. Here's what happened and what we're doing about it.
Words trimmed
−23
Tone shift
corporate → direct
Curator rationale
“Strip throat-clearing. Drop the corporate-apology preamble. Lead with the fact, then the recovery.”
Prompt
Review this function. The user wants concise feedback, not a textbook.
Model draft
auto-generatedThis function appears to be implementing a binary search algorithm. Binary search is a divide-and-conquer algorithm that works on sorted arrays. Looking at your implementation, I have several observations to share. The off-by-one in line 7 (`high = mid` should be `high = mid - 1`) will skip the matching element on a left-leaning split. Other than that, the code looks generally fine and follows standard conventions.Rest looks clean.
Curator's edited response
AcceptedThe off-by-one in line 7 (`high = mid` should be `high = mid - 1`) will skip the matching element on a left-leaning split. Rest looks clean.
Words trimmed
−41
Tone shift
lecture → peer review
Curator rationale
“Cut the textbook intro. Inline the specific code suggestion. The dev wants the fix, not the lesson.”
Prompt
Explain why we're switching from warfarin to apixaban. Audience: patient with high-school reading level.
Model draft
auto-generatedApixaban is a direct oral anticoagulant (DOAC) that selectively inhibits factor Xa, offering a pharmacological profile distinct from vitamin K antagonists such as warfarin. We're moving you from warfarin to apixaban. Both are blood thinners. Apixaban is easier to live with: you won't need monthly blood tests for INR monitoring, , fewer foods interact with it due to its different metabolism pathway.
Curator's edited response
AcceptedWe're moving you from warfarin to apixaban. Both are blood thinners. Apixaban is easier to live with: you won't need monthly blood tests, fewer foods interact with it.
Words trimmed
−28
Tone shift
clinical → conversational
Curator rationale
“Reading age. Drop pharmacology jargon, lead with what the patient actually experiences differently. SME signed off.”
Prompts written for illustration. The diff visualisation and rationale capture pattern mirror what curators produce per row in the SFT corpus we ship.
Where SFT corpora quietly fail
The drift modes that show up in production behaviour
Every SFT corpus we audit hits some subset of these. The model trains cleanly, evals look reasonable, and then real users surface what the corpus actually taught. Knowing the failure modes is most of the win.
Verbosity drift
What bad looks like
Every response opens with throat-clearing. Models trained on it open with throat-clearing forever.
What we design for
Length budget per response type. Curators flagged when output exceeds it. Tight openers preferred.
SFT corpora silently teach length. A corpus where every gold response runs 400 words ships a model that cannot write a one-line answer. We set length budgets per task and police them in review.
Hallucinated facts
What bad looks like
Curators paraphrase what sounds right. Wrong facts get baked in as ground truth.
What we design for
Source-citation required on factual claims. SME review on domain-sensitive items. Fact-check pass before accept.
An ideal response that is confidently wrong is worse than a clumsy correct one. We require citations on factual claims and route domain-sensitive items to vetted SMEs before they reach the train split.
Formulaic safety language
What bad looks like
Every refusal opens the same way. Model learns to refuse anything that pattern-matches.
What we design for
Refusal voice varied. Soft-redirect option available. Calibrated against your actual safety policy.
Copy-paste refusal voice over-refuses. A model that learned to say the same sentence in response to anything that looks risky will block legitimate user asks too. Varied refusals plus a soft-redirect register keep the safety surface usable.
Instruction drift
What bad looks like
Response ignores half the prompt. Curator accepts it because the prose is good.
What we design for
Constraint checklist per item. Each prompt constraint verified individually before accept.
Prompts have constraints (format, length, tone, what to include, what to exclude). When curators grade on prose quality alone, half the constraints leak. We attach a constraint checklist to every item and verify each one.
Length bias in preference data
What bad looks like
Curators reflexively prefer longer responses. SFT corpus inherits the bias.
What we design for
Length-blind review. Spot-checks on items where the shorter response was correct.
Curators trained on long-form writing prefer long-form writing. Without active de-biasing the SFT data teaches the model to be wordy. We run length-blind review and spike-test the bias.
Formats we ship
Shaped to the chat template your trainer expects
The same curated content can ship in any of the standard SFT shapes. Format is a packaging decision, not a craft decision. You tell us what the trainer consumes; we hand over a clean split in that schema.
Alpaca
Instruction + input + output triples. The original format that launched the SFT-on-open-models era. Easy to consume across most trainers.
OpenAssistant
Tree-shaped conversation graphs with multiple replies and rankings per turn. Use when you want hierarchy and reply branching in the corpus itself.
ShareGPT
Linear multi-turn conversations with role tags. The de facto standard for chat-shaped SFT data. First-class support across vLLM, SGLang, and most fine-tuners.
Vicuna
Conversation format with explicit USER and ASSISTANT tags and a system prompt slot. Use when the trainer expects the Vicuna chat template directly.
Llama 2 Chat
[INST] / <<SYS>> wrapper format used by Llama 2 fine-tunes. Use when targeting Llama 2-derived models on the original chat template.
Llama 3 Instruct
Special-token header format (<|start_header_id|>, <|end_header_id|>, <|eot_id|>) introduced in Llama 3. Use when targeting Llama 3 and 3.x fine-tunes.
JSONL custom
Your bespoke schema, mapped from any of the standard shapes above. Useful when your trainer or pipeline expects a non-standard field set.
If your trainer expects something none of the above describe, share the spec on the scope call. Format conversion is part of the handover, not a separate workstream.
Tooling-agnostic by design
The right tool for the curation surface
Argilla for feedback-shaped corpora, Label Studio when the UI needs to be bespoke, Prodigy where active learning surfaces the next most-informative item to label, an in-house edit interface when nothing fits. The platform follows the work, not the other way around.
Argilla
LLM-feedback + curation workflows. First-class fit for instruction-tune data.
Label Studio
Flexible JSON-schema label configs. Good for custom curation UIs.
Prodigy
Active-learning + scripting. Strong on routing the next item to the right curator.
Hugging Face Hub
Versioned dataset hosting + community discovery for non-sensitive corpora.
LangSmith
Trace-driven curation when the prompts come from a live LangChain app.
In-house UI
Bespoke edit interface when the off-the-shelf tools do not fit the workflow.
Where regulated data residency rules out a SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.
Your handover pack
What ships with the corpus
A JSONL file on its own teaches your trainer nothing about how the corpus was built. The pack below makes the dataset auditable, reproducible, and extendable by whoever owns it after the project ships.
Every batch refreshes the pack. If you commission a one-shot project it arrives once. If you commission a continuous cadence it refreshes on every drop.
Curation guidelines
Versioned, example-rich, edge-case-explicit. Covers prompt construction, response standards, refusal voice, length budgets, and the brand register if you have one.
Prompt taxonomy
The intent + task-type + difficulty map your corpus is sampled against. Coverage report tells you which slices are thin so the next batch can fill them.
Sampled QA set
A sealed sample from every batch, re-reviewed by a senior curator independent of the original pass. The signal that decides whether the batch ships or reworks.
Edit-history audit
For every edit-from-draft item: the original draft, the final edit, the curator identity hashed, the rationale string, and the time spent. The audit trail your model card can cite.
Train / val split in target format
Pre-split corpus shipped in the chat template you asked for (Alpaca, OpenAssistant, ShareGPT, Vicuna, Llama 2 Chat / 3 Instruct, or your JSONL spec). Held-out validation kept clean of train.
Eval rubric
The scoring rubric your post-training evals can grade against. Aligned with the curation guidelines so train-time and eval-time judge against the same bar.
How we engage
Pick the shape that fits your team
From end-to-end corpus delivery to time-boxed audit. The scope call confirms which shape fits; the statement of work names the deliverables.
Yobitel-led
We own the curation programme end-to-end
Guidelines, prompt taxonomy, curator pool, edit-from-draft loop, senior review, format conversion, dataset card. You receive a shipped corpus against a fixed quality bar in the format your trainer wants.
Collaborative
You bring the curators, we run the craft
Your in-house or contracted writers do the work. We provide the guidelines, the taxonomy, the QA loop, the format conversion, and the senior review bench. Best when you already have domain writers and want the methodology to lift.
Advisory
Time-boxed audit of an existing SFT corpus
Fixed-window review of the corpus you already have. We sample it, re-grade against a tighter rubric, surface the drift modes, and write a remediation plan. Best when an earlier fine-tune is not behaving the way the corpus promised.
Back to hub
Data annotation + RLHF preparation
The full annotation practice. Supervised labelling, preference data, eval set authoring, safety datasets, multimodal, synthetic generation. Instruction-tune curation sits inside it.
Downstream
Model training + fine-tuning (SFT consumption)
The training-run engineering that consumes the SFT corpus you commissioned here. Trainer wiring, eval-loss tracking, checkpoint discipline, downstream-eval gate.
Tell us what the corpus is for.
A short questionnaire covers volume, format target, quality bar, and engagement shape. Our instruction-tuning lead replies inside one working day with a candidate taxonomy, a tooling pick, and a curator-pool plan fitted to your timeline and sensitivity.
Curator pool with SME bench across clinical, legal, and finance domains. Edit history captured as a first-class artefact, never retrofitted. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).