Annotation Practice · Instruction-Tuning Curation

The SFT corpus your fine-tune actually deserves

Prompt authoring, ideal-response writing, and edit-and-improve passes on model drafts. Shipped in the chat template your trainer expects, with an edit-history audit attached. The artefact your supervised fine-tune consumes before any RLHF kicks in.

See the formats we ship

Alpaca · OpenAssistant · ShareGPT · Vicuna · Llama 2 Chat / 3 Instruct · JSONL customEdit history captured per item, not retrofittedSME pool for clinical, legal, and finance domains

Representative edit pass

Accepted

SFT corpus · customer-support persona · ShareGPT

Prompt

Write a customer email apologising for a delayed shipment.

Model draft

verbose · stiff

We deeply regret to inform you that your recent order has unfortunately experienced an unexpected delay in the shipment process.

Edited

tight · direct

Your order was delayed. Here's what happened and what we're doing about it.

+12 words tightened·tone: corporate → direct

Edits / week

1,420

Accept rate

73%

Format

ShareGPT

Every edit captures the original draft, the curator, the rationale, and the time spent. Train / val split shipped in target schema with full edit history attached.

What we curate

Six distinct passes on the same corpus

An SFT corpus is not one kind of writing. Prompt construction, gold-response authoring, edit-from-draft, multi-turn dialogue, refusal voice, and persona grounding are each a separate craft with its own failure modes. We run all of them.

Prompt authoring

Curators write prompts to a versioned taxonomy. Coverage of task types, intents, difficulty bands, and edge cases is tracked so the corpus does not over-index on the easy slice.

Taxonomy · intent map · difficulty bands · edge slices

Ideal-response writing

Senior curators write the gold response for the highest-value prompts. Used where the model has no good draft to start from, or where the answer is the artefact you want to teach.

Gold responses · reference answers · style anchors

Model-assisted (post-editing)

A base model generates a draft. A curator tightens phrasing, fixes facts, strips verbosity, and reshapes tone. Industry-standard model-assisted annotation pattern, also called post-editing. The most efficient signal-per-hour for general-assistant corpora.

Draft + improve · diff tracked · rationale captured

Multi-turn dialogue authoring

Whole conversations curated turn-by-turn. State carried across turns, clarification questions written where the prompt is ambiguous, follow-up prompts shaped to stress the model.

Multi-turn · clarifying questions · state-tracking

Refusal authoring

Helpful refusals to genuinely unsafe asks, soft redirects on borderline ones, full answers where the model was wrongly refusing. Calibrated against your safety policy, not a generic refusal voice.

Helpful refusals · soft redirects · policy-aligned

Persona-grounded responses

Responses written in the voice you ship to customers. Brand register, reading age, allowed claims, escalation phrases. The persona becomes a constraint the model learns, not a system-prompt hack.

Brand voice · register · allowed-claims · escalations

Real curation samples

What a curator actually strikes and rewrites

Three real edit shapes across customer-support, code review, and patient-facing clinical copy. The strikethrough is what the model produced; the green spans are the curator's rewrite. Rationale captured per row.

Customer supportFormat · Alpaca

Prompt

Write a customer email apologising for a delayed shipment. Tone: direct, no throat-clearing.

Model draft

auto-generated

We sincerely and deeply regret to inform you that your order has unfortunately been was delayed. We understand the impact this may have on you and Here's what happened and we want to be fully transparent about the situationand what we're doing about it.

Curator's edited response

Accepted

Your order was delayed. Here's what happened and what we're doing about it.

Words trimmed

−23

Tone shift

corporate → direct

Curator rationale

“Strip throat-clearing. Drop the corporate-apology preamble. Lead with the fact, then the recovery.”

Code reviewFormat · ShareGPT

Prompt

Review this function. The user wants concise feedback, not a textbook.

Model draft

auto-generated

This function appears to be implementing a binary search algorithm. Binary search is a divide-and-conquer algorithm that works on sorted arrays. Looking at your implementation, I have several observations to share. The off-by-one in line 7 (`high = mid` should be `high = mid - 1`) will skip the matching element on a left-leaning split. Other than that, the code looks generally fine and follows standard conventions.Rest looks clean.

Curator's edited response

Accepted

The off-by-one in line 7 (`high = mid` should be `high = mid - 1`) will skip the matching element on a left-leaning split. Rest looks clean.

Words trimmed

−41

Tone shift

lecture → peer review

Curator rationale

“Cut the textbook intro. Inline the specific code suggestion. The dev wants the fix, not the lesson.”

Patient-facing explanationFormat · Vicuna

Prompt

Explain why we're switching from warfarin to apixaban. Audience: patient with high-school reading level.

Model draft

auto-generated

Apixaban is a direct oral anticoagulant (DOAC) that selectively inhibits factor Xa, offering a pharmacological profile distinct from vitamin K antagonists such as warfarin. We're moving you from warfarin to apixaban. Both are blood thinners. Apixaban is easier to live with: you won't need monthly blood tests for INR monitoring, , fewer foods interact with it due to its different metabolism pathway.

Curator's edited response

Accepted

We're moving you from warfarin to apixaban. Both are blood thinners. Apixaban is easier to live with: you won't need monthly blood tests, fewer foods interact with it.

Words trimmed

−28

Tone shift

clinical → conversational

Curator rationale

“Reading age. Drop pharmacology jargon, lead with what the patient actually experiences differently. SME signed off.”

Prompts written for illustration. The diff visualisation and rationale capture pattern mirror what curators produce per row in the SFT corpus we ship.

Where SFT corpora quietly fail

The drift modes that show up in production behaviour

Every SFT corpus we audit hits some subset of these. The model trains cleanly, evals look reasonable, and then real users surface what the corpus actually taught. Knowing the failure modes is most of the win.

Verbosity drift

What bad looks like

Every response opens with throat-clearing. Models trained on it open with throat-clearing forever.

What we design for

Length budget per response type. Curators flagged when output exceeds it. Tight openers preferred.

SFT corpora silently teach length. A corpus where every gold response runs 400 words ships a model that cannot write a one-line answer. We set length budgets per task and police them in review.

Hallucinated facts

What bad looks like

Curators paraphrase what sounds right. Wrong facts get baked in as ground truth.

What we design for

Source-citation required on factual claims. SME review on domain-sensitive items. Fact-check pass before accept.

An ideal response that is confidently wrong is worse than a clumsy correct one. We require citations on factual claims and route domain-sensitive items to vetted SMEs before they reach the train split.

Formulaic safety language

What bad looks like

Every refusal opens the same way. Model learns to refuse anything that pattern-matches.

What we design for

Refusal voice varied. Soft-redirect option available. Calibrated against your actual safety policy.

Copy-paste refusal voice over-refuses. A model that learned to say the same sentence in response to anything that looks risky will block legitimate user asks too. Varied refusals plus a soft-redirect register keep the safety surface usable.

Instruction drift

What bad looks like

Response ignores half the prompt. Curator accepts it because the prose is good.

What we design for

Constraint checklist per item. Each prompt constraint verified individually before accept.

Prompts have constraints (format, length, tone, what to include, what to exclude). When curators grade on prose quality alone, half the constraints leak. We attach a constraint checklist to every item and verify each one.

Length bias in preference data

What bad looks like

Curators reflexively prefer longer responses. SFT corpus inherits the bias.

What we design for

Length-blind review. Spot-checks on items where the shorter response was correct.

Curators trained on long-form writing prefer long-form writing. Without active de-biasing the SFT data teaches the model to be wordy. We run length-blind review and spike-test the bias.

Formats we ship

Shaped to the chat template your trainer expects

The same curated content can ship in any of the standard SFT shapes. Format is a packaging decision, not a craft decision. You tell us what the trainer consumes; we hand over a clean split in that schema.

Alpaca

Instruction + input + output triples. The original format that launched the SFT-on-open-models era. Easy to consume across most trainers.

OpenAssistant

Tree-shaped conversation graphs with multiple replies and rankings per turn. Use when you want hierarchy and reply branching in the corpus itself.

ShareGPT

Linear multi-turn conversations with role tags. The de facto standard for chat-shaped SFT data. First-class support across vLLM, SGLang, and most fine-tuners.

Vicuna

Conversation format with explicit USER and ASSISTANT tags and a system prompt slot. Use when the trainer expects the Vicuna chat template directly.

Llama 2 Chat

[INST] / <<SYS>> wrapper format used by Llama 2 fine-tunes. Use when targeting Llama 2-derived models on the original chat template.

Llama 3 Instruct

JSONL custom

Your bespoke schema, mapped from any of the standard shapes above. Useful when your trainer or pipeline expects a non-standard field set.

If your trainer expects something none of the above describe, share the spec on the scope call. Format conversion is part of the handover, not a separate workstream.

Tooling-agnostic by design

The right tool for the curation surface

Argilla for feedback-shaped corpora, Label Studio when the UI needs to be bespoke, Prodigy where active learning surfaces the next most-informative item to label, an in-house edit interface when nothing fits. The platform follows the work, not the other way around.

Argilla

LLM-feedback + curation workflows. First-class fit for instruction-tune data.

Label Studio

Flexible JSON-schema label configs. Good for custom curation UIs.

Prodigy

Active-learning + scripting. Strong on routing the next item to the right curator.

Hugging Face Hub

Versioned dataset hosting + community discovery for non-sensitive corpora.

LangSmith

Trace-driven curation when the prompts come from a live LangChain app.

In-house UI

Bespoke edit interface when the off-the-shelf tools do not fit the workflow.

Where regulated data residency rules out a SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.

Your handover pack

What ships with the corpus

A JSONL file on its own teaches your trainer nothing about how the corpus was built. The pack below makes the dataset auditable, reproducible, and extendable by whoever owns it after the project ships.

Every batch refreshes the pack. If you commission a one-shot project it arrives once. If you commission a continuous cadence it refreshes on every drop.

Curation guidelines

Versioned, example-rich, edge-case-explicit. Covers prompt construction, response standards, refusal voice, length budgets, and the brand register if you have one.

Prompt taxonomy

The intent + task-type + difficulty map your corpus is sampled against. Coverage report tells you which slices are thin so the next batch can fill them.

Sampled QA set

A sealed sample from every batch, re-reviewed by a senior curator independent of the original pass. The signal that decides whether the batch ships or reworks.

Edit-history audit

For every edit-from-draft item: the original draft, the final edit, the curator identity hashed, the rationale string, and the time spent. The audit trail your model card can cite.

Train / val split in target format

Pre-split corpus shipped in the chat template you asked for (Alpaca, OpenAssistant, ShareGPT, Vicuna, Llama 2 Chat / 3 Instruct, or your JSONL spec). Held-out validation kept clean of train.

Eval rubric

The scoring rubric your post-training evals can grade against. Aligned with the curation guidelines so train-time and eval-time judge against the same bar.

How we engage

Pick the shape that fits your team

From end-to-end corpus delivery to time-boxed audit. The scope call confirms which shape fits; the statement of work names the deliverables.

Yobitel-led

We own the curation programme end-to-end

Guidelines, prompt taxonomy, curator pool, edit-from-draft loop, senior review, format conversion, dataset card. You receive a shipped corpus against a fixed quality bar in the format your trainer wants.

Collaborative

You bring the curators, we run the craft

Your in-house or contracted writers do the work. We provide the guidelines, the taxonomy, the QA loop, the format conversion, and the senior review bench. Best when you already have domain writers and want the methodology to lift.

Advisory

Time-boxed audit of an existing SFT corpus

Fixed-window review of the corpus you already have. We sample it, re-grade against a tighter rubric, surface the drift modes, and write a remediation plan. Best when an earlier fine-tune is not behaving the way the corpus promised.

Back to hub

Data annotation + RLHF preparation

The full annotation practice. Supervised labelling, preference data, eval set authoring, safety datasets, multimodal, synthetic generation. Instruction-tune curation sits inside it.

Downstream

Model training + fine-tuning (SFT consumption)

The training-run engineering that consumes the SFT corpus you commissioned here. Trainer wiring, eval-loss tracking, checkpoint discipline, downstream-eval gate.

Tell us what the corpus is for.

A short questionnaire covers volume, format target, quality bar, and engagement shape. Our instruction-tuning lead replies inside one working day with a candidate taxonomy, a tooling pick, and a curator-pool plan fitted to your timeline and sensitivity.

Prefer email? Contact us

Curator pool with SME bench across clinical, legal, and finance domains. Edit history captured as a first-class artefact, never retrofitted. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).

The SFT corpus your fine-tune actually deserves

Alpaca · OpenAssistant · ShareGPT · Vicuna · Llama 2 Chat / 3 Instruct · JSONL customEdit history captured per item, not retrofittedSME pool for clinical, legal, and finance domains

What ships with the corpus

Every batch refreshes the pack. If you commission a one-shot project it arrives once. If you commission a continuous cadence it refreshes on every drop.

Tell us what the corpus is for.