Annotation Practice · Multimodal
Image, video, and audio annotated by one operating model
Bounding boxes and masks on image. Keyframes, event timelines, and tracking on video. Speaker diarisation and ASR correction on audio. Per-modality labeller pools, the right tool per modality, frame-accurate time-sync where the modalities meet.
Tri-modality canvas
In reviewOne dataset · image, video, audio · shared annotators
Image · bbox
3 / 3 labelled
Video · timeline
00:00 → 02:00
event: object_handoff · 00:01:14 → 00:01:18
Audio · diarisation
2 speakers
S1: “Roger that” · 00:00:08
Cross-modal IAA on shared annotators 0.84 · 7,200 frames + 480 audio min + 14k bboxes.
Real annotation samples · all three modalities
What lands in the trainer for each modality
A real image with bounding boxes, a real video keyframe + event timeline, and an audio waveform with two-speaker diarisation. Same operating discipline across modalities; the export format follows the consumer (COCO, WebVTT, ELAN .eaf).
car · 0.96bicycle · 0.98car · 0.91truck · 0.89Object detection on a real street scene. Class labels and per-box confidence the way a Label Studio or CVAT export lands in your training pipeline.
00:00:00
00:00:14person enters
00:00:32
00:00:48Keyframes plucked from a 60-second clip plus event-span markers on the timeline. Exports to WebVTT / SRT / ELAN .eaf depending on the downstream consumer.
+ 2 more utterances
Waveform with two-speaker diarisation bands. Each utterance carries the speaker tag plus the ASR-corrected transcript. Procedural render — no real PII audio surfaces on this page.
Image and video keyframe stills sampled from COCO 2017 validation set (Creative Commons Attribution 4.0). Audio waveform is procedurally generated for illustration. Bounding boxes, event markers, and diarisation bands hand-placed.
The shape of the work, per modality
Each modality is a distinct craft
The labellers, the tools, and the failure modes differ per modality. We treat each as its own discipline, then stitch them together where the trainer expects a joint item.
Image · bounding box, polygon, mask
From flat classification through tight polygons and per-pixel masks. We pick the geometry to the trainer's expectation, not the labeller's preference. COCO + Pascal VOC + YOLO on export.
bbox · polygon · semantic mask · instance mask
Video · keyframe, event timeline, tracking
Keyframe-and-interpolate for object tracking. Event timelines for action recognition. Tracklet IDs preserved across cuts. Annotated frames stay in step with the source video timecode.
keyframe · event span · tracklet id · timecode
Audio · diarisation, ASR correction
Speaker turns segmented and labelled. ASR transcripts corrected against the audio with timestamps preserved. Multi-language pools where the corpus crosses scripts.
diarisation · ASR · timestamps · multi-language
OCR + document layout
Per-character OCR correction and per-region layout (heading, paragraph, table, figure). Reading order tracked so a downstream LLM sees the document the way a human would.
OCR · layout · reading order · tables
Point-cloud + LiDAR (AV teams)
3D bounding cuboids on LiDAR returns with optional 2D image-plane projection. For autonomy stacks where camera-only labels miss occluded objects.
3D cuboid · LiDAR · sensor fusion
Cross-modal aligned items
Items where one timeline spans image, video, and audio. Frame-accurate alignment between the bbox in the frame, the event span on the timeline, and the speaker turn on the audio track.
shared timeline · synced labels · joint item id
Cross-modal trap modes
Where multimodal datasets quietly mis-align
Most multimodal corpora we audit hit some subset of these. The model trains, the single-modality evals look fine, and the cross-modal benchmark exposes the cracks. Knowing the traps exist is most of the win.
Modality-blind labeller pool corrupts every modality at once
What bad looks like
One labeller pool labelling image, video, and audio
What we design for
Per-modality pools with cross-trained adjudicators
A bbox labeller is not an audio diariser, and an ASR corrector is not a video tracker. Treating the workforce as fungible drops IAA across the board. We staff per modality and cross-train adjudicators on the joint items.
Format translation is where the labels quietly mutate
What bad looks like
Export COCO → convert to YOLO → lose mask geometry
What we design for
Native export per target, validated against the source
Every format conversion is a chance to drop information. Masks become bboxes. Polygons become rectangles. Timecodes become fractional seconds and lose precision. We export native per target and validate the round-trip before shipping.
Single-modality QA on a multimodal item misses half the bugs
What bad looks like
Sample bboxes for QA, never re-listen to the audio
What we design for
QA passes by modality, plus a joint cross-modal sample
If the QA dashboard only samples one modality at a time, the cross-modal mis-alignments never surface. We run per-modality QA plus a smaller joint sample where the reviewer checks the bbox, the event span, and the speaker turn together.
Time-sync drift between video and audio breaks fusion models
What bad looks like
Video at 30 fps, audio re-sampled, 80 ms drift never measured
What we design for
Frame-accurate time-sync audit on every batch
When a model fuses video and audio it learns the alignment as a feature. Even sub-second drift erases that signal. Every batch with both modalities ships with a time-sync audit and a measured drift figure on the dataset card.
Tool-per-modality matrix
We drive the tool that fits the modality
No single platform wins on every modality. We pick per modality, then stitch the outputs into a single dataset bundle. The matrix reads in seconds.
Image
Bounding box, polygon, semantic and instance segmentation. CVAT for open-source default and on-prem deploys; the commercial four where active learning or hosted scale matters.
Video
Keyframe interpolation, object tracking across cuts, event-timeline tagging. Encord and Supervisely lean for long-form video; CVAT and V7 for shorter clips at higher volume.
Audio
Speaker diarisation, ASR correction, phonetic and prosodic markup. ELAN and Praat for linguistic depth; Label Studio and Audino for production-scale ASR-correction pipelines.
Document + OCR
OCR correction, layout-region tagging, reading-order chains. Label Studio and doccano for text-heavy; V7 for image-dense document layout with mixed scripts.
Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.
Formats we ship
Native export per target, validated round-trip
Every format conversion is a chance to mutate labels silently. We export native to your trainer's preferred format and ship the converter scripts so the round-trip is reproducible on your side too.
COCO
Image bbox + segmentation. Industry default for detection models.
Pascal VOC
Image bbox in XML. Legacy detector stacks.
YOLO
One-line-per-object .txt. Tight fit for YOLO-family trainers.
WebVTT
Time-coded subtitles + caption tracks. Web-native.
SRT
Time-coded subtitles, broad NLE compatibility.
ELAN .eaf
Multi-tier annotated transcript. Linguistic + research workflows.
Custom JSONL
When your trainer expects a bespoke schema.
Sidecar conversions
We ship the converter scripts as artefacts, not a black box.
Your handover pack
What ships with the multimodal dataset
A multimodal dataset on its own is a liability. The same dataset plus per-modality guidelines, format converters, cross-modal QA, time-sync audit, and a modality-aware dataset card is an artefact your training programme and your auditor can both work with.
Every batch ships with these artefacts. One-shot projects receive them once; steady cadences refresh per batch.
Per-modality annotation guidelines
Versioned guidelines per modality, with the cross-modal joint-item rules called out separately. Edge-case examples sourced from your corpus so a new labeller can ramp inside a working day.
Format-converter scripts
The scripts that translate our native export to your trainer's preferred format. Shipped as code in the dataset bundle so the round-trip is reproducible, not a vendor button-click.
Cross-modal QA dashboard
Per-modality IAA, per-labeller drift, per-batch joint-sample reviews. The signal that tells you whether to ship the batch or rework it.
Time-sync audit
For any batch that crosses video and audio, a measured drift figure with the methodology used to produce it. Goes on the dataset card next to the IAA numbers.
Dataset card with modality split
Per-modality composition. Volume per modality. IAA per modality. Bias notes per modality. The artefact your model card cites and your auditor can read end-to-end.
Language + script coverage report
When audio or document is in scope, a per-language coverage report covering volume, labeller pool, and IAA per language. Catches the under-represented scripts before training does.
How we engage
Pick the shape that fits your team
From end-to-end programme delivery to a time-boxed audit of an existing dataset. The scope call confirms which fits; the statement of work names the deliverables.
Yobitel-led
We own the multimodal programme end-to-end
Per-modality guidelines, calibration sets, labeller pools, adjudication, cross-modal QA, time-sync audit, dataset card. You receive shipped batches against a fixed quality bar. Best when annotation is on the critical path of a training milestone.
Collaborative
You bring labellers, we run the craft
You provide an in-house or contracted labelling team. We own per-modality guidelines, calibration, IAA tracking, cross-modal QA, and the format-export pipeline. Best when you already operate labellers and want the methodology to lift.
Advisory
Time-boxed review of your existing dataset
Fixed-window audit of an existing multimodal dataset. We sample by modality, re-label a control, measure cross-modal drift, write a remediation plan. Best when last year's corpus is no longer holding up.
Back to hub
Data annotation + RLHF preparation
The full practice: supervised labelling, preference data, instruction-tune curation, eval sets, safety datasets, domain-grounded review, synthetic data.
Related
Model training + fine-tuning
The training-run engineering that consumes the multimodal dataset you commissioned here. SFT, vision-language post-training, ASR fine-tuning. Same bench across both.
Tell us which modalities the dataset has to cover.
A short questionnaire covers modality mix, volume per modality, format target, quality bar, and engagement shape. Our multimodal lead replies inside one working day with a calibration plan and a per-modality tooling stack fitted to your data sensitivity and timeline.
Per-modality labeller pools with cross-trained adjudicators on joint items. One dataset bundle, native exports per trainer target. Frame-accurate time-sync audit on every batch that crosses video and audio. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).