Annotation Practice · Multimodal

Image, video, and audio annotated by one operating model

Bounding boxes and masks on image. Keyframes, event timelines, and tracking on video. Speaker diarisation and ASR correction on audio. Per-modality labeller pools, the right tool per modality, frame-accurate time-sync where the modalities meet.

See the tool-per-modality matrix

CVAT · V7 · Encord · Roboflow · SuperAnnotate · SuperviselyLabel Studio · ELAN · Praat · Audino · Subtitle EditCOCO · Pascal VOC · YOLO · WebVTT · SRT · ELAN .eaf

Tri-modality canvas

In review

One dataset · image, video, audio · shared annotators

Image · bbox

3 / 3 labelled

Video · timeline

00:00 → 02:00

event: object_handoff · 00:01:14 → 00:01:18

Audio · diarisation

2 speakers

S1: “Roger that” · 00:00:08

Cross-modal IAA on shared annotators 0.84 · 7,200 frames + 480 audio min + 14k bboxes.

Real annotation samples · all three modalities

What lands in the trainer for each modality

A real image with bounding boxes, a real video keyframe + event timeline, and an audio waveform with two-speaker diarisation. Same operating discipline across modalities; the export format follows the consumer (COCO, WebVTT, ELAN .eaf).

Street scene with bicycle in foreground, parked cars and a delivery van in the background.

car · 0.96bicycle · 0.98car · 0.91truck · 0.89

Image · bounding boxesCOCO format

Object detection on a real street scene. Class labels and per-box confidence the way a Label Studio or CVAT export lands in your training pipeline.

00:00:00

00:00:14person enters

00:00:32

00:00:48

00:00:00scene + event timeline00:01:00

scene_startperson_entersscene_cutobject_handoffscene_end

Video · keyframes + eventsWebVTT export

Keyframes plucked from a 60-second clip plus event-span markers on the timeline. Exports to WebVTT / SRT / ELAN .eaf depending on the downstream consumer.

0:002 speakers · 5 utterances1:00

S1Yeah, the trial enrolment is on track for Q3.

S2How many sites are we live in now?

S1Seven sites confirmed. London, Manchester, Berlin, two in NYC, San Francisco, and Tokyo.

+ 2 more utterances

Audio · diarisation + ASR.eaf · .vtt

Waveform with two-speaker diarisation bands. Each utterance carries the speaker tag plus the ASR-corrected transcript. Procedural render — no real PII audio surfaces on this page.

Image and video keyframe stills sampled from COCO 2017 validation set (Creative Commons Attribution 4.0). Audio waveform is procedurally generated for illustration. Bounding boxes, event markers, and diarisation bands hand-placed.

The shape of the work, per modality

Each modality is a distinct craft

The labellers, the tools, and the failure modes differ per modality. We treat each as its own discipline, then stitch them together where the trainer expects a joint item.

Image · bounding box, polygon, mask

From flat classification through tight polygons and per-pixel masks. We pick the geometry to the trainer's expectation, not the labeller's preference. COCO + Pascal VOC + YOLO on export.

bbox · polygon · semantic mask · instance mask

Video · keyframe, event timeline, tracking

Keyframe-and-interpolate for object tracking. Event timelines for action recognition. Tracklet IDs preserved across cuts. Annotated frames stay in step with the source video timecode.

keyframe · event span · tracklet id · timecode

Audio · diarisation, ASR correction

Speaker turns segmented and labelled. ASR transcripts corrected against the audio with timestamps preserved. Multi-language pools where the corpus crosses scripts.

diarisation · ASR · timestamps · multi-language

OCR + document layout

Per-character OCR correction and per-region layout (heading, paragraph, table, figure). Reading order tracked so a downstream LLM sees the document the way a human would.

OCR · layout · reading order · tables

Point-cloud + LiDAR (AV teams)

3D bounding cuboids on LiDAR returns with optional 2D image-plane projection. For autonomy stacks where camera-only labels miss occluded objects.

3D cuboid · LiDAR · sensor fusion

Cross-modal aligned items

Items where one timeline spans image, video, and audio. Frame-accurate alignment between the bbox in the frame, the event span on the timeline, and the speaker turn on the audio track.

shared timeline · synced labels · joint item id

Cross-modal trap modes

Where multimodal datasets quietly mis-align

Most multimodal corpora we audit hit some subset of these. The model trains, the single-modality evals look fine, and the cross-modal benchmark exposes the cracks. Knowing the traps exist is most of the win.

Modality-blind labeller pool corrupts every modality at once

What bad looks like

One labeller pool labelling image, video, and audio

What we design for

Per-modality pools with cross-trained adjudicators

A bbox labeller is not an audio diariser, and an ASR corrector is not a video tracker. Treating the workforce as fungible drops IAA across the board. We staff per modality and cross-train adjudicators on the joint items.

Format translation is where the labels quietly mutate

What bad looks like

Export COCO → convert to YOLO → lose mask geometry

What we design for

Native export per target, validated against the source

Every format conversion is a chance to drop information. Masks become bboxes. Polygons become rectangles. Timecodes become fractional seconds and lose precision. We export native per target and validate the round-trip before shipping.

Single-modality QA on a multimodal item misses half the bugs

What bad looks like

Sample bboxes for QA, never re-listen to the audio

What we design for

QA passes by modality, plus a joint cross-modal sample

If the QA dashboard only samples one modality at a time, the cross-modal mis-alignments never surface. We run per-modality QA plus a smaller joint sample where the reviewer checks the bbox, the event span, and the speaker turn together.

Time-sync drift between video and audio breaks fusion models

What bad looks like

Video at 30 fps, audio re-sampled, 80 ms drift never measured

What we design for

Frame-accurate time-sync audit on every batch

When a model fuses video and audio it learns the alignment as a feature. Even sub-second drift erases that signal. Every batch with both modalities ships with a time-sync audit and a measured drift figure on the dataset card.

Tool-per-modality matrix

We drive the tool that fits the modality

No single platform wins on every modality. We pick per modality, then stitch the outputs into a single dataset bundle. The matrix reads in seconds.

Image

CVATV7EncordRoboflowSuperAnnotate

Bounding box, polygon, semantic and instance segmentation. CVAT for open-source default and on-prem deploys; the commercial four where active learning or hosted scale matters.

Video

CVATV7EncordSupervisely

Keyframe interpolation, object tracking across cuts, event-timeline tagging. Encord and Supervisely lean for long-form video; CVAT and V7 for shorter clips at higher volume.

Audio

Label StudioELANPraatAudino

Speaker diarisation, ASR correction, phonetic and prosodic markup. ELAN and Praat for linguistic depth; Label Studio and Audino for production-scale ASR-correction pipelines.

Document + OCR

Label StudiodoccanoV7

OCR correction, layout-region tagging, reading-order chains. Label Studio and doccano for text-heavy; V7 for image-dense document layout with mixed scripts.

Where regulated data residency rules out a managed SaaS tool, we deploy the self-hostable equivalent into your perimeter. The methodology is the same; the hosting posture changes.

Formats we ship

Native export per target, validated round-trip

Every format conversion is a chance to mutate labels silently. We export native to your trainer's preferred format and ship the converter scripts so the round-trip is reproducible on your side too.

COCO

Image bbox + segmentation. Industry default for detection models.

Pascal VOC

Image bbox in XML. Legacy detector stacks.

YOLO

One-line-per-object .txt. Tight fit for YOLO-family trainers.

WebVTT

Time-coded subtitles + caption tracks. Web-native.

SRT

Time-coded subtitles, broad NLE compatibility.

ELAN .eaf

Multi-tier annotated transcript. Linguistic + research workflows.

Custom JSONL

When your trainer expects a bespoke schema.

Sidecar conversions

We ship the converter scripts as artefacts, not a black box.

Your handover pack

What ships with the multimodal dataset

A multimodal dataset on its own is a liability. The same dataset plus per-modality guidelines, format converters, cross-modal QA, time-sync audit, and a modality-aware dataset card is an artefact your training programme and your auditor can both work with.

Every batch ships with these artefacts. One-shot projects receive them once; steady cadences refresh per batch.

Per-modality annotation guidelines

Versioned guidelines per modality, with the cross-modal joint-item rules called out separately. Edge-case examples sourced from your corpus so a new labeller can ramp inside a working day.

Format-converter scripts

The scripts that translate our native export to your trainer's preferred format. Shipped as code in the dataset bundle so the round-trip is reproducible, not a vendor button-click.

Cross-modal QA dashboard

Per-modality IAA, per-labeller drift, per-batch joint-sample reviews. The signal that tells you whether to ship the batch or rework it.

Time-sync audit

For any batch that crosses video and audio, a measured drift figure with the methodology used to produce it. Goes on the dataset card next to the IAA numbers.

Dataset card with modality split

Per-modality composition. Volume per modality. IAA per modality. Bias notes per modality. The artefact your model card cites and your auditor can read end-to-end.

Language + script coverage report

When audio or document is in scope, a per-language coverage report covering volume, labeller pool, and IAA per language. Catches the under-represented scripts before training does.

How we engage

Pick the shape that fits your team

From end-to-end programme delivery to a time-boxed audit of an existing dataset. The scope call confirms which fits; the statement of work names the deliverables.

Yobitel-led

We own the multimodal programme end-to-end

Per-modality guidelines, calibration sets, labeller pools, adjudication, cross-modal QA, time-sync audit, dataset card. You receive shipped batches against a fixed quality bar. Best when annotation is on the critical path of a training milestone.

Collaborative

You bring labellers, we run the craft

You provide an in-house or contracted labelling team. We own per-modality guidelines, calibration, IAA tracking, cross-modal QA, and the format-export pipeline. Best when you already operate labellers and want the methodology to lift.

Advisory

Time-boxed review of your existing dataset

Fixed-window audit of an existing multimodal dataset. We sample by modality, re-label a control, measure cross-modal drift, write a remediation plan. Best when last year's corpus is no longer holding up.

Back to hub

Data annotation + RLHF preparation

The full practice: supervised labelling, preference data, instruction-tune curation, eval sets, safety datasets, domain-grounded review, synthetic data.

Model training + fine-tuning

The training-run engineering that consumes the multimodal dataset you commissioned here. SFT, vision-language post-training, ASR fine-tuning. Same bench across both.

Tell us which modalities the dataset has to cover.

A short questionnaire covers modality mix, volume per modality, format target, quality bar, and engagement shape. Our multimodal lead replies inside one working day with a calibration plan and a per-modality tooling stack fitted to your data sensitivity and timeline.

Prefer email? Contact us

Per-modality labeller pools with cross-trained adjudicators on joint items. One dataset bundle, native exports per trainer target. Frame-accurate time-sync audit on every batch that crosses video and audio. Engagements scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).

Image, video, and audio annotated by one operating model

CVAT · V7 · Encord · Roboflow · SuperAnnotate · SuperviselyLabel Studio · ELAN · Praat · Audino · Subtitle EditCOCO · Pascal VOC · YOLO · WebVTT · SRT · ELAN .eaf

What ships with the multimodal dataset

Every batch ships with these artefacts. One-shot projects receive them once; steady cadences refresh per batch.

Tell us which modalities the dataset has to cover.