TL;DR
- Originally created at Huawei in 2019, Volcano is the de facto batch scheduler for AI / HPC workloads on Kubernetes — CNCF Incubating since 2020, Apache 2.0, written in Go.
- Replaces the default kube-scheduler for pods that opt in via `schedulerName: volcano`; introduces `Job` and `PodGroup` CRDs with `minAvailable`, `minMember`, `minResources`, plugins (ssh / svc / env / tensorflow / pytorch / mpi), tasks and per-task replicas.
- Gang scheduling, queue-based DRF fair-share, preemption with reclaim, topology-aware placement on NVLink / NVSwitch / rack domains, reservation + backfill — every primitive an MPI or NCCL workload needs to start atomically.
- First-class plugins for PyTorchJob, MPIJob, TensorFlow, Spark, Ray, Flink and the Kubeflow Training Operator; the canonical pairing in production is Volcano (scheduler) + KubeRay or Kubeflow (framework) + NVIDIA GPU Operator (hardware layer).
- Yobibyte runs Volcano internally as part of its scheduling substrate, so Yobitel NeoCloud customers never see a partial pod-group admission — distributed training and tensor-parallel inference launch atomically or stay queued.
Overview#
Volcano is a Kubernetes-native batch scheduler purpose-built for AI / ML / HPC / Big Data workloads. The default kube-scheduler optimises for stateless single-pod workloads — REST services, ingestion daemons, control-plane controllers — and treats every pod as an independent placement decision. That model breaks the moment you submit a distributed training job that needs 64 H100 GPUs across eight nodes to launch atomically: the scheduler admits the first 60 pods, the cluster runs out of fitting GPUs, the remaining four sit Pending, and the 60 admitted ranks burn GPU-hours waiting for the rendezvous that will never complete.
Volcano was created at Huawei in 2019 (initially called `kube-batch`) to fix exactly this class of problem. It introduces a `PodGroup` abstraction with `minMember` and `minResources` so the scheduler can reason about "the whole job is admitted or none of it is"; layers a queue-based fair-share model with DRF (Dominant Resource Fairness) across GPU + CPU + memory; adds topology-aware placement so an eight-rank tensor-parallel job lands on a single NVLink / NVSwitch island; and provides HPC-style preemption with reclaim so a high-priority production training run can evict opportunistic backfill without losing accounting fidelity.
Volcano joined CNCF as a Sandbox project in 2020 and was promoted to Incubating in 2022. By mid-2026 it is on v1.10.x, supports Kubernetes 1.27-1.33, and ships pre-built integrations for PyTorch, TensorFlow, MPI (Horovod / OpenMPI), Ray, Spark, Flink and the Kubeflow Training Operator. It is dual-purpose: it can either fully replace the default scheduler in a Volcano-only namespace, or run side-by-side with kube-scheduler in a mixed cluster where only workloads that opt in via `schedulerName: volcano` use Volcano's placement logic.
This entry helps you decide when Volcano is the right addition to a Kubernetes cluster, how to wire it up against the NVIDIA GPU Operator and your training operators, how to size the resulting queue / cohort plane, and how its job model differs from the lighter-weight Kueue alternative. Yobibyte runs Volcano under the hood across every Yobitel NeoCloud region so that Yobibyte customers never experience a partial gang-admission failure — this entry documents the surface for teams that operate their own clusters or want to understand what Yobibyte provides on their behalf.
Quick start#
The fastest sane path is the upstream Helm install plus a single `Job` (Volcano's own CRD, distinct from `batch/v1 Job`) running a four-rank MPI worker. The five commands below install Volcano, define a queue, submit a gang-scheduled MPI job and observe the admission decision. Run them against a cluster that already has the NVIDIA GPU Operator installed and at least four `nvidia.com/gpu` resources free.
# 1. Install Volcano via the upstream Helm chart
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install --wait volcano volcano-sh/volcano \
--version "1.10.0" \
--namespace volcano-system --create-namespace
# 2. Confirm the controller, scheduler and admission webhook are Ready
kubectl -n volcano-system get pods
# 3. Create a queue with a 16-GPU weight
cat <<'YAML' | kubectl apply -f -
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: training }
spec:
weight: 4
capability:
nvidia.com/gpu: "16"
YAML
# 4. Submit a gang-scheduled MPI job (4 workers, all-or-nothing)
cat <<'YAML' | kubectl apply -f -
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: nccl-smoke }
spec:
schedulerName: volcano
minAvailable: 4
queue: training
plugins:
ssh: []
svc: []
env: []
tasks:
- name: worker
replicas: 4
template:
spec:
containers:
- name: nccl-test
image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["sleep", "infinity"]
resources:
limits: { nvidia.com/gpu: 1 }
YAML
# 5. Inspect gang-admission decision
kubectl get podgroup
kubectl describe podgroup nccl-smoke
kubectl get pods -l volcano.sh/job-name=nccl-smokeAlways pair Volcano with the NVIDIA GPU Operator and Node Feature Discovery (NFD). Without `nvidia.com/gpu.nvlink.domain` and similar topology labels, Volcano's `topology-aware` plugin has nothing to optimise against and gang admission still works but placement falls back to default spread. See `nvidia-gpu-operator` for the install.
How it works#
Internally Volcano is three components running in `volcano-system`: the `vc-controller-manager` (reconciles `Job` and `PodGroup` CRDs into pods), the `vc-scheduler` (the scheduler itself — replaces or supplements kube-scheduler), and the `vc-webhook-manager` (validating + mutating admission for Volcano CRDs). The scheduler is built around a session model: every scheduling tick (default 1 s) opens a session, walks a configurable pipeline of actions (`enqueue`, `allocate`, `backfill`, `preempt`, `reclaim`), and closes the session by committing the resulting bindings to the API server. Plugins implement the policy each action consults: `gang`, `priority`, `drf`, `predicates`, `nodeorder`, `proportion`, `binpack`, `topology-aware`, `numa-aware`, `task-topology`.
The `gang` plugin is the headline. A `PodGroup` (created automatically by Volcano's `Job`, or explicitly for raw pods) declares `minAvailable` and `minResources`. The scheduler will not transition the group to `Inqueue` (allowed to be admitted) until enough total cluster capacity exists; will not start binding individual pods until the entire group can be bound; and will preempt or reclaim only when the gang as a whole can be satisfied after the eviction. This eliminates the partial-admission deadlock that breaks the default scheduler for distributed training.
Queues are first-class. A `Queue` CRD has a `weight` (DRF share), `capability` (hard ceiling per resource), `reclaimable` flag (can higher-priority queues claw back from this queue?) and `priority`. The `proportion` plugin allocates fair shares across queues per resource dimension; the `drf` plugin handles the multi-dimensional case (GPU + CPU + memory) so a CPU-heavy data-prep job and a GPU-heavy training job get fair allocations on the dimension that dominates each. The `reclaim` action then enforces the contract — when a higher-priority queue arrives, opportunistic borrowers are evicted in a deterministic order until the guarantee is restored.
Topology awareness wires Volcano to the underlying fabric. The `topology-aware` plugin reads node labels emitted by NFD and the GPU Operator (`nvidia.com/gpu.nvlink.domain`, `topology.kubernetes.io/zone`, custom `volcano.sh/topology=rack-3-leaf-4`) and groups nodes into hierarchical domains. The `task-topology` plugin then bin-packs an MPI / NCCL job into the smallest fitting domain — an 8-rank tensor-parallel job lands on a single 8-way NVLink island, a 64-rank pipeline-parallel job lands on a single rack, a 512-rank pretraining job picks the smallest spine-leaf cluster that fits.
- Session-based scheduling — actions execute per tick over a snapshot, commits are atomic; failed bindings are reverted without leaking partial state.
- Plugins are config-driven via the `volcano-scheduler-configmap` — enable / disable / reorder without rebuilding the binary.
- Plugin extension model — `gang`, `priority`, `drf`, `binpack`, `topology-aware`, `numa-aware`, `task-topology`, `tdm` (time-division multiplexing), `proportion`, `overcommit`, `usage`, `rescheduling`.
- Per-task templates — a `Job` can declare multiple `tasks` (e.g. `master` + `worker` + `param-server`), each with its own replica count, image and resource shape, all admitted together as a single gang.
- Built-in plugins for framework idioms — `pytorch` (sets `MASTER_ADDR` / `RANK` / `WORLD_SIZE`), `tensorflow` (TF_CONFIG), `mpi` (mpirun rendezvous), `ssh` (SSH key fan-out), `svc` (headless service for collective bootstrap), `env` (rank-aware env vars).
- Pre-emption is policy-driven — `priority` + `victim` selection minimises killed pods; `tdm` preempts on a time-share rather than killing outright.
Volcano's scheduler is not a drop-in replacement for kube-scheduler — pods must opt in via `schedulerName: volcano`. This is by design: a typical cluster runs Volcano for batch / training workloads and kube-scheduler for services, with admission webhooks routing pods to the right scheduler based on namespace or label.
Reference and specifications#
The fields below are the Volcano CRD surface that matters in production. The reference covers `Job` (batch.volcano.sh/v1alpha1), `PodGroup` (scheduling.volcano.sh/v1beta1), `Queue` (scheduling.volcano.sh/v1beta1) and the scheduler ConfigMap. Defaults are taken from v1.10.0.
| Resource / field | Type | Default | Purpose |
|---|---|---|---|
| Job.spec.schedulerName | string | volcano | Pin the job to the Volcano scheduler; required for gang. |
| Job.spec.minAvailable | int | (required) | Minimum number of tasks that must be admitted for the gang to run. |
| Job.spec.queue | string | default | Which `Queue` consumes the job's resource share. |
| Job.spec.priorityClassName | string | (none) | Standard Kubernetes PriorityClass; drives preemption order. |
| Job.spec.policies | list | [] | Lifecycle policies — restart-on-failure, restart-on-pod-evicted, etc. |
| Job.spec.plugins.ssh | object | disabled | Generate SSH keys and inject into pods for mpirun rendezvous. |
| Job.spec.plugins.svc | object | disabled | Create a headless Service for collective bootstrap. |
| Job.spec.plugins.env | object | disabled | Inject `VC_TASK_INDEX`, `VK_TASK_NAME`, `VC_*_NUM` env vars. |
| Job.spec.plugins.pytorch | object | disabled | Set `MASTER_ADDR` / `MASTER_PORT` / `RANK` / `WORLD_SIZE` for torchrun. |
| Job.spec.plugins.tensorflow | object | disabled | Build TF_CONFIG cluster spec across worker / ps / chief tasks. |
| Job.spec.plugins.mpi | object | disabled | Configure master/worker tasks for an mpirun-style launch. |
| Job.spec.tasks[].name | string | (required) | Logical task name (e.g. `master`, `worker`, `ps`). |
| Job.spec.tasks[].replicas | int | (required) | How many pods of this task. |
| Job.spec.tasks[].template | PodTemplate | (required) | Standard PodSpec for this task — containers, resources, volumes. |
| Job.spec.tasks[].policies | list | [] | Per-task restart policies; can override Job-level. |
| Job.spec.maxRetry | int | 3 | Job-level retry count before terminal failure. |
| PodGroup.spec.minMember | int | (required) | Minimum pods for gang admission — set by Job, or explicit for raw pods. |
| PodGroup.spec.minResources | ResourceList | (optional) | Minimum aggregate resources required — used for preemption sizing. |
| PodGroup.spec.queue | string | default | Queue this group consumes from. |
| PodGroup.spec.priorityClassName | string | (none) | Drives reclaim victim selection. |
| Queue.spec.weight | int | 1 | Relative DRF share — queue's resources / total = weight / sum(weights). |
| Queue.spec.capability | ResourceList | (unlimited) | Hard ceiling per resource (`nvidia.com/gpu`, `cpu`, `memory`). |
| Queue.spec.reclaimable | bool | true | If false, this queue's resources are never preempted. |
| Queue.spec.priority | int | 0 | Tie-breaker when DRF shares are equal. |
| Queue.spec.guarantee.resource | ResourceList | (optional) | Minimum resources guaranteed even under preemption. |
| scheduler ConfigMap.actions | string | enqueue,allocate,backfill | Pipeline of actions per tick; add `preempt`, `reclaim` for HPC patterns. |
| scheduler ConfigMap.tiers[].plugins | list | see default | Plugin list per priority tier — gang, drf, priority, predicates, nodeorder. |
| scheduler ConfigMap.schedulerPeriod | duration | 1s | Session frequency; lower for high-churn clusters. |
`minAvailable: spec.tasks[*].replicas` (i.e. all tasks must start) is the safe default for distributed training. Setting `minAvailable` lower than the total replicas turns on elastic-batch semantics — useful for hyperparameter sweeps where partial completion is acceptable, dangerous for tensor-parallel training where every rank is mandatory.
Workload patterns#
Three patterns cover the bulk of production Volcano deployments on Yobitel-operated clusters and on the upstream community. Each pattern uses a different combination of plugins and a different queue topology; pick the one closest to your dominant workload.
Pattern A — distributed PyTorch + Horovod gang training. The canonical Volcano use case. A single `Job` declares one or more `worker` tasks at the desired tensor / data-parallel scale, with `minAvailable` set to the full replica count and the `pytorch` (or `mpi`) plugin enabled. Horovod's `mpirun` finds peers through the headless service the `svc` plugin creates; `ssh` plugin handles key fan-out. Volcano holds the gang until the topology-aware plugin can land all ranks on the same NVLink island (or, for >8-way jobs, the same rack with InfiniBand RDMA paths).
Pattern B — multi-queue tenant fair-share on shared GPU capacity. Each tenant gets a `Queue` with a `weight` proportional to their committed share and a `capability` ceiling. Opportunistic backfill is allowed via `reclaimable: true`. Tenants submit `Job`s into their own queue; DRF allocates fair shares; when a higher-priority queue arrives, the `reclaim` action evicts opportunistic borrowers in priority order. This is the substrate Yobitel sovereign tenancies use to publish guaranteed GPU shares per tenant while keeping average utilisation high.
Pattern C — heterogeneous task types in one job. Training Operator and KubeRay drive their own pod lifecycles, but you can use Volcano's `Job` directly for jobs that have asymmetric task types — e.g. one `master` (parameter server, FP32) + many `worker` (FP16, NCCL) + a `metrics-sidecar` (CPU-only, scrapes worker telemetry to Prometheus). Each `task` declares its own replicas, template and policies; the gang admits them together.
# Pattern A: distributed PyTorch + Horovod on 8x H100 with gang scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: llama-pretrain }
spec:
schedulerName: volcano
minAvailable: 8
queue: training
priorityClassName: high-priority
plugins:
ssh: []
svc: []
env: []
pytorch: ["--master=master", "--worker=worker", "--port=23456"]
policies:
- event: PodEvicted
action: RestartJob
tasks:
- name: master
replicas: 1
template:
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["torchrun", "--standalone", "--nproc_per_node=1", "/app/pretrain.py"]
resources:
limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }
- name: worker
replicas: 7
template:
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["torchrun", "/app/pretrain.py"]
resources:
limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }
---
# Pattern B: tenant queues with weighted fair-share and hard ceilings
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: tenant-acme }
spec:
weight: 8 # 8/16 of cluster GPU share under fair contention
reclaimable: true # opportunistic borrow allowed
capability: # hard ceiling
nvidia.com/gpu: "64"
guarantee: # never preempted below this floor
resource:
nvidia.com/gpu: "16"For Pattern A on multi-node tensor-parallel jobs, set the `task-topology` plugin in the scheduler ConfigMap and label nodes with `volcano.sh/topology=island-3`. The scheduler will pack the eight ranks into the same NVLink island first, falling back to InfiniBand-connected racks only if no island has eight free H100s. This matches what Yobibyte does internally on Yobitel NeoCloud.
Sizing and capacity planning#
Volcano's footprint is modest. The control plane runs three deployments in `volcano-system` (controller-manager, scheduler, webhook-manager) that together cost roughly 1-2 vCPU and 2-4 GiB at steady state on a 100-node cluster. The scheduler is the dominant component — it walks the snapshot every tick and the work scales roughly O(pods × nodes × plugins). Past ~1,000 pods or ~500 nodes you should raise the request shape; past ~5,000 pods you should split into multiple scheduler replicas (HA mode, one active at a time).
- Single Volcano scheduler handles ~5,000 pods + 500 nodes comfortably; beyond that, partition by queue or by namespace label.
- Schedule period (default 1 s) trades latency for CPU — drop to 500 ms for high-churn clusters; raise to 2 s if pod creation lags scheduler load.
- Action pipeline matters: enable only what you need. Adding `preempt` and `reclaim` to a small homogeneous cluster wastes session time without measurable benefit.
- Each Volcano `Job` is a single etcd object plus N pods plus 1 `PodGroup` — overhead is dominated by pods, not by the CRDs themselves.
- On Yobitel NeoCloud regions Yobibyte uses, the scheduler runs HA with two replicas and a 750 ms tick — sized for the typical workspace mix on H100 + H200 capacity.
| Component | CPU | Memory | Notes |
|---|---|---|---|
| vc-controller-manager | 300-600 mCPU | 512 MiB - 1 GiB | Reconciles Job / PodGroup CRDs to pods. |
| vc-scheduler | 500 mCPU - 2 vCPU | 1-4 GiB | Scales with pods × nodes × plugins; HA via leader election. |
| vc-webhook-manager | 100-200 mCPU | 128-256 MiB | Validating + mutating admission for Volcano CRDs. |
| Per session cost | n/a | n/a | ~50-200 ms per tick on 100-node, 500-pod cluster. |
| Per Job overhead | n/a | ~5-10 KiB etcd | PodGroup + Job CRDs are small; reclaim events generate audit entries. |
Limits and quotas#
Volcano's quota model is the `Queue.spec.capability` field — a hard ceiling on resources the queue can hold across all admitted gangs. Combined with Kubernetes `ResourceQuota` (namespace-level cap on counts and resources) and NetworkPolicy (tenant isolation), queues form the basis of hard-isolated multi-tenant GPU clusters. The limits below are the practical envelope teams hit in production.
| Dimension | Soft limit | Hard limit | Mitigation |
|---|---|---|---|
| Pods scheduled by a single Volcano scheduler | ~5,000 | ~10,000 | Partition by queue selector; deploy multiple Volcano installs per cell. |
| Nodes per scheduler snapshot | ~500 | ~1,500 | Use `nodeSelector` filtering and shard by node-pool label. |
| Queues per cluster | ~50 | ~500 | DRF cost grows with queue count; aggregate tiny tenants under a shared queue. |
| minMember per PodGroup | ~256 | ~1,024 | Past 1,024-rank gangs, partition into sub-jobs with explicit synchronisation. |
| Schedule tick (default 1 s) | 500 ms | 100 ms | Sub-second ticks burn CPU; profile before lowering. |
| Plugins enabled in pipeline | ~8 | ~12 | Each plugin runs per pod per node; pruning the pipeline often beats raising resources. |
| Webhook latency budget | ~100 ms | ~500 ms | Slow admission webhooks throttle Job creation; tune timeouts. |
| Reclaim cascade depth | ~3 | ~10 | Deep preemption chains thrash; cap with `reclaim.tolerance`. |
Yobibyte exposes Volcano's queue capabilities through workspace-level GPU caps — a customer sees "workspace can burst to 32 GPUs, guaranteed 8" rather than the underlying `Queue.spec.weight` / `capability` / `guarantee` fields. The mechanism is the same; the surface is intentionally simpler.
Observability#
Volcano exposes Prometheus metrics on `:8080/metrics` from both the controller-manager and the scheduler. The metric set covers scheduler session timing, allocation success / failure rates, queue share utilisation, plugin error counters and pod-group state transitions. Combined with the standard Kubernetes scheduler audit log, this is enough to alert on starvation, deadlock and reclaim churn — the three failure modes that produce 90% of operator pages on a busy batch cluster.
The metrics worth alerting on are: PodGroup phase distribution (especially `Pending` duration), scheduler session latency, queue allocation vs capability ratios, and the reclaim action's eviction rate. Yobibyte's internal SRE alerts on the equivalent customer-facing signals (workspace pending queue depth, gang admission latency) without exposing the underlying Volcano metric names.
- `volcano_scheduler_session_duration_seconds` — per-tick session time; alert if p95 > schedule period.
- `volcano_scheduler_pod_scheduling_attempts_total` / `_failures_total` — admission throughput and failure rate.
- `volcano_e2e_scheduling_latency_milliseconds` — gang admission end-to-end latency; the SLO that matters to customers.
- `volcano_queue_allocated` / `volcano_queue_deserved` / `volcano_queue_capability` — fair-share vs ceiling per queue; ratio > 1 = overcommit.
- `volcano_podgroup_phase_count{phase=...}` — distribution of `Pending`, `Inqueue`, `Running`, `Completed`, `Failed`.
- `volcano_reclaim_total` — count of reclaim evictions; a sustained non-zero rate means the cluster is over-subscribed.
- `volcano_admission_latency_milliseconds` — webhook latency; slow admission throttles job submission.
- `workqueue_depth{name="volcano-controller"}` — controller-manager backlog; alert if growing.
# Prometheus alerts for Volcano in production
groups:
- name: volcano-sla
interval: 30s
rules:
- alert: VolcanoGangStarvation
expr: max by (podgroup, queue) (time() - volcano_podgroup_pending_since_seconds) > 1800
for: 5m
labels: { severity: warning }
annotations:
summary: "PodGroup {{ $labels.podgroup }} in {{ $labels.queue }} pending > 30m"
- alert: VolcanoSchedulerSlow
expr: histogram_quantile(0.95, rate(volcano_scheduler_session_duration_seconds_bucket[5m])) > 2
for: 10m
labels: { severity: critical }
annotations:
summary: "Volcano scheduler p95 session > 2 s — cluster too large for single scheduler"
- alert: VolcanoQueueOversubscribed
expr: volcano_queue_allocated / volcano_queue_capability > 1.0
for: 15m
labels: { severity: warning }
annotations:
summary: "Queue {{ $labels.queue }} allocated > capability — investigate borrowing"
- alert: VolcanoReclaimThrash
expr: rate(volcano_reclaim_total[10m]) > 0.5
for: 30m
labels: { severity: critical }
annotations:
summary: "Reclaim eviction rate > 0.5/s — quota plane misconfigured"Cost and FinOps#
Volcano itself is free (Apache 2.0). The cost surface is the GPU capacity Volcano allocates. Where Volcano changes FinOps is in the conversion from raw GPU-hours to *useful* GPU-hours: by eliminating partial-admission deadlock, Volcano can lift effective utilisation on a busy training cluster from ~60-70% (default scheduler with no gang) to ~85-90% (Volcano with topology-aware reclaim). On a 64-node H100 cluster at ~$3.00/GPU/hr on Yobitel NeoCloud, that is roughly $90,000-$130,000/month of recovered productivity.
- Effective utilisation lift — measure `volcano_queue_allocated` vs cluster capacity over 30 days; the gap to 100% is what reclaim + backfill can recover.
- Yobitel NeoCloud H100 SXM5 list — roughly $3.00/GPU/hr on-demand, $2.00/GPU/hr reserved, ~$1.50/GPU/hr opportunistic backfill via Volcano's lowest-priority queue.
- Recovered productivity from gang admission alone — typically 10-20% lift on training-heavy clusters that previously saw partial deadlock.
- Reclaim overhead — each reclaim event kills a pod and burns its checkpoint window; budget for one re-do every reclaim cycle on opportunistic backfill jobs.
- Quota-pricing alignment — if you sell capacity to internal teams, set `Queue.spec.weight` to match the dollar committed, and surface `volcano_queue_allocated` as a chargeback feed.
- Yobibyte's workspace billing surface is the customer-facing equivalent — Yobitel runs the Volcano + Kueue plane and bills the customer in USD per GPU-hour without exposing the raw queue metrics.
Security and compliance#
Volcano runs as a privileged controller — it can read every pod and node in the cluster, create / delete pods on behalf of users, and mutate scheduling decisions. The standard mitigations apply: namespace-scoped RBAC for end users (they create `Job` and `PodGroup` in their own namespace, but never in `volcano-system`), restricted PodSecurity for everything except the named controller pods, and audit logging on every `Job` / `Queue` mutation. For UK NCSC OFFICIAL workloads, Volcano sits inside the sovereign perimeter on Yobitel-operated clusters — the scheduler never makes calls to a SaaS control plane.
Multi-tenant isolation comes from the combination of Queue capability ceilings, ResourceQuota, NetworkPolicy and PodSecurity. A tenant can never exceed its queue's `capability`; a tenant's pods can never see another tenant's pods on the network without a NetworkPolicy exception; a tenant's pods cannot mount host paths or run privileged. Volcano enforces the resource ceiling but not the network or security boundary — those are standard Kubernetes primitives layered on top.
Reclaim and preemption are auditable. Every reclaim event is a Kubernetes event with the victim pod, the claimant queue and the reason, fed to the standard audit pipeline. For SOC 2 and ISO 27001 evidence, the audit log plus the Volcano metric stream is sufficient to demonstrate that resource shares were honoured per the contracted SLA.
Do not expose `scheduling.volcano.sh` CRDs to end users via cluster-admin RBAC. End users should only have permission to create `batch.volcano.sh/v1alpha1 Job` in their own namespace — Queue and PodGroup are platform-team objects. Yobibyte's workspace surface enforces this implicitly; on a self-operated cluster you must wire the RBAC yourself.
Migration and alternatives#
Most clusters that need Volcano migrate from one of three starting points: default kube-scheduler (which produces partial-admission deadlock on distributed training), Kueue (which queues jobs but does not gang-schedule individual pods), or a legacy YARN / Slurm cluster (which has gang semantics but lives outside Kubernetes). The migration playbook differs by source.
The dominant 2026 alternative is Kueue — see [[kueue]] for the full comparison. The short version: Kueue queues whole jobs at admission and then delegates pod placement to the default scheduler; Volcano replaces the scheduler entirely with a gang-aware pipeline. Kueue is lighter touch and easier to audit; Volcano is more powerful when you need topology-aware placement and HPC reclaim. Many production clusters run both: Kueue at the platform level for fair-share queueing across teams, Volcano in the training-only namespace for the actual gang admission.
| From | Effort | Risk | Notes |
|---|---|---|---|
| Default kube-scheduler | Low | Low | Volcano runs alongside; pods opt in via `schedulerName: volcano` per namespace. |
| Kueue only | Medium | Low | Keep Kueue at platform level; add Volcano under the training namespace for gang. |
| Slurm / PBS Pro on bare metal | High | Medium | Re-model batch scripts as Volcano Jobs; preserve mpirun semantics via `mpi` plugin. |
| YARN on Hadoop | High | Medium | Spark / Flink integrations match YARN queue semantics; data locality differs. |
| Run:ai pre-NVIDIA-acquisition | Medium | Low | Run:ai now uses Volcano under the hood; surface APIs differ but the engine is the same. |
| KubeRay autoscaler only | Low | Low | Layer Volcano under KubeRay for Ray cluster admission gang semantics. |
| vs Yobibyte managed alternative | n/a | n/a | If you would rather not run the scheduler plane at all, Yobibyte exposes the equivalent customer surface (gang-admitted training jobs, workspace queues, GPU pool budgets) on Yobitel-managed tenancies — see `yobibyte` and `neocloud`. |
Troubleshooting#
The error patterns below cover roughly 80% of production Volcano incidents observed on Yobitel-operated fleets and on the upstream community tracker. Each row maps a symptom to the underlying mechanism and the minimum-viable fix.
| Symptom | Cause | Fix |
|---|---|---|
| PodGroup stuck Pending forever | `minResources` exceeds cluster free capacity, or queue at `capability` ceiling. | Reduce `minAvailable`; raise `Queue.spec.capability`; check `volcano_queue_allocated`. |
| Some pods schedule but not all | Volcano not the scheduler — default scheduler picked some pods up. | Confirm `schedulerName: volcano` on every task template; check admission webhook. |
| Gang admits but NCCL hangs at init | Pods on different NVLink islands or no RDMA path. | Enable `task-topology` plugin; label nodes; verify InfiniBand subnet manager. |
| Reclaim evicts wrong pod | Priority class missing on victim queue's jobs. | Set `priorityClassName` explicitly on every Job; verify queue `priority`. |
| Scheduler session latency > 1 s | Plugin pipeline too long or cluster too large. | Prune action pipeline; partition by queue selector; scale scheduler replicas. |
| Webhook timeout — Job creation fails | vc-webhook-manager OOM or slow node. | Raise webhook timeout in apiserver; add resources to webhook deployment. |
| Queue capability silently exceeded | Borrowing across queues without reclaim enabled. | Set `reclaimable: false` on critical queues; enable `reclaim` action. |
| Job restarts forever after PodEvicted | Restart policy too aggressive. | Set `policies.maxRetry: 3` or change action to `CompleteJob`. |
| Multi-task Job partially launches | `minAvailable` lower than sum(replicas). | Raise `minAvailable` to total; or accept elastic-batch semantics deliberately. |
| Pods admitted, no IP, stuck ContainerCreating | CNI plugin overloaded after gang admission burst. | Throttle gang size; pre-warm CNI; not a Volcano issue per se. |
Where this fits in the Yobitel stack#
Volcano is the gang-scheduling substrate under every distributed training and large multi-rank inference job that Yobibyte runs on Yobitel NeoCloud. When a Yobibyte customer launches an 8-rank tensor-parallel inference deployment or a 64-rank pretraining job through their workspace, the underlying job is admitted by Volcano with gang semantics on Yobitel-operated capacity — the customer never sees a partial admission, never burns budget on stranded ranks, and never has to author a `PodGroup` CRD themselves. Yobibyte presents the customer-facing surface as a workspace, a model name and a region; Volcano handles the atomic admission on the back end.
On Yobitel-managed clusters, Volcano is installed via GitOps from the platform's Argo CD root, paired with the NVIDIA GPU Operator (for the hardware layer) and Kueue (for cross-team fair-share at the platform level). Topology labels emitted by NFD and the GPU Operator drive Volcano's `task-topology` plugin so that NCCL collectives stay inside NVLink islands by default and fall back to InfiniBand-connected racks only when island capacity is exhausted. The InferenceBench benchmark engine uses the same plane for reproducible, gang-admitted benchmark runs — every benchmark is admitted atomically or rejected, never partial.
For UK and EU sovereign tenancies, Volcano runs inside the sovereign perimeter on Yobitel-operated clusters under the NCSC Cloud Security Principles and OFFICIAL handling caveat — no SaaS control plane, audit logs feed the regional SIEM, and queue shares are documented in the customer's contracted SLA. Customers who want the gang-scheduling primitive but not the operations burden consume it through Yobibyte; customers who want to run their own cluster with Yobitel Managed Operations get Volcano installed, tuned and on-call covered as part of the engagement.
References
- Volcano Documentation · Volcano
- volcano on GitHub · GitHub (volcano-sh)
- CNCF Volcano Project Page · CNCF
- Volcano Plugins Reference · GitHub (volcano-sh)
- Gang Scheduling in Kubernetes (Original Proposal) · Kubernetes SIGs