Volcano Scheduler

TL;DR

Originally created at Huawei in 2019, Volcano is the de facto batch scheduler for AI / HPC workloads on Kubernetes — CNCF Incubating since 2020, Apache 2.0, written in Go.
Replaces the default kube-scheduler for pods that opt in via `schedulerName: volcano`; introduces `Job` and `PodGroup` CRDs with `minAvailable`, `minMember`, `minResources`, plugins (ssh / svc / env / tensorflow / pytorch / mpi), tasks and per-task replicas.
Gang scheduling, queue-based DRF fair-share, preemption with reclaim, topology-aware placement on NVLink / NVSwitch / rack domains, reservation + backfill — every primitive an MPI or NCCL workload needs to start atomically.
First-class plugins for PyTorchJob, MPIJob, TensorFlow, Spark, Ray, Flink and the Kubeflow Training Operator; the canonical pairing in production is Volcano (scheduler) + KubeRay or Kubeflow (framework) + NVIDIA GPU Operator (hardware layer).
Yobibyte runs Volcano internally as part of its scheduling substrate, so Yobitel NeoCloud customers never see a partial pod-group admission — distributed training and tensor-parallel inference launch atomically or stay queued.

Overview#

Volcano is a Kubernetes-native batch scheduler purpose-built for AI / ML / HPC / Big Data workloads. The default kube-scheduler optimises for stateless single-pod workloads — REST services, ingestion daemons, control-plane controllers — and treats every pod as an independent placement decision. That model breaks the moment you submit a distributed training job that needs 64 H100 GPUs across eight nodes to launch atomically: the scheduler admits the first 60 pods, the cluster runs out of fitting GPUs, the remaining four sit Pending, and the 60 admitted ranks burn GPU-hours waiting for the rendezvous that will never complete.

Volcano was created at Huawei in 2019 (initially called `kube-batch`) to fix exactly this class of problem. It introduces a `PodGroup` abstraction with `minMember` and `minResources` so the scheduler can reason about "the whole job is admitted or none of it is"; layers a queue-based fair-share model with DRF (Dominant Resource Fairness) across GPU + CPU + memory; adds topology-aware placement so an eight-rank tensor-parallel job lands on a single NVLink / NVSwitch island; and provides HPC-style preemption with reclaim so a high-priority production training run can evict opportunistic backfill without losing accounting fidelity.

Volcano joined CNCF as a Sandbox project in 2020 and was promoted to Incubating in 2022. By mid-2026 it is on v1.10.x, supports Kubernetes 1.27-1.33, and ships pre-built integrations for PyTorch, TensorFlow, MPI (Horovod / OpenMPI), Ray, Spark, Flink and the Kubeflow Training Operator. It is dual-purpose: it can either fully replace the default scheduler in a Volcano-only namespace, or run side-by-side with kube-scheduler in a mixed cluster where only workloads that opt in via `schedulerName: volcano` use Volcano's placement logic.

This entry helps you decide when Volcano is the right addition to a Kubernetes cluster, how to wire it up against the NVIDIA GPU Operator and your training operators, how to size the resulting queue / cohort plane, and how its job model differs from the lighter-weight Kueue alternative. Yobibyte runs Volcano under the hood across every Yobitel NeoCloud region so that Yobibyte customers never experience a partial gang-admission failure — this entry documents the surface for teams that operate their own clusters or want to understand what Yobibyte provides on their behalf.

Quick start#

The fastest sane path is the upstream Helm install plus a single `Job` (Volcano's own CRD, distinct from `batch/v1 Job`) running a four-rank MPI worker. The five commands below install Volcano, define a queue, submit a gang-scheduled MPI job and observe the admission decision. Run them against a cluster that already has the NVIDIA GPU Operator installed and at least four `nvidia.com/gpu` resources free.

bash

# 1. Install Volcano via the upstream Helm chart
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install --wait volcano volcano-sh/volcano \
    --version "1.10.0" \
    --namespace volcano-system --create-namespace

# 2. Confirm the controller, scheduler and admission webhook are Ready
kubectl -n volcano-system get pods

# 3. Create a queue with a 16-GPU weight
cat <<'YAML' | kubectl apply -f -
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: training }
spec:
  weight: 4
  capability:
    nvidia.com/gpu: "16"
YAML

# 4. Submit a gang-scheduled MPI job (4 workers, all-or-nothing)
cat <<'YAML' | kubectl apply -f -
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: nccl-smoke }
spec:
  schedulerName: volcano
  minAvailable: 4
  queue: training
  plugins:
    ssh: []
    svc: []
    env: []
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: nccl-test
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["sleep", "infinity"]
              resources:
                limits: { nvidia.com/gpu: 1 }
YAML

# 5. Inspect gang-admission decision
kubectl get podgroup
kubectl describe podgroup nccl-smoke
kubectl get pods -l volcano.sh/job-name=nccl-smoke

Always pair Volcano with the NVIDIA GPU Operator and Node Feature Discovery (NFD). Without `nvidia.com/gpu.nvlink.domain` and similar topology labels, Volcano's `topology-aware` plugin has nothing to optimise against and gang admission still works but placement falls back to default spread. See `nvidia-gpu-operator` for the install.

How it works#

Internally Volcano is three components running in `volcano-system`: the `vc-controller-manager` (reconciles `Job` and `PodGroup` CRDs into pods), the `vc-scheduler` (the scheduler itself — replaces or supplements kube-scheduler), and the `vc-webhook-manager` (validating + mutating admission for Volcano CRDs). The scheduler is built around a session model: every scheduling tick (default 1 s) opens a session, walks a configurable pipeline of actions (`enqueue`, `allocate`, `backfill`, `preempt`, `reclaim`), and closes the session by committing the resulting bindings to the API server. Plugins implement the policy each action consults: `gang`, `priority`, `drf`, `predicates`, `nodeorder`, `proportion`, `binpack`, `topology-aware`, `numa-aware`, `task-topology`.

The `gang` plugin is the headline. A `PodGroup` (created automatically by Volcano's `Job`, or explicitly for raw pods) declares `minAvailable` and `minResources`. The scheduler will not transition the group to `Inqueue` (allowed to be admitted) until enough total cluster capacity exists; will not start binding individual pods until the entire group can be bound; and will preempt or reclaim only when the gang as a whole can be satisfied after the eviction. This eliminates the partial-admission deadlock that breaks the default scheduler for distributed training.

Queues are first-class. A `Queue` CRD has a `weight` (DRF share), `capability` (hard ceiling per resource), `reclaimable` flag (can higher-priority queues claw back from this queue?) and `priority`. The `proportion` plugin allocates fair shares across queues per resource dimension; the `drf` plugin handles the multi-dimensional case (GPU + CPU + memory) so a CPU-heavy data-prep job and a GPU-heavy training job get fair allocations on the dimension that dominates each. The `reclaim` action then enforces the contract — when a higher-priority queue arrives, opportunistic borrowers are evicted in a deterministic order until the guarantee is restored.

Topology awareness wires Volcano to the underlying fabric. The `topology-aware` plugin reads node labels emitted by NFD and the GPU Operator (`nvidia.com/gpu.nvlink.domain`, `topology.kubernetes.io/zone`, custom `volcano.sh/topology=rack-3-leaf-4`) and groups nodes into hierarchical domains. The `task-topology` plugin then bin-packs an MPI / NCCL job into the smallest fitting domain — an 8-rank tensor-parallel job lands on a single 8-way NVLink island, a 64-rank pipeline-parallel job lands on a single rack, a 512-rank pretraining job picks the smallest spine-leaf cluster that fits.

Session-based scheduling — actions execute per tick over a snapshot, commits are atomic; failed bindings are reverted without leaking partial state.
Plugins are config-driven via the `volcano-scheduler-configmap` — enable / disable / reorder without rebuilding the binary.
Plugin extension model — `gang`, `priority`, `drf`, `binpack`, `topology-aware`, `numa-aware`, `task-topology`, `tdm` (time-division multiplexing), `proportion`, `overcommit`, `usage`, `rescheduling`.
Per-task templates — a `Job` can declare multiple `tasks` (e.g. `master` + `worker` + `param-server`), each with its own replica count, image and resource shape, all admitted together as a single gang.
Built-in plugins for framework idioms — `pytorch` (sets `MASTER_ADDR` / `RANK` / `WORLD_SIZE`), `tensorflow` (TF_CONFIG), `mpi` (mpirun rendezvous), `ssh` (SSH key fan-out), `svc` (headless service for collective bootstrap), `env` (rank-aware env vars).
Pre-emption is policy-driven — `priority` + `victim` selection minimises killed pods; `tdm` preempts on a time-share rather than killing outright.

Volcano's scheduler is not a drop-in replacement for kube-scheduler — pods must opt in via `schedulerName: volcano`. This is by design: a typical cluster runs Volcano for batch / training workloads and kube-scheduler for services, with admission webhooks routing pods to the right scheduler based on namespace or label.

Reference and specifications#

The fields below are the Volcano CRD surface that matters in production. The reference covers `Job` (batch.volcano.sh/v1alpha1), `PodGroup` (scheduling.volcano.sh/v1beta1), `Queue` (scheduling.volcano.sh/v1beta1) and the scheduler ConfigMap. Defaults are taken from v1.10.0.

Resource / field	Type	Default	Purpose
Job.spec.schedulerName	string	volcano	Pin the job to the Volcano scheduler; required for gang.
Job.spec.minAvailable	int	(required)	Minimum number of tasks that must be admitted for the gang to run.
Job.spec.queue	string	default	Which `Queue` consumes the job's resource share.
Job.spec.priorityClassName	string	(none)	Standard Kubernetes PriorityClass; drives preemption order.
Job.spec.policies	list	[]	Lifecycle policies — restart-on-failure, restart-on-pod-evicted, etc.
Job.spec.plugins.ssh	object	disabled	Generate SSH keys and inject into pods for mpirun rendezvous.
Job.spec.plugins.svc	object	disabled	Create a headless Service for collective bootstrap.
Job.spec.plugins.env	object	disabled	Inject `VC_TASK_INDEX`, `VK_TASK_NAME`, `VC_*_NUM` env vars.
Job.spec.plugins.pytorch	object	disabled	Set `MASTER_ADDR` / `MASTER_PORT` / `RANK` / `WORLD_SIZE` for torchrun.
Job.spec.plugins.tensorflow	object	disabled	Build TF_CONFIG cluster spec across worker / ps / chief tasks.
Job.spec.plugins.mpi	object	disabled	Configure master/worker tasks for an mpirun-style launch.
Job.spec.tasks[].name	string	(required)	Logical task name (e.g. `master`, `worker`, `ps`).
Job.spec.tasks[].replicas	int	(required)	How many pods of this task.
Job.spec.tasks[].template	PodTemplate	(required)	Standard PodSpec for this task — containers, resources, volumes.
Job.spec.tasks[].policies	list	[]	Per-task restart policies; can override Job-level.
Job.spec.maxRetry	int	3	Job-level retry count before terminal failure.
PodGroup.spec.minMember	int	(required)	Minimum pods for gang admission — set by Job, or explicit for raw pods.
PodGroup.spec.minResources	ResourceList	(optional)	Minimum aggregate resources required — used for preemption sizing.
PodGroup.spec.queue	string	default	Queue this group consumes from.
PodGroup.spec.priorityClassName	string	(none)	Drives reclaim victim selection.
Queue.spec.weight	int	1	Relative DRF share — queue's resources / total = weight / sum(weights).
Queue.spec.capability	ResourceList	(unlimited)	Hard ceiling per resource (`nvidia.com/gpu`, `cpu`, `memory`).
Queue.spec.reclaimable	bool	true	If false, this queue's resources are never preempted.
Queue.spec.priority	int	0	Tie-breaker when DRF shares are equal.
Queue.spec.guarantee.resource	ResourceList	(optional)	Minimum resources guaranteed even under preemption.
scheduler ConfigMap.actions	string	enqueue,allocate,backfill	Pipeline of actions per tick; add `preempt`, `reclaim` for HPC patterns.
scheduler ConfigMap.tiers[].plugins	list	see default	Plugin list per priority tier — gang, drf, priority, predicates, nodeorder.
scheduler ConfigMap.schedulerPeriod	duration	1s	Session frequency; lower for high-churn clusters.

`minAvailable: spec.tasks[*].replicas` (i.e. all tasks must start) is the safe default for distributed training. Setting `minAvailable` lower than the total replicas turns on elastic-batch semantics — useful for hyperparameter sweeps where partial completion is acceptable, dangerous for tensor-parallel training where every rank is mandatory.

Workload patterns#

Three patterns cover the bulk of production Volcano deployments on Yobitel-operated clusters and on the upstream community. Each pattern uses a different combination of plugins and a different queue topology; pick the one closest to your dominant workload.

Pattern A — distributed PyTorch + Horovod gang training. The canonical Volcano use case. A single `Job` declares one or more `worker` tasks at the desired tensor / data-parallel scale, with `minAvailable` set to the full replica count and the `pytorch` (or `mpi`) plugin enabled. Horovod's `mpirun` finds peers through the headless service the `svc` plugin creates; `ssh` plugin handles key fan-out. Volcano holds the gang until the topology-aware plugin can land all ranks on the same NVLink island (or, for >8-way jobs, the same rack with InfiniBand RDMA paths).

Pattern B — multi-queue tenant fair-share on shared GPU capacity. Each tenant gets a `Queue` with a `weight` proportional to their committed share and a `capability` ceiling. Opportunistic backfill is allowed via `reclaimable: true`. Tenants submit `Job`s into their own queue; DRF allocates fair shares; when a higher-priority queue arrives, the `reclaim` action evicts opportunistic borrowers in priority order. This is the substrate Yobitel sovereign tenancies use to publish guaranteed GPU shares per tenant while keeping average utilisation high.

Pattern C — heterogeneous task types in one job. Training Operator and KubeRay drive their own pod lifecycles, but you can use Volcano's `Job` directly for jobs that have asymmetric task types — e.g. one `master` (parameter server, FP32) + many `worker` (FP16, NCCL) + a `metrics-sidecar` (CPU-only, scrapes worker telemetry to Prometheus). Each `task` declares its own replicas, template and policies; the gang admits them together.

yaml

# Pattern A: distributed PyTorch + Horovod on 8x H100 with gang scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: llama-pretrain }
spec:
  schedulerName: volcano
  minAvailable: 8
  queue: training
  priorityClassName: high-priority
  plugins:
    ssh: []
    svc: []
    env: []
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - name: master
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["torchrun", "--standalone", "--nproc_per_node=1", "/app/pretrain.py"]
              resources:
                limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }
    - name: worker
      replicas: 7
      template:
        spec:
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["torchrun", "/app/pretrain.py"]
              resources:
                limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }

---
# Pattern B: tenant queues with weighted fair-share and hard ceilings
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: tenant-acme }
spec:
  weight: 8         # 8/16 of cluster GPU share under fair contention
  reclaimable: true # opportunistic borrow allowed
  capability:       # hard ceiling
    nvidia.com/gpu: "64"
  guarantee:        # never preempted below this floor
    resource:
      nvidia.com/gpu: "16"

For Pattern A on multi-node tensor-parallel jobs, set the `task-topology` plugin in the scheduler ConfigMap and label nodes with `volcano.sh/topology=island-3`. The scheduler will pack the eight ranks into the same NVLink island first, falling back to InfiniBand-connected racks only if no island has eight free H100s. This matches what Yobibyte does internally on Yobitel NeoCloud.

Sizing and capacity planning#

Volcano's footprint is modest. The control plane runs three deployments in `volcano-system` (controller-manager, scheduler, webhook-manager) that together cost roughly 1-2 vCPU and 2-4 GiB at steady state on a 100-node cluster. The scheduler is the dominant component — it walks the snapshot every tick and the work scales roughly O(pods × nodes × plugins). Past ~1,000 pods or ~500 nodes you should raise the request shape; past ~5,000 pods you should split into multiple scheduler replicas (HA mode, one active at a time).

Single Volcano scheduler handles ~5,000 pods + 500 nodes comfortably; beyond that, partition by queue or by namespace label.
Schedule period (default 1 s) trades latency for CPU — drop to 500 ms for high-churn clusters; raise to 2 s if pod creation lags scheduler load.
Action pipeline matters: enable only what you need. Adding `preempt` and `reclaim` to a small homogeneous cluster wastes session time without measurable benefit.
Each Volcano `Job` is a single etcd object plus N pods plus 1 `PodGroup` — overhead is dominated by pods, not by the CRDs themselves.
On Yobitel NeoCloud regions Yobibyte uses, the scheduler runs HA with two replicas and a 750 ms tick — sized for the typical workspace mix on H100 + H200 capacity.

Component	CPU	Memory	Notes
vc-controller-manager	300-600 mCPU	512 MiB - 1 GiB	Reconciles Job / PodGroup CRDs to pods.
vc-scheduler	500 mCPU - 2 vCPU	1-4 GiB	Scales with pods × nodes × plugins; HA via leader election.
vc-webhook-manager	100-200 mCPU	128-256 MiB	Validating + mutating admission for Volcano CRDs.
Per session cost	n/a	n/a	~50-200 ms per tick on 100-node, 500-pod cluster.
Per Job overhead	n/a	~5-10 KiB etcd	PodGroup + Job CRDs are small; reclaim events generate audit entries.

Limits and quotas#

Volcano's quota model is the `Queue.spec.capability` field — a hard ceiling on resources the queue can hold across all admitted gangs. Combined with Kubernetes `ResourceQuota` (namespace-level cap on counts and resources) and NetworkPolicy (tenant isolation), queues form the basis of hard-isolated multi-tenant GPU clusters. The limits below are the practical envelope teams hit in production.

Dimension	Soft limit	Hard limit	Mitigation
Pods scheduled by a single Volcano scheduler	~5,000	~10,000	Partition by queue selector; deploy multiple Volcano installs per cell.
Nodes per scheduler snapshot	~500	~1,500	Use `nodeSelector` filtering and shard by node-pool label.
Queues per cluster	~50	~500	DRF cost grows with queue count; aggregate tiny tenants under a shared queue.
minMember per PodGroup	~256	~1,024	Past 1,024-rank gangs, partition into sub-jobs with explicit synchronisation.
Schedule tick (default 1 s)	500 ms	100 ms	Sub-second ticks burn CPU; profile before lowering.
Plugins enabled in pipeline	~8	~12	Each plugin runs per pod per node; pruning the pipeline often beats raising resources.
Webhook latency budget	~100 ms	~500 ms	Slow admission webhooks throttle Job creation; tune timeouts.
Reclaim cascade depth	~3	~10	Deep preemption chains thrash; cap with `reclaim.tolerance`.

Yobibyte exposes Volcano's queue capabilities through workspace-level GPU caps — a customer sees "workspace can burst to 32 GPUs, guaranteed 8" rather than the underlying `Queue.spec.weight` / `capability` / `guarantee` fields. The mechanism is the same; the surface is intentionally simpler.

Observability#

Volcano exposes Prometheus metrics on `:8080/metrics` from both the controller-manager and the scheduler. The metric set covers scheduler session timing, allocation success / failure rates, queue share utilisation, plugin error counters and pod-group state transitions. Combined with the standard Kubernetes scheduler audit log, this is enough to alert on starvation, deadlock and reclaim churn — the three failure modes that produce 90% of operator pages on a busy batch cluster.

The metrics worth alerting on are: PodGroup phase distribution (especially `Pending` duration), scheduler session latency, queue allocation vs capability ratios, and the reclaim action's eviction rate. Yobibyte's internal SRE alerts on the equivalent customer-facing signals (workspace pending queue depth, gang admission latency) without exposing the underlying Volcano metric names.

`volcano_scheduler_session_duration_seconds` — per-tick session time; alert if p95 > schedule period.
`volcano_scheduler_pod_scheduling_attempts_total` / `_failures_total` — admission throughput and failure rate.
`volcano_e2e_scheduling_latency_milliseconds` — gang admission end-to-end latency; the SLO that matters to customers.
`volcano_queue_allocated` / `volcano_queue_deserved` / `volcano_queue_capability` — fair-share vs ceiling per queue; ratio > 1 = overcommit.
`volcano_podgroup_phase_count{phase=...}` — distribution of `Pending`, `Inqueue`, `Running`, `Completed`, `Failed`.
`volcano_reclaim_total` — count of reclaim evictions; a sustained non-zero rate means the cluster is over-subscribed.
`volcano_admission_latency_milliseconds` — webhook latency; slow admission throttles job submission.
`workqueue_depth{name="volcano-controller"}` — controller-manager backlog; alert if growing.

yaml

# Prometheus alerts for Volcano in production
groups:
  - name: volcano-sla
    interval: 30s
    rules:
      - alert: VolcanoGangStarvation
        expr: max by (podgroup, queue) (time() - volcano_podgroup_pending_since_seconds) > 1800
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "PodGroup {{ $labels.podgroup }} in {{ $labels.queue }} pending > 30m"

      - alert: VolcanoSchedulerSlow
        expr: histogram_quantile(0.95, rate(volcano_scheduler_session_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Volcano scheduler p95 session > 2 s — cluster too large for single scheduler"

      - alert: VolcanoQueueOversubscribed
        expr: volcano_queue_allocated / volcano_queue_capability > 1.0
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Queue {{ $labels.queue }} allocated > capability — investigate borrowing"

      - alert: VolcanoReclaimThrash
        expr: rate(volcano_reclaim_total[10m]) > 0.5
        for: 30m
        labels: { severity: critical }
        annotations:
          summary: "Reclaim eviction rate > 0.5/s — quota plane misconfigured"

Cost and FinOps#

Volcano itself is free (Apache 2.0). The cost surface is the GPU capacity Volcano allocates. Where Volcano changes FinOps is in the conversion from raw GPU-hours to *useful* GPU-hours: by eliminating partial-admission deadlock, Volcano can lift effective utilisation on a busy training cluster from ~60-70% (default scheduler with no gang) to ~85-90% (Volcano with topology-aware reclaim). On a 64-node H100 cluster at ~$3.00/GPU/hr on Yobitel NeoCloud, that is roughly $90,000-$130,000/month of recovered productivity.

Effective utilisation lift — measure `volcano_queue_allocated` vs cluster capacity over 30 days; the gap to 100% is what reclaim + backfill can recover.
Yobitel NeoCloud H100 SXM5 list — roughly $3.00/GPU/hr on-demand, $2.00/GPU/hr reserved, ~$1.50/GPU/hr opportunistic backfill via Volcano's lowest-priority queue.
Recovered productivity from gang admission alone — typically 10-20% lift on training-heavy clusters that previously saw partial deadlock.
Reclaim overhead — each reclaim event kills a pod and burns its checkpoint window; budget for one re-do every reclaim cycle on opportunistic backfill jobs.
Quota-pricing alignment — if you sell capacity to internal teams, set `Queue.spec.weight` to match the dollar committed, and surface `volcano_queue_allocated` as a chargeback feed.
Yobibyte's workspace billing surface is the customer-facing equivalent — Yobitel runs the Volcano + Kueue plane and bills the customer in USD per GPU-hour without exposing the raw queue metrics.

Security and compliance#

Volcano runs as a privileged controller — it can read every pod and node in the cluster, create / delete pods on behalf of users, and mutate scheduling decisions. The standard mitigations apply: namespace-scoped RBAC for end users (they create `Job` and `PodGroup` in their own namespace, but never in `volcano-system`), restricted PodSecurity for everything except the named controller pods, and audit logging on every `Job` / `Queue` mutation. For UK NCSC OFFICIAL workloads, Volcano sits inside the sovereign perimeter on Yobitel-operated clusters — the scheduler never makes calls to a SaaS control plane.

Multi-tenant isolation comes from the combination of Queue capability ceilings, ResourceQuota, NetworkPolicy and PodSecurity. A tenant can never exceed its queue's `capability`; a tenant's pods can never see another tenant's pods on the network without a NetworkPolicy exception; a tenant's pods cannot mount host paths or run privileged. Volcano enforces the resource ceiling but not the network or security boundary — those are standard Kubernetes primitives layered on top.

Reclaim and preemption are auditable. Every reclaim event is a Kubernetes event with the victim pod, the claimant queue and the reason, fed to the standard audit pipeline. For SOC 2 and ISO 27001 evidence, the audit log plus the Volcano metric stream is sufficient to demonstrate that resource shares were honoured per the contracted SLA.

Do not expose `scheduling.volcano.sh` CRDs to end users via cluster-admin RBAC. End users should only have permission to create `batch.volcano.sh/v1alpha1 Job` in their own namespace — Queue and PodGroup are platform-team objects. Yobibyte's workspace surface enforces this implicitly; on a self-operated cluster you must wire the RBAC yourself.

Migration and alternatives#

Most clusters that need Volcano migrate from one of three starting points: default kube-scheduler (which produces partial-admission deadlock on distributed training), Kueue (which queues jobs but does not gang-schedule individual pods), or a legacy YARN / Slurm cluster (which has gang semantics but lives outside Kubernetes). The migration playbook differs by source.

The dominant 2026 alternative is Kueue — see [[kueue]] for the full comparison. The short version: Kueue queues whole jobs at admission and then delegates pod placement to the default scheduler; Volcano replaces the scheduler entirely with a gang-aware pipeline. Kueue is lighter touch and easier to audit; Volcano is more powerful when you need topology-aware placement and HPC reclaim. Many production clusters run both: Kueue at the platform level for fair-share queueing across teams, Volcano in the training-only namespace for the actual gang admission.

From	Effort	Risk	Notes
Default kube-scheduler	Low	Low	Volcano runs alongside; pods opt in via `schedulerName: volcano` per namespace.
Kueue only	Medium	Low	Keep Kueue at platform level; add Volcano under the training namespace for gang.
Slurm / PBS Pro on bare metal	High	Medium	Re-model batch scripts as Volcano Jobs; preserve mpirun semantics via `mpi` plugin.
YARN on Hadoop	High	Medium	Spark / Flink integrations match YARN queue semantics; data locality differs.
Run:ai pre-NVIDIA-acquisition	Medium	Low	Run:ai now uses Volcano under the hood; surface APIs differ but the engine is the same.
KubeRay autoscaler only	Low	Low	Layer Volcano under KubeRay for Ray cluster admission gang semantics.
vs Yobibyte managed alternative	n/a	n/a	If you would rather not run the scheduler plane at all, Yobibyte exposes the equivalent customer surface (gang-admitted training jobs, workspace queues, GPU pool budgets) on Yobitel-managed tenancies — see `yobibyte` and `neocloud`.

Troubleshooting#

The error patterns below cover roughly 80% of production Volcano incidents observed on Yobitel-operated fleets and on the upstream community tracker. Each row maps a symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
PodGroup stuck Pending forever	`minResources` exceeds cluster free capacity, or queue at `capability` ceiling.	Reduce `minAvailable`; raise `Queue.spec.capability`; check `volcano_queue_allocated`.
Some pods schedule but not all	Volcano not the scheduler — default scheduler picked some pods up.	Confirm `schedulerName: volcano` on every task template; check admission webhook.
Gang admits but NCCL hangs at init	Pods on different NVLink islands or no RDMA path.	Enable `task-topology` plugin; label nodes; verify InfiniBand subnet manager.
Reclaim evicts wrong pod	Priority class missing on victim queue's jobs.	Set `priorityClassName` explicitly on every Job; verify queue `priority`.
Scheduler session latency > 1 s	Plugin pipeline too long or cluster too large.	Prune action pipeline; partition by queue selector; scale scheduler replicas.
Webhook timeout — Job creation fails	vc-webhook-manager OOM or slow node.	Raise webhook timeout in apiserver; add resources to webhook deployment.
Queue capability silently exceeded	Borrowing across queues without reclaim enabled.	Set `reclaimable: false` on critical queues; enable `reclaim` action.
Job restarts forever after PodEvicted	Restart policy too aggressive.	Set `policies.maxRetry: 3` or change action to `CompleteJob`.
Multi-task Job partially launches	`minAvailable` lower than sum(replicas).	Raise `minAvailable` to total; or accept elastic-batch semantics deliberately.
Pods admitted, no IP, stuck ContainerCreating	CNI plugin overloaded after gang admission burst.	Throttle gang size; pre-warm CNI; not a Volcano issue per se.

Where this fits in the Yobitel stack#

Volcano is the gang-scheduling substrate under every distributed training and large multi-rank inference job that Yobibyte runs on Yobitel NeoCloud. When a Yobibyte customer launches an 8-rank tensor-parallel inference deployment or a 64-rank pretraining job through their workspace, the underlying job is admitted by Volcano with gang semantics on Yobitel-operated capacity — the customer never sees a partial admission, never burns budget on stranded ranks, and never has to author a `PodGroup` CRD themselves. Yobibyte presents the customer-facing surface as a workspace, a model name and a region; Volcano handles the atomic admission on the back end.

On Yobitel-managed clusters, Volcano is installed via GitOps from the platform's Argo CD root, paired with the NVIDIA GPU Operator (for the hardware layer) and Kueue (for cross-team fair-share at the platform level). Topology labels emitted by NFD and the GPU Operator drive Volcano's `task-topology` plugin so that NCCL collectives stay inside NVLink islands by default and fall back to InfiniBand-connected racks only when island capacity is exhausted. The InferenceBench benchmark engine uses the same plane for reproducible, gang-admitted benchmark runs — every benchmark is admitted atomically or rejected, never partial.

For UK and EU sovereign tenancies, Volcano runs inside the sovereign perimeter on Yobitel-operated clusters under the NCSC Cloud Security Principles and OFFICIAL handling caveat — no SaaS control plane, audit logs feed the regional SIEM, and queue shares are documented in the customer's contracted SLA. Customers who want the gang-scheduling primitive but not the operations burden consume it through Yobibyte; customers who want to run their own cluster with Yobitel Managed Operations get Volcano installed, tuned and on-call covered as part of the engagement.

References

Volcano Documentation · Volcano
volcano on GitHub · GitHub (volcano-sh)
CNCF Volcano Project Page · CNCF
Volcano Plugins Reference · GitHub (volcano-sh)
Gang Scheduling in Kubernetes (Original Proposal) · Kubernetes SIGs

TL;DR

Originally created at Huawei in 2019, Volcano is the de facto batch scheduler for AI / HPC workloads on Kubernetes — CNCF Incubating since 2020, Apache 2.0, written in Go.
Replaces the default kube-scheduler for pods that opt in via `schedulerName: volcano`; introduces `Job` and `PodGroup` CRDs with `minAvailable`, `minMember`, `minResources`, plugins (ssh / svc / env / tensorflow / pytorch / mpi), tasks and per-task replicas.
Gang scheduling, queue-based DRF fair-share, preemption with reclaim, topology-aware placement on NVLink / NVSwitch / rack domains, reservation + backfill — every primitive an MPI or NCCL workload needs to start atomically.
First-class plugins for PyTorchJob, MPIJob, TensorFlow, Spark, Ray, Flink and the Kubeflow Training Operator; the canonical pairing in production is Volcano (scheduler) + KubeRay or Kubeflow (framework) + NVIDIA GPU Operator (hardware layer).
Yobibyte runs Volcano internally as part of its scheduling substrate, so Yobitel NeoCloud customers never see a partial pod-group admission — distributed training and tensor-parallel inference launch atomically or stay queued.

Overview#

Quick start#

bash

# 1. Install Volcano via the upstream Helm chart
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install --wait volcano volcano-sh/volcano \
    --version "1.10.0" \
    --namespace volcano-system --create-namespace

# 2. Confirm the controller, scheduler and admission webhook are Ready
kubectl -n volcano-system get pods

# 3. Create a queue with a 16-GPU weight
cat <<'YAML' | kubectl apply -f -
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: training }
spec:
  weight: 4
  capability:
    nvidia.com/gpu: "16"
YAML

# 4. Submit a gang-scheduled MPI job (4 workers, all-or-nothing)
cat <<'YAML' | kubectl apply -f -
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: nccl-smoke }
spec:
  schedulerName: volcano
  minAvailable: 4
  queue: training
  plugins:
    ssh: []
    svc: []
    env: []
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: nccl-test
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["sleep", "infinity"]
              resources:
                limits: { nvidia.com/gpu: 1 }
YAML

# 5. Inspect gang-admission decision
kubectl get podgroup
kubectl describe podgroup nccl-smoke
kubectl get pods -l volcano.sh/job-name=nccl-smoke

How it works#

Session-based scheduling — actions execute per tick over a snapshot, commits are atomic; failed bindings are reverted without leaking partial state.
Plugins are config-driven via the `volcano-scheduler-configmap` — enable / disable / reorder without rebuilding the binary.
Plugin extension model — `gang`, `priority`, `drf`, `binpack`, `topology-aware`, `numa-aware`, `task-topology`, `tdm` (time-division multiplexing), `proportion`, `overcommit`, `usage`, `rescheduling`.
Per-task templates — a `Job` can declare multiple `tasks` (e.g. `master` + `worker` + `param-server`), each with its own replica count, image and resource shape, all admitted together as a single gang.
Built-in plugins for framework idioms — `pytorch` (sets `MASTER_ADDR` / `RANK` / `WORLD_SIZE`), `tensorflow` (TF_CONFIG), `mpi` (mpirun rendezvous), `ssh` (SSH key fan-out), `svc` (headless service for collective bootstrap), `env` (rank-aware env vars).
Pre-emption is policy-driven — `priority` + `victim` selection minimises killed pods; `tdm` preempts on a time-share rather than killing outright.

Reference and specifications#

Resource / field	Type	Default	Purpose
Job.spec.schedulerName	string	volcano	Pin the job to the Volcano scheduler; required for gang.
Job.spec.minAvailable	int	(required)	Minimum number of tasks that must be admitted for the gang to run.
Job.spec.queue	string	default	Which `Queue` consumes the job's resource share.
Job.spec.priorityClassName	string	(none)	Standard Kubernetes PriorityClass; drives preemption order.
Job.spec.policies	list	[]	Lifecycle policies — restart-on-failure, restart-on-pod-evicted, etc.
Job.spec.plugins.ssh	object	disabled	Generate SSH keys and inject into pods for mpirun rendezvous.
Job.spec.plugins.svc	object	disabled	Create a headless Service for collective bootstrap.
Job.spec.plugins.env	object	disabled	Inject `VC_TASK_INDEX`, `VK_TASK_NAME`, `VC_*_NUM` env vars.
Job.spec.plugins.pytorch	object	disabled	Set `MASTER_ADDR` / `MASTER_PORT` / `RANK` / `WORLD_SIZE` for torchrun.
Job.spec.plugins.tensorflow	object	disabled	Build TF_CONFIG cluster spec across worker / ps / chief tasks.
Job.spec.plugins.mpi	object	disabled	Configure master/worker tasks for an mpirun-style launch.
Job.spec.tasks[].name	string	(required)	Logical task name (e.g. `master`, `worker`, `ps`).
Job.spec.tasks[].replicas	int	(required)	How many pods of this task.
Job.spec.tasks[].template	PodTemplate	(required)	Standard PodSpec for this task — containers, resources, volumes.
Job.spec.tasks[].policies	list	[]	Per-task restart policies; can override Job-level.
Job.spec.maxRetry	int	3	Job-level retry count before terminal failure.
PodGroup.spec.minMember	int	(required)	Minimum pods for gang admission — set by Job, or explicit for raw pods.
PodGroup.spec.minResources	ResourceList	(optional)	Minimum aggregate resources required — used for preemption sizing.
PodGroup.spec.queue	string	default	Queue this group consumes from.
PodGroup.spec.priorityClassName	string	(none)	Drives reclaim victim selection.
Queue.spec.weight	int	1	Relative DRF share — queue's resources / total = weight / sum(weights).
Queue.spec.capability	ResourceList	(unlimited)	Hard ceiling per resource (`nvidia.com/gpu`, `cpu`, `memory`).
Queue.spec.reclaimable	bool	true	If false, this queue's resources are never preempted.
Queue.spec.priority	int	0	Tie-breaker when DRF shares are equal.
Queue.spec.guarantee.resource	ResourceList	(optional)	Minimum resources guaranteed even under preemption.
scheduler ConfigMap.actions	string	enqueue,allocate,backfill	Pipeline of actions per tick; add `preempt`, `reclaim` for HPC patterns.
scheduler ConfigMap.tiers[].plugins	list	see default	Plugin list per priority tier — gang, drf, priority, predicates, nodeorder.
scheduler ConfigMap.schedulerPeriod	duration	1s	Session frequency; lower for high-churn clusters.

Workload patterns#

yaml

# Pattern A: distributed PyTorch + Horovod on 8x H100 with gang scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: llama-pretrain }
spec:
  schedulerName: volcano
  minAvailable: 8
  queue: training
  priorityClassName: high-priority
  plugins:
    ssh: []
    svc: []
    env: []
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - name: master
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["torchrun", "--standalone", "--nproc_per_node=1", "/app/pretrain.py"]
              resources:
                limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }
    - name: worker
      replicas: 7
      template:
        spec:
          containers:
            - name: pytorch
              image: nvcr.io/nvidia/pytorch:24.10-py3
              command: ["torchrun", "/app/pretrain.py"]
              resources:
                limits: { nvidia.com/gpu: 1, hugepages-1Gi: 16Gi }

---
# Pattern B: tenant queues with weighted fair-share and hard ceilings
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata: { name: tenant-acme }
spec:
  weight: 8         # 8/16 of cluster GPU share under fair contention
  reclaimable: true # opportunistic borrow allowed
  capability:       # hard ceiling
    nvidia.com/gpu: "64"
  guarantee:        # never preempted below this floor
    resource:
      nvidia.com/gpu: "16"

Sizing and capacity planning#

Single Volcano scheduler handles ~5,000 pods + 500 nodes comfortably; beyond that, partition by queue or by namespace label.
Schedule period (default 1 s) trades latency for CPU — drop to 500 ms for high-churn clusters; raise to 2 s if pod creation lags scheduler load.
Action pipeline matters: enable only what you need. Adding `preempt` and `reclaim` to a small homogeneous cluster wastes session time without measurable benefit.
Each Volcano `Job` is a single etcd object plus N pods plus 1 `PodGroup` — overhead is dominated by pods, not by the CRDs themselves.
On Yobitel NeoCloud regions Yobibyte uses, the scheduler runs HA with two replicas and a 750 ms tick — sized for the typical workspace mix on H100 + H200 capacity.

Component	CPU	Memory	Notes
vc-controller-manager	300-600 mCPU	512 MiB - 1 GiB	Reconciles Job / PodGroup CRDs to pods.
vc-scheduler	500 mCPU - 2 vCPU	1-4 GiB	Scales with pods × nodes × plugins; HA via leader election.
vc-webhook-manager	100-200 mCPU	128-256 MiB	Validating + mutating admission for Volcano CRDs.
Per session cost	n/a	n/a	~50-200 ms per tick on 100-node, 500-pod cluster.
Per Job overhead	n/a	~5-10 KiB etcd	PodGroup + Job CRDs are small; reclaim events generate audit entries.

Limits and quotas#

Dimension	Soft limit	Hard limit	Mitigation
Pods scheduled by a single Volcano scheduler	~5,000	~10,000	Partition by queue selector; deploy multiple Volcano installs per cell.
Nodes per scheduler snapshot	~500	~1,500	Use `nodeSelector` filtering and shard by node-pool label.
Queues per cluster	~50	~500	DRF cost grows with queue count; aggregate tiny tenants under a shared queue.
minMember per PodGroup	~256	~1,024	Past 1,024-rank gangs, partition into sub-jobs with explicit synchronisation.
Schedule tick (default 1 s)	500 ms	100 ms	Sub-second ticks burn CPU; profile before lowering.
Plugins enabled in pipeline	~8	~12	Each plugin runs per pod per node; pruning the pipeline often beats raising resources.
Webhook latency budget	~100 ms	~500 ms	Slow admission webhooks throttle Job creation; tune timeouts.
Reclaim cascade depth	~3	~10	Deep preemption chains thrash; cap with `reclaim.tolerance`.

Observability#

`volcano_scheduler_session_duration_seconds` — per-tick session time; alert if p95 > schedule period.
`volcano_scheduler_pod_scheduling_attempts_total` / `_failures_total` — admission throughput and failure rate.
`volcano_e2e_scheduling_latency_milliseconds` — gang admission end-to-end latency; the SLO that matters to customers.
`volcano_queue_allocated` / `volcano_queue_deserved` / `volcano_queue_capability` — fair-share vs ceiling per queue; ratio > 1 = overcommit.
`volcano_podgroup_phase_count{phase=...}` — distribution of `Pending`, `Inqueue`, `Running`, `Completed`, `Failed`.
`volcano_reclaim_total` — count of reclaim evictions; a sustained non-zero rate means the cluster is over-subscribed.
`volcano_admission_latency_milliseconds` — webhook latency; slow admission throttles job submission.
`workqueue_depth{name="volcano-controller"}` — controller-manager backlog; alert if growing.

yaml

# Prometheus alerts for Volcano in production
groups:
  - name: volcano-sla
    interval: 30s
    rules:
      - alert: VolcanoGangStarvation
        expr: max by (podgroup, queue) (time() - volcano_podgroup_pending_since_seconds) > 1800
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "PodGroup {{ $labels.podgroup }} in {{ $labels.queue }} pending > 30m"

      - alert: VolcanoSchedulerSlow
        expr: histogram_quantile(0.95, rate(volcano_scheduler_session_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Volcano scheduler p95 session > 2 s — cluster too large for single scheduler"

      - alert: VolcanoQueueOversubscribed
        expr: volcano_queue_allocated / volcano_queue_capability > 1.0
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Queue {{ $labels.queue }} allocated > capability — investigate borrowing"

      - alert: VolcanoReclaimThrash
        expr: rate(volcano_reclaim_total[10m]) > 0.5
        for: 30m
        labels: { severity: critical }
        annotations:
          summary: "Reclaim eviction rate > 0.5/s — quota plane misconfigured"

Cost and FinOps#

Effective utilisation lift — measure `volcano_queue_allocated` vs cluster capacity over 30 days; the gap to 100% is what reclaim + backfill can recover.
Yobitel NeoCloud H100 SXM5 list — roughly $3.00/GPU/hr on-demand, $2.00/GPU/hr reserved, ~$1.50/GPU/hr opportunistic backfill via Volcano's lowest-priority queue.
Recovered productivity from gang admission alone — typically 10-20% lift on training-heavy clusters that previously saw partial deadlock.
Reclaim overhead — each reclaim event kills a pod and burns its checkpoint window; budget for one re-do every reclaim cycle on opportunistic backfill jobs.
Quota-pricing alignment — if you sell capacity to internal teams, set `Queue.spec.weight` to match the dollar committed, and surface `volcano_queue_allocated` as a chargeback feed.
Yobibyte's workspace billing surface is the customer-facing equivalent — Yobitel runs the Volcano + Kueue plane and bills the customer in USD per GPU-hour without exposing the raw queue metrics.

Security and compliance#

Migration and alternatives#

From	Effort	Risk	Notes
Default kube-scheduler	Low	Low	Volcano runs alongside; pods opt in via `schedulerName: volcano` per namespace.
Kueue only	Medium	Low	Keep Kueue at platform level; add Volcano under the training namespace for gang.
Slurm / PBS Pro on bare metal	High	Medium	Re-model batch scripts as Volcano Jobs; preserve mpirun semantics via `mpi` plugin.
YARN on Hadoop	High	Medium	Spark / Flink integrations match YARN queue semantics; data locality differs.
Run:ai pre-NVIDIA-acquisition	Medium	Low	Run:ai now uses Volcano under the hood; surface APIs differ but the engine is the same.
KubeRay autoscaler only	Low	Low	Layer Volcano under KubeRay for Ray cluster admission gang semantics.
vs Yobibyte managed alternative	n/a	n/a	If you would rather not run the scheduler plane at all, Yobibyte exposes the equivalent customer surface (gang-admitted training jobs, workspace queues, GPU pool budgets) on Yobitel-managed tenancies — see `yobibyte` and `neocloud`.

Troubleshooting#

Symptom	Cause	Fix
PodGroup stuck Pending forever	`minResources` exceeds cluster free capacity, or queue at `capability` ceiling.	Reduce `minAvailable`; raise `Queue.spec.capability`; check `volcano_queue_allocated`.
Some pods schedule but not all	Volcano not the scheduler — default scheduler picked some pods up.	Confirm `schedulerName: volcano` on every task template; check admission webhook.
Gang admits but NCCL hangs at init	Pods on different NVLink islands or no RDMA path.	Enable `task-topology` plugin; label nodes; verify InfiniBand subnet manager.
Reclaim evicts wrong pod	Priority class missing on victim queue's jobs.	Set `priorityClassName` explicitly on every Job; verify queue `priority`.
Scheduler session latency > 1 s	Plugin pipeline too long or cluster too large.	Prune action pipeline; partition by queue selector; scale scheduler replicas.
Webhook timeout — Job creation fails	vc-webhook-manager OOM or slow node.	Raise webhook timeout in apiserver; add resources to webhook deployment.
Queue capability silently exceeded	Borrowing across queues without reclaim enabled.	Set `reclaimable: false` on critical queues; enable `reclaim` action.
Job restarts forever after PodEvicted	Restart policy too aggressive.	Set `policies.maxRetry: 3` or change action to `CompleteJob`.
Multi-task Job partially launches	`minAvailable` lower than sum(replicas).	Raise `minAvailable` to total; or accept elastic-batch semantics deliberately.
Pods admitted, no IP, stuck ContainerCreating	CNI plugin overloaded after gang admission burst.	Throttle gang size; pre-warm CNI; not a Volcano issue per se.

Where this fits in the Yobitel stack#

References

Volcano Documentation · Volcano
volcano on GitHub · GitHub (volcano-sh)
CNCF Volcano Project Page · CNCF
Volcano Plugins Reference · GitHub (volcano-sh)
Gang Scheduling in Kubernetes (Original Proposal) · Kubernetes SIGs

Volcano Scheduler

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Volcano Scheduler

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel