TL;DR
- Kueue is the Kubernetes SIG-batch project (Apache 2.0, Go) that adds job-level queueing, quotas, fair-share and preemption on top of the default kube-scheduler — without replacing it.
- Five core CRDs: `ResourceFlavor` (a labelled slice of cluster capacity), `ClusterQueue` (a quota over one or more flavours), `LocalQueue` (namespace-scoped pointer), `Workload` (internal representation of a queued job) and `Cohort` (group of ClusterQueues that lend / borrow).
- First-class integrations for native `Job`, `JobSet`, `MPIJob`, `RayJob`, `RayCluster`, `PyTorchJob`, `TFJob`, `XGBoostJob`, `JAXJob`, `SparkApplication` and Kubeflow Training Operator v2 — pattern is `suspend: true` on submit, Kueue un-suspends when quota is available.
- Lighter-weight alternative to Volcano for clusters that need queueing, quota enforcement and cohort borrowing but not gang scheduling or topology-aware placement. Many production clusters run both — Kueue at platform level, Volcano in training namespaces.
- Yobibyte exposes Kueue-style cluster quotas as workspace-owner-facing "GPU pool budgets" — customers see a clean per-workspace floor and ceiling without ever touching ClusterQueue or Cohort directly.
Overview#
Plain Kubernetes has no notion of a queue. If a hundred distributed training jobs are submitted at once, the API server admits all hundred, the default scheduler races to place their pods, the cluster runs out of GPU resources, and most jobs partially launch and then sit half-running, holding capacity they cannot use. There is no concept of a guaranteed tenant share, no fair-share across teams, no way to borrow idle quota without permanently giving it away, and no preemption that respects a queue contract. Every team that has run a serious training cluster has felt this pain.
Kueue is the Kubernetes SIG-batch answer. It is a small, opinionated queueing controller that inserts a job-level admission step in front of the default scheduler: jobs sit in a `LocalQueue`, get matched against a `ClusterQueue` quota, and have their pods released to the scheduler only when their share is available. Unlike Volcano (see [[volcano-scheduler]]) Kueue does not replace the scheduler — pods are placed by kube-scheduler with its full feature set (topology spread, pod affinity, taints / tolerations, image locality). Kueue only decides *when* a job is allowed to start; placement remains the scheduler's job.
Kueue first shipped from Kubernetes SIG-batch in 2022, hit v1.0 in 2024, and by mid-2026 is on v0.12 (the project version-numbers per-feature-flag-graduation; the v0.x prefix belies a mature codebase under stable upstream APIs). It runs on Kubernetes 1.27-1.33, ships an admission webhook plus a controller, and integrates with every major batch framework via per-framework integrations registered as flags on the manager.
This entry helps you decide when Kueue is the right addition to a Kubernetes cluster, how to model your tenants as ResourceFlavors / ClusterQueues / Cohorts, how to size the queueing plane, and how Kueue's queueing model differs from Volcano's gang scheduling and from Run:ai's commercial GPU orchestration. Yobibyte exposes Kueue-style cluster quotas as customer-visible workspace GPU pool budgets — this entry documents the surface for teams that operate their own clusters or want to understand what Yobibyte provides on their behalf.
Quick start#
The fastest sane path is the upstream manifest install plus a single ResourceFlavor / ClusterQueue / LocalQueue triad and a suspended Job. The commands below install Kueue, define a single-flavour quota of eight H100 GPUs, create a namespace queue pointer, and submit a Job that Kueue admits when capacity is free. Run them against a cluster that has the NVIDIA GPU Operator installed and at least eight `nvidia.com/gpu` resources free.
# 1. Install Kueue from the upstream release manifest
VERSION=v0.12.0
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
# 2. Wait for the controller to be Ready
kubectl -n kueue-system rollout status deployment/kueue-controller-manager
# 3. Define a single ResourceFlavor (on-demand H100 nodes labelled "nvidia.com/gpu.product=NVIDIA-H100-SXM5")
cat <<'YAML' | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata: { name: h100-ondemand }
spec:
nodeLabels: { "nvidia.com/gpu.product": "NVIDIA-H100-SXM5" }
YAML
# 4. Define a ClusterQueue with 8-GPU nominal quota
cat <<'YAML' | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: applied-ai-gpu }
spec:
namespaceSelector: {}
cohort: gpu-pool
resourceGroups:
- coveredResources: ["nvidia.com/gpu", "cpu", "memory"]
flavors:
- name: h100-ondemand
resources:
- name: nvidia.com/gpu
nominalQuota: "8"
borrowingLimit: "8"
- name: cpu
nominalQuota: "256"
- name: memory
nominalQuota: "2Ti"
preemption:
reclaimWithinCohort: LowerPriority
withinClusterQueue: LowerPriority
YAML
# 5. Define a namespace LocalQueue pointing at the ClusterQueue
kubectl create namespace team-applied-ai
cat <<'YAML' | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata: { name: applied-ai-gpu, namespace: team-applied-ai }
spec:
clusterQueue: applied-ai-gpu
YAML
# 6. Submit a suspended Job — Kueue un-suspends when quota is available
cat <<'YAML' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: finetune-llama-3-8b
namespace: team-applied-ai
labels: { "kueue.x-k8s.io/queue-name": "applied-ai-gpu" }
spec:
parallelism: 4
completions: 4
suspend: true
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["sleep", "infinity"]
resources:
limits: { nvidia.com/gpu: 1, cpu: "16", memory: "128Gi" }
YAML
# 7. Inspect the Workload Kueue created from the Job
kubectl -n team-applied-ai get workloads
kubectl -n team-applied-ai describe workloadAlways combine Kueue with NVIDIA GPU Operator labels. The `nodeLabels` selector on ResourceFlavor lets you split on-demand vs spot, H100 vs H200, NVLink vs PCIe — Kueue's quotas then operate over the right physical pool. Without flavours, a customer requesting H200 can borrow H100 quota and be silently wrong.
How it works#
Kueue runs a single controller-manager in `kueue-system` plus a validating + mutating admission webhook. The controller reconciles `Workload` objects — Kueue's internal representation of any queued job — derived from the framework object the user submits (`batch/v1 Job`, `MPIJob`, `RayJob`, etc). The submission flow is identical for every framework: the user submits with `suspend: true` and a `kueue.x-k8s.io/queue-name` label; the integration webhook spots the label and creates a `Workload`; the controller waits until the corresponding `ClusterQueue` has enough quota in the right `ResourceFlavor`; then it patches the job to `suspend: false` and the framework's own controller releases the pods.
ClusterQueue is the unit of quota. Each ClusterQueue declares a `resourceGroups` list, where each group covers a set of resources (typically `nvidia.com/gpu`, `cpu`, `memory`) and lists the `flavors` that satisfy those resources. Per flavour per resource, the ClusterQueue declares `nominalQuota` (guaranteed floor), `borrowingLimit` (how far above the floor it can borrow from cohort peers) and `lendingLimit` (how much of the floor it is willing to lend out). The controller maintains a usage accounting for each ClusterQueue and admits Workloads in FIFO order within each priority class, respecting all three limits.
Cohort is the unit of borrowing. ClusterQueues with the same `cohort` field form a borrowing pool — idle quota in one can be used by another up to the borrowing / lending limits. Borrowing is cohort-scoped, never cluster-wide, so platform teams can carve out hard boundaries (e.g. production-cohort cannot borrow from research-cohort) while still allowing elastic overflow within each pool. When a higher-priority queue requests its guaranteed share back, Kueue preempts borrowed Workloads in the configured order (`LowerPriority`, `Any`, `Never`).
Preemption is policy-driven and respects both `priorityClassName` (standard Kubernetes priority) and the queue's `preemption.withinClusterQueue` / `reclaimWithinCohort` fields. Kueue evicts at the Workload level, not the pod level — when it preempts, it patches the job back to `suspend: true` so the framework's own controller cleans up pods cleanly. This is a meaningfully different model from Volcano's pod-level preemption: Kueue cooperates with the framework operator's lifecycle hooks, which means MPIJob, RayJob and friends see preemption as a clean restart rather than an abrupt eviction.
- Workload — Kueue's internal representation of one queued job; one Workload per Job / MPIJob / RayJob / etc.
- ResourceFlavor — labelled slice of cluster capacity; selected by `nodeLabels`, tolerated by `tolerations`, identified by name in ClusterQueue.
- ClusterQueue — cluster-scoped quota over one or more flavours; cohort-aware; preemption-aware; namespace-selectable.
- LocalQueue — namespace-scoped pointer to a ClusterQueue; what end users actually reference via `kueue.x-k8s.io/queue-name`.
- Cohort — name shared by multiple ClusterQueues that wish to lend / borrow from each other.
- AdmissionCheck — pluggable gate that runs after quota admission but before un-suspend (e.g. ProvisioningRequest for autoscaling).
- WorkloadPriorityClass — Kueue-scoped priority distinct from Pod PriorityClass; resolves queue ordering vs scheduler eviction independently.
- Integrations registered as `--integrations` flag — `batch/job`, `jobset.x-k8s.io/jobset`, `kubeflow.org/mpijob`, `ray.io/rayjob`, `ray.io/raycluster`, `kubeflow.org/pytorchjob`, `kubeflow.org/tfjob`, `kubeflow.org/xgboostjob`, `kubeflow.org/paddlejob`, `sparkoperator.k8s.io/sparkapplication`.
Kueue does not gang-schedule pods. It admits Workloads atomically — either the whole Workload is allowed in or none of it is — but it relies on the framework operator and the default scheduler to launch pods. For jobs that require atomic pod rendezvous (NCCL collectives that fail if one rank is missing), pair Kueue with a framework operator that owns the gang (Kubeflow Training Operator v2, JobSet) or layer Volcano underneath.
Reference and specifications#
The fields below are the Kueue CRD surface that matters in production. The reference covers `ResourceFlavor`, `ClusterQueue`, `LocalQueue`, `Workload` and `WorkloadPriorityClass`. Defaults are taken from v0.12.0.
| Resource / field | Type | Default | Purpose |
|---|---|---|---|
| ResourceFlavor.spec.nodeLabels | map | {} | Required node labels for this flavour. |
| ResourceFlavor.spec.nodeTaints | list | [] | Taints flavour pods will tolerate. |
| ResourceFlavor.spec.tolerations | list | [] | Tolerations added to admitted pods. |
| ClusterQueue.spec.cohort | string | (none) | Borrowing pool name; queues sharing a cohort lend / borrow. |
| ClusterQueue.spec.namespaceSelector | LabelSelector | (none) | Which namespaces' LocalQueues can use this ClusterQueue. |
| ClusterQueue.spec.queueingStrategy | string | BestEffortFIFO | BestEffortFIFO | StrictFIFO — controls head-of-line blocking. |
| ClusterQueue.spec.resourceGroups | list | (required) | Group resources that share flavour candidacy. |
| ...resourceGroups[].coveredResources | list | (required) | Resources this group quotes (e.g. nvidia.com/gpu, cpu, memory). |
| ...resourceGroups[].flavors[].name | string | (required) | ResourceFlavor name. |
| ...flavors[].resources[].nominalQuota | Quantity | (required) | Guaranteed floor for this resource on this flavour. |
| ...flavors[].resources[].borrowingLimit | Quantity | (unset) | How much above nominal can be borrowed from cohort peers. |
| ...flavors[].resources[].lendingLimit | Quantity | (unset) | How much of nominal this queue is willing to lend out. |
| ClusterQueue.spec.preemption.reclaimWithinCohort | string | Never | Never | Any | LowerPriority — preempt borrowed Workloads. |
| ClusterQueue.spec.preemption.withinClusterQueue | string | Never | Preempt own Workloads to honour higher-priority arrival. |
| ClusterQueue.spec.stopPolicy | string | None | None | Hold | HoldAndDrain — emergency pause. |
| ClusterQueue.spec.flavorFungibility | object | Borrow / Borrow | Per-borrow / per-preempt policy when flavour A is full and B is free. |
| ClusterQueue.spec.fairSharing | object | (off) | Enable fair-share within cohort using DRF-like weight. |
| ClusterQueue.spec.admissionChecks | list | [] | AdmissionCheck names that must pass before un-suspend. |
| LocalQueue.spec.clusterQueue | string | (required) | Which ClusterQueue this LocalQueue points at. |
| Workload.spec.priorityClassName | string | (inherited) | Standard Pod PriorityClass. |
| Workload.spec.priorityClassSource | string | (auto) | Source of the priority value — Pod or Workload. |
| Workload.spec.podSets | list | (required) | Pod templates that constitute the Workload. |
| Workload.spec.queueName | string | (required) | LocalQueue name. |
| WorkloadPriorityClass.value | int | 0 | Higher = more important; used for queue ordering only, not pod eviction. |
| Job label `kueue.x-k8s.io/queue-name` | string | (required) | Marks a Job for Kueue admission; integration webhook creates Workload. |
| Job label `kueue.x-k8s.io/priority-class` | string | (optional) | Use a WorkloadPriorityClass for queue ordering. |
| Manager flag `--integrations` | list | batch/job | Comma-separated framework integrations to enable. |
| Manager flag `--feature-gates` | list | (many) | Toggle alpha / beta features — TopologyAwareScheduling, ManagedJobsNamespaceSelector, etc. |
Set `queueingStrategy: BestEffortFIFO` unless you have a strong reason for `StrictFIFO`. Strict FIFO will hold a small Workload behind a head-of-line giant Workload that cannot fit, leaving GPUs idle — best-effort lets the small one slip past while the big one waits.
Workload patterns#
Three patterns cover the bulk of production Kueue deployments. Pick the one closest to your tenant model.
Pattern A — single-cohort fair-share across teams. Every team gets its own ClusterQueue in the shared cohort, with `nominalQuota` proportional to their committed share and `borrowingLimit` covering the elastic overflow. `reclaimWithinCohort: LowerPriority` lets a higher-priority arrival reclaim borrowed quota; `fairSharing` enabled lets DRF allocate the cohort's free capacity among queues weighted by share. This is the canonical platform-team-runs-the-cluster pattern.
Pattern B — flavour fungibility for on-demand + spot mix. Two ResourceFlavors (`h100-ondemand` and `h100-spot`); one ClusterQueue with both flavours listed, `flavorFungibility.whenCanBorrow: Borrow` and `whenCanPreempt: Preempt`. When a Workload requests H100, Kueue tries the on-demand flavour first; if the nominal quota is exhausted, it borrows from cohort peers, or falls back to the spot flavour. The pattern lets a single ClusterQueue ride on cheap capacity by default and burst onto reserved capacity only when spot is full.
Pattern C — admission checks gate provisioning. Kueue's `AdmissionCheck` is a pluggable gate between quota admission and un-suspend. Common uses: a `ProvisioningRequest` check that asks Karpenter or Cluster Autoscaler to spin up nodes before un-suspending; a budget-check that asks the FinOps controller whether the spend window allows the job; a compliance-check that confirms the workload is allowed in the requested ResourceFlavor's geographic region. The check holds the Workload at `Admitted` until it returns `Ready`.
# Pattern A: cohort fair-share across two teams sharing 32 H100s
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: applied-ai }
spec:
cohort: research-pool
namespaceSelector:
matchLabels: { team: applied-ai }
preemption:
reclaimWithinCohort: LowerPriority
withinClusterQueue: LowerPriority
fairSharing:
weight: 4 # 4/6 of free cohort capacity under contention
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: h100-ondemand
resources:
- name: nvidia.com/gpu
nominalQuota: "16" # guaranteed floor
borrowingLimit: "16" # can burst to 32 via cohort
lendingLimit: "8" # will lend up to 8 when idle
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: research-platform }
spec:
cohort: research-pool
namespaceSelector:
matchLabels: { team: research-platform }
preemption:
reclaimWithinCohort: LowerPriority
fairSharing:
weight: 2
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: h100-ondemand
resources:
- name: nvidia.com/gpu
nominalQuota: "16"
borrowingLimit: "16"
lendingLimit: "8"
---
# Pattern B: flavour fungibility — try on-demand, then spot
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: batch-inference }
spec:
cohort: inference-pool
flavorFungibility:
whenCanBorrow: Borrow
whenCanPreempt: Preempt
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: h100-ondemand
resources: [{ name: nvidia.com/gpu, nominalQuota: "8" }]
- name: h100-spot
resources: [{ name: nvidia.com/gpu, nominalQuota: "24" }]Yobibyte's workspace GPU pool budget is a higher-level surface over Pattern A + B combined. Customers configure `committed-gpus: 16, burst-gpus: 32, prefer-flavour: h100-spot` on their workspace; Yobibyte translates that into ClusterQueue / ResourceFlavor / flavour fungibility on the Yobitel NeoCloud back end. The customer never edits the CRDs directly.
Sizing and capacity planning#
Kueue's controller is light. On a 100-node cluster with 500 Workloads in flight, the controller costs ~500 mCPU and 1-2 GiB. The cost scales primarily with Workload count and Cohort complexity, not with raw pod count — the default scheduler handles pod placement, Kueue handles queue admission. The admission webhook is on the hot path for every Job / MPIJob / RayJob creation, so a slow webhook will throttle job submission rate.
- Single Kueue controller handles ~10,000 Workloads in flight comfortably; beyond that, partition by ClusterQueue or shard by namespaceSelector.
- Webhook latency should stay below 100 ms p95 — slow webhooks cause apiserver to time out and reject job submission.
- Cohort complexity is the steeper scaling axis — N ClusterQueues with full borrowing produces O(N²) borrowing edges; cap cohorts at ~20 queues each.
- AdmissionChecks add latency to admission — a `ProvisioningRequest` check that calls out to Karpenter adds 30-90 s per Workload.
- Yobibyte runs Kueue HA with two replicas in every Yobitel NeoCloud region; the workspace UI never blocks on Kueue admission decisions.
| Component | CPU | Memory | Notes |
|---|---|---|---|
| kueue-controller-manager | 300 mCPU - 1 vCPU | 512 MiB - 2 GiB | Reconciles Workloads, ClusterQueues, AdmissionChecks. |
| Admission webhook (in same pod) | 100 mCPU | 128 MiB | On Job / MPIJob create — must respond within apiserver timeout. |
| Per Workload overhead | n/a | ~5-10 KiB etcd | Workload + status; small compared to underlying Job + pods. |
| Per ClusterQueue overhead | n/a | ~20 KiB etcd | Quota accounting + cohort references. |
| Reconciliation rate | n/a | n/a | ~10-50 ms per Workload admission decision on 500-Workload cluster. |
| Memory per 1,000 Workloads | ~50 MB | n/a | Cached state; grows linearly with active workload count. |
Limits and quotas#
Kueue's quota model is its core; the practical limits below are the envelope teams hit in production.
| Dimension | Soft limit | Hard limit | Mitigation |
|---|---|---|---|
| Active Workloads per ClusterQueue | ~1,000 | ~10,000 | Partition by namespaceSelector; split ClusterQueues by team. |
| ClusterQueues per Cohort | ~20 | ~50 | Cohort cost is O(N²) in borrowing edges; cap and split. |
| Cohorts per cluster | ~50 | ~200 | Soft only; mostly limited by humans tracking the model. |
| ResourceFlavors per ClusterQueue | ~5 | ~20 | More flavours = more admission paths to check per Workload. |
| Workload PodSets per Workload | ~8 | ~32 | Each PodSet is checked independently against the quota. |
| AdmissionChecks per ClusterQueue | ~3 | ~10 | Each check adds end-to-end admission latency. |
| Pods per admitted Workload | no Kueue limit | scheduler limit | Default scheduler caps at ~5,000 pods per node-tick. |
| Webhook timeout | 5 s | 30 s | Default apiserver timeout; tune per environment. |
Yobibyte surfaces a single per-workspace `gpu-pool-budget` knob — committed floor + burst ceiling, opaque flavour — and translates that into a ClusterQueue per workspace under the hood. This is the customer-facing analogue of the limits above; workspace owners never see ClusterQueue, ResourceFlavor or Cohort CRDs.
Observability#
Kueue exposes Prometheus metrics on `:8080/metrics` from the controller-manager. The metric set covers Workload state distribution, admission latency, ClusterQueue quota utilisation, cohort borrowing rates and preemption counts. Combined with the standard Kubernetes audit log, this is enough to operate Kueue at scale and to evidence SLA compliance against contracted queue shares.
- `kueue_admitted_workloads_total{cluster_queue=...}` — admission throughput per queue.
- `kueue_pending_workloads{cluster_queue=...,status=...}` — queue depth, partitioned by `Pending`, `AdmissionCheckHold`, `QuotaReserved`.
- `kueue_admission_attempts_total{result=...}` — admission attempt outcomes (`success`, `inadmissible`).
- `kueue_admission_wait_time_seconds` — distribution of how long Workloads waited before admission; this is the customer-felt SLO.
- `kueue_cluster_queue_resource_usage{cluster_queue=...,resource=...}` — current usage per resource per queue.
- `kueue_cluster_queue_nominal_quota` / `_borrowing_limit` / `_lending_limit` — the quota constants for chargeback.
- `kueue_preempted_workloads_total{reason=...}` — preemption rate, partitioned by reason (`InClusterQueue`, `InCohortReclamation`).
- `kueue_workload_check_status` — AdmissionCheck pass / fail rates.
# Prometheus alerts for Kueue in production
groups:
- name: kueue-sla
interval: 30s
rules:
- alert: KueueAdmissionSlow
expr: histogram_quantile(0.95, rate(kueue_admission_wait_time_seconds_bucket[10m])) > 600
for: 15m
labels: { severity: warning }
annotations:
summary: "Kueue admission p95 > 10 min for {{ $labels.cluster_queue }}"
- alert: KueueQueueOversubscribed
expr: kueue_cluster_queue_resource_usage > kueue_cluster_queue_nominal_quota + kueue_cluster_queue_borrowing_limit
for: 10m
labels: { severity: critical }
annotations:
summary: "ClusterQueue {{ $labels.cluster_queue }} exceeds quota + borrowing"
- alert: KueuePreemptionThrash
expr: rate(kueue_preempted_workloads_total[10m]) > 0.5
for: 30m
labels: { severity: warning }
annotations:
summary: "Preemption rate > 0.5/s on {{ $labels.cluster_queue }} — investigate cohort sizing"
- alert: KueueControllerSlow
expr: workqueue_depth{name="workload"} > 100
for: 10m
labels: { severity: critical }
annotations:
summary: "Kueue Workload queue depth > 100 — controller-manager backlog"Cost and FinOps#
Kueue itself is free (Apache 2.0). The cost surface is the GPU capacity Kueue meters and the FinOps story it enables. Kueue is essentially a FinOps tool dressed as a queue: it converts "committed shares" into auditable resource accounting and lets you charge tenants for their `nominalQuota` while keeping average utilisation high through cohort borrowing.
- Chargeback feed — surface `kueue_cluster_queue_resource_usage` per queue per resource as the per-tenant invoice line; multiply by USD per-resource-hour for the dollar amount.
- Yobitel NeoCloud H100 SXM5 list — roughly $3.00/GPU/hr on-demand, $2.00/GPU/hr reserved, ~$1.50/GPU/hr spot (admitted into queues marked `borrowingLimit`-only).
- Borrowing recovery — the gap between `nominalQuota` and `resource_usage` over time is the dollar value of cohort lending; teams that lend a lot can be rewarded with priority class boosts.
- Reservation policy — set the cohort `nominalQuota` sum to the contracted reserved capacity; cohort borrowing absorbs the burst without overcommitting the underlying physical reserve.
- Spot vs on-demand fungibility — use Pattern B above to prefer spot for batch and fall back to on-demand only when spot is unavailable; saves ~50% on training spend when workloads tolerate preemption.
- Yobibyte's workspace GPU pool budget is the customer-facing dollar surface — Yobitel runs the Kueue plane on Yobitel NeoCloud and bills the customer per the workspace budget without exposing raw queue metrics.
Security and compliance#
Kueue runs as a cluster-scoped controller — it watches all Workloads, ClusterQueues and ResourceFlavors, and mutates Job objects via admission. The standard mitigations apply: namespace-scoped RBAC for end users (they create Jobs and LocalQueues in their own namespace; ClusterQueue and ResourceFlavor and Cohort are platform-team objects), restricted PodSecurity on the controller pod, and admission audit logging. For UK NCSC OFFICIAL workloads, Kueue sits inside the sovereign perimeter on Yobitel-operated clusters with no SaaS dependency.
Multi-tenant isolation in Kueue comes from `ClusterQueue.spec.namespaceSelector` — a ClusterQueue only admits Workloads from namespaces matching the selector. Combined with standard NetworkPolicy and PodSecurity, this means a tenant can only consume the queues their namespace label allows. Cohort borrowing is the only path by which one tenant's quota can flow to another; cohort boundaries should mirror organisational trust boundaries exactly.
Audit and accountability are straightforward — every Workload admission, preemption and quota change is a Kubernetes audit event. For SOC 2 and ISO 27001 evidence, the audit log plus the Kueue metric stream is sufficient to demonstrate contracted queue shares were honoured. For GDPR Article 32, Kueue processes no personal data; the relevant evidence is that Kueue ran inside the sovereign perimeter and that audit logs were retained per policy.
Cohort borrowing crosses tenant boundaries — make absolutely sure cohort membership matches your trust model. A misconfigured cohort can let an external research team reclaim quota from a production-payments team. Yobibyte enforces cohort boundaries at the platform layer so a workspace cannot accidentally cohort itself with another customer's workspace.
Migration and alternatives#
Most clusters that adopt Kueue migrate from one of four starting points: no queueing at all (just default kube-scheduler), namespace-level ResourceQuota (no fair-share), Volcano (gang-only without cohort borrowing), or a custom admission webhook (homegrown queueing). The migration playbook below summarises each path.
Volcano vs Kueue is the most-asked comparison. Short version: Kueue is queue-and-quota-only; Volcano is queue-and-quota-and-scheduler. Kueue cooperates with the default scheduler and the framework operator's lifecycle; Volcano replaces the scheduler entirely. Kueue is the right answer when you want auditable per-team quotas and cohort borrowing but no scheduler swap; Volcano is the right answer when you need gang admission of individual pods and topology-aware placement. Many production clusters — including Yobitel NeoCloud — run both: Kueue at the platform level for tenant quotas, Volcano in training namespaces for gang admission.
| From | Effort | Risk | Notes |
|---|---|---|---|
| No queueing (default scheduler only) | Low | Low | Add ResourceFlavor / ClusterQueue / LocalQueue; relabel existing Jobs. |
| Namespace ResourceQuota | Low | Low | Keep ResourceQuota as belt-and-braces; ClusterQueue becomes the active queueing layer. |
| Volcano only | Medium | Medium | Keep Volcano in training namespaces; add Kueue at platform level for cross-team fair-share. |
| Custom admission webhook | High | Medium | Retire the homegrown logic in stages; mirror its rules in ClusterQueue.spec first. |
| Run:ai pre-NVIDIA-acquisition | Medium | Low | Run:ai now layers over the same primitives; Kueue is the open equivalent of Run:ai's quota plane. |
| YuniKorn | Medium | Low | YuniKorn fills a similar niche; migrate per-queue mappings to ClusterQueues. |
| KEDA only | Low | Low | KEDA is event-driven autoscaling; Kueue complements it, not replaces it. |
| vs Yobibyte managed alternative | n/a | n/a | If you would rather not run the quota plane at all, Yobibyte exposes the equivalent customer surface (workspace GPU pool budget, committed vs burst, cohort isolation) on Yobitel-managed tenancies — see `yobibyte` and `neocloud`. |
Troubleshooting#
The error patterns below cover the failure modes that account for roughly 80% of production Kueue incidents on Yobitel-operated clusters and on the upstream community tracker.
| Symptom | Cause | Fix |
|---|---|---|
| Workload stuck Pending forever | ClusterQueue at `nominalQuota`, no cohort borrowing available. | Raise `borrowingLimit`; inspect `kueue_cluster_queue_resource_usage`; add capacity. |
| Workload admitted but pods never start | Default scheduler cannot place — node taint / resource gap. | Check pods' `Pending` reason; confirm ResourceFlavor `nodeLabels` match real nodes. |
| Job creates but no Workload appears | Integration not enabled for that framework. | Add to `--integrations` flag; restart controller-manager. |
| Webhook timeout — Job creation rejected | Controller pod under-resourced; webhook slow. | Raise controller resources; check `kueue_admission_webhook_request_duration_seconds`. |
| Preemption evicts the wrong Workload | PriorityClass missing on protected jobs. | Set `Workload.spec.priorityClassName` explicitly; use WorkloadPriorityClass. |
| Cohort borrowing not happening | `borrowingLimit` unset, or peer's `lendingLimit` is 0. | Set both; verify cohort name spelling matches across queues. |
| FairSharing weight ignored | Feature gate off, or `fairSharing` block missing. | Enable `--feature-gates=FairSharing=true`; set per-queue `fairSharing.weight`. |
| ClusterQueue quota silently exceeded | Workloads created without the queue-name label bypass Kueue. | Use namespaceSelector to scope; enforce via OPA / Kyverno that Jobs in namespace carry the label. |
| AdmissionCheck never returns Ready | External controller not reconciling the check. | Check the controller backing the AdmissionCheckController; restart if hung. |
| StrictFIFO blocking small jobs | Head-of-line big Workload cannot fit. | Switch to `BestEffortFIFO` or raise borrowing limits. |
Where this fits in the Yobitel stack#
Kueue is the quota and fair-share substrate that Yobitel uses to convert Yobitel NeoCloud capacity into per-workspace and per-tenant budgets. Every Yobibyte workspace maps to a ClusterQueue on the back end, with the workspace's committed share encoded as `nominalQuota`, the burst ceiling as `borrowingLimit`, and the cohort scoped to the customer's account. When a workspace owner sees "committed 16 GPUs / burst 32 GPUs" in the Yobibyte UI, the engine driving that is a Kueue ClusterQueue inside a customer-scoped cohort on Yobitel-operated capacity.
On Yobitel-managed clusters Kueue is installed via GitOps from the platform's Argo CD root, paired with Volcano (for gang admission within training namespaces) and the NVIDIA GPU Operator (for the hardware layer). On customer-managed clusters where Yobitel provides Managed Operations, Kueue is the first thing installed and the last thing touched — incidents almost always resolve to a layer above. InferenceBench uses Kueue to enforce per-benchmark-run quota, so a misbehaving benchmark cannot stall the cluster's other tenants.
For UK and EU sovereign workloads, Kueue runs entirely inside the sovereign perimeter on Yobitel-operated clusters under NCSC Cloud Security Principles, G-Cloud 14 lot definitions and OFFICIAL handling. Customers consuming Yobibyte never edit Kueue CRDs directly — Yobitel runs the queueing plane, the customer sees a clean per-workspace floor and ceiling in USD per GPU-hour. Customers who want to run their own cluster with Yobitel Managed Operations get Kueue (alongside Volcano) installed, tuned and on-call covered as part of the engagement.
References
- Kueue Documentation · Kubernetes SIGs
- kueue on GitHub · GitHub (kubernetes-sigs)
- Kueue Concepts (ClusterQueue, Cohort, Workload) · Kueue Docs
- SIG-batch Charter · Kubernetes Community
- Multi-Cluster Kueue (KEP) · Kueue KEPs