Troubleshooting common problems with Kueue
If you are experiencing errors in Alauda AI relating to Kueue workload management, read this section to understand what could be causing the problem, and how to resolve it.
If the problem is not documented here or in the release notes, contact Alauda Support.
TOC
I see a "failed to call webhook" error messageProblemDiagnosisResolutionI see a "Default Local Queue not found" error messageProblemDiagnosisResolutionI see a "local_queue provided does not exist" error messageProblemDiagnosisResolutionMy workload is stuck in a suspended stateProblemDiagnosisResolutionMy workload pod is terminated before the image pull completesProblemDiagnosisResolutionMy ClusterQueue is not readyProblemDiagnosisResolutionWorkloads are not being admitted in orderProblemDiagnosisResolutionI see a "failed to call webhook" error message
Problem
When creating or updating a workload (such as a Job, RayCluster, or InferenceService), you see an error similar to:
Diagnosis
The Kueue controller pod might not be running, or the webhook service has no available endpoints.
Resolution
- Check the status of the Kueue controller pod:
- If the pod is not running, check the pod events for errors:
- Restart the Kueue controller pod if necessary:
- Check the webhook service and its endpoints:
I see a "Default Local Queue not found" error message
Problem
After submitting a workload, you see an error similar to:
Diagnosis
No default local queue is defined in the namespace, and a local queue was not specified in the workload configuration.
Resolution
Resolve the problem in one of the following ways:
- If a local queue exists in the namespace, add the
kueue.x-k8s.io/queue-namelabel to your workload manifest: - If no local queue exists, create a default local queue in the namespace:
- Contact your administrator to request a local queue be created for your namespace.
I see a "local_queue provided does not exist" error message
Problem
After submitting a workload, you see an error similar to:
Diagnosis
An incorrect value is specified for the local queue, or the local queue exists in a different namespace.
Resolution
- Verify the local queue exists in the correct namespace:
- Ensure the local queue name in the
kueue.x-k8s.io/queue-namelabel matches exactly. - If no local queue exists in your namespace, contact your administrator to request one.
My workload is stuck in a suspended state
Problem
A workload (Job, RayCluster, InferenceService, etc.) remains in a Suspended or SchedulingGated state and its pods are not created.
Diagnosis
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
Resolution
- Check the Workload resource status:
- Inspect the Workload YAML for detailed status messages:
Check the
status.conditions.messagefield, which provides the reason for the suspended state: - Verify the ClusterQueue has sufficient quota:
- Either reduce the requested resources in your workload, or contact your administrator to increase the quota.
My workload pod is terminated before the image pull completes
Problem
Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after this waiting period elapses, Kueue fails the workload and terminates the related pods.
Diagnosis
- Check the events on the pod:
- Look for events indicating the image pull was still in progress when the pod was terminated.
Resolution
To resolve this issue, use one of the following approaches:
- Add an
OnFailurerestart policy to your workload pod template so that the pod restarts and the partially pulled image can be used: - Contact your administrator to increase the
waitForPodsReadytimeout in the Kueue deployment configuration.
My ClusterQueue is not ready
Problem
A ClusterQueue exists but is not admitting any workloads.
Diagnosis
The ClusterQueue references a ResourceFlavor that does not exist.
Resolution
- Check the ClusterQueue status:
- Verify all referenced ResourceFlavors exist:
- Create any missing ResourceFlavors. A ClusterQueue is not ready until all referenced ResourceFlavors are created.
Workloads are not being admitted in order
Problem
Workloads are not being admitted in the expected order (e.g., first-in-first-out).
Diagnosis
This can happen when workloads request different resource amounts, or when fair sharing and preemption policies are configured.
Resolution
- Check the workload priorities:
- Review the ClusterQueue's fair sharing weight and preemption configuration.
- Use the Visibility API to check the pending workload order: