Troubleshooting common problems with distributed workloads for administrators

If users report errors related to distributed workloads in Alauda AI, read this section to understand what could be causing the problem and how to resolve it as an administrator.

If the problem is not documented here or in the release notes, contact Alauda Support.

Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The Ray cluster head pod or worker pods remain in a suspended state.

Resolution

Check the Workload resource status:
```
kubectl get workloads -n <namespace>
```

Inspect the Workload YAML for the detailed reason:

kubectl get workload <workload-name> -n <namespace> -o yaml

Check the status.conditions.message field:

status:
  conditions:
    - message: "couldn't assign flavors to pod set worker: insufficient quota for nvidia.com/gpu in flavor gpu-flavor in ClusterQueue"

Check the ClusterQueue configuration:
```
kubectl get clusterqueues -o yaml
```
Verify that the requested resources are within the limits defined in the ClusterQueue:
- If the quota is insufficient, increase the nominalQuota for the relevant resource.
- If the ResourceFlavor does not exist, create it.
- If the user requested more resources than available, ask them to reduce their request.

Ray cluster is in a failed state

Problem

You might have insufficient resources or a misconfiguration.

Diagnosis

The Ray cluster head pod or worker pods are not running. When a Ray cluster is first created, it may initially enter a failed state. This usually resolves after the reconciliation process completes and the pods start running.

Resolution

If the failed state persists:

Check the pod events:

kubectl describe pod <pod-name> -n <namespace>

Check the RayCluster resource status:
```
kubectl get raycluster <name> -n <namespace> -o yaml
```
Review the status.conditions.message field.
Common causes:
- Insufficient node resources: The cluster does not have enough physical resources. Scale up the cluster or reduce the workload request.
- Image pull failure: The container image cannot be pulled. Check image registry access and image name.
- Scheduling failure: Nodes do not match the required labels or tolerations. Verify the ResourceFlavor configuration.

Ray cluster does not start

Problem

After creating a Ray cluster, it remains in the Starting state and no pods are created.

Diagnosis

Check the Workload resource:
```
kubectl get workloads -n <namespace>
```
Inspect the status.conditions.message field of both the Workload and RayCluster resources.

Resolution

Verify the KubeRay operator pod is running:

kubectl get pods -n cpaas-system | grep kuberay

If the KubeRay operator pod is not running, restart it:

kubectl delete pod -n cpaas-system -l app=kuberay-operator

Check the KubeRay operator logs for errors:

kubectl logs -n cpaas-system -l app=kuberay-operator --tail=100

PyTorchJob is not being admitted

Problem

A PyTorchJob remains in a pending state and its pods are not created.

Diagnosis

Check if a Workload was created for the PyTorchJob:
```
kubectl get workloads -n <namespace>
```

If a Workload exists, check its status conditions:

kubectl get workload <workload-name> -n <namespace> -o yaml

Resolution

Verify the PyTorchJob has the kueue.x-k8s.io/queue-name label:

kubectl get pytorchjob <name> -n <namespace> -o yaml | grep queue-name

If the label is missing, add it to the PyTorchJob manifest.
Verify the LocalQueue exists in the namespace and is backed by a ClusterQueue with sufficient quota.
Ensure all resources requested by the PyTorchJob (CPU, memory, GPU) are covered in the ClusterQueue's coveredResources.

Kueue webhook service has no endpoints

Problem

When creating distributed workloads, you see a 500 error about "failed to call webhook" with "no endpoints available for service".

Diagnosis

The Kueue controller manager pod is not running.

Resolution

Check the Kueue pod status:

kubectl get pods -n cpaas-system | grep kueue

If the pod is in CrashLoopBackOff or not running, check the logs:

kubectl logs -n cpaas-system -l app=kueue-controller-manager --tail=100

Restart the Kueue controller:

kubectl delete pod -n cpaas-system -l app=kueue-controller-manager

Verify the webhook service endpoints are available:

kubectl get endpoints -n cpaas-system kueue-webhook-service

Workload pod terminated before image pull completes

Problem

Kueue's waitForPodsReady timeout (default: 5 minutes) is too short for large container images commonly used in distributed workloads (e.g., CUDA images, large model images).

Diagnosis

Check the pod events:

kubectl describe pod <pod-name> -n <namespace>

Look for events indicating the image was still being pulled when the pod was terminated.

Resolution

For workloads that use large images, add an OnFailure restart policy to the pod template so that partially pulled images can be reused:
```
spec:
  template:
    spec:
      restartPolicy: OnFailure
```
Increase the waitForPodsReady timeout in the Alauda Build of Kueue deployment configuration. Contact Alauda Support for guidance on modifying this setting.
Pre-pull large images on GPU nodes to reduce image pull time.

Insufficient resources across the cohort

Problem

Distributed workloads are not being admitted even though other ClusterQueues in the same cohort have unused resources.

Diagnosis

The ClusterQueue might not be part of a cohort, or borrowing limits might be configured too restrictively.

Resolution

Check if the ClusterQueue belongs to a cohort:

kubectl get clusterqueue <name> -o yaml | grep cohort

If the ClusterQueue does not have a spec.cohort field, it cannot borrow resources. Add a cohort:
```
spec:
  cohort: shared-cohort
```
If borrowing limits are set, verify they allow sufficient borrowing for the workload's resource requirements.
Check other ClusterQueues in the cohort to verify they have unused resources and their lendingLimit allows lending.

#Troubleshooting common problems with distributed workloads for administrators

#TOC

#Ray cluster is in a suspended state

#Problem

#Diagnosis

#Resolution

#Ray cluster is in a failed state

#Problem

#Diagnosis

#Resolution

#Ray cluster does not start

#Problem

#Diagnosis

#Resolution

#PyTorchJob is not being admitted

#Problem

#Diagnosis

#Resolution

#Kueue webhook service has no endpoints

#Problem

#Diagnosis

#Resolution

#Workload pod terminated before image pull completes

#Problem

#Diagnosis

#Resolution

#Insufficient resources across the cohort

#Problem

#Diagnosis

#Resolution

Troubleshooting common problems with distributed workloads for administrators

TOC

Ray cluster is in a suspended state

Problem

Diagnosis

Resolution

Ray cluster is in a failed state

Problem

Diagnosis

Resolution

Ray cluster does not start

Problem

Diagnosis

Resolution

PyTorchJob is not being admitted

Problem

Diagnosis

Resolution

Kueue webhook service has no endpoints

Problem

Diagnosis

Resolution

Workload pod terminated before image pull completes

Problem

Diagnosis

Resolution

Insufficient resources across the cohort

Problem

Diagnosis

Resolution