Running jobs with Kueue

As a data scientist or ML engineer, you can submit various types of workloads to Alauda Build of Kueue for quota-managed scheduling. This page shows how to run different workload types with Kueue.

Prerequisites

  • The Alauda Build of Kueue cluster plugin is installed.
  • A ClusterQueue and LocalQueue have been configured by your administrator.
  • The Alauda Container Platform Web CLI has communication with your cluster.

Identifying available local queues

Before submitting a job, identify the local queues available in your namespace:

kubectl get localqueues -n <your-namespace>

If a default local queue (named default) exists, you do not need to add the kueue.x-k8s.io/queue-name label to your workload. Otherwise, you must specify the local queue name.

Running a batch Job

To run a standard Kubernetes batch Job with Kueue, add the kueue.x-k8s.io/queue-name label to the Job manifest:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-
  namespace: team-ml
  labels:
    kueue.x-k8s.io/queue-name: team-ml-queue
spec:
  parallelism: 3
  completions: 3
  template:
    spec:
      containers:
      - name: worker
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["entrypoint-tester", "hello", "world"]
        resources:
          requests:
            cpu: 1
            memory: "200Mi"
      restartPolicy: Never
  1. kueue.x-k8s.io/queue-name: Specifies the LocalQueue that manages this Job. Replace team-ml-queue with the name of a LocalQueue in your namespace.

Submit the Job:

kubectl create -f job.yaml

Running a RayJob

To run a Ray-based distributed job with Kueue, add the kueue.x-k8s.io/queue-name label to the RayJob manifest:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: ray-training-job
  namespace: team-ml
  labels:
    kueue.x-k8s.io/queue-name: team-ml-queue
spec:
  entrypoint: python /home/ray/train.py
  runtimeEnvYAML: |
    pip:
      - torch
      - transformers
  rayClusterSpec:
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.9.0-py310-gpu
            resources:
              requests:
                cpu: "2"
                memory: "4Gi"
              limits:
                cpu: "2"
                memory: "4Gi"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: gpu-workers
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.9.0-py310-gpu
            resources:
              requests:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
  1. kueue.x-k8s.io/queue-name: Specifies the LocalQueue for this RayJob. Kueue will admit the entire RayJob (head + workers) as a single unit using gang scheduling.

Running a RayCluster

To create a Ray cluster managed by Kueue:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
  namespace: team-ml
  labels:
    kueue.x-k8s.io/queue-name: team-ml-queue
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.9.0-py310-gpu
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
  workerGroupSpecs:
  - replicas: 2
    minReplicas: 2
    maxReplicas: 2
    groupName: gpu-workers
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.9.0-py310-gpu
          resources:
            requests:
              cpu: "4"
              memory: "8Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "4"
              memory: "8Gi"
              nvidia.com/gpu: "1"
  1. kueue.x-k8s.io/queue-name: The RayCluster will not be created until Kueue admits it based on available quota.

Running a PyTorchJob

To run a distributed PyTorch training job with Kueue:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-training
  namespace: team-ml
  labels:
    kueue.x-k8s.io/queue-name: team-ml-queue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
            command:
            - python
            - -m
            - torch.distributed.launch
            - --nproc_per_node=1
            - train.py
            resources:
              requests:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
          restartPolicy: OnFailure
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
            command:
            - python
            - -m
            - torch.distributed.launch
            - --nproc_per_node=1
            - train.py
            resources:
              requests:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "4"
                memory: "8Gi"
                nvidia.com/gpu: "1"
          restartPolicy: OnFailure
  1. kueue.x-k8s.io/queue-name: Kueue will admit all replicas (Master + Workers) together using gang scheduling, ensuring the entire training job starts only when all required GPUs are available.

Monitoring your workloads

After submitting a workload, you can monitor its status:

  1. Check if the workload was admitted:

    kubectl get workloads -n <your-namespace>
  2. View the position of your workload in the queue:

    kubectl get --raw "/apis/visibility.kueue.x-k8s.io/v1beta2/namespaces/<your-namespace>/localqueues/<queue-name>/pendingworkloads"
  3. Check the workload's admission status:

    kubectl get workload <workload-name> -n <your-namespace> -o jsonpath='{.status.conditions}'