CNCF AI Conformance

The CNCF Kubernetes AI Conformance defines a set of additional capabilities, APIs, and configurations that a Kubernetes cluster MUST offer, on top of standard CNCF Kubernetes Conformance, to reliably and efficiently run AI/ML workloads.

This document demonstrates how these requirements can be met using RKE2 v1.34.1+rke2r1.

Support Dynamic Resource Allocation (DRA)

DRA is a new API that enables more flexible and fine-grained resource requests beyond simple counts.

It is GA since v1.34. The requirement states that it must be possible to verify that all the resource.k8s.io/v1 DRA API resources are enabled. This can be done by running:

kubectl api-resources --api-group=resource.k8s.io

The output should be:

NAME                     SHORTNAMES   APIVERSION           NAMESPACED   KIND
deviceclasses                         resource.k8s.io/v1   false        DeviceClass
resourceclaims                        resource.k8s.io/v1   true         ResourceClaim
resourceclaimtemplates                resource.k8s.io/v1   true         ResourceClaimTemplate
resourceslices                        resource.k8s.io/v1   false        ResourceSlice

Support the Gateway API

Gateway API represents the next generation of Kubernetes ingress, load balancing and service mesh APIs.

To enable the Gateway API in RKE2, we must deploy with Traefik and enable its KuberneteGateway provider as explained in the Ingress Controller docs

The requirement states that it must be possible to verify that all the gateway.networking.k8s.io/v1 Gateway API resources are enabled. This can be done by running:

kubectl api-resources --api-group=gateway.networking.k8s.io/v1

The output should be:

NAME              SHORTNAMES   APIVERSION                          NAMESPACED   KIND
gatewayclasses    gc           gateway.networking.k8s.io/v1        false        GatewayClass
gateways          gtw          gateway.networking.k8s.io/v1        true         Gateway
grpcroutes                     gateway.networking.k8s.io/v1        true         GRPCRoute
httproutes                     gateway.networking.k8s.io/v1        true         HTTPRoute
referencegrants   refgrant     gateway.networking.k8s.io/v1beta1   true         ReferenceGrant

To verify that Traefik is consuming Gateway API resources, you can create a GatewayClass:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: traefik
spec:
  controllerName: traefik.io/gateway-controller

Then, if you verify its status:

kubectl get gatewayclass traefik -o jsonpath='{.status}'

It will say:

"message":"Handled by Traefik controller","observedGeneration":1,"reason":"Handled","status":"True","type":"Accepted"

Gang Scheduling

A gang scheduling solution, that ensures all-or-nothing scheduling for distributed AI workloads, must be possible to install (e.g. Kueue or Volcano).

We will install Volcano in RKE2 and make a quick test:

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

That should install three deployments in the volcano-system namespace:

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           130m
deployment.apps/volcano-controllers   1/1     1            1           130m
deployment.apps/volcano-scheduler     1/1     1            1           130m

This is enough evidence for the requirement, however, let's test if it is working correctly. In a cluster with two GPUs, we can create a gang job with two tasks which require an nvidia GPU:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gpu-nbody-gang-job
  namespace: default
spec:
  minAvailable: 2 
  schedulerName: volcano
  
  tasks:
    - name: nbody-task-1
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          runtimeClassName: nvidia 
          containers:
            - name: cuda-container-1
              image: nvcr.io/nvidia/k8s/cuda-sample:nbody
              command: ["/bin/bash", "-c"]
              args:  
                - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
              resources:
                limits:
                  nvidia.com/gpu: 1
    
    - name: nbody-task-2
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          runtimeClassName: nvidia
          containers:
            - name: cuda-container-2
              image: nvcr.io/nvidia/k8s/cuda-sample:nbody
              command: ["/bin/bash", "-c"]
              args:  
                - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
              resources:
                limits:
                  nvidia.com/gpu: 1

After a few seconds, we should see both pods running.

If we start again and now modify the manifest and use: minAvailable: 3 add a third task:

    - name: nbody-task-3
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          runtimeClassName: nvidia
          containers:
            - name: cuda-container-3
              image: nvcr.io/nvidia/k8s/cuda-sample:nbody
              command: ["/bin/bash", "-c"]
              args:  
                - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
              resources:
                limits:
                  nvidia.com/gpu: 1

We'll see that the three pods will not run with STATUS Pending:

default          gpu-nbody-gang-job-nbody-task-1-0                             0/1     Pending     0          50s 
default          gpu-nbody-gang-job-nbody-task-2-0                             0/1     Pending     0          50s
default          gpu-nbody-gang-job-nbody-task-3-0                             0/1     Pending     0          50s

Demonstrating that gang scheduling is working as expected.

Cluster autoscaler

If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators. RKE2 is a Kubernetes distro and thus does not provide a cluster autoscaler.

For reference, it'd be explained how to use the upstream autoscaler with Azure as an example.

1 - Create a vmss with VMs that are equipped with GPUs. 2 - Deploy RKE2 with the following options:

disable-cloud-controller: true # Only in rke2-server
kubelet-arg: # On both rke2-server and rke2-agent
- --cloud-provider=external

3 - Install the Azure CCM:

helm install --repo https://raw.githubusercontent.com/kubernetes-sigs/cloud-provider-azure/master/helm/repo cloud-provider-azure --generate-name --set cloudControllerManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudControllerManager.imageName=azure-cloud-controller-manager --set cloudNodeManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudNodeManager.imageName=azure-cloud-node-manager --set cloudControllerManager.configureCloudRoutes=false --set cloudControllerManager.allocateNodeCidrs=false

4 - Create a correct azure.json and save it under /etc/kubernetes/azure.json. Make sure these two config options are present:

  "useManagedIdentityExtension": false,
  "useInstanceMetadata": true

If it is correctly deployed, nodes should include a ProviderID. You can check it with:

kubectl get nodes -o yaml | grep ProviderID

This ID is picked from the Metadata of the instance. This can be checked with:

curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2021-02-01"

5 - Install the upstream autoscaler. Firstly, create a correct values.yaml where we specify the vmss created in step 1 and other azure details. Then:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler -f values.yaml

6 - If everything is correctly deployed, the autoscaler will detect if there is a pod requesting a GPU resource. If the cluster is unable to provide it, autoscaler will contact Azure and a new node with a GPU will be automatically created and added to the cluster.

Horizontal pod autoscaler

This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads. For that we use the HorizontalPodAutoscaler included by default in Kubernetes.

We can demonstrate this requirement by installing an ollama deployment in RKE2 and then using this manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
spec:
  scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: ollama
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      describedObject:
        apiVersion: v1
        kind: Namespace
        name: suse-private-ai
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "70"

When increasing the load on ollama, the gpu utilization will reach 70 and that will trigger the deployment of new ollama pods.

Accelerator Performance Metrics

The requirement states that it must be possible to allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage.

If the NVIDIA GPU operator is installed as described in the docs, a daemonset with the name nvidia-dcgm-exporter is deployed together with the service nvidia-dcgm-exporter. We can query this service to collect the required GPU metrics: accelerator utilization, memory usage, temperature, power usage, etc.

From within the cluster, for example if we ssh into one cluster node:

# Get the clusterIP
svcIP=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.clusterIP}')
# Get the port
svcPort=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.ports[0].port}')
# Output the metrics
curl -sL http://${svcIP}:${svcPort}/metrics

That will show the metrics exposed using the OpenMetrics text format. In the next section, it is explained how to use Prometheus and Grafana to consume those.

AI Job & Inference Service Metrics

It is required that there is a system which is capable of discovering and collecting metrics from workloads the expose the in a standard format.

We can use Prometheus/Grafana for that. First we install them:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
>   --namespace monitoring \
>   --create-namespace

Once everything is running, we must create a ServiceMonitor that will scrape the metrics from any workload. As an example, we will tell Prometheus to collect metrics from the DCGM, a component present in the NVIDIA GPU Operator and described in the previous section:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-monitor
  namespace: monitoring
  labels:
    release: prometheus-stack 
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  namespaceSelector:
    matchNames:
      - gpu-operator
  endpoints:
  - port: gpu-metrics
    path: /metrics
    interval: 15s

After a few minutes, the Grafana board will show the DCGM metrics, for example: DCGM_FI_DEV_GPU_UTIL.

Secure accelerator access

We must ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes. In order to achieve this, we must install the NVIDIA GPU Operator as described in the docs.

Once that is done, please verify that the toolkit config under /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml contains:

accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

and the device-plugin deamonset includes the envvar:

DEVICE_LIST_STRATEGY:        volume-mounts

If that is the case, you can verify this requirement by running the following three pods in a cluster with only one GPU:

apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark1
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    command: ["/bin/bash", "-c"]
    args: 
      - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
    resources:
      limits:
        nvidia.com/gpu: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark2
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container2
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    command: ["/bin/bash", "-c"]
    args: 
      - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
    resources:
      limits:
        nvidia.com/gpu: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark3
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container3
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    command: ["/bin/bash", "-c"]
    args:
      - "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"

The first pod will run correctly and in the logs you will see that it is able to consume the GPU.

The second pod will not be scheduled by Kubernetes because the only GPU available in the cluster is already being consumed by the first pod.

The third pod will run but in the logs you will see that it does not find any GPU available. Therefore, the isolation is working.

Robust CRD and Controller Operation

The requirement states that at least one complex AI operator with CRD can be installed and functions reliably. To test it, it recommends checking that CRDs are correctly registered and any invalid configuration is rejected by an admission webhook.

We are going to install Kubeflow in RKE2 to verify this requirement.

Unfortunately, kubeflow does not include a helm chart, but there is a workaround we can use to install it:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.8.0"

After some seconds, everything should be up and running and we should be able to verify that CRDs are installed and the webhook is registered:

$> kubectl get crds | grep kubeflow
mpijobs.kubeflow.org                                       2025-10-24T13:04:27Z
mxjobs.kubeflow.org                                        2025-10-24T13:04:27Z
paddlejobs.kubeflow.org                                    2025-10-24T13:04:28Z
pytorchjobs.kubeflow.org                                   2025-10-24T13:04:28Z
tfjobs.kubeflow.org                                        2025-10-24T13:04:29Z
xgboostjobs.kubeflow.org                                   2025-10-24T13:04:29Z

$> kubectl get validatingwebhookconfigurations
validator.training-operator.kubeflow.org   5          10m

$> kubectl get pods -n kubeflow
NAME                                READY   STATUS    RESTARTS   AGE
training-operator-f7d4b59f6-vdnh9   1/1     Running   0          9m54s

If now we try running a TFJob which is missing a field, for example:

# saved as invalid-tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tfjob-invalid-test
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
            - name: tensorflow 
              # INTENTIONAL ERROR: Missing the 'image' field
              # image: tensorflow/tensorflow:latest
              # command: ["/bin/bash", "-c"]
              # args: ["echo 'Chief running'; sleep 10;"]

We get the expected error from the admission webhook:

Error from server (Forbidden): error when creating "invalid-tfjob.yaml": admission webhook "validator.tfjob.training-operator.kubeflow.org" denied the request: spec.tfReplicaSpecs[Chief].template.spec.containers[0].image: Required value: must be required

If we remove the comments in the previous example and re-try it, we can see how it works

Support Dynamic Resource Allocation (DRA)​

Support the Gateway API​

Gang Scheduling​

Cluster autoscaler​

Horizontal pod autoscaler​

Accelerator Performance Metrics​

AI Job & Inference Service Metrics​

Secure accelerator access​

Robust CRD and Controller Operation​