Skip to main content

GPU Operators

Deploy NVIDIA operator

The NVIDIA operator allows administrators of Kubernetes clusters to manage GPUs just like CPUs. It includes everything needed for pods to be able to operate GPUs.

Host OS requirements

To expose the GPU to the pod correctly, the NVIDIA kernel drivers and the libnvidia-ml library must be correctly installed in the host OS (Operating System). The NVIDIA Operator can automatically install drivers and libraries on some operating systems; check the NVIDIA documentation for information on supported operating system releases.

Starting with GPU Operator v26.3.x, the operator can also manage driver and library installation on any operating system, provided the OS vendor or administrator supplies a compatible driver container image.

Installation of the NVIDIA components on your host OS is out of the scope of this document; reference the NVIDIA documentation for instructions.

Checks for pre-installed NVIDIA drivers/libraries

The following three commands should return a correct output if the kernel driver was correctly installed:

  1. lsmod | grep nvidia

    Returns a list of nvidia kernel modules, for example:

    nvidia_uvm 2129920 0
    nvidia_drm 131072 0
    nvidia_modeset 1572864 1 nvidia_drm
    video 77824 1 nvidia_modeset
    nvidia 9965568 2 nvidia_uvm,nvidia_modeset
    ecc 45056 1 nvidia
  2. cat /proc/driver/nvidia/version

    returns the NVRM and GCC version of the driver. For example:

    NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 555.42.06 Release Build (abuild@host) Thu Jul 11 12:00:00 UTC 2024
    GCC version: gcc version 7.5.0 (SUSE Linux)
  3. find /usr/ -iname libnvidia-ml.so

    returns a path to the libnvidia-ml.so library. For example:

    /usr/lib64/libnvidia-ml.so

    This library is used by Kubernetes components to interact with the kernel driver.

Operator installation

Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:

There are two installation options available.

If drivers and libraries are pre-installed or you are using a supported operating system by nvidia, please use the following manifest

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
version: v26.3.1
targetNamespace: gpu-operator
createNamespace: true
valuesContent: |-
cdi:
nriPluginEnabled: true

If your operating system vendor supplies a compatible driver image, you can use the driver value field to point to it. For example, in SLES 16.0, you can use the following manifest:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
version: v26.3.1
targetNamespace: gpu-operator
createNamespace: true
valuesContent: |-
cdi:
nriPluginEnabled: true
driver:
repository: registry.suse.com/third-party/nvidia
usePrecompiled: true
version: 595 # This depends on the nvidia driver that works with your GPU architecture
info

NVIDIA GPU Operator v26.3.x recommends using Node Resource Interface (NRI) specification and that simplifies operations: we don't need to pass any extra envvar and it does not require changing containerd configuration. It requires containerd 2.1

Version Gate

Containerd 2.1 is available as of September 2025 releases: v1.31.13+rke2r1, v1.32.9+rke2r1, v1.33.5+rke2r1, v1.34.1+rke2r1

After a few minutes, you can make the following checks to verify that everything worked as expected:

  1. Assuming the drivers and libnvidia-ml.so library were previously installed, check if the operator detects them correctly:

    kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | grep "nvidia.com"

    You should see labels specifying driver and GPU (e.g. nvidia.com/gpu.machine or nvidia.com/cuda.driver.major)

  2. Check if the gpu was added (by nvidia-device-plugin-daemonset) as an allocatable resource in the node:

    kubectl get node $NODENAME -o jsonpath='{.status.allocatable}'

    You should see "nvidia.com/gpu": followed by the number of gpus in the node

  3. Check that the container runtime binary exists (it gets installed by the nvidia-container-toolkit-daemonset):

    ls /usr/local/nvidia/toolkit/nvidia-container-runtime
  4. (Only if not using NRI) Verify if containerd config was updated to include the nvidia container runtime:

    grep nvidia /var/lib/rancher/rke2/agent/etc/containerd/config.toml
  5. Run a pod to verify that the GPU resource can successfully be scheduled on a pod and the pod can detect it

    apiVersion: v1
    kind: Pod
    metadata:
    name: nbody-gpu-benchmark
    namespace: default
    spec:
    restartPolicy: OnFailure
    # runtimeClassName: nvidia <== Only needed for v25.3.x
    containers:
    - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
    limits:
    nvidia.com/gpu: 1

RKE2 will now use PATH to find alternative container runtimes, in addition to checking the default paths used by the container runtime packages. In order to use this feature, you must modify the RKE2 service's PATH environment variable to add the directories containing the container runtime binaries.

It's recommended that you modify one of this two environment files:

  • /etc/default/rke2-server # or rke2-agent
  • /etc/sysconfig/rke2-server # or rke2-agent

This example will add the PATH in /etc/default/rke2-server:

echo PATH=$PATH >> /etc/default/rke2-server
warning

PATH changes should be done with care to avoid placing untrusted binaries in the path of services that run as root.