So, playing as I am with AI applications these days, I convinced myself (hey, it was my birthday...) that I needed to build a machine with a decent graphics card in it. An A100 is somewhat outside my price range, and even my preferred Nvidia L4s[1] (just look at that blissfully low power requirement...) would break the budget, so I was forced (honestly!) to look at a GTX4090 build. The GPU performance is less important than the 24GB of VRAM that allows me to play with larger models than my current machine can handle, and it's probably the most cost-effective option on the market at the moment for that.
Having built the beast, it does seem like a bit of a shame not to use it for maybe playing the odd game, though...
So, I'd like it to be able to run AI tasks when I want it to, but at the same time be a decent gaming machine. It seems to me that this is the kind of problem I can solve with Kubernetes; if I add it to my Kubernetes cluster, but then configure AI applications to "scale to zero" the actual compute-heavy AI workloads (such as a custom coding assistant model) when I'm not making use of them, I can have the best of both worlds. AI-on-demand when I'm doing work, and a GPU all-to-myself when I want to run Steam.
So, this is a record of my journey getting this set up. Part one will document getting the basics ready - adding the machine to my Kubernetes cluster. In the next part, I'll have worked out how to deploy an example AI workload with automatic scale-up and scale-down when idle.
First, we configure our machine as we would normally for a Kubernetes cluster (this is out of the scope of this article, but essentially we need to do all the configuration you would need for any node, like turning off swap and adjusting kernel parameters for CNI to work, and of course ensuring `containerd` is installed), and then join the cluster:
kubeadm join cv-new.k8s.int.snowgoons.ro:6443 \ --token <...> \ --discovery-token-ca-cert-hash sha256:<...>
I don't want to let any old workload be deployed to this machine though - after all, sometimes I want to use it to play Tomb Raider (it's a shame to waste a nice graphics card,) and I don't want my games slowing down because Kubernetes scheduled PostgreSQL or Kafka on that machine.
So, I add a taint[2] to the new node, which restricts scheduling on that node. Only workloads which have been specifically deployed with a corresponding *toleration* will be scheduled to my games, err, research, machine:
kubectl taint nodes joi restriction=CUDAonly:NoSchedule
Nvidia provide an operator which will automate the deployment of appropriate GPU drivers and container toolkit on the worker nodes that have GPUs.
However, my node is not dedicated to Kubernetes. It will be used as a workstation (and gaming machine!) as well as a specialist Kubernetes node for CUDA deployments. So I prefer to be able to manage the deployment of graphics drivers etc. myself; that means installing the GPU drivers and container toolkit on the machine manually.
I covered installing the CUDA toolkit in an earlier article[3], so I'll not go over that again. But installing the Container Toolkit is a new challenge - and in particular, I need to install it for `containerd` (as used on my Kubernetes cluster) and not Docker. To do this I'm going to follow the instructions on the NVIDIA Website[4].
Firstly, we need to install the software; since I use Ubuntu 22.04 as my host operating system, we install from Apt:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit
Then we need to configure `containerd`. In the past when I did similar to enable the container toolkit for Docker, there was some assembly required here - but mercifully NVIDIA have made configuration for `containerd` trivially easy:
sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd
To see if it's working, let's try running a CUDA container directly with `containerd`. We should be able to execute the `nvidia-smi` command within an appropriate container and see our graphics card:
sudo ctr images pull docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04 sudo ctr run --gpus 0 --rm docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04 nvidia-smi nvidia-smi
Hopefully, you will see an output like this:
Sun Jun 30 08:32:31 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.5 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off | | 0% 34C P8 22W / 450W | 478MiB / 24564MiB | 12% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+
Interesting to note that the container runtime is properly segregating the GPU among containers - if I ran that same command directly on the host, I'd see a couple of processes that are making use of the GPU but which are hidden in the container environment.
The NVIDIA GPU operator gives Kubernetes the knowledge of GPUs as a special resource that containers can require to run. We install this operator so that we can then add resource limits like this to our deployment manifests and have Kubernetes automatically schedule the pod on an appropriate node:
spec: containers: resources: limits: nvidia.com/gpu: 1
Installing the operator is as simple as adding the Helm repo and then installing like so:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name \ -n nvidia-gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.enabled=false \ --set toolkit.enabled=false
Installing this Helm chart will deploy a `feature-discovery-worker` to every node in your cluster. This will then run on the node and attempt to determine if that node has a GPU available in the container runtime. If it does, it should add a label `feature.node.kubernetes.io/pci-10de.present=true` to each node that has an NVIDIA GPU attached.
BUT, in our environment this won't work. Why not? Because we have a taint on our GPU node; this will prevent the discovery worker being deployed on the one node where we really need it.
So we need to edit the `values.yaml` for our deployment and add a toleration to the daemonset, and then update (or if you read this far without making the mistake I did, use an appropriate configuration from the beginning!)
# values.yaml for deploying on tainted cluster driver: enabled: false toolkit: enabled: false daemonsets: tolerations: - key: restriction operator: "Equal" value: "CUDAonly" effect: "NoSchedule" operator: tolerations: - key: restriction operator: "Equal" value: "CUDAonly" effect: "NoSchedule" node-feature-discovery: worker: tolerations: - key: restriction operator: "Equal" value: "CUDAonly" effect: "NoSchedule"
So, our first test to see if it is working is to have a look-see for that label and other attributes on our CUDA-enabled node. From `kubectl describe node joi`[^1]:
Name: joi Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true [...snipped for brevity...] feature.node.kubernetes.io/kernel-version.major=6 feature.node.kubernetes.io/kernel-version.minor=5 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-10ec.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 kubernetes.io/arch=amd64 kubernetes.io/hostname=joi kubernetes.io/os=linux nvidia.com/gpu.deploy.container-toolkit=true nvidia.com/gpu.deploy.dcgm=true nvidia.com/gpu.deploy.dcgm-exporter=true nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.driver=true nvidia.com/gpu.deploy.gpu-feature-discovery=true nvidia.com/gpu.deploy.node-status-exporter=true nvidia.com/gpu.deploy.operator-validator=true nvidia.com/gpu.present=true
It's there! Along with a selection of other labels added by the NVIDIA operator. If we carry on down, we can also see that now the scheduler is aware of the resource type `nvidia.com/gpu`:
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 650m (2%) 1 (3%) memory 471966464 (0%) 1101219328 (1%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0
SO, now we have it working (hopefully), we can try deploying the example Jupyter Notebook deployment - my `test-jupyter.yaml` looks just like the one on NVIDIA's documentation[5], but with the added toleration for my CUDA node (and I use a LoadBalancer for my service):
--- apiVersion: v1 kind: Service metadata: name: tf-notebook labels: app: tf-notebook spec: type: LoadBalancer ports: - port: 80 name: http targetPort: 8888 selector: app: tf-notebook --- apiVersion: v1 kind: Pod metadata: name: tf-notebook labels: app: tf-notebook spec: runtimeClassName: nvidia securityContext: fsGroup: 0 tolerations: - key: restriction operator: "Equal" value: "CUDAonly" effect: "NoSchedule" containers: - name: tf-notebook image: tensorflow/tensorflow:latest-gpu-jupyter resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8888 name: notebook
`kubectl get services` tells me the IP it allocated...
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE tf-notebook LoadBalancer 10.102.248.172 192.168.0.196 80:30392/TCP 10m
...and, with bated breath, pointing a web browser at it: {{< figure src="img/Screenshot_2024-06-30_14-51-27.png" caption="Nobody is more surprised than me" captionPosition="right">}} It works!
That's good enough for a Sunday afternoon's work. In the next part, I'll work out how to take an application like Stable Diffusion/ComfyUI and deploy it so that it can scale-up and scale-down (to zero) on demand when I want to use it.
[^1]: In case it wasn't already obvious, my AI node is named after the AI character in Blade Runner 2049[6]
1: https://www.nvidia.com/en-us/data-center/l4/
2: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
3: /posts/posts-2024-06-06-local-llama3-assistant-in-jetbrains-.gmi
4: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
6: https://en.wikipedia.org/wiki/Blade_Runner_2049
--------------------