The combination of NVIDIA GPUs, to allow computing power to be harnessed, and Kubernetes, responsible for managing containerization, may seem like a perfect marriage of two complementary tools, and an obvious solution. Yet, at the technical level, this combination, like many marriages, turned out to be more tricky than might have been expected. Read this blogpost to find out how CodiLime managed to find a way to deal with this matter.
Let’s introduce the main characters then: NVIDIA GPUs (Graphic Processing Units) are powerful tools used to accelerate computationally-intensive tasks. They power millions of desktops, notebooks, workstations and supercomputers around the world and are used by developers, scientists, researchers and common users. Kubernetes, on the other hand, (also known as K8s) is an open-source container orchestration system, sometimes called an operating system for data centres, that enables automatic application deployment, scaling and management. It is widely used in running containerized applications.
The challenge - install Kubernetes cluster that can work with NVIDIA GPUs
We had to create a Kubernetes cluster that could effectively allocate NVIDIA GPUs. We wanted to use our hardware since it was at our disposal instead of paying for cloud-based services. This was not only a question of money, but also of better management. The use of hardware under our control would enable us to verify different versions of GPU drivers easily and flexibly adapt them to the needs of users.
Testing the approach based on documentation
We had a full documentation provided by NVIDIA, but it was related to the outdated version of Kubernetes and Ubuntu. The docs also required K8s from NVIDIA repository and not from the official distribution to be installed. Additionally, the network plugin for Kubernetes was already preinstalled. We decided to run some tests and create a cluster. It wasn’t really what we expected: it was pretty unstable and fussy, sometimes working properly and sometimes not. When we did the reboot, the nodes became unavailable. What’s worse, NVIDIA provided ready software packages with preinstalled tools (Helm, Tiller, Prometheus) which made debugging difficult when problems occured.
The next step was to try to install NVIDIA packages on a cluster that was prepared based on the official packages. This time we could use Ubuntu 18.04 and the newest version of K8s. We could also choose a network plugin ourselves and install additional packages we considered necessary. The cluster was working just fine, but Docker couldn’t communicate properly with GPUs and pods deployed by Kubernetes DeamonSet didn’t work. What was wrong then? While investigating the problem, we discovered that the nvidia-docker2 package was to blame. It should have installed the NVIDIA runtime that would communicate with Docker, so that the GPUs could be used. Actually, NVIDIA runtime had not been installed. Moreover, the package forced us to use Docker Community Edition (CE) instead of other Docker versions, including docker.io from the official Ubuntu repositories. It was this very package that made Kubernetes act out. We performed tests of other versions of NVIDIA packages from their repository, but the final results were no different. We also tested it on Ubuntu 18.04 LTS and 16.04 LTS. Version 16.04 was easier to work with, but problems still occurred.
A viable solution
We put our heads together and decided to use the nvidia-container-runtime package instead of nvidia-docker2, one from NVIDIA’s GitHub repository. We set NVIDIA runtime as the default in Docker. That’s how we were able to create a k8s cluster using the newest version of Kubernetes. We then downloaded a Device Plugin from NVIDIA’s Github repository and applied it to the cluster. We also added additional packages like Helm, Tiller and Prometheus. A huge success in our couple’s therapy: this time NVIDIA GPUs were working correctly with Kubernetes.
What are the major benefits? Obviously, we can harness the computing power provided by GPUs and the flexibility of Kubernetes. Second, it’s more cost-effective: we don’t pay for expensive cloud-based services. In case it is needed, developers can simply request a driver change, something that is impossible in cloud-based solutions.
Lessons learned
The key takeaway from this lesson is that, with a bit of research, you can get a better understanding of how things are working under the hood. In this way you can build your own solution that, from the very beginning to the very end, will work as you designed it. Using a prepackaged solution makes it difficult to troubleshoot arising issues and manage the whole installation. What’s even better: you are able to upgrade everything by yourself and you are independent from the version used by the author of the prepared package.
Yet, you have to bear in mind that this very setup will not work correctly with virtual machines running on Kubernetes nodes. This is the case of Virtlet or KubeVirt. At the moment, there is no universal solution for sharing GPUs between containers and VMs running as pods.
If you are experiencing problems with GPUs and Kubernetes, just give us a shout, our technical team is here to make this bumpy relationship work, and to help your apps live happily ever after.