Scaling Kubernetes to 2,500 nodes

Our Dota⁠(opens in a new window) project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in Pending⁠(opens in a new window) for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found that kubelet⁠(opens in a new window) has a --serialize-image-pulls flag which defaults to true, meaning the Dota image pull blocked all other images. Changing to false required switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.

Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message: rpc error: code = 2 desc = net/http: request canceled. The kubelet and Docker logs also contained messages indicating that the image pull had been canceled, due to a lack of progress. We tracked the root to large images taking too long to pull/extract, or times when we had a long backlog of images to pull. To address this, we set kubelet’s --image-pull-progress-deadline flag to 30 minutes, and set the Docker daemon’s max-concurrent-downloads option to 10. (The second option didn’t speed up extraction of large images, but allowed the queue of images to pull in parallel.)

Our last Docker pull issue was due to the Google Container Registry. By default, kubelet pulls a special image from gcr.io (controlled by the --pod-infra-container-image flag) which is used when starting any new container. If that pull fails for any reason, like exceeding your quota⁠(opens in a new window), that node won’t be able to launch any containers. Because our nodes go through a NAT to reach gcr.io rather than having their own public IP, it’s quite likely that we’ll hit this per-IP quota limit. To fix this, we simply preloaded that Docker image in the machine image for our Kubernetes workers by using docker image save -o /opt/preloaded_docker_images.tar and docker image load -i /opt/preloaded_docker_images.tar. To improve performance, we do the same for a whitelist of common OpenAI-internal images like the Dota image.