Software stack #

Cluster nodes do not have any software pre-installed beyond what is necessary for cluster operation. We instead use containerized environments to manage the software stack required to run experiments, which provides better stability and reproducibility compared to bare-metal installation. The most widely known tool for this would be Docker, however there are several compatibility and security issues that make Docker inappropriate as a container runtime for our cluster. We selected Enroot as a replacement container runtime on Slurm worker nodes. It integrates seamlessly with Slurm via the Pyxis plugin that adds options to create containers to srun. These are the most important options:

--container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
                        [pyxis] the image to use for the container
                        filesystem. Can be either a docker image given as
                        an enroot URI, or a path to a squashfs file on the
                        remote host filesystem.

--container-mounts=SRC:DST[:FLAGS][,SRC:DST...]
                        [pyxis] bind mount[s] inside the container. Mount
                        flags are separated with "+", e.g. "ro+rprivate"

--container-workdir=PATH

The following command creates a container from the image /enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh, with the current directory as workdir, and makes /netscratch, /ds, and the current directory available inside the container:

$ srun \
  --container-image=/enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh \
  --container-workdir="`pwd`" \
  --container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
  [your command]

Due to a quirk in how Pyxis uses storage, pointing --container-image to a docker image works for small images (up to a few GiB), but will fail for larger ones with error no space left on device. Instead, import the image first with srun enroot import as outlined in the custom software section.

Available environments #

Our Enroot images are mostly converted Nvidia NGC Docker images with the same name. They contain the same software and should thus behave exactly like the equivalent Docker container.

ls -1 /enroot/ to see all available images. .packages files list the installed software. Grep them to find images that contain the software you require, e.g.:

$ grep torch==1.7 /enroot/*.packages
/enroot/dlcc_pytorch_20.10.sqsh.packages:torch==1.7.0a0+7036e91
/enroot/huggingface+transformers-pytorch-gpu+latest.sqsh.packages:torch==1.7.1
/enroot/nvcr.io_nvidia_pytorch_20.10-py3.sqsh.packages:torch==1.7.0a0+7036e91

Note: Some GPUs require a minimum image version. A100 GPUs require 20.06 or newer. RTX3090 and RTX A6000 require 20.10 or newer. A10 GPUs require 21.05 or newer. See also Use cases & connectivity.

Startup Times #

Compared to Docker, starting Enroot containers currently takes a bit longer, mainly due to the need to first transfer the .sqsh file from /enroot via network to the node and create the container file system. Afterward operations should be as fast, if not faster, than with Docker. We’re looking into ways to optimize this, but expect 20–30 seconds for containers to spin up (may also be longer when traffic is high).