Custom software #

You have several options to install custom software if none of the existing images in /enroot/ suit your needs.

Install scripts #

If all you require is a small system/Python package (or similar) that will only take a couple of seconds to download and install you can create a simple shell script, e.g. install.sh, that installs them before your actual experiment runs.

Note: Make sure to run chmod a+x install.sh to make your script executable.

Note that container images can come with very old versions of pip. We recommend upgrading before installing Python packages with it. The script templates below do so.

This is the recommended way to use custom software. It is space efficient, flexible, and easy to use. However, consider one of the other options if installing this way takes too much time.

Task prolog #

The most flexible way to run install.sh is as a task prolog.

Simply replace the template commands to install your required packages (with exact versions for reproducibility). For Python users, a requirements.txt is the preferred way of installing several packages with pip.

Simplistic template for single-task jobs #

If you have a single-task, single-node job, you can probably just use this simplistic template (remove lines you don’t need):

#!/bin/bash
apt update ; apt install -y [...] ; apt clean
conda install [...]
python -m pip install --upgrade pip
pip install -r requirements.txt

Better Generic Template #

However, as soon as you scale your jobs up and run multi-task jobs, you probably want to wrap that “install block”, so it’s only run once per node. So in general, you might want to use this template:

#!/bin/bash

# make sure only first task per node installs stuff, others wait
DONEFILE="/tmp/install_done_${SLURM_JOBID}"
if [[ $SLURM_LOCALID == 0 ]]; then
  
  # put your install commands here (remove lines you don't need):
  apt update; apt install -y [...] ; apt clean
  conda install -y [...]
  python -m pip install --upgrade pip
  pip install -r requirements.txt
  
  # Tell other tasks we are done installing
  touch "${DONEFILE}"
else
  # Wait until packages are installed
  while [[ ! -f "${DONEFILE}" ]]; do sleep 1; done
fi

Run it as task-prolog arg #

Now simply tell srun to run your script as task prolog. As always, make sure that the script and all other files like requirements.txt are in your workdir so Slurm can access them.

srun \
  --container-mounts="`pwd`":"`pwd`" \
  --container-image=/enroot/[image].sqsh \
  --container-workdir="`pwd`" \
  --task-prolog="`pwd`/install.sh" \
  python train.py

Wrapper script #

You can also use this install.sh template to wrap your command, i.e., first install requirements and then run your command.

#!/bin/bash

# make sure only first task per node installs stuff, others wait
DONEFILE="/tmp/install_done_${SLURM_JOBID}"
if [[ $SLURM_LOCALID == 0 ]]; then
  
  # put your install commands here:
  apt update
  apt install -y [...]
  apt clean
  conda install -y [...]
  python -m pip install --upgrade pip
  pip install -r requirements.txt
  
  # Tell other tasks we are done installing
  touch "${DONEFILE}"
else
  # Wait until packages are installed
  while [[ ! -f "${DONEFILE}" ]]; do sleep 1; done
fi

# This runs your wrapped command
"$@"

"$@" means “all remaining parameters”, so you can now do something like this:

srun \
  --container-mounts="`pwd`":"`pwd`" \
  --container-image=/enroot/[image].sqsh \
  --container-workdir="`pwd`" \
  install.sh python train.py

Modify an existing image #

Similar to how docker images are often based on another image Enroot allows you to start an existing image, modify its contents, and save the result as a new image with the --container-save option:

srun \
  --time=04:00:00 \
  --immediate=3600 \
  --container-image=/enroot/[image].sqsh \
  --container-save=/netscratch/$USER/[modified image].sqsh \
  --pty /bin/bash

When modifying an existing image make sure that the requested memory --mem is more than twice the size of the container you want to modify. That is 2x the size of the original container plus the added size due to your modification.

Keep in mind that larger images take longer to start and occupy more memory, so start with a lightweight image and try to remove unused files. E.g., avoid installing documentation packages and run apt-get clean (or similar) after installing packages. Also see Available environments for hints on how to find a base image that contains specific version of required software like PyTorch or TensorFlow.

For reproducibility, we recommend these steps:

Specify exact versions of packages when installing.
Once you confirm the image is working, convert your history to an install script:
```
srun \
  --container-image=/netscratch/[your username]/[modified image].sqsh \
  bash -c 'fc -l -n' | sed 's/^\s*//' > install.sh
```
You can remove unnecessary lines from install.sh and use it to recreate your modified image.

Convert an image from Docker Hub #

In addition to the provided Enroot images (see /enroot), you can also import and run images from Docker hub. See here for instruction on how to push your own Docker images to Docker hub.

For example, use the following command to import the nvidia/cuda:10.2-runtime-centos7 image and store it at /netscratch/$USER/nvidia_cuda_10.2-runtime-centos7.sqsh:

srun \
  enroot import \
  -o /netscratch/$USER/nvidia_cuda_10.2-runtime-centos7.sqsh \
  docker://nvidia/cuda:10.2-runtime-centos7

When the registry requires credentials for accessing the docker images please see here. Also, see this in case you have issues with pulling docker images with credentials.

Once finished, you can use the created .sqsh file as --container-image in your srun command.

Build custom Docker / OCI images #

We recommend the provided podman+enroot.sqsh image to build images with Podman and import them with Enroot for use on the cluster.

To begin, start a basic interactive session on a compute node. You may also want more CPUs to speed up compilation.

srun \
  --mem=[enough] \
  --time=04:00:00 \
  --immediate=3600 \
  --container-image=/enroot/podman+enroot.sqsh \
  --container-mounts=/dev/fuse:/dev/fuse,/netscratch/$USER:/netscratch/$USER,"`pwd`":"`pwd`" \
  --container-workdir="`pwd`" \
  --pty bash

The /dev/fuse mount for the container is strictly required for Podman to work. Make sure you request enough memory with your srun command to build large images. You will need enough memory to store base images and all changes made on top.

Inside the container, use podman to build the image as usual, then enroot import to create the SquashFS file.

podman build -t temp .
enroot import -o /netscratch/$USER/[name].sqsh podman://temp

The podman command is also aliased to docker for improved compatibility with existing build scripts. You can use either interchangeably.

Using APT in rootless containers #

APT (the Debian package manager) is incompatible with rootless containers. This will manifest in errors like Operation not permitted, Method gave invalid 400 URI, Could not switch group, etc. As a workaround, you can disable sandboxing for APT. Add the following line to your Dockerfile before any apt* commands:

RUN echo 'APT::Sandbox::User "root";' > /etc/apt/apt.conf.d/sandbox-disable

Move a local image to the cluster #

To move an existing image from your machine to the cluster, first run docker save to create an archive file.

docker save -o image.tar [name of the image]

If you manually modified a container, you must first run docker commit. This creates a new image that contains your changes. docker export on a container will not work.

Copy the archive to the cluster using scp, rsync, an SFTP client of your choice, etc. Once copying finishes, start an interactive job with the podman+enroot.sqsh image as outlined above. Here, run podman load to import your archive, then podman images to determine the name Podman assigned your imported image, and finally enroot import to save it as Enroot format.

podman load -i image.tar
podman images
enroot import -o /netscratch/$USER/[name].sqsh podman://[name of the image]

A note on native Podman #

Podman is available on compute nodes, but the installed version is quite outdated and buggy. You can try to use it to run containers, but we do not recommend it currently. In particular using GPUS in Podman has not been tested and we do not intend to support this use case. Enroot is still the way to go.

Further, the Podman registry on the compute nodes and all built images are temporary, meaning once your job ends they are gone. Remember to enroot import images to preserve them.

Virtualenv / conda env #

Note: This section is a little sparse on details as we’re less familiar with this procedure. We recommend you try one of the other methods first.

If you need a completely custom environment, e.g., code provided for some paper uses conda to specify requirements, you can also do this. Start with a small base image like nvidia/cuda:11.3.0-cudnn8-runtime-centos8 (if CUDA runtime is required, or devel if code needs to be compiled) and create your environment in /netscratch/$USER. Do not use $HOME as outlined in the storage section as these environments can easily occupy several GB when deep learning frameworks and their dependencies are installed.