Custom software #
You have several options to install custom software if none of
the existing images in /enroot/
suit your needs.
Install scripts #
If all you require is a small system/Python package (or similar)
that will only take a couple of seconds to download and install
you can create a simple shell script, e.g. install.sh
, that
installs them before your actual experiment runs.
Note: Make sure to run chmod a+x install.sh
to make your
script executable.
Note that container images can come with very old versions of pip
.
We recommend upgrading before installing Python packages with it.
The script templates below do so.
This is the recommended way to use custom software. It is space efficient, flexible, and easy to use. However, consider one of the other options if installing this way takes too much time.
Task prolog #
The most flexible way to run install.sh
is as a task prolog.
Simply replace the template commands to install your required
packages (with exact versions for reproducibility).
For Python users, a requirements.txt
is the preferred way of
installing several packages with pip
.
Simplistic template for single-task jobs #
If you have a single-task, single-node job, you can probably just use this simplistic template (remove lines you don’t need):
#!/bin/bash
apt update ; apt install -y [...] ; apt clean
conda install [...]
python -m pip install --upgrade pip
pip install -r requirements.txt
Better Generic Template #
However, as soon as you scale your jobs up and run multi-task jobs, you probably want to wrap that “install block”, so it’s only run once per node. So in general, you might want to use this template:
#!/bin/bash
# make sure only first task per node installs stuff, others wait
DONEFILE="/tmp/install_done_${SLURM_JOBID}"
if [[ $SLURM_LOCALID == 0 ]]; then
# put your install commands here (remove lines you don't need):
apt update; apt install -y [...] ; apt clean
conda install -y [...]
python -m pip install --upgrade pip
pip install -r requirements.txt
# Tell other tasks we are done installing
touch "${DONEFILE}"
else
# Wait until packages are installed
while [[ ! -f "${DONEFILE}" ]]; do sleep 1; done
fi
Run it as task-prolog arg #
Now simply tell srun
to run your script as task prolog.
As always, make sure that the script and all other files like
requirements.txt
are in your workdir so Slurm can access them.
srun \
--container-mounts="`pwd`":"`pwd`" \
--container-image=/enroot/[image].sqsh \
--container-workdir="`pwd`" \
--task-prolog="`pwd`/install.sh" \
python train.py
Wrapper script #
You can also use this install.sh
template to wrap your command,
i.e., first install requirements and then run your command.
#!/bin/bash
# make sure only first task per node installs stuff, others wait
DONEFILE="/tmp/install_done_${SLURM_JOBID}"
if [[ $SLURM_LOCALID == 0 ]]; then
# put your install commands here:
apt update
apt install -y [...]
apt clean
conda install -y [...]
python -m pip install --upgrade pip
pip install -r requirements.txt
# Tell other tasks we are done installing
touch "${DONEFILE}"
else
# Wait until packages are installed
while [[ ! -f "${DONEFILE}" ]]; do sleep 1; done
fi
# This runs your wrapped command
"$@"
"$@"
means “all remaining parameters”, so you can now do
something like this:
srun \
--container-mounts="`pwd`":"`pwd`" \
--container-image=/enroot/[image].sqsh \
--container-workdir="`pwd`" \
install.sh python train.py
Modify an existing image #
Similar to how docker images are often based on another image
Enroot allows you to start an existing image, modify its contents,
and save the result as a new image with the --container-save
option:
srun \
--time=04:00:00 \
--immediate=3600 \
--container-image=/enroot/[image].sqsh \
--container-save=/netscratch/$USER/[modified image].sqsh \
--pty /bin/bash
When modifying an existing image make sure that the requested
memory --mem
is more than twice the size of the container
you want to modify. That is 2x the size of the original container
plus the added size due to your modification.
Keep in mind that larger images take longer to start and occupy more memory,
so start with a lightweight image and try to remove unused files.
E.g., avoid installing documentation packages
and run apt-get clean
(or similar) after installing packages.
Also see
Available environments
for hints on how to find a base image that contains specific version of
required software like PyTorch or TensorFlow.
For reproducibility, we recommend these steps:
- Specify exact versions of packages when installing.
- Once you confirm the image is working, convert your history to an install script:You can remove unnecessary lines from
srun \ --container-image=/netscratch/[your username]/[modified image].sqsh \ bash -c 'fc -l -n' | sed 's/^\s*//' > install.sh
install.sh
and use it to recreate your modified image.
Convert an image from Docker Hub #
In addition to the provided Enroot images (see /enroot
),
you can also import and run images from
Docker hub.
See
here for instruction
on how to push your own Docker images to Docker hub.
For example, use the following command to import the
nvidia/cuda:10.2-runtime-centos7
image and store it at
/netscratch/$USER/nvidia_cuda_10.2-runtime-centos7.sqsh
:
srun \
enroot import \
-o /netscratch/$USER/nvidia_cuda_10.2-runtime-centos7.sqsh \
docker://nvidia/cuda:10.2-runtime-centos7
When the registry requires credentials for accessing the docker images please see here. Also, see this in case you have issues with pulling docker images with credentials.
Once finished, you can use the created .sqsh
file as
--container-image
in your srun
command.
Build custom Docker / OCI images #
We recommend the provided podman+enroot.sqsh
image to build images
with Podman and import them with Enroot for use on the cluster.
To begin, start a basic interactive session on a compute node. You may also want more CPUs to speed up compilation.
srun \
--mem=[enough] \
--time=04:00:00 \
--immediate=3600 \
--container-image=/enroot/podman+enroot.sqsh \
--container-mounts=/dev/fuse:/dev/fuse,/netscratch/$USER:/netscratch/$USER,"`pwd`":"`pwd`" \
--container-workdir="`pwd`" \
--pty bash
The/dev/fuse
mount for the container is strictly required for Podman to work. Make sure you request enough memory with yoursrun
command to build large images. You will need enough memory to store base images and all changes made on top.
Inside the container, use podman
to build the image as usual,
then enroot import
to create the SquashFS file.
podman build -t temp .
enroot import -o /netscratch/$USER/[name].sqsh podman://temp
The podman
command is also aliased to docker
for improved
compatibility with existing build scripts.
You can use either interchangeably.
Using APT in rootless containers #
APT (the Debian package manager) is
incompatible with rootless containers.
This will manifest in errors like Operation not permitted
,
Method gave invalid 400 URI
, Could not switch group
, etc.
As a workaround, you can disable sandboxing for APT.
Add the following line to your Dockerfile
before any apt*
commands:
RUN echo 'APT::Sandbox::User "root";' > /etc/apt/apt.conf.d/sandbox-disable
Move a local image to the cluster #
To move an existing image from your machine to the cluster,
first run
docker save
to create an archive file.
docker save -o image.tar [name of the image]
If you manually modified a container, you must first rundocker commit
. This creates a new image that contains your changes.docker export
on a container will not work.
Copy the archive to the cluster using scp
, rsync
,
an SFTP client of your choice, etc.
Once copying finishes, start an interactive job with the podman+enroot.sqsh
image as outlined
above.
Here, run
podman load
to import your archive, then
podman images
to determine the name Podman assigned your imported image, and finally
enroot import
to save it as Enroot format.
podman load -i image.tar
podman images
enroot import -o /netscratch/$USER/[name].sqsh podman://[name of the image]
A note on native Podman #
Podman is available on compute nodes, but the installed version is quite outdated and buggy. You can try to use it to run containers, but we do not recommend it currently. In particular using GPUS in Podman has not been tested and we do not intend to support this use case. Enroot is still the way to go.
Further, the Podman registry on the compute nodes and all built images
are temporary, meaning once your job ends they are gone.
Remember to enroot import
images to preserve them.
Virtualenv / conda env #
Note: This section is a little sparse on details as we’re less familiar with this procedure. We recommend you try one of the other methods first.
If you need a completely custom environment, e.g., code provided
for some paper uses conda to specify requirements, you can also
do this.
Start with a small base image like
nvidia/cuda:11.3.0-cudnn8-runtime-centos8
(if CUDA runtime is
required, or devel
if code needs to be compiled)
and create your environment in /netscratch/$USER
.
Do not use $HOME
as outlined in the
storage section as these
environments can easily occupy several GB when deep learning
frameworks and their dependencies are installed.