Known issues #

This is a collection of known issues with available workarounds (if any).

Installation failures with old pip #

Container images can come with very old versions of pip, which are likely to misinterpret metadata provided by PyPI. This can manifest itself as pip resorting to building packages from source instead of downloading a binary distribution with errors like these:

ERROR: Could not find a version that satisfies the requirement cython>=3.0.0
ERROR: No matching distribution found for cython>=3.0.0
[...] inconsistent Name: expected 'cython', but metadata has 'Cython'

We recommend upgrading pip before installing Python packages with it:

python -m pip install --upgrade pip

GPUs inaccessible with `--gpus-per-task` #

When starting jobs with a combination of --ntasks and --gpus-per-task, some GPUs may be inaccessible. This manifests in errors such as CUDA error: invalid device ordinal or similar. We have reported this issue to SchedMD. Here are some possible workarounds:

Before accessing GPUs, set CUDA_VISIBLE_DEVICES to the correct value, e.g.,

import os
num_local_gpus = int(os.getenv('SLURM_GPUS_ON_NODE', 1))
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(map(str, range(num_local_gpus)))

If you do not need multi-node operation, use the --gpus option instead of --gpus-per-task. E.g., to request 4 GPUs, use --nodes=1 --ntasks=4 --gpus=4. Setting node count to one is necessary to prevent Slurm from allocating different numbers of tasks and GPUs on a node.
Use a launcher program instead of Slurm tasks, e.g., torchrun. Specify the number of GPUS with --gpus and remove --ntasks entirely.

GPU binding and peer-to-peer communication #

For effective GPU peer-to-peer communication (e.g., using the NCCL library, see here) all involved processes need to see all GPUs. When using Slurm tasks to manage GPU processes (e.g. srun --ntasks=4 --gpus-per-task=1), the default behavior is to bind GPUs to their respective task, making them inacessible to other tasks. There is currently no Slurm config setting to change the default behavior. Add --gpu-bind=none to let all processes see all GPUs.

Container root mapping #

By default, your user will be mapped to root inside containers. Of course, this does not mean you have root access on the machine, but it allows you to manipulate the software within the container as needed.

However, some software (e.g. CARLA) refuses to run as the root user. Add --no-container-remap-root to your srun command to disable root mapping.

Multithreading contention #

You may experience that your job runs slower than expected and the CPU graph in the job details dashboard shows a large portion of red (system) CPU usage. This is the result of many threads competing for the limited number of CPUs reserved for your job to run them, which loads to a large amount of system overhead.

Usually, this is caused by software parallelizing tasks to use all CPUs installed in the system (sometimes hundreds), e.g., linear algebra libraries may use all CPUs to perform even small matrix multiplications. You can prevent this behavior by setting the following environment variables:

MKL_NUM_THREADS=1
NUMEXPR_NUM_THREADS=1
OMP_NUM_THREADS=1
USE_OPENMP=1  # prevents openblas to override OMP_NUM_THREADS

Additionally, set the number of workers (e.g., num_workers parameter of the Pytorch dataloader) to an appropriate value. You can check the SLURM_CPUS_ON_NODE (number of CPUs available on the current node) environment variable, or SLURM_CPUS_PER_TASK (number of CPUs requested per task) if the --cpus-per-task option is specified. Check the Slurm documentation for more details.

`multiproccessing.cpu_count` reports too many CPUs #

Slurm assigns a specific set of CPUs to each job, depending on how many CPUs were requested. The remainder of the CPUs on the node cannot be used. However, the Python function multiproccessing.cpu_count always reports the number of installed CPUs (and will continue to do so for the forseeable future, see here), irrespective of how many are actually accessible. Use num_cpu=int(os.environ['SLURM_CPUS_ON_NODE']) instead to find out how many CPUs your code can use.

It is very likely that other tools are similarly affected by this issue. See also multithreading contention.

Jobs appear running after processes exit #

We sometimes observe that jobs appear to be running after all managed processes have exited. Slurm considers the associated resources as used. Unfortunately, we do not know why this is happening. It is rare, so it may be something specific about jobs that causes it, or an issue with Slurm in general. Peggy can detect this issue and sends a message about the job not ending. You can simply scancel affected jobs to fix it.

OMP shared memory errors #

You may encounter OpenMP errors like these related to shared memory (/dev/shm), e.g. when using the PyTorch dataloader:

OMP: Error #178: Function Can't open SHM failed:
OMP: System error #0: Success
...
RuntimeError: DataLoader worker (pid 12345) is killed by signal: Aborted.

The cause of these errors is not currently known. There are, however, a couple of workarounds that have been shown to help in some cases:

Your process may be running out of open file descriptors. Increase the limit before running your training script with ulimit -n 4000.
The job may be running out of memory and crashes when it tries to allocate more shared memory. Try requesting more memory for your job.
Reduce the number of workers and set the environment variables mentioned in the multithreading contention section to reduce the amount of shared memory required by your job.

You can try higher values for ulimit if the issue persists, but beware that there is a hard limit on the number of open files the kernel can handle. Exceeding it may crash important system services.