Known issues #
This is a collection of known issues with available workarounds (if any).
Installation failures with old pip #
Container images can come with very old versions of pip
,
which are likely to misinterpret metadata provided by PyPI.
This can manifest itself as pip
resorting to building packages
from source instead of downloading a binary distribution with errors like these:
ERROR: Could not find a version that satisfies the requirement cython>=3.0.0
ERROR: No matching distribution found for cython>=3.0.0
[...] inconsistent Name: expected 'cython', but metadata has 'Cython'
We recommend upgrading pip
before installing Python packages with it:
python -m pip install --upgrade pip
GPUs inaccessible with --gpus-per-task
#
When starting jobs with a combination of --ntasks
and --gpus-per-task
,
some GPUs may be inaccessible.
This manifests in errors such as
CUDA error: invalid device ordinal
or similar.
We have reported this
issue
to SchedMD.
Here are some possible workarounds:
- Before accessing GPUs, set
CUDA_VISIBLE_DEVICES
to the correct value, e.g.,import os num_local_gpus = int(os.getenv('SLURM_GPUS_ON_NODE', 1)) os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(map(str, range(num_local_gpus)))
- If you do not need multi-node operation, use the
--gpus
option instead of--gpus-per-task
. E.g., to request 4 GPUs, use--nodes=1 --ntasks=4 --gpus=4
. Setting node count to one is necessary to prevent Slurm from allocating different numbers of tasks and GPUs on a node. - Use a launcher program instead of Slurm tasks, e.g.,
torchrun.
Specify the number of GPUS with
--gpus
and remove--ntasks
entirely.
GPU binding and peer-to-peer communication #
For effective GPU peer-to-peer communication
(e.g., using the NCCL library, see
here)
all involved processes need to see all GPUs.
When using Slurm tasks to manage GPU processes
(e.g. srun --ntasks=4 --gpus-per-task=1
),
the default behavior is to bind GPUs to their respective task,
making them inacessible to other tasks.
There is currently no Slurm config setting to change the default behavior.
Add --gpu-bind=none
to let all processes see all GPUs.
Container root mapping #
By default, your user will be mapped to root inside containers. Of course, this does not mean you have root access on the machine, but it allows you to manipulate the software within the container as needed.
However, some software (e.g. CARLA) refuses to run as the root user.
Add --no-container-remap-root
to your srun command to disable
root mapping.
Multithreading contention #
You may experience that your job runs slower than expected and the CPU graph in the job details dashboard shows a large portion of red (system) CPU usage. This is the result of many threads competing for the limited number of CPUs reserved for your job to run them, which loads to a large amount of system overhead.
Usually, this is caused by software parallelizing tasks to use all CPUs installed in the system (sometimes hundreds), e.g., linear algebra libraries may use all CPUs to perform even small matrix multiplications. You can prevent this behavior by setting the following environment variables:
MKL_NUM_THREADS=1
NUMEXPR_NUM_THREADS=1
OMP_NUM_THREADS=1
USE_OPENMP=1 # prevents openblas to override OMP_NUM_THREADS
Additionally, set the number of workers (e.g., num_workers
parameter
of the Pytorch dataloader) to an appropriate value.
You can check the SLURM_CPUS_ON_NODE
(number of CPUs available on
the current node) environment variable,
or SLURM_CPUS_PER_TASK
(number of CPUs requested per task)
if the --cpus-per-task
option is specified.
Check the
Slurm documentation
for more details.
multiproccessing.cpu_count
reports too many CPUs
#
Slurm assigns a specific set of CPUs to each job,
depending on how many CPUs were requested.
The remainder of the CPUs on the node cannot be used.
However, the Python function multiproccessing.cpu_count
always reports the number of installed CPUs
(and will continue to do so for the forseeable future, see
here),
irrespective of how many are actually accessible.
Use num_cpu=int(os.environ['SLURM_CPUS_ON_NODE'])
instead to
find out how many CPUs your code can use.
It is very likely that other tools are similarly affected by this issue. See also multithreading contention.
Jobs appear running after processes exit #
We sometimes observe that jobs appear to be running after all
managed processes have exited.
Slurm considers the associated resources as used.
Unfortunately, we do not know why this is happening.
It is rare, so it may be something specific about jobs that causes it,
or an issue with Slurm in general.
Peggy can detect this issue and sends a message about the job not ending.
You can simply scancel
affected jobs to fix it.
OMP shared memory errors #
You may encounter OpenMP errors like these related to shared memory
(/dev/shm
), e.g. when using the PyTorch dataloader:
OMP: Error #178: Function Can't open SHM failed:
OMP: System error #0: Success
...
RuntimeError: DataLoader worker (pid 12345) is killed by signal: Aborted.
The cause of these errors is not currently known. There are, however, a couple of workarounds that have been shown to help in some cases:
- Your process may be running out of open file descriptors.
Increase the limit before running your training script
with
ulimit -n 4000
. - The job may be running out of memory and crashes when it tries to allocate more shared memory. Try requesting more memory for your job.
- Reduce the number of workers and set the environment variables mentioned in the multithreading contention section to reduce the amount of shared memory required by your job.
You can try higher values for ulimit
if the issue persists,
but beware that there is a hard limit on the number of open files
the kernel can handle.
Exceeding it may crash important system services.