Resource allocation #

srun offers many options to specify what resources to allocate for your job. These are the most important ones (run srun --help for the complete list):

-n, --ntasks=ntasks         number of tasks to run
    --ntasks-per-node=n     number of tasks to invoke on each node
-N, --nodes=N               number of nodes on which to run (N = min[-max])
-c, --cpus-per-task=ncpus   number of cpus required per task
    --cpus-per-gpu=n        number of CPUs required per allocated GPU
-G, --gpus=n                count of GPUs required for the job
    --gpus-per-node=n       number of GPUs required per allocated node
    --gpus-per-task=n       number of GPUs required per spawned task
    --mem=MB                minimum amount of real memory
    (--mem-per-gpu=MB)      DO NOT USE, BUGGY!
                            real memory required per allocated GPU
    --mem-per-cpu=MB        maximum amount of real memory per allocated
                            cpu required by the job.
                            --mem >= --mem-per-cpu if --mem is specified.
    --time=1-00:00          job runtime limit [d-hh:mm] (default 7 days for
                            non-privileged partitions, 1-3 for A100)

Use the respective partition to select the GPU type.

Warning: While you can specify many GPUs for a job, you still have to make some modifications to your code to actually use them. For more details, see the section on Multi-GPU & distributed training.

The Slurm documentation states that a combination of --gpus and --gpus-per-task should imply how many tasks to start. This is currently false, so use --ntasks and --gpus-per-task with --gpu-bind=none instead. For a single task job on a single GPU, stating --gpus=1 is sufficient.

Sadly, --mem-per-gpu is buggy at the moment and does not work.

We recommend configuring jobs according to how many GPUs are required with a combination of --ntasks and --gpus-per-task with --gpu-bind=none, and giving each GPU a sufficient number of CPUs for pre-processing with --cpus-per-gpu.

Optionally, set the number of nodes to 1 to ensure that all GPUs are on the same node. This is not generally required for jobs on DGX-system (V100 and A100 partitions) where InfiniBand is available, but may yield minor performance improvements with jobs on other machines. This is, of course, at the expense of longer wait times for your job to get scheduled.

You can verify the default settings for each partition with scontrol show partition:

$ scontrol show partition A100
PartitionName=A100
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=serv-33[28-32]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1280 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=24576 MaxMemPerNode=UNLIMITED

By default, GPU-based jobs receive 24 GiB of memory per node. Your job may run poorly or crash with an “out of memory” message if more memory is required. You can request more memory with, e.g., --mem=42G or --mem-per-cpu=12G.

Enroot containers require RAM roughly equivalent to the size of the .sqsh file (10-15 GiB).

Interactive jobs #

Please use interactive jobs or notebooks sparingly, and never for long-running tasks. Forgotten interactive jobs are the #1 reason for idling hardware. You are also missing out on some really nice features like email notifications when your jobs finish/crash.

Some tasks like debugging are significantly easier with interactive jobs. You can use srun [...] --time=04:00:00 --immediate=3600 --pty /bin/bash to open an interactive shell where you can start and stop processes and use a debugger like you would on your local machine. Note the time limit of 4 hours and immediate period of 3600 seconds. Both are required and the maximum values allowed for interactive jobs.

Finally, unless you need a specific hardware feature, consider using GPUs from the batch partition for your interactive jobs. They are perfectly adequate for small examples.