Resource allocation #
srun
offers many options to specify what resources to allocate for
your job.
These are the most important ones (run srun --help
for the complete list):
-n, --ntasks=ntasks number of tasks to run
--ntasks-per-node=n number of tasks to invoke on each node
-N, --nodes=N number of nodes on which to run (N = min[-max])
-c, --cpus-per-task=ncpus number of cpus required per task
--cpus-per-gpu=n number of CPUs required per allocated GPU
-G, --gpus=n count of GPUs required for the job
--gpus-per-node=n number of GPUs required per allocated node
--gpus-per-task=n number of GPUs required per spawned task
--mem=MB minimum amount of real memory
(--mem-per-gpu=MB) DO NOT USE, BUGGY!
real memory required per allocated GPU
--mem-per-cpu=MB maximum amount of real memory per allocated
cpu required by the job.
--mem >= --mem-per-cpu if --mem is specified.
--time=1-00:00 job runtime limit [d-hh:mm] (default 7 days for
non-privileged partitions, 1-3 for A100)
Use the respective partition to select the GPU type.
Warning: While you can specify many GPUs for a job, you still have to make some modifications to your code to actually use them. For more details, see the section on Multi-GPU & distributed training.
The Slurm documentation states that a combination of--gpus
and--gpus-per-task
should imply how many tasks to start. This is currently false, so use--ntasks
and--gpus-per-task
with--gpu-bind=none
instead. For a single task job on a single GPU, stating--gpus=1
is sufficient.
Sadly, --mem-per-gpu
is buggy at the moment and does not work.
We recommend configuring jobs according to how many GPUs are required with a
combination of --ntasks
and --gpus-per-task
with --gpu-bind=none
,
and giving each GPU a sufficient number of CPUs for pre-processing
with --cpus-per-gpu
.
Optionally, set the number of nodes to 1 to ensure that all GPUs are on the same node. This is not generally required for jobs on DGX-system (V100 and A100 partitions) where InfiniBand is available, but may yield minor performance improvements with jobs on other machines. This is, of course, at the expense of longer wait times for your job to get scheduled.
You can verify the default settings for each partition with
scontrol show partition
:
$ scontrol show partition A100
PartitionName=A100
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=serv-33[28-32]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1280 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=24576 MaxMemPerNode=UNLIMITED
By default, GPU-based jobs receive 24 GiB of memory per node. Your job may run poorly or crash with an “out of memory” message if more memory is required. You can request more memory with, e.g.,--mem=42G
or--mem-per-cpu=12G
.
Enroot containers require RAM roughly equivalent to the size of
the .sqsh
file (10-15 GiB).
Interactive jobs #
Please use interactive jobs or notebooks sparingly, and never for long-running tasks. Forgotten interactive jobs are the #1 reason for idling hardware. You are also missing out on some really nice features like email notifications when your jobs finish/crash.
Some tasks like debugging are significantly easier with interactive jobs.
You can use srun [...] --time=04:00:00 --immediate=3600 --pty /bin/bash
to open an interactive shell where you can start and stop processes and use
a debugger like you would on your local machine.
Note the time limit of 4 hours and immediate period of 3600 seconds.
Both are required and the maximum values allowed for interactive jobs.
Finally, unless you need a specific hardware feature, consider using GPUs from the batch partition for your interactive jobs. They are perfectly adequate for small examples.