Partitions #
Our cluster is organized in partitions by GPU type.
Add -p [name]
or --partition [name]
to your srun
command:
A100-40GB
A100-80GB
A100-PCI
H100
H200
L40S
RTX3090
RTXA6000
V100-16GB
V100-32GB
batch
You can use the partition to specify which kind of GPU, if any, your job requires.
There are also sub-partitions for the group that contributed the node to the cluster. Only users from that group can use these sub-partitions and jobs scheduled there have a higher priority.
Resources in partitions #
Values below like CPUs and memory per GPU are based on evenly distributing all available resources of a node. While it is reasonable to start with this amount at first, please consider that some jobs may require more. If everyone requests only the necessary amount of CPUs and memory we can ensure that more demanding jobs can run as well. Monitor your jobs to find out what they actually require and add a margin for safety. The resources dashboard in particular will tell you what is currently available.
Legend: πGPU, πCPU, πMem
Partition | π Name | π Arch | π Mem (GB) | π per Node | π per π | π per π (GB) | TimeLimit Default - Max |
---|---|---|---|---|---|---|---|
A100-40GB A100-80GB | A100-SXM4 | Ampere | 40 80 | 8 | 32 | 112 224 | 1 - 3 days |
A100-PCI | A100-PCIE | Ampere | 40 | 8 | 12 | 48 | 1 - 3 days |
H100 | H100-SXM5 | Hopper | 80 | 8 | 28 | 224 | 1 - 1 days |
H200 | H200-SXM5 | Hopper | 141 | 8 | 28 | 224 | 1 - 1 days |
L40S | L40S | Ada Lovelace | 48 | 8 | 16 | 125 | 1 - 3 days |
RTX3090 | RTX 3090 | Ampere | 24 | 8 | 12 | 64 | 1 - 3 days |
RTXA6000 | RTX A6000 | Ampere | 48 | 8 | 12 | 108 | 1 - 3 days |
V100-16GB | V100-SXM2 | Volta | 16 | 8 | 10 | 64 | 1 - 3 days |
V100-32GB | V100-SXM2 | Volta | 32 | 8 | 10. | 64 | 1 - 3 days |
batch | RTX 6000 | Turing | 24 | 10 | 7 | 64 | 1 - 3 days |
Group sub-partitions have the same default time limit, but no maximum.
Use cases & connectivity #
Here is some more info on how GPUs and nodes are connected in each partition. NVLink or NVSwitch speeds up communication between GPUs in the same node. InfiniBand improves communication between nodes.
Some jobs can work perfectly fine using PCIe / Ethernet as interconnect, while others may run significantly slower. As a rule of thumb, more GPUs and iterations/s (increased synchronization overhead) means the impact will be higher.
Partition | π Link | InfiniBand | Comments |
---|---|---|---|
A100-40GB A100-80GB | NVSwitch | yes | best for multi-gpu, Infiniband for multi-node Jobs, needs image version 20.06 or newer |
A100-PCI | PCIe | no | good for single-node multi-GPU, needs image version 20.10 or newer |
H100 | NVSwitch | no | latest architecture, best for multi-gpu, needs image version 22.09 or newer |
H200 | NVSwitch | no | latest architecture, best for multi-gpu, needs image version 22.09 or newer |
L40S | PCIe | no | good for single-node multi-GPU, needs image version 22.09 or newer |
RTX3090 | PCIe | no | good for single-node multi-GPU, needs image version 20.10 or newer |
RTXA6000 | PCIe | no | good for single-node multi-GPU, needs image version 20.10 or newer |
V100-16GB V100-32GB | NVLink | yes | NVLink for multi-gpu, InfiniBand for multi-node jobs |
batch | PCIe | no | default partition, good for single-GPU jobs, OK otherwise |
Performance #
Here are some benchmarks for ImageNet training. Throughput is measured in images/s using one entire node in each partition and the PyTorch 20.10 container (H100/200 & GH200 PyTorch Container 24.05).
Partition | π | Batch size | Throughput (images/s) |
---|---|---|---|
batch | 8 | 192 | 2250 |
batch | 10 | 192 | 2835 |
V100 | 8 | 256 | 2900 |
V100 (node gera) | 16 | 256 | 6234 |
A100 | 8 | 256 | 6500 |
A10 | 4 | 160 | 1162 |
H100 | 8 | 256 | 12879 |
H200 | 8 | 256 | 13472 |
L40S | 8 | 256 | 4500 |
GH200 | 1 | 256 | 2044 |
RTX3090 | 8 | 192 | 3300 |
RTXA6000 | 8 | 192 | 3450 |
RTXA6000 | 8 | 384 | 3550 |
Benchmark results for Transformer network training. Throughput is measured in tokens/s using one entire node in each partition and the PyTorch 21.05 container.
Partition | π | Batch size | Throughput (tokens/s) | Data type |
---|---|---|---|---|
batch | 10 | 5120 | 63.000 | FP32 |
V100 | 8 | 5120 | 65.000 | FP32 |
RTX3090 | 8 | 5120 | 29.500 | FP32 |
RTX3090 | 8 | 5120 | 32.500 | TF32 |
RTXA6000 | 8 | 10240 | 80.000 | FP32 |
RTXA6000 | 8 | 10240 | 150.000 | TF32 |
A100-40GB | 8 | 10240 | 90.000 | FP32 |
A100-40GB | 8 | 10240 | 326.000 | TF32 |
A100-80GB | 8 | 10240 | 90.000 | FP32 |
A100-80GB | 8 | 20480 | 92.000 | FP32 |
A100-80GB | 8 | 10240 | 341.000 | TF32 |
A100-80GB | 8 | 20480 | 360.000 | TF32 |
Partition status #
Check the resources dashboard for the most up-to-date info on available resources.
On head nodes, you can also run sinfo
to see a list of available partitions.
This is just an example of what its output looks like.
The current situation is almost certainly different.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 3 mix kent,kersey,kusel
V100-32GB up infinite 4 mix garda,gera,gifu,glendale
V100-16GB up infinite 1 mix glasgow
A100-40GB up infinite 5 mix serv-[3328-3332]
Note: Use this extendedsinfo
command to get a bit more info:sinfo -o "| %15P | %15f | %8c | %10m | %15G | %10O | %8t | %N"