LogoPegasus Docs

Multi-GPU & distributed jobs

Distributed jobs #

Our cluster supports distributed jobs with PyTorch out of the box. We recommend a setup with DistributedDataParallel and one Slurm task (Python process) per GPU. If you need to use the non-distributed version DataParallel you must specify -N 1 to limit your job to a single node.

While our cluster has excellent support for distributed jobs this does not mean there are no modifications required on your end. Check the section on required modifications to code.
Using the default batch partition for multi-GPU jobs is strongly discouraged. Since Slurm does not differentiate between the different GPU models in that partition your job may receive an arbitrary mix. Best-case result, your job works, but runs poorly. Worst-case, it just crashes with error messages that are not helpful.

Modify your srun command #

We recommend you add -K to your srun command to kill remaining tasks if one of your GPU workers dies unexpectedly:

-K, --kill-on-bad-exit      kill the job if any task terminates with a
                            non-zero exit code

Modify your code #

Some modifications to your code are necessary to enable distributed jobs.

  • Initialize the process group with the environment method and NCCL backend: torch.distributed.init_process_group('NCCL'). The required environment variables are already set up correctly for you (PyTorch: MASTER_PORT, MASTER_ADDR, WORLD_SIZE, RANK; NCCL: NCCL_SOCKET_IFNAME=bond,eth, NCCL_IB_HCA=mlx5). Check the available NCCL settings, e.g., NCCL_DEBUG to output debug information.
  • Use the GPU device specified by the LOCAL_RANK environment variable. Either make sure that model and tensors are on the correct torch.device (recommended), or use torch.cuda.set_device.
  • You can read the WORLD_SIZE environment variable to see how many processes (also known as ranks) are part of the group.
  • Typically, each process/rank needs to load a different subset of the whole dataset. The PyTorch dataloader supports this with samplers, e.g., the DistributedSampler.
  • Otherwise, follow the official guides on distributed training and DistributedDataParallel.
Important: Since there are many parallel processes involved in the job, you may want to limit output (printing, snapshots) to rank 0 only and make all other ranks quiet. Use the RANK environment variable to find out which rank a process has.

Example training script #

Provided is a relatively complete training script to train ResNet 18 on the ImageNet dataset using PyTorch. Currently missing are storing the training state and resuming from a stored state.

It uses datadings to load the dataset. You can use the simple install script as task prolog to install it as outlines in the custom software section.

ResNet 18 is a relatively small network by today’s standards and thus requires a lot of CPU workers to saturate a GPU. The following command should be a reasonable starting point:

$ srun -K \
    --partition=A100 \
    --nodes=1 \
    --ntasks=4 \
    --cpus-per-task=10 \
    --gpus-per-task=1 \
    --gpu-bind=none \
    --container-image=/enroot/nvcr.io_nvidia_pytorch_22.04-py3.sqsh \
    --container-workdir="`pwd`" \
    --container-mounts="`pwd`":"`pwd`",/ds:/ds:ro \
    --task-prolog="`pwd`/install_datadings.sh" \
    --mem-per-cpu=6G \
    python resnet18_imagenet.py
The example script uses AMP to accelerate training on GPUs with tensor cores. Both large batch sizes needed for effective distributed training and AMP can have a negative effect on the training result, so accuracy may be slightly lower than expected. You usually want to run with AMP disabled and fewer GPUs for final results.

Example commands #

1 task per GPU #

Schedule a job with 4 tasks and one RTX A6000 GPU each:

$ srun -p RTXA6000 --ntasks 4 --gpus-per-task --gpu-bind=none 1 [your command]

This configuration is used with PyTorch DistributedDataParallel and scales better with more GPUs.

All GPUs in 1 task #

Schedule a job with 4 GPUs in a single task:

$ srun -p RTXA6000 --ntasks 1 --gpus-per-task 4 [your command]

This configuration is used for older PyTorch DataParallel code and other frameworks that are not capable of distributed computing.

Example project #

A PyTorch Lightning based project template allowing for asymmetric DDP configurations. More details in the repository.