Examples #

Simple single-GPU job #

This command requests 1 GPU from the default partition and executes python train.py inside the PyTorch 23.12 container within the current directory:

$ srun -K \
  --job-name="EZVEProject_MyExperimentName" \
  --gpus=1 \
  --container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
  --container-image=/enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh \
  --container-workdir="`pwd`" \
  python train.py

Distributed training with 16 GPUs #

The srun command below does the following once all requested resources have been allocated:

Start a container from the nvcr.io_nvidia_pytorch_23.12-py3.sqsh image.
Mount some directories from the worker host.
Start 16 tasks.
Request one GPU from the RTXA6000 partition for each task.
Request 6 CPUs for each GPU.
Run the Python script called train.py located in the current directory 16 times.

$ srun -K \
  --job-name="EZVEProject_MyExperimentName" \
  --container-image=/enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh \
  --container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
  --container-workdir="`pwd`" \
  -p RTXA6000 \
  --ntasks=16 \
  --gpus-per-task=1 \
  --gpu-bind=none \
  --cpus-per-gpu=6 \
  python train.py