Examples #
Simple single-GPU job #
This command requests 1 GPU from the default partition and executes
python train.py
inside the PyTorch 23.12 container within the
current directory:
$ srun -K \
--job-name="EZVEProject_MyExperimentName" \
--gpus=1 \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
--container-image=/enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh \
--container-workdir="`pwd`" \
python train.py
Distributed training with 16 GPUs #
The srun
command below does the following once all requested resources
have been allocated:
- Start a container from the
nvcr.io_nvidia_pytorch_23.12-py3.sqsh
image. - Mount some directories from the worker host.
- Start 16 tasks.
- Request one GPU from the RTXA6000 partition for each task.
- Request 6 CPUs for each GPU.
- Run the Python script called
train.py
located in the current directory 16 times.
$ srun -K \
--job-name="EZVEProject_MyExperimentName" \
--container-image=/enroot/nvcr.io_nvidia_pytorch_23.12-py3.sqsh \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
--container-workdir="`pwd`" \
-p RTXA6000 \
--ntasks=16 \
--gpus-per-task=1 \
--gpu-bind=none \
--cpus-per-gpu=6 \
python train.py