Changes to interactive jobs
November 8, 2024, Joachim Folz
We will soon be rolling out changes to interactive jobs. Read below how to adjust your jobs to fit the new requirements.
New requirements for interactive jobs #
From Wednesday, 13th November 2024, jobs started with srun --pty
(interactive jobs) must have a time limit of 4 hours or less
(--time 04:00:00
) and an immediate period of at most 3600 seconds
(--immediate=3600
).
The latter means srun will exit if resources do not become available
during the given time period.
If you attempt to start an interactive job with a longer time limit
or a long immediate period, srun
will exit with this error message:
Interative jobs (with –pty) must have –time <= 4 hours and –immediate <= 3600 seconds
We also recommend setting mail-type BEGIN and mail-user to get notified when the job starts
Example: srun –time=01:00:00 –immediate=300 –mail-type=BEGIN – mail-user=address@domain.tld –pty bash
The equals sign in --immediate=300
is mandatory.
Slurm does not like it if there is a space there.
Migrating from interactive jobs #
Migrating to non-interactive jobs is simple.
If you are currently starting your job with srun --pty bash
and manually
run python train.py
, you can simply pass your training command to srun
directly, i.e., srun python train.py
.
If you have been installing additional software manually until now, please
refer to our page on using
Custom Software to
automate this step.
Why we believe this is necessary #
In combination, a shorter time limit and set immediate period are intended to avoid long periods of inactivity. It is currently too easy to inadvertently block resources if interactive jobs start many hours after they are submitted. In addition, an interactive job implies a higher priority than a normal job, and shorter time limits often allow the scheduler to squeeze a job into a gap between larger, long-running jobs.
Feedback welcome #
Ultimately, Pegasus is supposed to support your research, and hard limits can prevent that from happening. 4 hours and 3600 seconds are just our best guesses for sensible values. Please let us know if they prevent one of your use cases so we can find a solution.