Getting started #
These docs aren’t long. Please read all of them before running anything. It will save yourself and everyone else a lot of time.
Requesting access #
As mentioned in the guidelines, you need a MyDFKI/Intranet account to access our intranet, and a Pegasus account to access the Slurm cluster itself. Send request to cluster@dfki.de from your DFKI mail address.
Connecting to the cluster #
Use ssh
to connect to one of our head nodes:
- login1.pegasus.kl.dfki.de
SSH fingerprintSHA256:BPnYmJOFDWqJEKoUP4GoC1+k7olmFLhP0zRaxZ+pl4M
- login2.pegasus.kl.dfki.de
SSH fingerprintSHA256:oZqynX+tSPdHpSkryArsWHEVuls18o32mjuWYMYCMwY
- login3.pegasus.kl.dfki.de
SSH fingerprintSHA256:ui6zRhZbSZtmMVEDRbEoTL0QBGgM+mtQCoSZ9b1ZTJ8
ssh [username]@[head node]
Here you can, among other things, modify your home directory
and schedule jobs with the srun
command.
See the
Slurm Cluster
section for more details on this.
Read the intro message as it contains important information.
Do not run compute jobs or other resource intensive commands on head nodes! This includes computation inside vs-code and other remote IDEs!
SSH key authentication #
In case you don’t want to type in your credentials each time, feel free to set up public key auth like this (once):
# Create a ssh key pair on your local machine
ssh-keygen
# Copy your public key to remote machine
ssh-copy-id [username]@[head node]
# You should now be able to ssh using your key to authorize
ssh [username]@[head node]
Your $HOME
is synchronized across all machines, so you only
need to do this once.
Moving data to the cluster #
While the cluster nodes all have shared access to NAS mounts such
as /ds
(it’s actually not unlikely that you’ll find your standard dataset
there already) or /netscratch
, currently we don’t allow mounting
these file systems directly on other machines (e.g., yours).
The reason for this is a trade-off between performance and security.
Hence, if you want to get data in and out of the cluster,
you’re left with options that you always have
when you have ssh
access:
scp
,rsync
- sftp clients.
Often directly included in your favorite file-browser via
sftp://...
orfish://
(KDE), otherwise standalone tools such as: CyberDuck, WinSCP sshfs
(fuse) mounts
See the section on storage to understand where to put what kind of data.
If you’re planning to download a large dataset or import a lot of data (let’s say starting at 20 GB) or don’t have the right permissions (to put it in what you think is the right place), please contact us.
Running commands in the background #
When you disconnect from an ssh
session, your shell will close
and with it, all processes running inside said shell.
Start a screen
or tmux
session to be able to disconnect and
reconnect to your jobs later:
screen -S [descriptive name]
Detach from the screen with CTRL+A
, followed by D
.
Then to reattach run:
screen -r [descriptive name]
Interactive jobs #
While it’s possible and sometimes useful to run interactive jobs
on the cluster, e.g., to debug training code and avoid
lengthy setup times
or create
custom enviroments
we ask you to keep them to a minimum.
From experience, we know that interactive sessions rarely utilize
the requested resources while they’re in use and are promptly forgotten
shortly after.
Remember that jobs have exclusive access to the resources that
they receive, so lingering jobs waste compute time.
To limit their impact, interactive jobs must be started with a time limit of
4 hours or less (--time=04:00:00
) and an immediate period of at most
3600 seconds (--immediate=3600
).
Connecting via Saarbrücken VPN #
In case you’re connecting via VPN provided by DFKI Saarbrücken, be aware that only certain ports on cluster nodes are reachable. The standard ports for ssh (22), http (80), and https (443) as well as ports >= 10000 are reachable. So, if you want to connect to self deployed services, i.e. Jupyter Notebooks, Tensorboard etc, keep in mind to choose ports above 10000 for the service to listen on.
Organizing experiments #
Command lines to run jobs can get quite lengthy, so we recommend that you create shell scripts to reduce some (or all) of the boilerplate.
It also helps to reproduce experiments if all parameters are known. Use sacred or mlflow to automatically document your experiments in minute detail. Combined with our fixed software environments this provides excellent reproducibility.
Make sure to write your code in a way that jobs can be continued in case of any interruption. Create regular snapshots of model and optimizer state so you can resume later. Power outages and other service interruptions are rare, but they do happen eventually. With the ability to continue the job, a lot of time and resources can be saved compared to starting over.