Getting started #

These docs aren’t long. Please read all of them before running anything. It will save yourself and everyone else a lot of time.

Requesting access #

As mentioned in the guidelines, you need a MyDFKI/Intranet account to access our intranet, and a Pegasus account to access the Slurm cluster itself. Send request to cluster@dfki.de from your DFKI mail address.

Connecting to the cluster #

Use ssh to connect to one of our head nodes:

SSH fingerprints:

login1.pegasus.kl.dfki.de
SHA256:/Sjwh5nHthPCVol7XtEM53BKYy2jlTvOu+MFXnCx7M4 (ECDSA) SHA256:BPnYmJOFDWqJEKoUP4GoC1+k7olmFLhP0zRaxZ+pl4M (ED25519) SHA256:nEbr0eTd5v0Iis4/g2oqyZPTBvfa63MjB3qMM72eV0U (RSA)
login2.pegasus.kl.dfki.de
SHA256:g9VaUAKpZYHvhMnM6j/bXeobih/WRD3KkrW2OKVDVGI (ECDSA) SHA256:oZqynX+tSPdHpSkryArsWHEVuls18o32mjuWYMYCMwY (ED25519) SHA256:a1PCII5MTIlcs2odAa34+rFFE21VMn58neEmm9DyHQ8 (RSA)
login3.pegasus.kl.dfki.de
SHA256:grb2zDUzhkE9SpsxxxfMsrWfgzvEDzh+A4w3ncD1/Zk (ECDSA) SHA256:ui6zRhZbSZtmMVEDRbEoTL0QBGgM+mtQCoSZ9b1ZTJ8 (ED25519) SHA256:v6wLEv4W3af48omlydCVE+npc0bbZpi6z6B0cz905N8 (RSA)

ssh [username]@[head node]

Here you can, among other things, modify your home directory and schedule jobs with the srun command. See the Slurm Cluster section for more details on this.

Read the intro message as it contains important information.

Do not run compute jobs or other resource intensive commands on head nodes! This includes computation inside vs-code and other remote IDEs!

SSH key authentication #

In case you don’t want to type in your credentials each time, feel free to set up public key auth like this (once):

# Create a ssh key pair on your local machine
ssh-keygen

# Copy your public key to remote machine
ssh-copy-id [username]@[head node]

# You should now be able to ssh using your key to authorize
ssh [username]@[head node]

Your $HOME is synchronized across all machines, so you only need to do this once.

Be aware that any kind of resource hooging is not appreciated. To prevent accidental mistakes and to keep the login nodes usable, the following limits are in place for all logins:

max user sessions per node: 500 (basically any kind of shell)

The per user cgroup slice is:

CPU limit 6 cores
Mem limit 10G

That is the max cpu core and RAM a user can utilize on a login node.

There are cron jobs in place that terminate all user processes that where started 2 months ago. This is to prevent lingering and forgotten shells, screens and tmuxes eating up the memory. They run every night.

Moving data to the cluster #

While the cluster nodes all have shared access to NAS mounts such as /ds (it’s actually not unlikely that you’ll find your standard dataset there already) or /netscratch, currently we don’t allow mounting these file systems directly on other machines (e.g., yours). The reason for this is a trade-off between performance and security.

Hence, if you want to get data in and out of the cluster, you’re left with options that you always have when you have ssh access:

scp, rsync
sftp clients. Often directly included in your favorite file-browser via sftp://... or fish:// (KDE), otherwise standalone tools such as: CyberDuck, WinSCP
sshfs (fuse) mounts

See the section on storage to understand where to put what kind of data.

If you’re planning to download a large dataset or import a lot of data (let’s say starting at 20 GB) or don’t have the right permissions (to put it in what you think is the right place), please contact us.

Running commands in the background #

When you disconnect from an ssh session, your shell will close and with it, all processes running inside said shell. Start a screen or tmux session to be able to disconnect and reconnect to your jobs later:

screen -S [descriptive name]

Detach from the screen with CTRL+A, followed by D. Then to reattach run:

screen -r [descriptive name]

Interactive jobs #

While it’s possible and sometimes useful to run interactive jobs on the cluster, e.g., to debug training code and avoid lengthy setup times or create custom enviroments we ask you to keep them to a minimum. From experience, we know that interactive sessions rarely utilize the requested resources while they’re in use and are promptly forgotten shortly after. Remember that jobs have exclusive access to the resources that they receive, so lingering jobs waste compute time. To limit their impact, interactive jobs must be started with a time limit of 4 hours or less (--time=04:00:00) and an immediate period of at most 3600 seconds (--immediate=3600).

Connecting via Saarbrücken VPN #

In case you’re connecting via VPN provided by DFKI Saarbrücken, be aware that only certain ports on cluster nodes are reachable. The standard ports for ssh (22), http (80), and https (443) as well as ports >= 10000 are reachable. So, if you want to connect to self deployed services, i.e. Jupyter Notebooks, Tensorboard etc, keep in mind to choose ports above 10000 for the service to listen on.

Organizing experiments #

Command lines to run jobs can get quite lengthy, so we recommend that you create shell scripts to reduce some (or all) of the boilerplate.

It also helps to reproduce experiments if all parameters are known. Use sacred or mlflow to automatically document your experiments in minute detail. Combined with our fixed software environments this provides excellent reproducibility.

Make sure to write your code in a way that jobs can be continued in case of any interruption. Create regular snapshots of model and optimizer state so you can resume later. Power outages and other service interruptions are rare, but they do happen eventually. With the ability to continue the job, a lot of time and resources can be saved compared to starting over.