Guidelines for DFKI Deep Learning infrastructure #

These guidelines will walk you through the basics of using the infrastructure to train your models, but first things first:

Non-adherence of the following guidelines may result in account suspension!

Read these guidelines #

Yes, all of them. From start to finish. Won’t take you long, you’ll quickly learn a lot and avoid a lot of RTFMs.

Be considerate #

We have a lot of cool resources and compute power, but we are also many researchers, and we all want to run experiments. Use as many resources as you need, but ensure to make good use of them. And don’t reserve any resources for use later.

Be responsible #

Monitor your jobs to make sure they are working as intended and not wasting compute time. This also helps you select the correct amount of resources for your jobs. And though we don’t like to do it: We will have to restrict your access if your actions interfere with the operation of the cluster.

Be reachable #

We primarily use our Mattermost chat to communicate. Ask your supervisor to invite you. Join the Deep Learning channel (via that link or by clicking on the “More…” under “Public Channels”), to keep up to date with news and developments about our infrastructure. If there are issues with your experiments and you cannot be reached within a reasonable time frame (1-2h), we might be forced to cancel your jobs to prevent harm to others.

Keep your directories tidy #

With the Pegasus account you are given some default directories. That is a $HOME directory with 10GB quota and a /netscratch/$USER directory without space limitations (yet). Since /netscratch is shared among all Pegasus users, it is strongly recommended to clean up your /netscratch directory regularly. In general the permanent space occupation should not exceed a couple of terabytes.

Calls in the Deep Learning channel for cleanup of /netscratch directories have to be obeyed asap. Failing to follow the call can lead to account suspension, cancellation of all running jobs and the data being removed.

Ask questions #

While it’s good to learn for yourself, it also isn’t very productive if you bang your head against the wall for hours. Google / think first, but if you really can’t figure something out or wonder what the best way to achieve something is, please ask in the Deep Learning channel. See the Questions / Contact page for hints on how to ask questions in such a way that allows others to answer them. We’re actually interested in making the HPC stack easy, hopefully even fun to use. We have pretty cool resources here and we have many people like you with crazy ideas using them. So if something doesn’t feel right, or you wonder “why can’t we do this”: Please ask. Otherwise, we can’t help!

Contribute back #

Don’t only take… if you benefited from the HPC cluster or these docs then realize that you probably benefited from others sharing their hardware and knowledge with you. Keep this going by contributing back! If your group is writing the next proposal, get in touch with us and ask if you can buy a GPU / Node / Storage for the cluster! If you figured something out and think it could save others some time: Consider updating these docs (see the link in the footer of this page) / or share your findings in our Deep Learning channel. If someone is asking questions in that channel, and you know the answer, help them out!

Before you continue #

Here are some requirements and prior knowledge that you will need to get the most out of this documentation:

Accounts (we’re working on simplifying this):
- Pegasus account to access the Slurm cluster itself. Send request to cluster@dfki.de
- Access to DFKI GitLab
- Access to DFKI Mattermost
Familiarity with basic shell commands (e.g., ssh, htop, tmux and/or screen)