LogoPegasus Docs

Storage

Storage #

Important! Pegasus storage systems are not to be used as long term storage space! They are only there to provide necessary capacities to run experiments. Mission critical data must(!) be saved/copied outside Pegasus. It is your responsiblity!

Here’s some information on what files to put where. A brief overview (explained in more details below):

  • $HOME: source code, final results, out of system backup
  • /netscratch: cluster file system, environments, data (temporary files, results), no backup
  • /fscratch: cluster file system, data needing low latency or used by (very) distributed jobs, no backup
  • /ds(-*): dataset shares (read from here, don’t write), local snapshots
In general, avoid working with many small files on cluster setups, as network file systems punish you with their overhead (latency) in such cases. In general, also avoid placing large numbers of small files into a single directory. Lookups in such directories will take very long. For yourself and for your code.

$HOME directory #

Your $HOME directory is limited to 10 GB, so use it for source code only.

One reason for this is that /home is backed up every 3 hours and backups are kept for 30 days. Hence, do not put the following in $HOME:

  • temporary files (e.g., logs / training snapshots) (use /netscratch instead)
  • final results (use /netscratch instead, download a copy)
  • software environments (virtualenv, conda env, …) (use one of the many existing images or customize one? Otherwise /netscratch)
  • large datasets (use /ds or your respective department dataset share instead)

/netscratch #

/netscratch is our large data store (BeeGFS cluster file system). It is available on all login and compute nodes.

Use your personal directory /netscratch/$USER for all your large files, other data such as experiments, results, etc. If in doubt where to place something, this is probably a good place.

Although the risk of data loss is very small, note that, contrary to $HOME, there are no backups of /netscratch (that’s what scratch means!). So in case there is something that you really must absolutely not loose and cannot reproduce, you should store it in $HOME or download it.

/fscratch #

/fscratch is different, smaller but way faster data store than /netscratch. It is available on all login and compute nodes.

However, when your experiment requires low latency data provision or is running very distributed accross top GPU systems, in order to prevent waiting for data /fscratchmight be a good choice. Otherwise, all statements wrt /netscratch apply here too.

Cleanup #

Storage is not endless and /netscratch is primarily intended for temporary data. Please clean up your /netscratch folder regularly.

Use ncdu to check how much space is occupied and where it’s going. You can also delete things straight from within ncdu interface (press D and confirm).

You can also use our springclean script. It looks for serial kinds of files like training snapshots. The help text /netscratch/springclean -h explains how it works and how you can influence what it selects. To just see what it would do, why not try a dry run:

/netscratch/springclean /netscratch/$USER --dry-run

Datasets #

Use /ds for datasets.

Note that there is already a large collection of common datasets.

If you think something is missing (e.g., common / others might also be interested), feel free to create a folder in one of the sub-folders and just put it there. Short README.md stating the src url, (latex) citation and maybe even an abstract, as well as a ping in our chat channel are very appreciated. If the categories in /ds don’t really fit, it’s bigger than a couple of GBs, you need help to download it or really anything else, also just ask in chat.

In general, try to avoid copying data around. Instead, if possible, mount /ds into your containers and read directly from there. Then write models, results, … to the also mounted /netscratch.

Avoid datasets provided in many small files directly if at all possible. Performance is always terrible, even more in network file system setups (as ours). Use datadings (some datasets already available in /ds) or webdataset to convert them into few larger files.

Dataset shares are snapshotted locally every 3 hours and those are preserved for 30 days.

Git repositories #

Use DFKI GitLab to create personal repositories. Don’t store anything that is large or binary in your git repository (datasets, models, …).

TODO: re/move? skeleton is outdated / needs update Make sure that your experiments are configured correctly. Check out our skeleton project for more detailed explanations on how to properly setup experiments. It defines a sensible structure for your repository and has template scripts for experiments that prepare the environment and avoid common pitfalls.