Storage #
Important! Pegasus storage systems are not to be used as long term storage space! They are only there to provide necessary capacities to run experiments. Mission critical data must(!) be saved/copied outside Pegasus. It is your responsiblity!
Here’s some information on what files to put where. A brief overview (explained in more details below):
$HOME
: source code, final results, out of system backup/netscratch
: cluster file system, environments, data (temporary files, results), no backup/fscratch
: cluster file system, data needing low latency or used by (very) distributed jobs, no backup/ds(-*)
: dataset shares (read from here, don’t write), local snapshots
In general, avoid working with many small files on cluster setups, as network file systems punish you with their overhead (latency) in such cases. In general, also avoid placing large numbers of small files into a single directory. Lookups in such directories will take very long. For yourself and for your code.
$HOME
directory
#
Your $HOME
directory is limited to 10 GB, so use it for source code only.
One reason for this is that /home
is backed up every 3 hours and backups are kept for 30 days.
Hence, do not put the following in $HOME
:
- temporary files (e.g., logs / training snapshots) (use
/netscratch
instead) - final results (use
/netscratch
instead, download a copy) - software environments (virtualenv, conda env, …) (use one of the many
existing images or
customize one? Otherwise
/netscratch
) - large datasets (use
/ds
or your respective department dataset share instead)
/netscratch
#
/netscratch
is our large data store (BeeGFS cluster file system).
It is available on all login and compute nodes.
Use your personal directory /netscratch/$USER
for all your large
files, other data such as experiments, results, etc.
If in doubt where to place something, this is probably a good place.
Although the risk of data loss is very small, note that,
contrary to $HOME
, there are no backups of /netscratch
(that’s what scratch
means!).
So in case there is something that you really must absolutely
not loose and cannot reproduce, you should store it in $HOME
or
download it.
/fscratch
#
/fscratch
is different, smaller but way faster data store than /netscratch
.
It is available on all login and compute nodes.
However, when your experiment requires low latency data provision or is running very distributed
accross top GPU systems, in order to prevent waiting for data /fscratch
might be a good choice.
Otherwise, all statements wrt /netscratch
apply here too.
Cleanup #
Storage is not endless and /netscratch
is primarily intended
for temporary data.
Please clean up your /netscratch
folder regularly.
Use ncdu
to check how much space is occupied and where it’s going.
You can also delete things straight from within ncdu
interface
(press D
and confirm).
You can also use our springclean
script.
It looks for serial kinds of files like training snapshots.
The help text /netscratch/springclean -h
explains how it works
and how you can influence what it selects.
To just see what it would do, why not try a dry run:
/netscratch/springclean /netscratch/$USER --dry-run
Datasets #
Use /ds
for datasets.
Note that there is already a large collection of common datasets.
If you think something is missing (e.g., common / others might also be interested),
feel free to create a folder in one of the sub-folders and just put it there.
Short README.md
stating the src url, (latex) citation and maybe even an abstract,
as well as a ping in
our chat channel
are very appreciated.
If the categories in /ds
don’t really fit, it’s bigger than a couple of GBs, you need help to
download it or really anything else, also just ask in chat.
In general, try to avoid copying data around. Instead, if possible, mount /ds
into your containers and
read directly from there. Then write models, results, … to the also mounted /netscratch
.
Avoid datasets provided in many small files directly if at all possible.
Performance is always terrible, even more in network file system setups (as ours).
Use
datadings
(some datasets already available in /ds
) or
webdataset
to convert them into few larger files.
Dataset shares are snapshotted locally every 3 hours and those are preserved for 30 days.
Git repositories #
Use DFKI GitLab to create personal repositories. Don’t store anything that is large or binary in your git repository (datasets, models, …).
TODO: re/move? skeleton is outdated / needs update Make sure that your experiments are configured correctly. Check out our skeleton project for more detailed explanations on how to properly setup experiments. It defines a sensible structure for your repository and has template scripts for experiments that prepare the environment and avoid common pitfalls.