Summary of 2023 Maintenance
January 2, 2024, Christian Schulze, Joachim Folz
Here is our summary of all changes to Pegasus during the 2023 year end maintenance period.
Slurm update to 23.11 #
Slurm has been updated to the latest stable version 23.11. We were previously running 22.05. For the complete (and long) list of changes, see the SchedMD NEWS.
Enroot images moved to NFS with caching #
We moved the Enroot SquashFS images to a new NFS mount /enroot
with local caching enabled.
This should reduce container creation times.
/netscratch/enroot
is now a symlink for backwards compatibility,
but please update your scripts to point to the new location.
Some “clutter” (e.g., code) was moved to the
/netscratch
directory of the respective owner.
Podman is now installed on compute nodes #
While enroot is excellent for running GPU-based containers with
very low overhead, it lacks the ability to build images.
Podman is a drop-in replacement for Docker
and does not require admin privileges.
The docker
command aliases to podman
,
so scripts that rely on Docker should work as well.
Check out our updated
instructions
on building Docker / OCI images images on the cluster.
New NVME based /fscratch #
A new /fscratch
mount is available on all head and compute nodes.
It is based on fast NVME flash storage, so it is faster, but also
much smaller than /netscratch
.
Each user has a 1 TiB quota that they can use for data that needs
very fast access and/or may be replaced frequently.
BeeGFS update to 7.4.2 #
Also storage related, all BeeGFS (the cluster filesystem that runs
both /netscratch
and /fscratch
) clients and servers have been
updated to version 7.4.2.
/ds migrated to Debian 12 #
Previously on SUSE, the /ds
host has been migrated over to Debian 12
to bring it in line with the remainder of the cluster.
Miscellaneous OS and software updates #
All cluster nodes and VMs have received OS upgrades where available. For the monitoring stack, all metrics exporters, Grafana, Loki, VictoriaMetrics, as well as various Python libraries have been upgraded as well.
BIOS performance settings for compute nodes #
Optimized BIOS settings for AMD Epyc based compute nodes were applied. They result in better and more consistent CPU performance.
Firmware updates #
Firmwares have been updated on all DGX A100 and RTX A6000 nodes, as well as all storage and miscellaneous nodes where available.
Virtualization hosts updated #
The virtualization hosts running the cluster head nodes was updated to latest Proxmox 8.
Increased log storage #
Since you guys are logging so much, we increase the storage space for logs.