LogoPegasus Docs

Summary of 2023 Maintenance

Summary of 2023 Maintenance

January 2, 2024, Christian Schulze, Joachim Folz

Here is our summary of all changes to Pegasus during the 2023 year end maintenance period.

Slurm update to 23.11 #

Slurm has been updated to the latest stable version 23.11. We were previously running 22.05. For the complete (and long) list of changes, see the SchedMD NEWS.

Enroot images moved to NFS with caching #

We moved the Enroot SquashFS images to a new NFS mount /enroot with local caching enabled. This should reduce container creation times. /netscratch/enroot is now a symlink for backwards compatibility, but please update your scripts to point to the new location. Some “clutter” (e.g., code) was moved to the /netscratch directory of the respective owner.

Podman is now installed on compute nodes #

While enroot is excellent for running GPU-based containers with very low overhead, it lacks the ability to build images. Podman is a drop-in replacement for Docker and does not require admin privileges. The docker command aliases to podman, so scripts that rely on Docker should work as well. Check out our updated instructions on building Docker / OCI images images on the cluster.

New NVME based /fscratch #

A new /fscratch mount is available on all head and compute nodes. It is based on fast NVME flash storage, so it is faster, but also much smaller than /netscratch. Each user has a 1 TiB quota that they can use for data that needs very fast access and/or may be replaced frequently.

BeeGFS update to 7.4.2 #

Also storage related, all BeeGFS (the cluster filesystem that runs both /netscratch and /fscratch) clients and servers have been updated to version 7.4.2.

/ds migrated to Debian 12 #

Previously on SUSE, the /ds host has been migrated over to Debian 12 to bring it in line with the remainder of the cluster.

Miscellaneous OS and software updates #

All cluster nodes and VMs have received OS upgrades where available. For the monitoring stack, all metrics exporters, Grafana, Loki, VictoriaMetrics, as well as various Python libraries have been upgraded as well.

BIOS performance settings for compute nodes #

Optimized BIOS settings for AMD Epyc based compute nodes were applied. They result in better and more consistent CPU performance.

Firmware updates #

Firmwares have been updated on all DGX A100 and RTX A6000 nodes, as well as all storage and miscellaneous nodes where available.

Virtualization hosts updated #

The virtualization hosts running the cluster head nodes was updated to latest Proxmox 8.

Increased log storage #

Since you guys are logging so much, we increase the storage space for logs.