LogoPegasus Docs

Summary of 2025 Maintenance

Summary of 2025 Maintenance

, Christian Schulze, Joachim Folz

Here is the summary of all changes to Pegasus during the 2025 year end maintenance period.

Slurm update to 25.05 #

Slurm has been updated to the latest stable version 25.05. Changelog We were previously running 24.11. Pyxis and enroot were updated to the latest version

  • pyxis 0.20.0 -> 0.21.0 (introducing native container timing)
  • enroot 3.5.0 -> 4.0.1

GPU nodes #

All GPU nodes (40+) were updated to the latest kernel (1044-nvidia) and nvidia-driver (580) state. DGX A100 (7) and DGX H100/200 (7) received significant updates to their RDMA stack, including the firmwares of all network interfaces.

Storage nodes #

All storage nodes have received:

  • OS updates (/home, /netscratch, /fscratch, /ds*)
  • HCA firmware upgrades (/fscratch, /ds-{sds,slt,albatross,da,iml,dsa}, /curatime)
  • respective cpu microcodes
  • system drive replacement of /ds

The share /ds-av-extern has been merged into /ds-av-nda. Therefore, this mountpoint was removed.

Head nodes #

Slurm controllers and login nodes were upgraded to Ubuntu 24.04. All other nodes (7+) have been updated/upgraded to their latest state.

Network #

All DGX A100 infiniband switches (4) were updated to their latest OS, requiring to hop over six+ versions. The two 200Gb switches interconnecting all DGX nodes to the storage via RDMA were reinstalled with the latest Cumulus OS version. Jumping from version 5.9 to 5.15.

Miscellaneous OS and software updates #

All cluster nodes and VMs have received OS upgrades where available. For the monitoring stack, all metrics exporters, Grafana, Loki, VictoriaMetrics, as well as various Python libraries have been upgraded as well.

Virtualization hosts updated #

The virtualization hosts running the cluster head nodes were updated to Proxmox 9, including underlying Ceoh upgrade.