A(C)-pocalypse Update
August 2, 2022, Joachim Folz, Christian Schulze
A quick update on the cluster state after the AC issues we experienced on 18th July.
Situation update #
By now you are probably aware that due to an AC malfunction some of our nodes had exceeded critical temperatures and had to be taken offline. Here’s an update on what has happened since then:
- Our AC system has been checked and reconfigured to prevent similar runaway temperature incidents in the future.
- Some nodes are back online and AC functionality has been verified for the current load. We have also implemented extra monitoring of AC operation and alerts for abnormal behavior.
- Extra measures will be put into place shortly to automatically shut down individual machines before they reach critical temperatures. We hope this will let the cluster handle this type of emergency more gracefully.
- Heat load will be increased gradually over the coming days, especially considering the expected heat wave. Two extra A100 nodes should go online tomorrow (3rd August).
Update 3rd August, 15:16 #
Two A100 nodes are back online. If temperatures stay reasonable the others will be powered on during the day.
Update 23rd August, 12:50 #
After some more adjustments to the AC system and monitoring the situation the cluster is now operating at full capacity.