Who is Peggy? #
Peggy is our AI assistant running on the cluster, able to detect common issues with jobs, alert users when they occur, and sometimes gives helpful hints on how they can be avoided in the future.
Peggy can send emails (e.g. set with the --mail-user
option)
and Mattermost chat messages.
Since cluster and Mattermost accounts are not currently linked
Peggy must be told whom to send messages to.
We sometimes do this, but you can also send her a message yourself.
She will tell you what to do.
What can Peggy find? #
Peggy is mostly focused on GPU-related issues right now. She can detect, among other things:
- Jobs that are not stopping when they should
- Frozen GPUs in multi-GPU jobs
- Very high system overhead, leading to poor performance
- Lots of I/O delay, again lowering performance
How does Peggy work? #
Peggy periodically analyzes the same timeseries data that drives our dashboards in Grafana. A set of Boolean rules defines whether issues are present in a job. If an issue persists long enough (from 30 minutes to several hours, depending on severity) a message is sent out and may be repeated at most once every 12 hours, unless a more severe issue is detected.