AlertmanagerClusterCrashlooping #
Meaning #
Half or more of the Alertmanager instances within the same cluster are crashlooping.
Impact #
Alerts could be notified multiple time unless pods are crashing to fast and no alerts can be sent.
Diagnosis #
kubectl get pod -l app=alertmanager
NAMESPACE NAME READY STATUS RESTARTS AGE
default alertmanager-main-0 1/2 CrashLoopBackOff 37107 2d
default alertmanager-main-1 2/2 Running 0 43d
default alertmanager-main-2 2/2 Running 0 43d
Find the root cause by looking to events for a given pod/deployement
kubectl get events --field-selector involvedObject.name=alertmanager-main-0
Mitigation #
Make sure pods have enough resources (CPU, MEM) to work correctly.