KubeletDown #

Meaning #

This alert is triggered when the monitoring system has not been able to reach any of the cluster’s Kubelets for more than 15 minutes.

Impact #

This alert represents a critical threat to the cluster’s stability. Excluding the possibility of a network issue preventing the monitoring system from scraping Kubelet metrics, multiple nodes in the cluster are likely unable to respond to configuration changes for pods and other resources, and some debugging tools are likely not functional, e.g. kubectl exec and kubectl logs.

Diagnosis #

Check the status of nodes and for recent events on Node objects, or for recent events in general:

$ kubectl get nodes
$ kubectl describe node $NODE_NAME
$ kubectl get events --field-selector 'involvedObject.kind=Node'
$ kubectl get events

If you have SSH access to the nodes, access the logs for the Kubelet directly:

$ journalctl -b -f -u kubelet.service

Mitigation #

The mitigation depends on what is causing the Kubelets to become unresponsive. Check for wide-spread networking issues, or node level configuration issues.