KubeNodeReadinessFlapping #
Meaning #
The readiness status of node has changed few times in the last 15 minutes.
Impact #
The performance of the cluster deployments is affected, depending on the overall workload and the type of the node.
Diagnosis #
The notification details should list the node that’s not reachable. For Example:
- alertname = KubeNodeUnreachable
...
- node = node1.example.com
...
Login to the cluster. Check the status of that node:
$ kubectl get node $NODE -o yaml
The output should describe why the node is not reachable.
Common failure scenarios:
- disruptive software upgrades
- network patitioning due to hardware failures
- firewall rules
- virtual machines suspended due to storage area network problems
- system crashes / freezes due to software or hardware malfunctions
Mitigation #
In case of maintenance ensure to cordon and drain node.
In other cases ensure storage and networking redundancy if applicable.
See KubeNode See node problem detector See Watchdog timer