Cluster has overcommitted CPU resource requests for Namespaces and cannot tolerate node failure.
In the event of a node failure, some Pods will be in
Pending state due to a lack of available CPU resources.
- Check if CPU resource requests are adjusted to the app usage
- Check if some nodes are available and not cordoned
- Check if cluster-autoscaler has issues with adding new nodes
- Check if the given namespace usage grows in time more than expected
Review existing quota for given namespace and adjust it accordingly.
Add more nodes to the cluster - usually it is better to have more smaller nodes, than few bigger.
Add different node pools with different instance types to avoid problem when using only one instance type in the cloud.
Use pod priorities to avoid important services from losing performance, see pod priority and preemption
Fine tune settings for special pods used with cluster-autoscaler
Prepare performance tests for the expected workload, plan cluster capacity accordingly.