Cluster has overcommitted memory resource requests for Namespaces.
Various services degradation or unavailability in case of single node failure.
- Check if Memory resource requests are adjusted to the app usage
- Check if some nodes are available and not cordoned
- Check if cluster-autoscaler has issues with adding new nodes
- Check if the given namespace usage grows in time more than expected
Review existing quota for given namespace and adjust it accordingly.
Add more nodes to the cluster - usually it is better to have more smaller nodes, than few bigger.
Add different node pools with different instance types to avoid problem when using only one instance type in the cloud.
Use pod priorities to avoid important services from losing performance, see pod priority and preemption
Fine tune settings for special pods used with cluster-autoscaler
Prepare performance tests for the expected workload, plan cluster capacity accordingly.