Troubleshooting Node Provisioning
Node Provisioning refers to the process of acquiring and configuring a new node in your cluster for your Project (pod) to be scheduled onto. This process occurs when the Kubernetes Control Plane is unable to schedule your Project onto any of the currently existing Nodes - this can be due to several factors, such as Node resource constraints, Node Taints or Tolerations, or Node Affinity. For more information about Node Provisioning and Pod Scheduling, please refer to the documentation for the Kubernetes Scheduler and the Cluster Autoscaler.
Node Provisioning can take up to 30 minutes and will usually resolve itself without any manual intervention. However, if you encounter timeouts and failures during this process, here are three common reasons why Node Provisioning may be failing.
1. Cloud Quota Limit
One possible reason for Node Provisioning failure is reaching Cloud Quota Limit. These quotas determine the resources that are alloted towards your cloud account. These are limits enforced by your cloud provider to prevent misuse of their resources.
Resolution:
Check your Cloud provider's dashboard to view your limits and put in a request to increase them.
2. Cloud Resource Capacity
Another possible reason Node Provisioning is failing could be that your cloud provider lacks sufficient nodes in the region that you are requesting. You can check your cloud provider's Autoscaler Activity Log for information regarding this.
For example, in the case of AWS, you can check the Autoscaling Group (ASG) Activity Log. This usually contains the following error message:
We currently do not have sufficient <instance type> capacity in the Availability Zone you requested (<requested Availability Zone>). Our system will be working on provisioning additional capacity. You can currently get <instance type> capacity by not specifying an Availability Zone in your request or choosing <list of Availability Zones that currently supports the instance type>. Launching EC2 instance failed.
Resolution:
- If you are using Spot Instances, try changing from Spot to on-demand instances.
- Contact Zeet Support to request a different node type.
3. Node Group Capacity
Node Provisioning can also fail if your cloud provider does not have any nodes that can accommodate the pod that are attempting to schedule. This is usually only seen when you are attempting to schedule a pod with unusually high CPU or Memory requirements. For more information regarding troubleshooting this problem, refer to the official Kubernetes Documentation.
Resolution
- Reduce your Project resource requirements.
- Contact Zeet support to add node groups with larger nodes that can accommodate your pod.