Kubernetes Autoscaling Best Practices

Gilad David Maayan
Published 03/28/2023
Share this on:

Kubernetes autoscalingWhat Is Kubernetes Autoscaling?

Kubernetes autoscaling refers to the Kubernetes platform’s ability to automatically adjust the number of replicas of a deployment or a stateful set according to the observed metrics, including CPU utilization. With autoscaling, Kubernetes applications can automatically scale their resources, responding to workload changes. This helps ensure they have enough resources to handle increased traffic without over-provisioning and wasting resources.

Kubernetes provides two types of pod autoscaling: horizontal (HPA) and vertical (VPA). Horizontal pod autoscaling automatically scales the number of a deployment’s replicas based on resource consumption metrics like CPU utilization and traffic demands, while vertical pod autoscaling handles the resources—i.e., CPU, memory—allocated to every pod. Both types of autoscaling can help improve the performance and cost-efficiency of applications running on Kubernetes.

Kubernetes Autoscaling Methods

Kubernetes offers the following scalability tools.



Want More Tech News? Subscribe to ComputingEdge Newsletter Today!




The vertical pod autoscaler (VPA) is a component of Kubernetes that automatically increases or decreases the resources (e.g., CPU or memory) allocated to each pod. It works by continuously analyzing the resource usage of pods and comparing it to the resource requests and limits specified in the pod’s resource configuration.

When a pod is underutilized, the VPA will reduce the resources allocated to it, freeing up resources for other pods to use. When a pod is heavily utilized and is approaching its resource limit, the VPA will increase the resources allocated to it to ensure that it has enough resources to handle the workload.

The VPA also considers the current overall usage of resources in the cluster, in addition to any resource constraints that apply. This allows it to make informed decisions about how much resource to allocate to each pod, and helps to prevent overprovisioning and resource contention.


The horizontal pod autoscaler (HPA) automatically increases or decreases the pods within a deployment based on the observed, specified metrics, such as CPU utilization. It does this by periodically checking the pods’ utilization in deployment and comparing it to the target the user specified. This is similar to the Kubernetes health check mechanism.

When the metrics for a deployment exceed a specified threshold, Kubernetes will automatically create new replicas of the deployment to handle the increased workload. When the workload reduces, Kubernetes will scale down the replicas to save resources.

The HPA implements pod scaling horizontally, by raising or lowering the number of replicas for the Kubernetes deployment. This means that it adds or removes entire copies of the Kubernetes deployment.

Cluster Autoscaler

This autoscaler focuses on scaling the nodes in a cluster up or down based on the workloads running on that cluster. It does this by periodically checking the nodes’ resource utilization and comparing it to the utilization target that the user specified.

If the actual utilization exceeds the target utilization, the Cluster Autoscaler will add more nodes to the cluster to handle the increased workload. Similarly, if the actual utilization is lower than the target utilization, the Cluster Autoscaler will remove nodes from the cluster to save resources.

When the Cluster Autoscaler detects unschedulable pods, it will automatically scale up the cluster by adding more nodes to provide the necessary resources for these pods to be scheduled and run. This ensures the cluster always has enough resources to handle the workloads running on it, and that no pods are left pending because of lacking resources.

Best Practices for Kubernetes Autoscaling

Using VPA Together with Cluster Autoscaler

Using VPA together with the cluster autoscaler is a best practice for Kubernetes autoscaling because it allows for more fine-grained and efficient scaling of the cluster.

The VPA can provide precise control over the resources consumed by the pods, while the cluster autoscaler can ensure that the cluster has the right proportion of nodes to support the workload. This can improve the performance and efficiency of the deployment, and help to ensure that it can handle changes in workloads without experiencing interruptions or downtime.

Using a Service Mesh with HPA

Using a service mesh, such as Istio or Envoy, in conjunction with the horizontal pod autoscaler is a best practice for Kubernetes autoscaling because it allows the HPA to more accurately and efficiently calibrate the pod count in a K8s deployment.

A service mesh provides a layer of infrastructure that sits between the individual services in a microservice-based application, and provides features such as load balancing, service discovery, and traffic routing.

By using a service mesh, the HPA can collect more accurate and granular metrics about the workloads running on the pods in a deployment—for example, the average response time or rate of requests per second. This allows the HPA to make more informed decisions about when and how to scale the number of pods, which can improve the efficiency and performance of the deployment.

In addition, a service mesh can provide automatic retries and circuit breaking, which can help to prevent cascading failures and other issues that can arise when the HPA scales the number of pods. This can improve the reliability and stability of the deployment, and ensure that it can handle changes in workloads without experiencing interruptions or downtime.

Ensuring VPA and HPA Policies Don’t Clash

Ensuring that the policies for the Vertical Pod Autoscaler and the Horizontal Pod Autoscaler do not clash can prevent conflicts and inconsistencies in the scaling behavior of a deployment.

If the policies for the VPA and HPA are not coordinated, it is possible for the two autoscaling mechanisms to try to scale the number of pods in opposite directions at the same time. For example, the VPA may try to increase the number of requests for resources and the limits for the pods in a given deployment, while the HPA is trying to reduce the deployment’s replicas. This can lead to confusion and unpredictable behavior and can make it difficult to predict or control the deployment’s pod count.

To avoid this situation, it is best to coordinate the policies for the VPA and HPA, and to ensure that they are aligned with each other and with the overall goals and objectives of the deployment. This can help to ensure that the VPA and HPA work together seamlessly, and that the scaling behavior of the deployment is consistent and predictable.

Ensuring All Pods Have Configured Resource Requests

By ensuring that all pods have resource requests configured, it is possible to provide HPA with the information they need to accurately and efficiently scale the pods in the deployment. This can help to ensure that the deployment has the optimal number of pods and the appropriate resources to handle the workload, and can improve the performance and reliability of the deployment.

Using Mixed Instances to Reduce Costs

In Kubernetes, mixed instances allow you to run different types of instances within a single cluster, which can help reduce costs. This is because different types of instances often have different pricing models, and using a mix of instance types can help you take advantage of the most cost-effective options for your workload.

For example, you might use larger, more expensive instances for your more compute-intensive workloads, and smaller, less expensive instances for your less demanding workloads. By using mixed instances, you can optimize your cluster’s performance and minimize your overall costs.

When using mixed instances, the Cluster Autoscaler can help ensure that your cluster is sized appropriately for the workloads running on it, and that you are using the most cost-effective instance types for your workloads.


In conclusion, Kubernetes autoscaling is a powerful way to manage the performance and cost of your containerized applications. By using autoscaling, you can ensure that your applications are running at optimal capacity and that you are not overpaying for the resources they require. By following these best practices, you can maximize the benefits of autoscaling and minimize the costs associated with running your applications on Kubernetes.

About the Writer

Gilad David Maayan is a technology writer who has worked with over 150 technology companies, including SAP, Imperva, Samsung NEXT, NetApp, and Check Point, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership. Today he heads Agile SEO, the leading marketing agency in the technology industry. Connect with him on LinkedIn.


Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.