One of the main advantages of Kubernetes is that it can take care of your containers for you. This means, for example, that it will move containers around to distribute the load on the cluster evenly, automatically restart failed pods, and kill those that misbehave and try to eat too many resources. Another nice feature is Horizontal Pod Autoscaler. As the name suggests, HPA can automatically scale your pods. But how? I'm glad you asked, because that's exactly what this post is about.
What Is HPA in Kubernetes?
Normally when you create a deployment in Kubernetes, you need to specify how many pods you want to run. This number is static. Therefore, every time you want to increase or decrease the number of pods, you need to edit the deployment.
If you only need to do that once or twice a year, it's not that big of a deal. But it's very unlikely that your traffic will be at the same exact level for the whole year. And the more spikes of traffic you have, the more time you'll have to spend editing your deployment to cope with the traffic. This also works the other way around: If you're running a very big cluster with lots of applications, you could save a lot of money by decreasing the number of replicas for each deployment during periods of less traffic, like during the night. But again, it would be a lot of work to make these adjustments manually all the time. And that brings us to HPA.
The main purpose of HPA is to automatically scale your deployments based on the load to match the demand. Horizontal, in this case, means that we're talking about scaling the number of pods. You can specify the minimum and the maximum number of pods per deployment and a condition such as CPU or memory usage. Kubernetes will constantly monitor your deployment, and based on the condition you specified, it will increase or decrease the number of pods accordingly.
What Is VPA in Kubernetes?
In Kubernetes, there is also a Vertical Pod Autoscaler (VPA). As you may guess by the name, it works contrary to Horizontal Pod Autoscaler. Instead of adjusting the number of pods up or down, as HPA does, VPA scales up or down the resource requests and limits for the pods.
So, in theory, VPA tries to achieve the same thing that HPA does, but in practice, they serve very different purposes, and you shouldn’t use them interchangeably. But to understand the difference, we need to take a step back and talk about how autoscaler knows when to scale your pods in the first place.
A Few Words About Resource Requests
We mentioned before that you need to provide a condition for HPA, such as CPU usage. So, for example, you can specify that you want your HPA to add an additional pod to the deployment when current pods have average CPU usage higher than 80%. But what does 80% mean? 80% of what? That's a very good question, and the answer will help you understand the main difference between HPA and VPA.
You see, if you create a very basic Kubernetes deployment just by specifying its name and which docker image to use, you won't be able to add HPA to it. Why is that? It's because you didn't specify resource requests and limits for it. Long story short, specifying resource requests and limits for your pods isn't just a good practice, but also drastically helps Kubernetes do its job more efficiently. And that brings us back to the question of "What does 80% usage mean?" when setting up HPA thresholds. This percentage value is related to the pods' resource requests. And that's how HPA knows when to scale your pods up or down.
If you say that your application is using around 2GB of RAM under normal load, you set the resource requests accordingly: for example, to 2.5GB (you should always set the request to a little bit more than average). Then your HPA will know that it needs to schedule an additional pod for your deployment when the current one is using more than ~2.4GB (this value will depend on the target value that you specify when creating your HPA).
Figuring Correct Values for Resource Requests
But how do you know what resource requests to set in the first place? You could, of course, run your pods without requests first and check how many resources they normally use. But that's quite a time-consuming process, especially on big clusters with multiple applications.
That brings us back to the VPA. You see, the point of VPA isn't to scale your deployments up or down in order to keep up with sudden spikes in traffic. That's the HPA's job. VPA should be used to get you a good baseline of resource requests and limits for your pods. This way, you're free from that time-consuming task of monitoring pods' typical usage and setting requests and limits accordingly. When the traffic goes up and your pods can't keep up, the HPA should add one or more pods to the pack to get resource usage back to the "average." At that point, VPA doesn't need to do anything.
You can think of it as VPA working much slower and more long-term, looking at patterns of usage, while HPA responds quicker and provides short-term solutions for load spikes.
How Do I Configure HPA in Kubernetes?
Now that you know all the theory, let's create some HPAs. We'll start with creating an example deployment with a resource request set:
I'll save it as nginx.yaml and apply with kubectl apply:
Our deployment is up and running, so we can now add HPA to it. Just like anything in Kubernetes, you can add HPA by creating and applying the YAML definition, just like we did with this deployment. Another option is to use the kubectl autoscale command. Let's start with the latter. To create HPA with kubectl autoscale, you need to execute the following command:
After kubectl autoscale, we need to specify the resource type we want to autoscale. In our case, it's deploy (short for deployment), and then we specify the deployment name. After that, we pass the minimum and maximum amount of pods HPA can create and the condition on which to scale.
Validating If HPA Works
In our example, we tell HPA to scale out pods when their CPU usage goes over 80%. And since in our deployment earlier we specified CPU requests of 100m, this means our HPA will start scaling out nginx-deployment when its average CPU usage goes over 80m. Let's validate that. First, we double-check if HPA is working using kubectl get hpa:
We can see it's working, and our nginx deployment is currently running one pod. There is no traffic on it, so it makes sense. But let's see if HPA will do its job when we put some traffic on nginx. For that, I'll deploy a simple load generator:
Once deployed (using kubectl apply -f load-generator.yaml) we can monitor your nginx deployment CPU usage with kubectl top pods:
And when we see the CPU usage goes over 80m, we can check the status of HPA again:
And shortly after that, the target CPU usage should also drop (since the traffic is now distributed to two pods):
So, HPA is working as expected. It increased the number of pods for our deployment based on the load. Just keep in mind that HPA doesn't work instantly. It usually takes a few seconds before it will take any action, just to avoid unnecessary scaling actions on very short load spikes.
You now know how to create HPA with kubectl, and this is how you do the same with YAML definition:
I showed you the basic usage of HPA using only CPU metrics. But you can also use memory usage instead. For that, you only need to change name: to memory in your HPA YAML definition.
HPA can also be configured using custom metrics. For example, take the average HTTP response time. For that, HPA offers either pod-related metrics or so-called object metrics, which can be related to anything other than pods, like networking.
Let's see an example. If your application exposes a metric called "orders-pending," we can use the AverageValue of that metric to configure HPA as follows:
Custom metrics are quite a broad topic and will depend on your use case. For more information on custom metrics, you can refer to Kubernetes documentation here.
As you can see, HPA is relatively easy to set up. It only takes one kubectl command or a few lines of YAML file, and you can get even more benefits by instructing it to monitor custom metrics from your application. Clearly, HPA is a very nice thing to know. It will make your Kubernetes cluster even more self-sustainable and will offload you from repetitive tasks.
If you want to learn more about Kubernetes, feel free to take a look at our blog here.
Release is the simplest way to spin up even the most complicated environments. We specialize in taking your complicated application and data and making reproducible environments on-demand.
Speed up time to production with Release
Get isolated, full-stack environments to test, stage, debug, and experiment with their code freely.