Autoscaling Pods in Kubernetes

8 min readDec 29, 2022

If you are hosting your workload in a cloud environment, and your traffic pattern is fluctuating in nature (think unpredictable), you need a mechanism to automatically scale out (and off-course scale in) your workload to ensure the service is able to perform as per defined Service Level Objective (SLO), without impacting the User Experience. This semantic is referred to as Autoscaling, to be very precise Horizontal Scaling.

Image from — https://unsplash.com/s/photos/kubernetes

Horizontal Scaling is the construct of adding/removing similar size (think replica) machines depending on demand/certain conditions. In the context of Kubernetes, it would mean — add more Pods or remove existing Pods.

Scaling can be of two types — Vertical and Horizontal. And this blog is focussed on Horizontal scaling.

By the way, there can be other components in the system which might impact the performance and user experience, but the focus of this blog is Compute layer, realised with Kubernetes pods.

Apart from ensuring that the service can handle load, there are couple of indirect benefits that gets added as part of Horizontal Scaling —

Cost — If you are running in a Cloud environment, one of the key charter is to run the workload in a cost effective manner. To address this, if we can dynamically run just enough workload pods needed to handle the traffic, then we can ensure that we are paying only for what we need. If we don’t leverage the AutoScaling construct, then we need to over provision the pods, and hence need to shell more dollars.
Better Resource Utilisation — If a Kubernetes workload is implemented with AutoScaling, it gives a fair opportunity to other workload pods to scale out should there be a need. And it would be tough to realize if the workloads are always running with a fixed number of Pods irrespective of the demand.

Now that we understand what is Horizontal Scaling and why we need it, the next logical question is — how do we make it work in Kubernetes environment?

Well, Kubernetes provides a native feature — HorizontalPodAutoscaler (referred as HPA from here on) which can help to horizontally scale the workload pods. The important question to ask is — When to scale?

Generally, scaling of the application is done based on one of the two metrics — CPU and Memory. HPA can be configured against these metrics to scale the pods and you are done. Not really !!!

There is still a prerequisite to make it work — you need to make these metrics available for the HPA to consume.

Some of the other metrics which can be used to scale the pods are — number of incoming requests, number of outgoing requests, message broker queue depth, etc.

Lost !!! Let’s make it a bit more clear in terms of how it works, logically with Kubernetes cluster —

Every Pod running in Kubernetes environment will be consuming CPU and Memory.
Make these metrics available as a centralised service for others to consume (referred as Metrics server from here on)
Configure HPA (Kubernetes native resource object) for a specific workload using the metric saying — If the CPU consumption reaches x%, scale the workload pod
HPA Kubernetes controller checks the Pod metrics with Metrics service at regular intervals and validates against the HPA resource configured for the pod and if the scaling condition matches, HPA controller updates the replica count of the workload’s Deployment resource (Kubernetes native object).
Deployment controller then takes the action depending on the replica count updated in the Deployment resource object.
And the pod replicas either increases (scale out) or decreases (scale in).

Kind of makes sense, but how do I implement the Metrics Service? There are multiple options available to implement the Metrics service. One of the easiest options is to go with Kubernetes Metrics Server.

Let’s see things in action and build a HPA for a workload running in Kubernetes.

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1 — Clone this Public repository from Github — https://github.com/waswani/kubernetes-hpa, navigate to the folder kubernetes-hpa and create EKS cluster using the below command —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created

2022-12-29 08:42:22 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2022-12-29 08:42:22 [✔]  EKS cluster "hpa-demo" in "us-west-2" region is ready

For demo purposes, you can attach Administrator policy to AWS IAM user or Role used for launching the EKS cluster.

It will take roughly 15–20 minutes for the cluster to come up. The Data tier will have 3 nodes of size t3.small.

Step 2 — Deploy the Metrics Server in the Kubernetes infrastructure.

# Deploy the Metric server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Output of the above command looks something like below - 

serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Step 3 — Deploy a sample workload in the default namespace. This workload is proxied by the service which listens on port 80.

# Deploy the sample application
kubectl apply -f apache-php-deployment.yaml 

# Expose the deployed application via Service, listening on port 80
kubectl apply -f apache-php-service.yaml

# Get pods running as part of Deployment 
NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65498c4955-d2tbj   1/1     Running   0          8m55s

The service is exposed with the host name apache-php inside the Kubernetes cluster. The value of the replica is configured as 1, and hence only a single pod is running.

# To test if metrics server is responding well, execute below command and 
# you should see the current resource consumption of the pod which just got
# deployed using above deployment

kubectl top pod

# Output
NAME                          CPU(cores)   MEMORY(bytes)   
apache-php-65498c4955-d2tbj   1m           9Mi

Step 4 — Create HPA resource mentioning the condition for the auto scale to happen.

We will configure HPA with a condition of keeping the average CPU utilization of the pods at 50 %. If this value goes above 50 %, HPA will trigger an auto scale to add more pods.

# Create HPA
kubectl apply -f apache-php-hpa.yaml 

# Get HPA configured
NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/50%    1         7         1          21s

Get HPA command shows that current CPU utilization is 0% and target scaling condition is configured at 50 %. It also shows the min, max and current replicas of the pod running.

Before executing next step, open another shell and fire below command to get hpa with watch flag

kubectl get hpa -w

Step 5 — Bombard the workload service with requests and wait for the workload to auto scale out.

# Launch a busbox pod and fire request to workload service in an infite loop
kubectl run -i --tty load-gen --image=busybox -- /bin/sh

# And from the shell command, fire below command
while true; wget -q -O - http://apache-php ; done

# And you should see output like this
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
...

If you look at the outcome of the shell where you executed — kubectl get hpa command, you would see pods getting scaled as the load increases.

NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/50%    1         7         1          4m41s
apache-php   Deployment/apache-php   0%/50%    1         7         1          5m16s
apache-php   Deployment/apache-php   3%/50%    1         7         1          5m31s
apache-php   Deployment/apache-php   0%/50%    1         7         1          6m1s
apache-php   Deployment/apache-php   352%/50%   1         7         1          6m46s
apache-php   Deployment/apache-php   455%/50%   1         7         4          7m1s
apache-php   Deployment/apache-php   210%/50%   1         7         7          7m16s
apache-php   Deployment/apache-php   31%/50%    1         7         7          7m31s
apache-php   Deployment/apache-php   50%/50%    1         7         7          7m46s
apache-php   Deployment/apache-php   59%/50%    1         7         7          8m1s
apache-php   Deployment/apache-php   46%/50%    1         7         7          8m16s
apache-php   Deployment/apache-php   54%/50%    1         7         7          8m31s
apache-php   Deployment/apache-php   57%/50%    1         7         7          8m46s
apache-php   Deployment/apache-php   59%/50%    1         7         7          9m1s
apache-php   Deployment/apache-php   52%/50%    1         7         7          9m16s

Are you thinking — towards the end, even if the CPU utilization is above 50 %, no more pods are getting launched??? The reason for that is — the maximum pods configured as part of HPA configuration is 7. Hence, HPA controller will not go beyond the maximum limit configured.

Step 6 — Stop the load and wait for the workload to auto scale in.

In the shell where you launched busybox and ran the while loop to hit the HTTP request, stop that process. And see the outcome of the — kubectl get hpa command.

apache-php   Deployment/apache-php   27%/50%    1         7         7          9m31s
apache-php   Deployment/apache-php   3%/50%     1         7         7          9m46s
apache-php   Deployment/apache-php   0%/50%     1         7         7          10m
apache-php   Deployment/apache-php   0%/50%     1         7         7          14m
apache-php   Deployment/apache-php   0%/50%     1         7         7          14m
apache-php   Deployment/apache-php   0%/50%     1         7         4          14m
apache-php   Deployment/apache-php   0%/50%     1         7         1          14m

The number of pods automatically scale in as the load decreases.

Step 7 — One of the most important step, to delete the Kubernetes cluster.

# Delete EKS Cluster 
eksctl delete cluster -f eks-cluster.yaml

# If you see an output like this, assume all has gone well :) 
2022-12-29 10:46:35 [ℹ]  will delete stack "eksctl-hpa-demo-cluster"
2022-12-29 10:46:35 [✔]  all cluster resources were deleted

In the above example, we scaled our workload based on CPU as a metric. But if you have a need to scale your workload on a custom metric, then you need to ensure that the specific metric is available for the HPA controller to consume via Metrics Server. One of the options to achieve it by using the Prometheus Adapter service.

On a related concern, what would happen if the maximum number of pods configured is not yet hit and Kubernetes cluster has run short of the CPU resources??? Logically, we would want additional nodes to be added automatically to the Kubernetes cluster else Pods will go in Pending state, impacting the Service performance.

Additionally, as a Platform engineer, you would also be interested to get an alert if a Pod is going into a Pending state because of whatever reason.

Hence, the next few blogs will cover some of the afore mentioned situations —

Automatically scale Kubernetes Cluster if pods are going into Pending state because of resource shortage — Blog
Scale the Pods based on Custom Metrics
Get alert on Slack if Pods go in Pending state

Hope you enjoyed reading this blog. Do share this blog with your friends and don’t forget to give claps if this has helped you in any way.

Happy Blogging…Cheers!!!

#ProductionReadyKubernetes #Kubernetes #EKS #AWS

Autoscaling Pods in Kubernetes

Written by Naresh Waswani