Hosting workload on the right node in Kubernetes

Naresh Waswani
8 min readJan 31, 2023

--

Image Credit — https://unsplash.com/photos/KU9ABpm7eV8

Are you running Kubernetes as a Platform for the Application teams to run their services? If yes, then very soon you will have some of the typical requests coming from the Teams saying —

  1. Our service is compute heavy and we want to get the service pods hosted on a Compute heavy node and not a generic one.
  2. Pods of these two specific services should always be running on the same node.
  3. This is an AI/ML specific service and it should always be running on a GPU based node.
  4. My service can run on any node provided the node has SSD type storage.
  5. And this list literally can just go on, and on, and on.

Or, maybe as a Platform team, you might collaborate with the Application teams to identify stateless workload and if it can be run on Spot instances, may be on lower environments if not in production. This helps you to implement the Cost saving strategy from Day 0. I know it sounds premature optimisation, but if properly planned, can be done right in the 1st attempt.

Well, if you are in this situation, nothing to panic as Kubernetes has Out of box features which can help you implement the above ask in a very clean way. These features in Kubernetes are known as —

  1. Node selector or Node Affinity &
  2. Taints and Tolerations

Let’s understand these two concepts and then we will jump straight into implementation.

Node selector is a construct where a Pod can help Kubernetes Scheduler to find the desired node on which the pod should be hosted using the labels attached to a node. Example —

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
stack: frontend

In the above example, Pod manifest file specifies the node on which it should be deployed by saying — schedule me on a node which has label applied with key=stack and value=frontend.

But Node Selector does not support some of the complex situations where desired node selection involves multiple labels or if the node selection criteria mentioned is a soft or preferred one (meaning, if no matching node found, scheduler will still try to host the pod on a node which does match the defined conditions). And therefore Node Affinity was proposed, which can handle such tricky situations.

For more details on Node affinity, please check this link.

Taints — while Node selector or Node affinity is a property of Pod which helps Scheduler to find the desired nodes, Taints are opposite — they are property of a Node and is configured to repel a set of pods.

Tolerations — And tolerations are applied on Pods to nullify the taint effect, and helps Kubernetes Scheduler to schedule pods with matching taints.

Taints and Tolerations work hand in hand. Let’s see an example —

# Below command helps to apply a Taint on node1 -

kubectl taint nodes node1 cloudforbeginners.com/ssd-storage=true:NoSchedule

A taint follows a convention of Key=Value:Effect. In the above example, key of the taint is “cloudforbeginners.com/ssd-storage”, value of the taint is “true” and the effect is “NoSchedule”.

Here is a Pod manifest which is configured to tolerate this taint —

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
tolerations:
- key: "cloudforbeginners.com/ssd-storage"
operator: "Equal"
value: "true"
effect: "NoSchedule"

In the above example, we used the effect as NoSchedule. The other possible options for this attribute are — PreferNoSchedule and NoExecute.

Effect helps Kubernetes Scheduler to decide -

  1. What should happen to an existing pod when a taint is applied to a node at runtime. Should it continue to run on the node even if it does not tolerate the taint or it should be evicted.
  2. If a new pod should be created on a node based on the Taints and Tolerations configured.

For further details on Taints and Tolerations, check this link

Now that we have understood the concept of Node Selectors/Affinity and Tains and Tolerations, it’s time to see them in action.

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

We will create a Kubernetes Cluster with 3 different category of worker nodes and each of these nodes will have taints and labels applied as per our business need —

Category A — Compute optimised nodes

  1. Labels (workload-type=compute-optimized)
  2. Taints (cloudforbeginners.com/compute-optimized=true:NoExecute)

Category B— SSD optimised nodes

  1. Labels (workload-type=ssd-optimized)
  2. Taints (cloudforbeginners.com/ssd-storage=true:NoSchedule)

Category C — General purpose nodes

  1. Labels (workload-type=generic)
  2. Taints (cloudforbeginners.com/generic=true:NoExecute)

In AWS EKS service, logical grouping of worker nodes with specific configurations is managed via Node Group concept. And behind the scene, these Node Groups are managed using AWS AutoScaling service; one Autoscaling group per Node Group.

Once we launch our cluster, we then deploy multiple application services having different resource requirements and see how they leverage Kubernetes Labels with Taint and Toleration features to get themselves scheduled to the desired Node.

Let’s jump in —

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1 — Clone this Public repository from Github — https://github.com/waswani/kubernetes-labels-taints, navigate to the folder kubernetes-labels-taints and create EKS cluster using the below command —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created
2023-01-31 19:19:19 [ℹ] kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2023-01-31 19:19:19 [✔] EKS cluster "labels-taints-demo" in "us-west-2" region is ready

Check the labels and taints applied to the Nodes.

#Get nodes 
kubectl get nodes --show-labels | grep workload-type=compute-optimized

#Output (scroll towards the end to see the label)
ip-192-168-14-93.us-west-2.compute.internal Ready <none> 15m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=compute-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=compute-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-095b33994789c7a16,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-14-93.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=compute-optimized
ip-192-168-53-19.us-west-2.compute.internal Ready <none> 15m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=compute-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=compute-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-095b33994789c7a16,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-53-19.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2b,workload-type=compute-optimized

kubectl get nodes --show-labels | grep workload-type=generic

#Output (scroll towards the end to see the label)
ip-192-168-68-185.us-west-2.compute.internal Ready <none> 15m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=generic-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=generic-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0afd5658b9ad2b18c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-68-185.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2c,workload-type=generic
ip-192-168-9-183.us-west-2.compute.internal Ready <none> 15m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=generic-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=generic-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0afd5658b9ad2b18c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-9-183.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=generic

kubectl get nodes --show-labels | grep workload-type=ssd-optimized

#Output (scroll towards the end to see the label)
ip-192-168-2-176.us-west-2.compute.internal Ready <none> 16m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=ssd-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=ssd-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0b9ff8f227fef842c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-2-176.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=ssd-optimized
ip-192-168-91-138.us-west-2.compute.internal Ready <none> 17m v1.23.13-eks-fb459a0 alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=ssd-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=ssd-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0b9ff8f227fef842c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-91-138.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2c,workload-type=ssd-optimized

#To check the Taints on the nodes
for kube_node in $(kubectl get nodes | awk '{ print $1 }' | tail -n +2); do
echo ${kube_node} $(kubectl describe node ${kube_node} | grep Taint);
done

#Output
ip-192-168-14-93.us-west-2.compute.internal Taints: cloudforbeginners.com/compute-optimized=true:NoExecute
ip-192-168-2-176.us-west-2.compute.internal Taints: cloudforbeginners.com/ssd-storage=true:NoSchedule
ip-192-168-53-19.us-west-2.compute.internal Taints: cloudforbeginners.com/compute-optimized=true:NoExecute
ip-192-168-68-185.us-west-2.compute.internal Taints: cloudforbeginners.com/generic=true:NoExecute
ip-192-168-9-183.us-west-2.compute.internal Taints: cloudforbeginners.com/generic=true:NoExecute
ip-192-168-91-138.us-west-2.compute.internal Taints: cloudforbeginners.com/ssd-storage=true:NoSchedule

Step 2 — Let us now verify if the application pods are getting deployed to the desired node as expected.

But before that, let’s see what happens, if we just try to deploy a pod with no tolerations configured.

# Create a nginx pod with no Toleration configured
kubectl run nginx --image=nginx

kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 18s

If you see the Pod status, it shows as Pending. And to see the reason, execute below command —

kubectl describe pod nginx

Name: nginx
Namespace: default
Priority: 0
Node: <none>
Labels: run=nginx
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Containers:
nginx:
Image: nginx
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jb4z4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-jb4z4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 29s default-scheduler 0/6 nodes are available: 2 node(s) had taint {cloudforbeginners.com/compute-optimized: true}, that the pod didn't tolerate, 2 node(s) had taint {cloudforbeginners.com/generic: true}, that the pod didn't tolerate, 2 node(s) had taint {cloudforbeginners.com/ssd-storage: true}, that the pod didn't tolerate.

Because the pod did not have any of the toleration configured against the tainted nodes, Kubernetes scheduler could not find any target node where this pod could be assigned.

Let’s now deploy a pod which is configured to tolerate the taint of “cloudforbeginners.com/compute-optimized=true:NoExecute” and has node affinity selected with label “workload-type=compute-optimized”

kubectl apply -f nginx-with-compute-toleration-and-label.yaml 

#To get the status of the Pod
kubectl get pods -o wide

#Output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-with-toleration-no-label 1/1 Running 0 48s 192.168.19.30 ip-192-168-14-93.us-west-2.compute.internal <none> <none>

And the node ip-192–168–14–93.us-west-2.compute.internal is the one which is tainted with compute-optimized.

Similarly, we can create pods with ssd or generic workload node selector.

In a nut shell, Node selector or Node Affinity is for Pods to select the Node and Taints is for Node to allow or deny the Pod from getting hosted on them.

Hope you enjoyed reading this blog. Do share this blog with your friends and don’t forget to give claps if this has helped you in any way.

Happy Blogging…Cheers!!!

#ProductionReadyKubernetes #EKS #KubernetesTaintsAndTolerations #KubernetesNodeAffinity #KubernetesNodeSelector

--

--

Naresh Waswani

#AWS #CloudArchitect #CloudMigration #Microservices #Mobility #IoT