Kubernetes – taming PODs

In this article, we will answer questions such as how to control where PODs will be launched and how does kube-scheduler work. We will also discuss a number of tools used to determine the relationship of PODs to NODEs: labels, nodeSelector, affinity, and marking tainted NODEs.

The following examples were carried out on a Kubernetes lab consisting of two NODEs. Therefore, we recommend access to such a lab. We described the process of creating such a test cluster in the article Combining two EuroLinux 8 machines into a Kubernetes cluster.

kubectl get nodes
NAME          STATUS     ROLES           AGE   VERSION
euro1         Ready      control-plane   35d   v1.24.0
euro2         Ready      <none>          35d   v1.24.0

Kubernetes scheduler

Scheduling is a process that manages the running of PODs on appropriately matched NODEs so that the kubelet process can handle them.

The scheduler process can be presented in the following steps:

1. Waiting for a new POD to appear that does not have an assigned NODE.
2. Finding the best NODE for each detected POD.
3. Informing the API of the selection.

kube-scheduler

kube-scheduler is the default Kubernetes scheduler that runs as part of the control-plane. It is designed to allow the use of other components (scheduling components) written independently or by third parties.

Selecting the best NODE for a new POD requires appropriate filtering of the available NODEs. Those NODEs that meet the scheduling requirements are called feasible nodes. When none of the NODEs is suitable, the POD remains “unscheduled” until the scheduler finds a suitable NODE for it. After selecting executable NODEs, the scheduler executes a set of functions scoring the selected NODEs. Then, it selects the NODE with the highest score. The final step is to inform the server’s API about the given selection in the binding process.

Assigning PODs to NODEs

Kubernetes allows assigning PODs to defined NODE classes. All recommended methods use label selector. Usually assignment is not necessary, because the scheduler automatically assigns PODs very well to NODEs that have the appropriate resources. On the other hand, one can easily imagine a case where defining an additional class of NODEs can be useful. For example, when you need access to “fast” SSD memory or want NODEs to belong to a one “fast” LAN.

NODE labels

Like many other Kubernetes objects, NODEs have labels. These can be assigned manually. In addition, Kubernetes implements a standard set of labels for all NODEs in the cluster. It’s worth getting to know them for troubleshooting purposes. However, we will not elaborate on that in this article.

Adding labels allows you to run PODs on a selected group of NODEs. This method is often used to ensure that applications run on specially isolated NODEs that meet specific security requirements. In this case, it is recommended to select such a label key that kubelet cannot modify. This will prevent another attacked NODE from setting such a label on itself. This can be done in the following steps:

make sure you are using the Node authorizer and that the NodeRestriction extension plugin is enabled:

kubectl -n kube-system describe po kube-apiserver-euro1|grep NodeRestriction
      --enable-admission-plugins=NodeRestriction

add a label with the prefix node-restriction.kubernetes.io/ to selected NODEs:

kubectl label nodes euro2 restriction.kubernetes.io/supersecure=true

we use these labels in the nodeSelector field:

cat << EOF | tee deployment-supersecure.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-supersecure
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
      nodeSelector:
        restriction.kubernetes.io/supersecure: "true"
EOF

Running the deployment:

kubectl apply -f deployment-supersecure.yaml
deployment.apps/deployment-supersecure created

Verification of POD distribution:

kubectl get pods -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
deployment-supersecure-5d4ccb7468-gf54d   1/1     Running   0          20s   10.33.0.20   euro1   <none>           <none>
deployment-supersecure-5d4ccb7468-vlncv   1/1     Running   0          20s   10.33.0.18   euro1   <none>           <none>
deployment-supersecure-5d4ccb7468-znpfw   1/1     Running   0          20s   10.33.0.19   euro1   <none>           <none>

Removing the sample deployment:

kubectl delete deployments.apps deployment-supersecure
deployment.apps "deployment-supersecure" deleted

nodeSelector

The nodeSelector field can be added to the POD specification. It contains a list of labels that the NODE on which the Kubernetes scheduler can run the POD must have. The NODE must contain all the selected labels.

Example:

Giving the NODEs euro1 and euro2 the label a=a:

kubectl label nodes euro1 euro2 a=a
node/euro1 labeled
node/euro2 labeled

Giving the euro2 NODE a b=b label:

kubectl label nodes euro2 b=b
node/euro2 labeled

Rewriting the previous sample deployment to require the labels a=a and b=b:

cat << EOF | tee deployment-ab.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-ab
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
      nodeSelector:
        a: "a"
        b: "b"
EOF

Running the deployment:

kubectl apply -f deployment-ab.yaml
deployment.apps/deployment-ab created

Verification of POD distribution:

kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
deployment-ab-74d6cc75b9-74qsq   1/1     Running   0          8s    10.33.2.19   euro2   <none>           <none>
deployment-ab-74d6cc75b9-bflh9   1/1     Running   0          8s    10.33.2.18   euro2   <none>           <none>
deployment-ab-74d6cc75b9-gvt7l   1/1     Running   0          8s    10.33.2.20   euro2   <none>           <none>

Deleting the old deployment:

kubectl delete deployment deployment-ab
deployment.apps "deployment-ab" deleted

Affinity / anti-affinity

nodeSelector is a simplified method of assigning PODs to NODEs. The affinity (linkage) and `anti-affinity` fields greatly expand the possibilities of tying PODs to NODEs, as well as PODs to PODs.

nodeAffinity

nodeAffinity (linking NODEs) works similarly to nodeSelector. There are two types of nodeAffinity:

requiredDuringSchedulingIgnoreDuringExecution – Kubernetes scheduler can only run a POD if the rule is satisfied. The rule can be specified in a more complex way, compared to nodeSelector, where the only option is to match all labels
preferredDuringSchedulingIgnoredDuringExecution – Kubernetes will try to select a NODE that meets the rule. However, this is only a preference with an assigned weight.

IgnoredDuringExecution should be understood as – if the NODE changes the label during execution (DuringExecution), it will not interfere with the operation of this POD.

Example of POD configuration associated with a NODE with label a=a and preference to NODEs with labels b=b (minimum weight equal to 1) and node-role.kubernetes.io/control-plane= (maximum weight equal to 100):

cat <<EOF | tee node-affinity-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: a
            operator: In
            values:
            - a
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: b
            operator: In
            values:
            - b
      - weight: 100
        preference:
          matchExpressions:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists

  containers:
  - image: nginx
    name: node-affinity
EOF

Running a POD:

kubectl apply -f node-affinity-pod.yaml

The result of the kubectl get pods -o wide command should look like this:

NAME                READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
node-affinity-pod   1/1     Running   0          19s   10.33.0.23   euro1   <none>           <none>

POD can be deleted with the command:

kubectl delete pod node-affinity-pod

In the following examples, we will configure a deployment based on analogous PODs.

Configuring a deployment consisting of 4 PODs analogous to the POD in the example above:

cat << EOF | tee node-affinity-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-affinity-deployment
  labels:
    app: nginx
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: a
                operator: In
                values:
                - a
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: b
                operator: In
                values:
                - b
          - weight: 100
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists

      containers:
      - name: node-affinity-deployment
        image: nginx
EOF

Running the deployment:

kubectl apply -f node-affinity-deployment.yaml

Checking the distribution of PODs among NODEs:

kubectl get pods -o wide
NAME                                     READY   STATUS    RESTARTS       AGE   IP           NODE    NOMINATED NODE   READINESS GATES
node-affinity-deployment-bbc88d9-25qr2   1/1     Running   1 (2m2s ago)   14m   10.33.0.30   euro1   <none>           <none>
node-affinity-deployment-bbc88d9-947x5   1/1     Running   1 (2m2s ago)   14m   10.33.0.28   euro1   <none>           <none>
node-affinity-deployment-bbc88d9-pg4zz   1/1     Running   1 (2m2s ago)   14m   10.33.0.33   euro1   <none>           <none>
node-affinity-deployment-bbc88d9-s9vsf   1/1     Running   1 (2m2s ago)   14m   10.33.0.32   euro1   <none>           <none>

All PODs were run on control-plane NODE due to the higher preference weight of this NODE.

In the following example, we will set two equal preference weights.

cat << EOF | tee node-affinity-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-affinity-deployment
  labels:
    app: nginx
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: a
                operator: In
                values:
                - a
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: b
                operator: In
                values:
                - b
          - weight: 1
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists

      containers:
      - name: node-affinity-deployment
        image: nginx
EOF

Applying configuration changes:

kubectl apply -f node-affinity-deployment.yaml

Verification of POD distribution:

kubectl get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
node-affinity-deployment-dc8df7bbf-nlkcz   1/1     Running   0          72s   10.33.2.24   euro2   <none>           <none>
node-affinity-deployment-dc8df7bbf-nq88k   1/1     Running   0          72s   10.33.0.34   euro1   <none>           <none>
node-affinity-deployment-dc8df7bbf-nxt85   1/1     Running   0          70s   10.33.2.25   euro2   <none>           <none>
node-affinity-deployment-dc8df7bbf-rpt5k   1/1     Running   0          69s   10.33.0.35   euro1   <none>           <none>

The PODs are evenly distributed among the NODEs. euro1 has an assigned label node-role.kubernetes.io/control-plane with a weight of 1. euro2 also has an assigned label b=b with a weight of 1.

What happens if we remove the label a=a from the NODE of euro1 and the label b=b from the NODE of euro2?

kubectl label nodes euro1 a- ; kubectl label nodes euro2 b-
node/euro1 unlabeled
node/euro2 unlabeled

kubectl get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
node-affinity-deployment-dc8df7bbf-nlkcz   1/1     Running   0          32m   10.33.2.24   euro2   <none>           <none>
node-affinity-deployment-dc8df7bbf-nq88k   1/1     Running   0          32m   10.33.0.34   euro1   <none>           <none>
node-affinity-deployment-dc8df7bbf-nxt85   1/1     Running   0          32m   10.33.2.25   euro2   <none>           <none>
node-affinity-deployment-dc8df7bbf-rpt5k   1/1     Running   0          32m   10.33.0.35   euro1   <none>           <none>

Nothing has changed. Removing the labels did not affect the distribution of PODs already running. On the other hand, after a restart, all PODs will go to euro2, since only this POD has the a=a label assigned.

kubectl rollout restart deployment node-affinity-deployment
deployment.apps/node-affinity-deployment restarted

kubectl get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE    IP           NODE    NOMINATED NODE   READINESS GATES
node-affinity-deployment-979965644-45kch   1/1     Running   0          114s   10.33.2.27   euro2   <none>           <none>
node-affinity-deployment-979965644-88gng   1/1     Running   0          110s   10.33.2.28   euro2   <none>           <none>
node-affinity-deployment-979965644-gjlwv   1/1     Running   0          114s   10.33.2.26   euro2   <none>           <none>
node-affinity-deployment-979965644-mnl6s   1/1     Running   0          109s   10.33.2.29   euro2   <none>           <none>

You can delete the sample deployment with the command:

kubectl delete deployments node-affinity-deployment
deployment.apps "node-affinity-deployment" deleted

Inter-pod affinity / anti-affinity

The principle of inter-pod affinity or isolation (anti-affinity) takes this form:

This POD should (or should not) work on X provided that PODs satisfying rule Y are already running on X. X constitutes a topology labeled with a topologyKey. Y constitutes a label selector rule with an optional list of namespaces.

There are two types (similar to nodeAffinity):

requiredDuringSchedulingIgnoreDuringExecution
preferredDuringSchedulingIgnoredDuringExecution

For the purposes of the demonstration, we will configure two PODs.

cat << EOF | tee examplePODs.yaml
apiVersion: v1
kind: Pod
metadata:
  name: euro1-pod
  labels:
    nr: "1"
spec:
  nodeSelector:
    kubernetes.io/hostname: euro1
  containers:
  - image: nginx
    name: nginx
---
apiVersion: v1
kind: Pod
metadata:
  name: euro2-pod
  labels:
    nr: "2"
spec:
  nodeSelector:
    kubernetes.io/hostname: euro2
  containers:
  - image: nginx
    name: nginx
EOF

Running the PODs:

kubectl apply -f examplePODs.yaml
pod/euro1-pod created
pod/euro2-pod created

kubectl get pods -o wide --show-labels
NAME        READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES   LABELS
euro1-pod   1/1     Running   0          18m   10.33.0.36   euro1   <none>           <none>            nr=1
euro2-pod   1/1     Running   0          18m   10.33.2.30   euro2   <none>           <none>            nr=2

In the next step of the demonstration, we will run 2 PODs associated with the euro1-pod and euro2-pod PODs, respectively, using the nr label.

cat << EOF | tee podAffinity-pods.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nr1-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: nr
            operator: In
            values:
            - "1"
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx
---
apiVersion: v1
kind: Pod
metadata:
  name: nr2-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: nr
            operator: In
            values:
            - "2"
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx
EOF

Launching and verifying the distribution of PODs:

kubectl apply -f podAffinity-pods.yaml
pod/nr1-pod created
pod/nr2-pod created

kubectl get pods -o wide --show-labels
NAME        READY   STATUS    RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES   LABELS
euro1-pod   1/1     Running   0          70m     10.33.0.36   euro1   <none>           <none>            nr=1
euro2-pod   1/1     Running   0          70m     10.33.2.30   euro2   <none>           <none>            nr=2
nr1-pod     1/1     Running   0          2m20s   10.33.0.37   euro1   <none>           <none>            <none>
nr2-pod     1/1     Running   0          2m20s   10.33.2.31   euro2   <none>           <none>            <none>

Analogous to nodeAffinity, you can define podAffinity as a preference with weight assignment (preferredDuringSchedulingIgnoredDuringExecution). In addition, you can define podAntiAffinity, that is, you can choose which PODs the configured PODs should not or cannot run with in the same topology.

It is important that the selected topologyKey be consistently defined for each NODE. The kubernetes.io/hostname, given in the example, is defined automatically. You can also choose a different topology key, such as topologyKey: city. In this case, ensure that each NODE is assigned this label (e.g. using the kubectl label nodes NODE city=Krakow command).

You can remove the PODs used for the demonstration:

kubectl delete pods euro1-pod euro2-pod nr1-pod nr2-pod

Taints and Tolerations

Taints are the opposite of nodeAffinity. A NODE marked with a taint, cannot be selected by the scheduler for a POD that does not have a defined taint tolerance.

Adding Taint using kubectl taint:

kubectl taint nodes euro1 node-role.kubernetes.io/control-plane:NoSchedule
node/euro1 tainted

Most often control-plane NODE has this taint added by default.

Skazy mogą być też definiowane jako pary KEY=VALUE. Przykład:

kubectl taint nodes euro2 skaza=tragiczna:NoSchedule

Removing taints (operator -):

kubectl taint nodes euro2 skaza=tragiczna:NoSchedule-

After following the above steps, an attempt to run a standard nginx deployment should run as follows:

kubectl create deployment nginx --image nginx --replicas 3
deployment.apps/nginx created

kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
nginx-8f458dc5b-jd56c   1/1     Running   0          68s   10.33.2.36   euro2   <none>           <none>
nginx-8f458dc5b-p72s6   1/1     Running   0          68s   10.33.2.37   euro2   <none>           <none>
nginx-8f458dc5b-tr5jn   1/1     Running   0          68s   10.33.2.38   euro2   <none>           <none>

All PODs have “landed” on euro2. The NODE euro1 is tainted.

Taints can create the following effects:

NoSchedule – Kubernetes scheduler will not start new PODs without specified tolerance
PreferNoSchedule – preference not to use NODEs. If other NODEs are not feasible, the scheduler will run the POD on the marked NODE
NoExecute – all PODs having no tolerance for this flaw, will be disabled.

The next example deals with the differences in the use of taint NoSchedule and NoExecute:

kubectl taint nodes euro2 skaza:NoSchedule
node/euro2 tainted

kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
nginx-8f458dc5b-jd56c   1/1     Running   0          7m28s   10.33.2.36   euro2   <none>           <none>
nginx-8f458dc5b-p72s6   1/1     Running   0          7m28s   10.33.2.37   euro2   <none>           <none>
nginx-8f458dc5b-tr5jn   1/1     Running   0          7m28s   10.33.2.38   euro2   <none>           <none>

PODs continue to run on the tainted NODE, but new PODs cannot be started by the scheduler. All NODEs are “tainted”.

kubectl create deployment nginx2 --image nginx
deployment.apps/nginx2 created

kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE    IP           NODE     NOMINATED NODE   READINESS GATES
nginx-8f458dc5b-jd56c     1/1     Running   0          11m    10.33.2.36   euro2    <none>           <none>
nginx-8f458dc5b-p72s6     1/1     Running   0          11m    10.33.2.37   euro2    <none>           <none>
nginx-8f458dc5b-tr5jn     1/1     Running   0          11m    10.33.2.38   euro2    <none>           <none>
nginx2-7cc8cd4598-tpp2s   0/1     Pending   0          3s     <none>       <none>   <none>           <none>

Then we add the NoExecute taint:

kubectl taint nodes euro2 handsUP:NoExecute
node/euro2 tainted

kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP       NODE     NOMINATED NODE   READINESS GATES
nginx-8f458dc5b-29h9v     0/1     Pending   0          48s     <none>   <none>   <none>           <none>
nginx-8f458dc5b-2bj96     0/1     Pending   0          48s     <none>   <none>   <none>           <none>
nginx-8f458dc5b-twf8z     0/1     Pending   0          48s     <none>   <none>   <none>           <none>
nginx2-7cc8cd4598-tpp2s   0/1     Pending   0          8m20s   <none>   <none>   <none>           <none>

PODs running on euro2 have been stopped.

Toleration is a property of a POD that allows it to run even though a NODE is “tainted”. Toleration is defined by PodSpec. For example:

cat << EOF | tee toleration.yaml
apiVersion: v1
kind: Pod
metadata:
  name: tolerancyjny-pod
spec:
  containers:
  - name: nginx
    image: nginx
  tolerations:
  - key: "skaza"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "handsUP"
    operator: "Exists"
    effect: "NoExecute"
EOF

Running a POD:

kubectl apply -f toleration.yaml
pod/tolerancyjny-pod created

Verifying the state of the POD:

kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE    IP           NODE     NOMINATED NODE   READINESS GATES
nginx-8f458dc5b-29h9v     0/1     Pending   0          5m3s   <none>       <none>   <none>           <none>
nginx-8f458dc5b-2bj96     0/1     Pending   0          5m3s   <none>       <none>   <none>           <none>
nginx-8f458dc5b-twf8z     0/1     Pending   0          5m3s   <none>       <none>   <none>           <none>
nginx2-7cc8cd4598-98h5r   0/1     Pending   0          5m3s   <none>       <none>   <none>           <none>
tolerancyjny-pod          1/1     Running   0          68s    10.33.2.46   euro2    <none>           <none>

The POD was started on a tainted euro2 host.

Removing “taints” from euro2 will enable the restart of suspended PODs.

kubectl taint nodes euro2 skaza:NoSchedule-
node/euro2 untainted

kubectl taint nodes euro2 handsUP:NoExecute-
node/euro2 untainted

kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
nginx-7588f7b96-45f9m     1/1     Running   0          11m     10.33.2.50   euro2   <none>           <none>
nginx-7588f7b96-f6745     1/1     Running   0          11m     10.33.2.47   euro2   <none>           <none>
nginx-7588f7b96-tws56     1/1     Running   0          11m     10.33.2.48   euro2   <none>           <none>
nginx2-7cc8cd4598-98h5r   1/1     Running   0          11m     10.33.2.49   euro2   <none>           <none>
tolerancyjny-pod          1/1     Running   0          7m46s   10.33.2.46   euro2   <none>           <none>

Summary

In this article, we introduced the most important Kubernetes mechanisms for managing PODs in a cluster. We used concepts such as nodeSelector, affinity and taints in uncomplicated examples on a cluster consisting of only 2 NODEs. We also explained what a key component of Kubernetes – the scheduler – does.