Kubernetes Pod Priority Class

Manish Sharma

4 min readJun 21, 2022

Priority indicates the importance of Pod relatives to other pods.

Key Points

When Pod priority is enabled, the scheduler orders pending Pods by their priority and a pending Pod is placed ahead of other pending Pods with lower priority in the scheduling queue. As a result, the higher priority Pod may be scheduled sooner than Pods with lower priority if its scheduling requirements are met.
If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
Kubernetes already ships with two PriorityClasses: system-cluster-critical and system-node-critical. These are common classes and are used to ensure that critical components are always scheduled first.
A PriorityClass is a non-namespaced object
The higher the value of the priority class, the higher the priority
A PriorityClass object can have any 32-bit integer value smaller than or equal to 1 billion
Larger numbers are reserved for critical system Pods that should not normally be preempted or evicted
The globalDefault=true field indicates that the value of this PriorityClass should be used for Pods without a priorityClassName
Only one PriorityClass with globalDefault set to true can exist in the system.
If you delete a PriorityClass, existing Pods that use the name of the deleted PriorityClass remain unchanged, but you cannot create more Pods that use the name of the deleted PriorityClass.
Non-Preemptive Priority Class: Pods with preemptionPolicy: Never will be placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be scheduled will stay in the scheduling queue, until sufficient resources are free, and it can be scheduled.
The priority admission controller uses the priorityClassName field and populates the integer value of the priority. If the priority class is not found, the Pod is rejected.

Credit: This image is copied from tanzu.vmware.com website

PriorityClass Sample Manifest

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

Use Case: There are certain apps in production which should always be deployed and don’t want to get killed. For example, a metrics collector Daemonset, logging agents, payment service, etc.

To ensure the availability of mission-critical pods, you can create a hierarchy of pod tiers with priorities; when there is a resource crunch in the clusters, kubelet tries to kill the low priority pods to accommodate pods with higher PriorityClass.

Non-Preemptive PriorityClass Sample Manifest

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be preempted."

Data Science Workload Use Case: A user may submit a job that they want to be prioritized above other workloads, but do not wish to discard existing work by preempting running pods. The high priority job with preemptionPolicy: Never will be scheduled ahead of other queued pods, as soon as sufficient cluster resources "naturally" become free.

Bind PriorityClass to Pod

The priority admission controller uses the priorityClassName field and populates the integer value of the priority. If the priority class is not found, the Pod is rejected.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

Preemption

When Pods are created, they go to a queue and wait to be scheduled.
The scheduler picks a Pod from the queue and tries to schedule it on a Node.
If no Node is found that satisfies all the specified requirements of the Pod, preemption logic is triggered for the pending Pod.
Let’s call the pending Pod P. Preemption logic tries to find a Node where removal of one or more Pods with lower priority than P would enable P to be scheduled on that Node.
If such a Node is found, one or more lower priority Pods get evicted from the Node. After the Pods are gone, P can be scheduled on the Node.
When Pod P preempts one or more Pods on Node N, nominatedNodeName field of Pod P's status is set to the name of Node N. This field helps scheduler track resources reserved for Pod P and also gives users information about preemptions in their clusters.

Credit: this image is copied from devopcube.com