12,399 views

Brian Grant

@bgrant0607

, 14 tweets, 4 min read

Kubernetes Borg/Omega history topic 13: Priority and preemption. Some work is more important and/or urgent than other work. Borg represented this as an integer value: priority. A higher value meant a task was more important than a lower value, and should be able to displace it.

When choosing a machine for a task, the scheduler ignored lower-priority tasks for determining whether/where a task would fit, but considered the number of tasks that would have to be preempted as part of the ranking function for choosing the best machine.

Disruption budgets were never added to the scheduler, which would have been hard, but there were also concerns about performance and priority inversion. Higher-priority tasks could specify how long they would wait for lower-priority ones to gracefully terminate

Priorities were used to ensure production/critical serving workloads could always get the resources they needed. This was essential to enabling mixed workloads to run together in the same clusters. Batch and experimental workloads ran at lower priorities, infrastructure at higher

For a while, users tried spreading their workloads across multiple priority bands in order to be nice to other tenants -- crude kind of fairness in the case of resource crunches. That resulted in preemption cascades of higher-priority tasks preempting lower-priority ones

Batch workloads, many of which were continuous automatically submitted, primarily preempted other batch tasks, causing significant amounts of lost work. So, priorities were "collapsed" into bands such that everything in the same band was treated as the same priority

The collapse reduced preemption, but other mechanisms were needed to ensure timely and efficient scheduling. The rescheduler ensured that pending production-priority tasks could schedule by choosing others to displace. It verified that both tasks would schedule, to avoid cascades

Groups of batch tasks were queued and admitted to the cluster when enough resources became available to schedule them. Resource quota by priority prevented priority inflation over time. Space was left between the bands in case new bands were needed -- like BASIC line numbering

Eventually the priority values of virtually all tasks were changed to rationalize them with the new scheme, across thousands of jobs, in their configuration files, through a painstaking process. This reiterated the importance of abstracting the operational intent.

Borg's approach is described in the Borg paper: ai.google/research/pubs/…. K8s design proposals were in github.com/kubernetes/com… and github.com/kubernetes/com…. Priority in resource quota: github.com/kubernetes/enh…. Coscheduling: github.com/kubernetes/enh…

Priority in Kubernetes is relatively new, and it's still evolving. For instance, there's an open proposal to add a preemption policy, github.com/kubernetes/enh…, primarily to avoid preempting other pods. Borg has a similar mechanism. I'll discuss why when covering QoS

Waiting for preempted pods to terminate gracefully before starting newly scheduled pods creates significant complexity in the design. The scheduler then needs to model the future state, and some controller needs to watch for the space to become before starting the new pod

The complexity of priority and preemption is primarily what drove the change for the DaemonSet controller to rely on the default scheduler to bind pods to nodes, as well as the scheduler framework proposal github.com/kubernetes/enh…, so the code could be reused in custom schedulers

I'll cover Quality of Service (QoS) and oversubscription next. Over time, priority bands in Borg (specific hardcoded integer values) came to be used as part of the determination of QoS level, for reasons I'll go into in that thread.

Enjoying this thread?

Keep Current with Brian Grant

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

More from @bgrant0607 see all

Related threads

Trending hashtags

Did Thread Reader help you today?