, 14 tweets, 4 min read
Kubernetes Borg/Omega history topic 13: Priority and preemption. Some work is more important and/or urgent than other work. Borg represented this as an integer value: priority. A higher value meant a task was more important than a lower value, and should be able to displace it.
When choosing a machine for a task, the scheduler ignored lower-priority tasks for determining whether/where a task would fit, but considered the number of tasks that would have to be preempted as part of the ranking function for choosing the best machine.
Disruption budgets were never added to the scheduler, which would have been hard, but there were also concerns about performance and priority inversion. Higher-priority tasks could specify how long they would wait for lower-priority ones to gracefully terminate
Priorities were used to ensure production/critical serving workloads could always get the resources they needed. This was essential to enabling mixed workloads to run together in the same clusters. Batch and experimental workloads ran at lower priorities, infrastructure at higher
For a while, users tried spreading their workloads across multiple priority bands in order to be nice to other tenants -- crude kind of fairness in the case of resource crunches. That resulted in preemption cascades of higher-priority tasks preempting lower-priority ones
Batch workloads, many of which were continuous automatically submitted, primarily preempted other batch tasks, causing significant amounts of lost work. So, priorities were "collapsed" into bands such that everything in the same band was treated as the same priority
The collapse reduced preemption, but other mechanisms were needed to ensure timely and efficient scheduling. The rescheduler ensured that pending production-priority tasks could schedule by choosing others to displace. It verified that both tasks would schedule, to avoid cascades
Groups of batch tasks were queued and admitted to the cluster when enough resources became available to schedule them. Resource quota by priority prevented priority inflation over time. Space was left between the bands in case new bands were needed -- like BASIC line numbering
Eventually the priority values of virtually all tasks were changed to rationalize them with the new scheme, across thousands of jobs, in their configuration files, through a painstaking process. This reiterated the importance of abstracting the operational intent.
Borg's approach is described in the Borg paper: ai.google/research/pubs/…. K8s design proposals were in github.com/kubernetes/com… and github.com/kubernetes/com…. Priority in resource quota: github.com/kubernetes/enh…. Coscheduling: github.com/kubernetes/enh…
Priority in Kubernetes is relatively new, and it's still evolving. For instance, there's an open proposal to add a preemption policy, github.com/kubernetes/enh…, primarily to avoid preempting other pods. Borg has a similar mechanism. I'll discuss why when covering QoS
Waiting for preempted pods to terminate gracefully before starting newly scheduled pods creates significant complexity in the design. The scheduler then needs to model the future state, and some controller needs to watch for the space to become before starting the new pod
The complexity of priority and preemption is primarily what drove the change for the DaemonSet controller to rely on the default scheduler to bind pods to nodes, as well as the scheduler framework proposal github.com/kubernetes/enh…, so the code could be reused in custom schedulers
I'll cover Quality of Service (QoS) and oversubscription next. Over time, priority bands in Borg (specific hardcoded integer values) came to be used as part of the determination of QoS level, for reasons I'll go into in that thread.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Brian Grant

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!