My Authors
Read all threads
Details still pending, but what if I told you I discovered a way to improve cluster scheduling performance under load by up to 60000% or more? Sun Microsystems hid a really amazing optional feature in #GrodEngine and it has largely been unknown/unused/unfinished for 19 years!
The problem arises when you have over 100k pending jobs. Things start to slow down, job wait times sky-rocket, and the end-result is low-throughout. All-bad if you spent $8MM for your equipment and it is under-performing, and not returning your investment. Well...
It turns out that Sun Microsystems identified the problem almost two full decades before hardware would be powerful enough to push these boundaries. Thankfully there was a solution 99% coded and ready to go. The missing 1% you ask? 💥
Sun didn’t finish the feature so when you turn it on, it appears as though everything is gravy. Better than gravy. It solves all the slowness issues which were stemming from the scheduler making poor choices that were made most evident with a huge number of pending jobs
But, invariably, it could be 4 seconds or (in our case) 4 days until we hit a race condition and 💥 SIGSEGV. It took a week to solve because we are talking linked lists of references to pointers to linked lists containing more linked lists accessed in a multi-threaded manner 🥳
It took adding over 400 lines of new debugging code but ultimately 49 lines of code to fix, with only one static muted and a new function to scrub old references from linked-lists when a qalter (jib modification request) was being processed.
The feature was originally written in 2001 before Sun made a big push in 2003 to make the code multi-threaded, and my guess is that it broke because it was neglected. That this forward-thinking feature which was before its time was only viable from 2001 to 2003 (1-2 years)
I even found documentation written much later saying that the feature was scheduled to be deprecated and removed 🤕 which was probably because not many people rely on the ability to regularly submit 250k jobs at-once (those who need the feature) so nobody was willing to fix it
Truth be told, when I was spelunking the code, a small team (I found out) was secretly planning on evaluating other software to replace SGE in the event I failed (they don’t know me well enough; I absolutely love C, there was no way I was going to fail)
And so, I will be cleaning up the patch and releasing it publicly. This excites me because SGE is also used for things like processing CT scans. Imagine if your hospital is running an unpatched #GridEngine in default configuration. As more CT scans need to be processed, ...
... the pending queue grows and the scheduler slows, leading to woes — wherein protracted scheduling runs to start pending jobs leads to lower running slot counts which saps throughout meaning fewer scans get through precisely when you need them most! Now imagine ...
A hospital patches their SGE with my patch, turns on scheduler job category filtering originally written in 2001 by Sun themselves, and all of a sudden you can process millions of CT scans a day instead of thousands. The difference is stark, and involved over a year of research
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with FreeBSD Frau

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!