My Authors
Read all threads
1/ A story about a bug that was taking out nodes in our container fleet. Mostly in a cell in us-west-2. For some reason we were having nodes fail with workqueue stalls during cleanup_net. cleanup_net is kinda of notorious. Nothing is special about us-west-2. Wtf?
2/ One of the things cleanup_net does is put some work on the a high priority workqueue for each CPU, and waits for this to complete. While doing this it holds the rtnl lock. If this gets stuck, it's very bad, and makes the system misbehave in exciting ways.
3/ Our kernel logs looked something like this: BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 309s! gist.github.com/sargun/0c2ad0b… (more if interested) It was as if CPU 2 just stopped doing work. We had no idea what was going on.
AWS introduced this feature last year that allows you to send an NMI to an instance (aws.amazon.com/about-aws/what…). You can use this to trigger a kernel crashdump. We rolled linux-crashdump out in our fleet on 1/11. @_msw_ probably knows who to thank for this feature.
@_msw_ 5/ Once we caught a crashdump, backtrace was uninteresting on the stalled CPU. There was this ruby process that kept showing up through. Huh. Weird. Then we looked at the runq of that CPU, thinking maybe CFS had shit itself:
@_msw_ 6/
crash> runq -c 2
CPU 2 RUNQUEUE: ...
CURRENT: PID: 406729 TASK: ... COMMAND: "ruby"
RT PRIO_ARRAY: ...
[ 0] PID: 406729 TASK: ... COMMAND: "ruby"
CFS RB_ROOT: ...
[120] PID: 389870 TASK: ... COMMAND: "kworker/2:2"
[and many more]
@_msw_ 7/ Well, that's fucking weird. How is ruby ending up on the RT prio queue?
@_msw_ 8/ Well, about 2.5 years ago, we decided to take over being pid 1 in the container. We used Tini (github.com/Netflix-Skunkw…). It does a bunch of stuff like taking over the workload's stderr / stdout, and redirecting it.

I think @sebp called this a container "manhole"
@_msw_ @sebp 9/ When terminating containers though, we saw a problem that sometimes tini (pid 1) didn't get scheduled to handle the signal when the container was terminated.

This meant that the containers were getting stuck when we were trying to kill them.
@_msw_ @sebp 10/ Our solution, set sched_rr on it (github.com/Netflix/titus-…). What could possibly go wrong, it fork-execed, and reset the scheduling class on fork.

tini just hangs in wait(), yielding CPU, so it's okay.
@_msw_ @sebp 11/ Little did I know, I laid a landmine for us.
@_msw_ @sebp 12/ People wanted systemd in their containers, and systemd requires to be pid 1. So, a while later, we added a feature that let users exec into their own pid 1, and did this by having them put a special label on this image.
@_msw_ @sebp 13/ At the time, that meant systemd in containers (as pid 1) was running SCHED_RR. It was never a problem because systemd is a light user of CPU.
@_msw_ @sebp 14/ A user decided to build an image inherited from this an image with this special label, not knowing about what it did. They changed their entrypoint from systemd to their ruby script. That meant their ruby process was pid 1, and had SCHED_RR.
@_msw_ @sebp 15/ What was happening is that the user's workload was getting prioritized as a RT workload and the high priority workqueue was never getting scheduled on that CPU because ruby was a heavy CPU user.
@_msw_ @sebp 16/ This would have been totally okay in just the land of CPU shares. But, a while ago we introduced cpusets. This container only asked for 1 hyperthread, which meant that ruby process couldn't run anywhere else.
@_msw_ @sebp 17/ What was happening is that the user's workload was getting prioritized as a RT workload and the high priority workqueue was never getting scheduled on that CPU.

The CPU set was keeping their workload on that CPU, stalling out the workqueue.
@_msw_ @sebp 18/ Lastly, you ask why this effected that cell in us-west-2 more than others? The user's workload in order, us-west-2->us-east-1->eu-west-1, and if the first one failed, they'd abort the entire pipeline.
@_msw_ @sebp 19/ Other people turn out to have the special "make me pid 1 label" on their image but didn't take out the system. What was special about this workload?

1) It asked for 1 hyperthread
2) It used up all the CPU it could possibly get (it was efficient)
3) It only ran periodically
@_msw_ @sebp 20/

I laid a landmine in our codebase -- wrote a bug that was in our codebase for ~3 years. It required two new features to arrive, plus a very specific workload.

It blew up the kernel, and manifested itself in a damn weird way.
@_msw_ @sebp Once we figured out what it was, we were able to get a fix deployed globally on the order of ~2 days.

I apologize to @gabehartmann for making his life exciting, but thank him for fixing it, and deploying the fix.
@_msw_ @sebp @gabehartmann (Also, thanks, and sorry to my other coworkers who do not appear to be on social media)
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Sargun Dhillon

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!