Follow @danielbittman

12,399 views

Daniel Bittman

Follow @danielbittman

, 25 tweets, 7 min read

My Authors

Systems research is hard and time consuming. Let's talk about that using this bug in Twizzler that I just squashed as an example.

It can be hard to make “research progress" in the face of these things.

#osdev is fun, I promise! Let's start with the bug itself... (0x0/n)

So the bug was this: Twizzler would boot, I would run “for i in {1..10}; do ls & done”, and then hit enter. The system would freeze.

This ended up being a combination of the scheduler, interrupt handling, userspace drivers, thread exiting, and page-faults. 0x1/n

Here’s what would happen: hitting any key after that for loop would cause an infinite number of page faults. Why? Well, the page-fault handler is supposed to “fix” the state so that the memory access succeeds (or kill the thread). Right? 0x2/n

Well, it would do it’s work, but it wasn’t enough. The access still failed, triggering the fault again, thus "looping" (faulting) forever. So I must have missed something in updating the page tables, right? 0x3/n

Nope! The page tables were updated correctly. I printed out the entire mapping, it was correct. Invalidation, then, surely? Nope! That was working correctly too. 0x4/n

Put that aside for a sec. Let’s talk about what happens when a thread exits. Threads exiting is a several step process, because we can’t cleanup all the state when the thread is actually running. So it sets a flag, queues some cleanup work on a workqueue, and reschedules. 0x5/n

If the scheduler detects that the thread has exited, and there are no other threads, it sets the current_thread pointer to NULL and halts.

So you might be thinking, how is this related to the fault-handler that updates the page-tables correctly but also somehow doesn’t? 0x6/n

The function that actually updates the page-tables does a trick. It needs to know which page-tables to use, so it looks at the current thread’s virtual memory context. But current_thread might be NULL, remember? That would cause a NULL pointer deref, which would be Bad™. 0x7/n

If current_thread is NULL it instead uses a “kernel bootstrap” memory context instead. Thus, no matter if we have a kernel thread, we can map memory in the kernel.

fwiw current_thread is also NULL during startup, so we use this kernel_ctx to bootstrap the mmu as well. 0x8/n

Here was my oversight: resetting the current_thread pointer to NULL without switching the page-tables of the processor to those bootstrap page-tables. So updating the page-tables in the fault handler didn’t actually update the tables the CPU was looking at! Duh. Uhg. 0x9/n

Why didn’t it switch away? Well, resetting that current_thread pointer was actually a new thing, added because of some idle work that the system takes care of that might refer to current_thread. Safer to set it to NULL if it’s meaningless! 0xa/n

It's an involved bug, to be fair to past me :)

0xb/n

Also… I figured that the scheduler isn’t actually going to schedule a thread if the last one exited. It’ll eventually free the page-tables and switch away then, but before that, what could happen. Well… an interrupt could happen… Like, when you hit enter on the keyboard. 0xc/n

Why does an interrupt cause a page-fault? Drivers in Twizzer are in userspace, except for a generic interrupt “upcall” system. This involves writing to a “device representation” object and waking a thread. That’s what triggered the page-fault. That write. That damn write. 0xd/n

The device object had never been mapped into the bootstrap context, because it never needed to be (only was used in an actual thread with a "real" VM context). 0xe/n

So we come full circle. That write triggered the infinite page-fault train because a thread had exited and reset the current_thread pointer without switching away from the page-tables that had been used.

Easy fix! Switch page-tables immediately when exiting, not delayed. 0xf/n

Why am I writing all this? Well, it’s fun. This was complex. It took me a few hours. I went down several dead-ends. These experiences are common in systems research. I want to talk about it more. This was a "success" (solved), but it was a failure by it's very existence. 0x10/n

I think we need to talk about failures more. This bug was 100% my fault. An oversight. This happens a lot. All the time. It can be hard to admit to embarrassing things, but we all need to to not discourage folks in the community. 0x11/n

Doing #osdev is hard. Like any big system, there are huge numbers of complex interactions. But you can do it! It’s fun. Debugging is hard, but doable.
0x12/n

@usenix

@usenix

From an academic perspective: solving this bug was important. I’m working towards open-sourcing Twizzler in time for @usenix ATC this coming week. I want the system to be as stable as possible before then.

0x13/n

Of course, it’s a research OS. There are bound to be problems. But I still feel embarrassed by them. So I’m trying to squash the bugs :)

But hey I'm just one person. Can't do it all. 0x14/n

Does solving this bug help my research? Not at all. This bug doesn’t affect the programming model nor the performance (which are what I’m actually interested in). Sure it's harder to run some tests. But lots of research code is "we got it stable enough to collect stats". 0x15/n

I want to be more usable than that. But it's so hard! So time consuming! It detracts from time I could spend writing papers.

My goal is not to make a stable OS. It’s to research OS design and programming model changes. I can’t spend all my time perfecting the kernel. 0x16/n

This is one reason systems research is hard. The bug was hard, sure. But there’s _so much_ that needs to be done to have a usable artifact to experiment with. And it’s easy to get stuck on things like this. These bugs can take hours to days to solve, depending on luck.
0x17/n

https://twitter.com/danielbittman/status/1278394236698628096?s=20

https://twitter.com/danielbittman/status/1278394236698628096?s=20

Anyway, I hope you enjoyed my ramblings with little-to-no cohesion or editing. Check out this thread for more details on the actual research :)

https://twitter.com/danielbittman/status/1278394236698628096?s=20

0x18/n

Try unrolling a thread yourself!

Related hashtags

Embed code for your website

Did Thread Reader help you today?