Concurrent programming can be hard. After days (over the course of weeks) I was today finally able to make a channel implementation pass my tests. Sometimes the smallest things takes the most time to get right.

Here's a short thread …
The kind of "channel" I've implemented is a form of CSP (see swtch.com/~rsc/thread/) which is used to communicate between threads. It's similar to Go's chan but works with threads rather than a custom coroutine runtime.
There are just two operations: send and receive.
send() blocks until there's a thread calling recv()
recv() blocks until there's a thread calling send()
A channel may be buffered in which case send and recv may succeed immediately.
So what makes this so challenging?
Time and causality; avoiding deadlocks and carefully handling CPU cache hierarchies, all while being efficient in time and space. For this to not be a confusing wild goose chase of trial and error, we need to have Mechanical Sympathy.
"Mechanical Sympathy" is a term coined by racing driver Jackie Stewart. Paraphrased and applied to computers: "You don’t have to be a hardware engineer to write good software, but you do have to have Mechanical Sympathy". What this means to me …
What this means to me is that having a sense of how the computer works helps tremendously in writing correct, efficient and elegant software, even if I don't fully understand all the details of the hardware.
One of the key aspects of modern software performance is managing memory. Modern CPUs have a hierarchy of data caches in addition to main working memory (RAM.) CPUs today can complete a staggering amount of computations through not just raw clock speed but …
… through a complex combination of clever "temporal" tricks, like intra-core pipeline parallelism and predicting what will be needed soon — the latter is key and involves the CPU and its cores moving instructions and data around in the various caches …
… (aka "line caches.") To write fast software you need to "trick" the CPU into invalidating as little data in its caches as possible. Each cache level is usually several orders of magnitude faster than the next. Don't be surprised by a 400x performance difference.
A second important fallout of this kind of hardware design is that changes to data done by one core may note be visible to another core at the same point in time. A simple example is a counter: Without some form of synchronization the value will become corrupt.
We'd expect the counter x in the image to be 2 after it has been incremented twice, but it's 1!
Shared memory is simply a hard problem to solve efficiently. What we need to do is to tell the CPU that "x may be modified by other cores" causing loads & stores to shared memory.
But that's really slow, as discussed earlier. So a better implementation of a multi-process counter is to give each core it's own counter and then occasionally as them to store their results to shared memory and ask one of them to sum up all counters.
This is better and will be faster than a naïve "locking"/"synchronized" counter (if we ignore the up-front costs of introducing multithreading in the first place.) But we might run into another problem here which affects performance and correctness in some cases: false sharing.
"False sharing" is seemingly weird term for what happens when multiple cores update memory that sits inside the same "cache line". Chances are your CPU's cache line size is 64 bytes and that it chunks up memory into cache lines by simply aligning addresses to 64.
So for our parallel counter example we need to make sure that the memory of the core-specific counters x¹ and x² lie inside separate cache line boundaries. This is easy to do in C et al. If you look at MP-forward code like Intel TBB or LMAX Disruptor you'll see this a lot.
Okay that's going to make our silly parallel counter really efficient.

So what's up with this channel thing I've been struggling with?

Well, I wanted to make it very fast and avoid thread sync. TLDR: I reached a compromise.

How many implementation attempts did I make? Many:
After researching and testing existing solutions, I tried to port a few of them. One neat one is Plan 9's libthread but that one, like the Go implementation, relies on a coroutine runtime but could be adopted to work with plain C11 threads.
I also tried porting a few other approaches like the (very java, very complex lol) LMAX Disruptor (which sadly requires call-site level changes so that went out the door) and Intel TBB's concurrent_queue which was an exercise in unwrapping 100 layers of complex C++ to …
… understand it. So that was fun. The TBB implementation was actually a pretty good fit, but I must have made some mistake as after 20 hours of debugging a correctness issue I have up. Also, the code was too complex (my C port was 1000 lines, C++ original is like 10x that.)
So I ended up with a really neat solution that is only about 350 lines of C, primarily based on the Go chan implementation cooked up together with some bits and pieces of ideas from other things I've written for other reasons (like a hybrid spinning-then-syscalling mutex.)
It passes a set of tests (which I ran a few thousand times over and over, over lunch, just to be sure ha ha... non-determinism, yay) The GIF in the head of this Twitter thread shows the tests. It's also fairly efficient:
(that benchmark is first an interleaving single-threaded program followed by a worst-case scenario starving, unbalanced multi-threaded program.)
There was a lot of this the past weeks. Deadlocks, "fun" memory crashes, etc.
I call it "fun with lldb" 😬 (It's basically "programming BDSM.")
For some deadlock bugs I had to devise a special logging mechanism since printf (anything dealing with FILE in libc) will implicitly sync threads (libc will lock a mutex on a FILE for each op) which would cause results to vary if I ran an experiment with instrumentation or not.
The debug logging basically works like the "parallel counter" example shown earlier but uses a write syscall to send its results to a central location. With that, I was able to order log messages by the CPU's view of time/causality and find two bugs. [img: timeline of 3 threads]
If you made it all the way here, congrats.
You get a cake.
{\___/}
( • - •)
/ >🍰
If you're curious and managed to find this hidden tweet, this is what the interface to the channel implementation looks like. It's very plain and simple (which is the way I like my codez.)
Note that there are "atomic" and "memory fence" CPU instructions that can be used to tell the CPU that "this memory here used for x is shared by many cores" which in turn makes the CPU flush caches when storing to them to maintain cache coherency (and that is slow.)
If you're curious to learn more about the impact of CPU cache hierarchies, check out this good recent article by @ForrestTheWoods: forrestthewoods.com/blog/memory-ba…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rasmus Andersson

Rasmus Andersson Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rsms

2 May
This is what large parts of the commercial WWW has become. It really saddens me. It’s like a hundred tortured voices screaming in desperation. How do we fix this?! How did we get here?
What needs to change is culture around value and technology to support it.
Being a journalistic publication in this day and age must be tough with a web culture of “everything is free.” How do you compete with that?! We all need to put bread on the table.
Here’s a practical example how broken the web system is. Okay so a friend told me about a cool DIY instax back for the Mamiya RZ/RB cameras. I do a web search and end up on this film photography website. Looks legit. Looks like work of passion. Good stuff. ImageImage
Read 10 tweets
1 Apr
Re building tools: I strongly believe that tools — software libraries, utilities, etc. — should only be created to solve an issue which you have a direct relationship to. History has proven this to be an excellent measure of success.
[Thread ...]
Most of us have probably experienced the "CRM system made for other people" that sucks, because the creators of it didn't use it themselves and thus couldn't relate to the problems their thing was solving.

So here's some good terminology:
- "Me ware"
- "Us ware"
- "Them ware"
"Them ware" is the worst possible thing; a "none of the roads taken" kind of compromise. You build something for someone else without a need of yourself. It will be incredibly hard to make something good this way since you will essentially fly blind.
Read 8 tweets
30 Mar
(Now that I'm no longer working at Figma I can do this:)

Hey @figmadesign this is _so_ confusing. Can we plsss reverse the direction? Image
Since I've gotten the same Q a few times:

Read 6 tweets
26 Jan
Although this article is a bit too extreme and an overreaction, I must say that as a non-Chrome user it seems more and more websites only work in chrom{e,ium}. At least once a week I have this experience (I use Safari.) Thread →

lwn.net/SubscriberLink…
The contemporary idea of a web browser is an abstraction layer for technology to allow some document or software to be available to anyone using a web browser. If your website only works in Chrome it’s really not different from say only working on macOS.
There’s a cost to everything and with web development a big chunk of cost is in quality (making a UI reliable and behave the way a user expects it to.) An increasing cost area is also writing & testing browser-specific code and that seems really backwards to me.
Read 14 tweets
19 Jun 20
We're growing our Editor engineering team here at @figmadesign. This is an amazing opportunity to work on OS-level engineering challenges together with a wonderful, small and diverse team of passionate humans.

More details here: jobs.lever.co/figma/b3721c76…

Also, thread:
Are you a person who identifies with a group that is often underrepresented in software engineering? Then we are particularly interested in talking! We strongly believe that diverse minds, opinions and perspectives makes for a better life and better software.
Here's is a short summary of life at Figma: figma.com/careers/
Read 7 tweets
11 Jun 20
@fionaosaurusrex @figmadesign Sharing some of mine
@fionaosaurusrex @figmadesign Another example from Figma's initial community & plugins launch in 2019. One file with one page that documents every part of the design, synced daily between PM, Eng and D.
@fionaosaurusrex @figmadesign Some more diagramming examples from the community & plugins Figma feature from 2019.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(