(1) asynchronous state machines (onto the same queue hierarchy), which is a way to address C10k and is fast
(2) getting concurrency (a better pthread_create())
(3) parallelism (dispatch_apply()
your workitem needs to represent enough work (100µs at the very least, 1ms is best)
your workitems if running concurrently need not to contend, else your perf sinks dramatically.
- IPC / daemons
- malloc (locks)
- shared memory (false sharing and other cacheline snoops between cores)
It is not a 2x cost in instructions count though, it's just that concurrency kills your IPC rate and you spend a lot of time just waiting.
Writing your code as if you're the only one on the system, despite being taught just that in CS-101, is unforgivable.