Read on Twitter

Matthew Honnibal @honnibal

, 10 tweets, 2 min read Read on Twitter

If you try out the new spacy-nightly (v2.1.0a3), you might be surprised to see it's single-threaded. This actually took a tonne of work! I've spent many hours getting the Blis linear algebra routines into a stand-alone, wheel-installable package. Why? A thread on threads 🧵1/10

In spaCy 2 we switched over to neural network models, so the bottleneck in spaCy comes down to matrix multiplication. Most Python libraries delegate CPU matrix multiplication to numpy, which then delegates it to a low-level library. Which library? Well, that depends. 2/10

There are three main libraries default numpy might delegate to. All have different problems.
* Intel MKL: May not perform well on non-Intel CPUs
* OpenBLAS: Often misdetects my CPU, leading to poor performance.
* Accelerate (for OSX): Crashes if executed from a subprocess.
3/10

Aside from the variation in problems, all of these matrix multiplication libraries will eagerly launch a tonne of threads. Most people see this as a good thing. It makes people happy to see their CPU working hard. But all these threads probably aren't helping you. 4/10

When we used OpenBLAS for matrix multiplication, people kept reporting terrible performance, even though they were running on a 96 core machine and 96 child threads were being launched. The solution, OMP_NUM_THREADS=2, was not exactly obvious. 5/10

The problem was that OpenBLAS -- like most other matrix multiplication libraries -- launched far too many threads for our relatively small workloads. This just caused a bunch of contention and switching costs, killing performance. 6/10

Piping lots of data through a statistical model is an embarrassingly parallel workload. It's completely backwards to launch lots of threads for the *matrix multiplications*. That's the *lowest* level of computation! You want to parallellise at the *highest* level! 7/10

Let's say you want to pipe 1 billion documents through spaCy. Great. Spin up 1,000 worker CPUs, give them 1 million documents each, and you'll be done in a few minutes. There's zero advantage to having an individual worker launching threads. You shouldn't want that. 8/10

The place where the single-threading sucks at the moment is training. I hope a multi-processing solution won't be too hard to implement. I've also got some ideas for a software transactional memory strategy I've been meaning to try. 9/10

With the new models, spaCy is running at around 8000 words per second on an n1-standard-1 machine on Google Compute Engine. This is a bit short of our target of 10k words per second, but still works out to more than 28m words parsed per $0.01, which ain't bad. 10/10

Like this thread? Get email updates or save it to PDF!

Subscribe to Matthew Honnibal

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Matthew Honnibal

This content may be removed anytime!

Try unrolling a thread yourself!

More from @honnibal see all

Related threads

Trending hashtags

Did Thread Reader help you today?