Let’s talk vectorization! You may have heard about or experienced how simple NumPy array ops (such as dot product) run significantly faster than for loops or list comprehension in Python. How? Why? Thread incoming.
Suppose we are doing a dot product on two n-dim vectors. In a Python for loop, scalars are individually loaded into registers, and operations are performed on the scalar level. Ignoring the sum, this gives us n multiplication operations.
NumPy makes this faster by employing vectorization, where you can load multiple scalars into registers and get many products for the price of one operation (SIMD). SIMD — single instruction, multiple data — is a backbone of NumPy vectorization.
I think this StackOverflow answer about vectorization’s relationship to SIMD is nice and concrete: stackoverflow.com/a/35092190
I like this diagram that highlights the difference between multiplication performed in a naive for loop and a vectorized product. Credit to datascience.blog.wzb.eu/2018/02/02/vec…
So essentially, behind the scenes, if you use NumPy array operations, the for loops are pushed down to the C level, which takes advantage of SIMD and vectorizes computation. Especially for large arrays, this can significantly speed up computation!
Vectorization is particularly useful when doing “broadcasting,” or ops on arrays that don’t have the same shape. I have my own qualms about broadcasting leading to silent data science errors, but from a systems perspective, it's a cool concept uniquely enabled by vectorization.
Finally, there's surprisingly little documentation about this, but I think vectorization is a core feature of modern scientific computing libraries. @HeyChelseaTroy did a great investigation a few years ago to benchmark for loops in C, Cython, and Python: chelseatroy.com/2018/11/07/cod…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Today I will be talking about some of the data structures we use regularly when doing data science work. I will start with numpy's ndarray.
What is an ndarray? It's numpy's abstraction for describing an array, or a group of numbers. In math terms, arrays are a "catch all" term used to describe matrices or vectors. Behind the scenes, it essentially describes memory using several key attributes:
* pointer: the memory address of the first byte in the array
* type: the kind of elements in the array, such as floats or ints
* shape: the size of each dimension of the array (ex: 5 x 5 x 5)
* strides: number of bytes to skip to proceed to the next element
* flags
During my leave I’ve really enjoyed reading about the inspiring women trailblazers in statistics who paved the way for us. Here are some of my favourite quotes in chronological order. Please share yours! #WSDS
Florence Nightingale states in her essay Cassandra 👇
🖼 source: Wikimedia commons
I’m really looking forward to attending this 👇 #Nightingale2020 has been one of the few things worth celebrating this year! Her lessons on sanitation couldn’t be more relevant. #WSDS
As part of the bicentennary celebrations of the birth of the first @RoyalStatSoc woman elected fellow, at the society we’ve also organised several events throughout the year rss.org.uk/news-publicati…
Support mechanisms for students and early career researchers have become ever so important during the pandemic, yet more difficult to provide.
🖼️Another beautiful and on-point creation by @allison_horst
@allison_horst As a consequence, the power and potential of the support they receive from online communities like this one have been strengthened by the circumstances. I have personally valued them more than ever.
@allison_horst When I registered to curate this account earlier in the year I didn’t know there was going to be either a pandemic or elections. I just thought it would be a nice way to return to work after extended maternal leave, and a great way to get my confidence & stats interests back.
Throughout my career, I’ve become a bit wary of institutions that claim to be the best and specify exceptional candidates in job offers and PhD studentships…
(Shout-out to the great @Letxuga007 for the mean gif 😉)
I’d like to take this opportunity to demand the right for the less excellent or “tending towards average” to be given opportunities and have their well deserved place in Academia!
This hyperbolic language is exclusionary and will not only deter the “average” student from applying but also very smart yet humble candidates who are perhaps more realistic and indeed honest in their self assessments 🤔
Tweetorial on going from regression to estimating causal effects with machine learning.
I get a lot of questions from students regarding how to think about this *conceptually*, so this is a beginner-friendly #causaltwitter high-level overview with additional references.
One thing to keep in mind is that a traditional parametric regression is estimating a conditional mean E(Y|T,X).
The bias—variance tradeoff is for that conditional mean, not the coefficients in front of T and X.
The next step to think about conceptually is that this conditional mean E(Y|T,X) can be estimated with other tools. Yes, standard parametric regression, but also machine learning tools like random forests.
It’s OK if this is big conceptual leap for you! It is for many people!