Sebastian Raschka Profile picture
Sep 7 6 tweets 3 min read
Some techniques for optimizing inference speeds (without changing the model architecture):

(1) Parallelization
(2) Vectorization
(3) Loop tiling
(4) Operator fusion
(5) Quantization

Anything missing?

[1/6]
[2/6] (1) Parallelization (in an inference context) essentially means splitting the batches you want to predict on into chunks; the chunks are then processed in parallel. PyTorch has a nice tutorial on that here: pytorch.org/tutorials/inte… ImageImage
[3/6] (2) Vectorization is a classic that probably doesn't need much explanation. In a nutshell this involves replacing costly for-loops with ops that apply the same operations to multiple elements. You probably already do that automatically if you are using linalg/a DL framework Image
[4/6] (3) Loop tiling. I actually only just learned about this recently (thx to #MLSystemsBook). Something that is still slightly above my head 🤯: essentially, you change the data accessing order in a loop to leverage the hardware's memory layout & cache en.wikipedia.org/wiki/Loop_nest… Image
[5/6] (4) Operator fusion: here, if you have multiple loops, you try to merge those into one. (A classic example is calculating the mean and standard deviation in one pass).
There was another nice example in the DANets paper I recently posted about (arxiv.org/abs/2112.02962): Image
[6/6] (5) Quantization essentially reduces the precision (and typically casts floats->ints) to speed up computation & lowering memory requirements (while maintaining accuracy). Borderline-included it as it can reduce the accuracy of your model. Tutorial: pytorch.org/tutorials/reci…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sebastian Raschka

Sebastian Raschka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rasbt

Sep 7
Going down some deep rabbit holes here and learning new things ...
Seems like a successful Kaggle strategy is randomly swapping cols in a tabular dataset (~like mix-up, but w/o including the labels).
Anyone tried this for a serious project with a non-deep learning tabular algo?
Link to the code (here used in a competition-winning deep learning for tabular data method as part of a denoising autoencoder backbone): kaggle.com/code/danofer/s…
It's worth clarifying (since the original tweet above looks misleading) that it doesn't literally swap columns but row values of a column. E.g., if you have two batches you exchange row data among the same columns. Probably easier to see in a figure:

Read 4 tweets
Sep 2
My top 5 basic checks when training deep learning models

1) Make sure training loss converged
2) Check for overfitting
3) Compare accuracy to a zero-rule baseline
4) Look at failure cases
5) Plot at a confusion matrix

6) <fill in the blank; what's your fav I am missing?>

[1/7]
[2/7]

1) Making sure training loss converged
=> that's a classic. We typically want to see that the loss plateaus
(Left: bad; right: better)
[3/7]

2) Check for overfitting

Another classic. We typically don't want the gap between training and validation accuracy to be too large (left: bad, right: better)
Read 7 tweets
Aug 31
Just added another paper to the "tabular deep learning" list -- intend to keep it up to date, and I previously missed DANets! DANets are centered around finding and grouping correlated features [right] -- something we usually would do manually [left]. How does that work? [1/9] Image
[2/9] The main idea behind DANets is to introduce an Abstract Layer (ABSTLAY) building block; multiple such blocks are then stacked to form a DANet. What does ABSTLAY do? It performs two steps: 1) feature selection and 2) feature abstraction. Image
[3/9] The feature selection (step 1) groups correlated features using a sparse learnable mask (they use Entmax -- analogous to Softmax but sparsity inducing). The feature abstraction (step 2) is using a fully connected layer with attention (not shown) on the selected features. Image
Read 10 tweets
Aug 28
In practice, a trained machine learning model is never final -- concept drift will inevitably cause a performance decline of a production model over time. [1/10]
[2/10] There are two main flavors of concept drift: feature drift and "real" concept drift.
There's an excellent article here that illustrates this in more detail: concept-drift.fastforwardlabs.com
[3/10] In a nutshell, feature drift describes the change in the input feature distribution over time. In rare cases, this is not harmful (subpanel to the right), but in most cases, it will require retraining the model (subpanel in the center)
Read 10 tweets
Aug 23
Random Forest is my favorite baseline algorithm (alongside Logistic Regression). It’s great because it can handle nonlinear problems and has good out-of-the-box performance (RF doesn’t require tuning). But …
[1/5]
[2/5] … even though RF performs an implicit feat sele (via the splitting criterion at each node), it's not immune to irrelevant features.
Here's a nice discussion & investigation by Gertjan Verhoeven: gsverhoeven.github.io/post/random-fo….
Perf decreases after adding 100 & 500 noise feats:
[3/5] According to "Hyperparameters and Tuning Strategies for Random Forest" (Probst, Wright, Boulesteix 2019), the number of trees at each node (here called "mtry") is the most influential hyperparameter for random forests. Increasing it improves perf in the presence of noise:
Read 7 tweets
Jul 23
And the deep learning vs conventional machine learning for tabular data continues!
A new paper looks at 45 mid-sized datasets (10k examples) and finds that tree-based models (XGBoost & random forests) still outperform deep neural networks on tabular datasets. [1/6]
[2/6] The plot above also nicely highlights one of my favorite points when talking to collaborators: If you use RF, you will often get good out-of-the-box performance! I am positively surprised that this is nowadays true for XGBoost as well!
[3/6] The authors also looked at both numerical and mixed numerical & categorical datasets (categorical features were one-hot encoded). The results hold: tree-based methods perform well
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(