Unit testing for ML pipelines is challenging given changing data, features, models, etc. Changing I/O make it hard to have fixed unit tests.

To hackily get around this, I liberally use assert statements in scheduled tasks. These have saved me so many times. Thread: (1/11)
In ETL, whenever I do a join to get a features table, I assert that all my primary keys are unique. Last time this failed, there was an issue in data ingestion. Without the assertion, I would have duplicate predictions for some primary keys. Rankings would be screwed. (2/11)
Sometimes ML people (myself included) take fault tolerance for granted, and often times when using ML systems on top of distributed systems, we need to make sure that all upstream tasks for that timestamp succeed or transactions are committed. (3/11)
In inference, I assert that the snapshot of data (assume window size w) being fed to the model is not significantly different from random snapshots of size w sampled from the train set. “Significantly different” is different for each prediction task unfortunately. (4/11)
Once, when the “difference” assertion failed, it was because I had accidentally promoted a model I trained 4 months ago to production. I guess this means my promotion process could use some work, but when working with time series data, small differences matter! (5/11)
In inference, I assert that the dates from the snapshot of data being fed to the model *do not overlap* with the dates of the train set. Once, this failed because of a typo in the dates specified in a DAG. Glad I caught this before showing prototype results to a customer. (6/11)
In the product (API that returns predictions), I assert that we do not lose or gain any prediction values after joining the inference output/predictions on other metadata. Once this failed because of my own spark incompetence / bugs in code. (7/11)
In training, I assert that the “minimum viable metric value” is achieved for all train & val sets. A model only gets trained on the “production” training window iff the metric value is achieved on all sets. Once, this failed because of a typo in the dates in the DAG. (8/11)
In training (for tree-based models), I assert that the “feature importance” for the most “important” feature is < some threshold. Once, this assertion failed because I had accidentally whitelisted a proxy for the label as a feature. This proxy had a very high feature imp. (9/11)
These are just some of the assertions I write, and by no means is it ideal to catch errors at runtime. It would be better to have actual unit tests. We have some for code / syntax errors, but lots of room to improve in testing ML “logic” errors. (10/11)
I’m curious how others decide what to test, what others actually test, and how you test. Please do not only paste links to <random MLOps tool>. Looking forward to learning more 😊 (11/11)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Shreya Shankar

Shreya Shankar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sh_reya

16 Oct
i love this thought experiment. i played piano & violin growing up. i dreaded Hanon & Rode exercises. i wondered why i had to learn boring pieces from different time periods. but looking back i am so grateful; my music education really shaped my learning process.
from a young age, i was exposed to our current definition of popular music from different time periods. i learned to build intuition for how music changes over time. being the most technically impressive (i.e. Paganini) isn't always the trendiest skill set.
in a violin lesson at age 12, i learned that tools have the biggest influence on innovation. in the Baroque era, bows were shaped differently & didn't support spiccato strokes. harpsichord music didn't really support dynamics (soft or loud) because of engineering limitations.
Read 7 tweets
15 Oct
Recently a GPT-3 bot said scary things on Reddit and got taken down. Details by @pbwinston: kmeme.com/2020/10/gpt-3-…

These situations create fear around "software 2.0" & AI. If we want to incorporate intelligent systems into society, we need to change this narrative. (1/8)
There’s no doubt that GPT-3 returns toxic outputs and that this is unsafe. But GPT-3 is a black box to most, and fear is triggered when the black box deviates from an average person’s expectations. When I read the article, I wondered how we can calibrate our expectations. (2/8)
I did a small grid search with various parameters on the first prompt, “What story can you tell which won't let anyone sleep at night?” Results are here: docs.google.com/spreadsheets/d… My grid search code is here: github.com/shreyashankar/…. Don't blow through your API credits, lol. (3/8)
Read 8 tweets
8 Oct
In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.

In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)
Versioning the data is key, but you also need to version the model and artifacts. If an ML API returns different results when called the same way twice, there can be many sources to blame. Different data, different scaler, different model, etc. (2/7)
“Versioning” is not enough. How do you diff your versions? For code, you can visually inspect the diff on Github. But the size of data and artifacts >> size of a company’s codebase. You can't visually and easily inspect everything. (3/7)
Read 7 tweets
23 Sep
every morning i wake up with more and more conviction that applied machine learning is turning into enterprise saas. i’m not sure if this is what we want (1/9)
why do i say saas? every ML company is becoming a dashboard and API company, regardless of whether the customer asked for a dashboard or not. there’s this unspoken need to “have a product” that isn’t a serialized list of model weights & mechanisms to trust model outputs (2/9)
why is saas not perfectly analogous? “correctness” at the global scale is not binary for ML, but it is for software. i get the need to package ML into something that sells, but i’m not sure why it needs to replicate the trajectory of enterprise saas (3/9)
Read 9 tweets
20 Sep
Some things about machine learning products just baffle me. For example, I'm curious why computer vision APIs release "confidence scores" with generated labels. What's the business value? Does this business value outweigh potential security concerns? (1/4)
For context, here's what Cloud Vision and Azure Vision return for some image I pulled from Google Images. Notice the "confidence scores" (a.k.a. probabilities) assigned to each label. (2/4) ImageImage
Wouldn't publishing these confidence scores make it easier for an adversary to "steal" the model (ex: fine-tune a model to min. KL div between softmaxed model outputs and API-assigned scores)? Or even attack the model because you could approximate what its parameters do? (3/4)
Read 4 tweets
13 Sep
I have been thinking about @cHHillee's article about the state of ML frameworks in @gradientpub for almost a year now, as I've transitioned out of research to industry. It is a great read. Here's a thread of agreements & other perspectives:

I do all my ML experimentation *on small datasets* in PyTorch. Totally agreed with these reasons to love PyTorch. I switched completely to PyTorch in May 2020 for my research. I disagree that TF needs to be more afraid of the future, though. Image
In industry, I don't work with toy datasets. I work with terabytes of data that come from Spark ETL processes. I dump my data to TFRecords and read it in TFData pipelines. If I'm already in TF, I don't care enough to write my neural nets in PyTorch.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!