1/ Last week's Production ML meetup featured Peter Gao and Princeton Kwong, former Engineering Managers at Cruise and Aquabyte. Below, their insights on data quality and its downstream effects for computer vision use cases:
2/ What is your experience with data quality?

- Cruise: (1) poorly-labeled data confuses the model; (2) models may perform poorly on edge case objects
3/ Data quality cont'd

- Aquabyte: No public datasets to work with. Our engineers went onsite to collect ground-truth data, built a huge labeling pipeline to get the data to human labelers, and designed our own labeling interface that enabled labelers to properly label fishes.
4/ What are common data quality issues?
- Labels aren't present, or they're mis-annotated
- Images are deformed
- Underlying metadata (i.e., image background) aren't properly captured
5/ How to detect such issues:
Be aware of scenarios below and take action:
- Stagnating model performance
- After being retrained on a new batch of data, models output worse results
- Models in production fail on instances that do not exist in the training set
6/ What about active learning?

- Cruise: ML systems are double-checked by vehicle operators, who could validate data quality. This continuous feedback loop substantially eases the workload of the ML engineers.
7/ Active learning cont'd

- Aquabyte: Because of a fixed annotation budget, we used a ranking algorithm to only put the best images in front of the labelers. When the images came in real time, they were dynamically sorted via an API.
8/ How to anticipate domain shift?

- Cruise: Ad-hoc (have people periodically check the results of the classifiers), or (better) capture data distribution in feature embedding spaces

- Aquabyte: Customers get a dashboard with histogram and time series data of fish buckets
9/ Quality issues in other data types:
- Humans can judge the accuracy of text & audio data, like with images
- Tabular/structured data are harder to eyeball. They are usually captured automatically (user activities, session logs, etc.), reducing the leverage that humans have
9/ Who manages data quality?
- Startups: generalist engineers
- Big orgs: specialization of platform/data engineers and ML engineers, though this can lead to an anti-pattern where the two sides speak different languages
- Solution: serve under one team and share incentives
10/ How helpful is synthetic data?
- Synthetic data generally doesn't work - only useful when you cannot collect real data.
- Better off sampling real data at low cost rather than generating synthetic data
- Data augmentation is a better solution

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Full Stack Deep Learning

Full Stack Deep Learning Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @full_stack_dl

19 Nov
1/ FSDL helps you turn ML experiments into shipped products with real-world impact.

This Spring, @josh_tobin_ @sergeykarayev & @pabbeel are teaching an improved version as an official Berkeley course: bit.ly/berkeleyfsdl

Want to follow along as we post lectures publicly?👇
2/ Sign up to receive updates on our lectures as they're released (and to optionally participate in a synchronous learning community): forms.gle/zqE2rjkfqex2AQ…
3/ We cover the full stack, from project management to MLOps:

- Formulating the problem and estimating cost
- Managing, labeling, and processing data
- Making the right HW and SW choices
- Troubleshooting and reproducing training
- Deploying the model at scale
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!