Big tech teams win because they have the best ML Ops. These teams
- Deploy models at 10x speed
- Spend more time on data science, less on engineering
- Reuse rather than rebuild features
How do they do it? An architecture called a Feature Store. Here's how it works
🧵 1/n
In almost every ML/data science project, your team will spend 90-95% of the time building data cleaning scripts and pipelines
Data scientists rarely get to put their skills to work because they spend most of their time outside of modeling
Enter: The Feature Store
This specialized architecture has:
- Registry to lookup/reuse previously built features
- Feature lineages
- Batch+stream transformation pipelines
- Offline store for historical lookups (training)
- Online store for low-latency lookups (live inferences)
You can think of the feature store as a "Feature API" made just for data scientists.
Anyone with access can view, pull, and contribute features for their own models. Over time, the feature store will eliminate countless hours of redundant feature engineering work
MLOps done right can supercharge your company
For example, the @NetflixEng team uses a ML architecture with FS-like capabilities called Metaflow. With Metaflow, data scientists can push models to production in 1 week or less on average
They have over 600+ models deployed today
The feature store has 4 basic functionalities:
1. Feature Transform
This is the main tool for writing and saving features to your feature store. Typically, this takes the form of a job or service orchestration tool such as @ApacheAirflow
Basically: Read, Transform, Write
2. Feature Discovery
This is the hardest part to get right. If you want data scientists to reuse features, you need an intuitive UI that lets them search for them.
@databricks's feature registry has some basic components. But, there's ample room for opportunity for improvement
3. Feature Serving
The Offline Store is a historical data store feature discovery and model training
The idea behind the Online Store is rather than running feature transforms during inference (slow), you can pre-compute them and cache them in the online store for quick lookups
Want to use a feature store yourself? You're in luck! There's a few open source options out there
1. @feast_dev is a fantastic open source feature store that plays nicely with both GCP and AWS
Here are what some (closed source) big players use:
@UberEng's Michaelangelo has an end-to-end feature engineering -> model training -> model deployment pipeline. Largely built around Spark's MLlib
@WixEng also has a nifty architecture that stores feature data with protobufs
Want to buy instead of building your own? Here are some cool startups bringing feature stores to the market
1. @TectonAI - Staffed by some of the original Feast developers 2. @stream_sql - Founded by the minds behind Michaelangelo 3. @databricks - Feature store just left beta
And that's it :)
If you enjoyed it, I post threads like this on the regular. Also on topics ranging from AI for UI, fintech, crypto(skepticism), and data science
The #NobelPrize in economics was just awarded to 3 top economists. #EconTwitter seems to be over it, but the data science/ML community is totally missing out!
Here's why Data Scientists should start paying attention and what they can take away 🧵
The prize was awarded to David Card, @metrics52, and Guido Imbens for their monumental contributions to statistical methodology and causal inference.
They used and developed strategies that were a true paradigm shift bridging the gap between data and causation in economics
One part of the prize went to David Card from UC Berkeley.
Card is most well-known for his famous minimum wage study that paradoxically revealed that an increase in the minimum wage did *not* reduce employment. How?
The study applied a strategy called Difference in Differences