1/ The year is 2003, and I’m in a bar questioning my life choices.
After months of travel and $100k+ in fees, our machine learning model failed badly on the test set.
Most ML efforts create no value. We should learn from those failures.
Here is a favorite of mine.
(attempt #2)
2/ Three months earlier, I was a mile under ground staring into an abyss left by a long-wall excavator.
The metal plates overhead were protecting us from a roof collapse, and would advance each time the massive machine cut away another slice from the wall ahead.
Jun 14, 2019 • 10 tweets • 2 min read
1/ Productivity is a skill.
It requires as much study, experimentation, introspection and experience as any other complex skill.
2/ First, we must clearly define productivity.
Value = Effort * Productivity
Where value is something that *really* matters, like happiness, wealth creation, or community impact.
Jun 12, 2019 • 13 tweets • 3 min read
1/ The most effective ML engineers I know master visualizing complex data.
Simply put, if it isn’t visualized with real data, you do not understand it, and you cannot trust it.
I generate 1,000s of visualizations per ML product.
2/ First, let’s dispel some myths.
- your math will not save you
- neither will your test set
- you cannot rely on canned visualizations
- you cannot just “look” at the data
Jun 10, 2019 • 9 tweets • 2 min read
1/ I often used data “quality”, “anomaly” and “outlier” interchangeably, but now have a much crisper sense of what I think each term should really mean.
Here are some definitions and advice for what to do about them.
2/ Data “quality” is when the data, as recorded or reported, is inconsistent with the reality it was intended to capture.
Data quality can only be assessed by an expert who understands the physics of the system being monitored.
Jun 10, 2019 • 8 tweets • 2 min read
1/ There are some great options for data warehousing today. After dozens of conversations with data teams, and a lot of testing ourselves, here is our (@eshmu and I) take on:
redshift v presto v snowflake v bigquery v athena
2/ Redshift is the incumbent. It’s a good compromise between time, value and performance.
But if your compute spikes dramatically, query times can slow to a crawl, as storage and compute are tied. Elastic resize and concurrency scaling may address this.
Jun 10, 2019 • 8 tweets • 2 min read
1/ shapley values (and the Python shap package) have become an *integral* part of my machine learning methodology
I’m surprised by how many people still don’t know about them github.com/slundberg/shap2/ They allow you to distribute credit for the prediction of an arbitrarily complex machine learning model to the underlying features
They resulting values are additive, and locally explain how the model behaves