Read on Twitter

12,399 views

Jeremy Stanley

@jeremystan

, 20 tweets, 5 min read Read on Twitter

1/ The year is 2003, and I’m in a bar questioning my life choices.

After months of travel and $100k+ in fees, our machine learning model failed badly on the test set.

Most ML efforts create no value. We should learn from those failures.

Here is a favorite of mine.
(attempt #2)

2/ Three months earlier, I was a mile under ground staring into an abyss left by a long-wall excavator.

The metal plates overhead were protecting us from a roof collapse, and would advance each time the massive machine cut away another slice from the wall ahead.

3/ Eventually, the void would grow too large, the roof behind us would collapse, and a small earthquake would ensue.

It could happen at any time, and it was *intentional*.

The only thing separating me from that void were strips of plastic curtains.

4/ Mining is perilous, and my team had been engaged to build a machine learning model that would predict just how dangerous a given mine was.

Our client would use this model to better price and underwriting mines. This would ultimately lead to increased safety and efficiency.

5/ We began by interviewing the miners we visited, and the insurance company underwriters and executives to collect as many hypotheses as we could about what caused and prevented mining accidents, and what drove the frequency and severity of the claim outcomes.

6/ We then collected all of the internal history of mine insurance applications, pricing, underwriting decisions and claims data.

We also identified multiple state and national datasets on mining activity and safety that we could augment this data with.

7/ We loaded everything into an Oracle database, and cleaned and matched it as best we could.

Then we used S-Plus to visualize the distribution of claim outcomes, and to validate that some of the most obvious relationships we expected in the data held true.

8/ Then we held aside the most recent 10% of records as a penultimate test set, and vowed not to touch it again until we had completed our modeling efforts.

9/ We sought to predict the total value of claims a given mine might cause per year.

We chose the Tweedie distribution to model this data, as a compound Poission-Gamma distribution fit our observation of many 0s (no claims) and occasional large values.
en.wikipedia.org/wiki/Tweedie_d…

10/ Then based on our hypotheses we computed hundreds of promising features using the internal and external datasets we had assembled, and visualized their distributions and relationship with claim frequency and severity in our training dataset.

11/ Then we began building our models.

We used GLMs with smoothing splines for continuous features and one-hot encoding for categorical features.

We were following best practices (at the time) in the wonderful (and free) Elements of Statistical Learning:
web.stanford.edu/~hastie/ElemSt…

12/ We wanted a model that was parsimonious (so the client could understand it), and so we did backwards step-wise variable selection, as we didn’t know about regularization.

Meanwhile, we cross-validated our model to ensure we didn’t overfit.

13/ After 4-5 iterations of refining our hypotheses, data and models, our model was performing well in cross-validation on the training set. We were ready to test and present to the client.

14/ We test the model, and it’s error is only slightly better than a random model, and controlling for pricing, the predicted riskiest 10% of policy years were *less* risky than average.

We went to a bar.

15/ We knew that a few dozen disastrous accidents accounted for the majority of the value of claims, and so we tried modeling just the frequency of claims.

We tested that model, and it was slightly better, but not really.

We went back to the bar.

16/ Ultimately, we delivered a model based entirely on a much larger set of 3rd party data, and convinced ourselves (and the client) that it would be a good proxy for their underwriting risk.

In retrospect, I doubt it.

So what went wrong?

17/ First, while we had 1,000s of mines covered for 10s of years, most of the risk was represented in just a handful of claims.

So while we could estimate the global parameters of the Tweedie distribution, there was no way we could support a machine learning model on it.

18/ Second, the mines were very heterogeneous and complex.

They ranged from large scale heavy machinery mines like the one we had visited, to family operations with pick-axes, to above ground strip mining operations.

There was little to generalize across these entities.

19/ Third, our model had to improve upon the judgments of the human underwriters, who could visit the mining operations and read the history of past claims (and safety precautions taken).

While I am confident they had many biases (Moneyball), they also had tremendous wisdom.

20/ In the end, we managed the relationship with the customer well, and were transparent about what we could and could not do.

Later, the same team built an amazingly successful model predicting medical malpractice claims. But I’ll save that story for another time.

Like this thread? Get email updates or save it to PDF!

Subscribe to Jeremy Stanley

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Jeremy Stanley

This content may be removed anytime!

Try unrolling a thread yourself!

More from @jeremystan see all

Related threads

Trending hashtags

Did Thread Reader help you today?