Senior PowerPoint Engineer Profile picture
Risk manager @AlamedaResearch. | Prev: QA engineer at Knight Capital Group | he/him | https://t.co/Z2YNAksZLO
Michael Hood Profile picture 1 subscribed
Oct 29, 2023 27 tweets 5 min read
Linear regression brain-dump mega thread.

Simple (i.e. 1 independent variable + constant term) linear regression coefficients are equivalent to:

β = Cov(x, y) / Var(x)

Understanding where this comes from is valuable for understanding both regression and ML/AI more generally. Let's first note that the closed form solution of OLS for X of arbitrary size is:

β = (X'X)⁻¹ X'y

With scalars, multiplying A by the inverse of some value B is like dividing A by B:

A * B⁻¹ = A / B

So X'X is being "divided" in some sense.
Jun 19, 2023 9 tweets 2 min read
Hold up, I think I was one of the 101 participants in this study. Guess it doesn't matter because they ended up faking the data anyway but they had you throw away the sheet of paper into a pristine emty trash bin and it was so obvious what they were doing. datacolada.org/109 Image In retrospect younger me should have cheated more on these Dan Ariely studies. I thought there was a nonzero risk that cheating may invalidate future study eligibility but I would have been doing Dan Ariely and co a favor and I'd have gotten more money. Win win.
Jun 13, 2023 14 tweets 3 min read
A take: data lineage / data catalog tools are really just lipstick on a pig and they're not going to solve your discoverability problems. Not usually worth implementing.

What you need is some planning ahead, agreed upon org practices, elbow grease, and people making quick calls. If you're using dbt you already have a table-level data lineage, just fill the docs in and turn on persist_docs (no reason to not turn this on).

Even then, people will be asking you directly questions and/or doing things incorrectly. Column level lineage isn't gonna stop that!
May 19, 2023 4 tweets 1 min read
College admissions really just comes down to whatever a school wants to prioritize. In some sense there is no such thing as objective criteria because it begs the question of the decision to prioritize certain criteria above others. 1/ Even if SAT scores were insurmountably correlated with parental HH income (and they are strongly correlated!), you can just do:

satscore = b_0 + b_income * income + ε_satscore

And evaluate candidates on ε_satscore. Boom, economic justice!*

* if the school actually cares.

2/
Mar 28, 2023 4 tweets 1 min read
Addendum to last night's thread but this is another one of those phenomena that is not surprising if you see ChatPGPT a little less like talking to a robot and a little more like querying a giant corpus of text off the internet. Every time I post this, and I've said it multiple times, I get people saying it's doing more than that. In a sense, yeah sure. But "querying a giant corpus of text" should be at least a nonzero part of your mental model of what it does lest you be mystified by things like this.
Mar 28, 2023 12 tweets 2 min read
Mini thread on how I like to explore ChatGPT in interesting ways.

ChatGPT is just a giant corpus of text from the internet. When you type stuff into ChatGPT, you are querying that text.

This is a better first order approximation for what ChatGPT is doing than "it's rly smart." So there was a prompt a while ago that turned ChatGPT into a terminal that was really cool.

That works surprisingly well because there are a bajillion code logs on the internet. (Not because ChatGPT is rly smart.)
Mar 28, 2023 4 tweets 1 min read
Large language model trained on ten thousand paperclip optimizer thought experiment blog posts by LessWrong bloggers, being asked to query these blog posts: "I would not make paperclips."

Big yud: "I can't believe it." Just feels like sometimes Yud has the machine learning equivalent of a child's object permanence.
Jan 9, 2023 24 tweets 5 min read
The optimal, general purpose + not-overengineered infrastructure for production data science batch jobs is actually really easy to do once you know what it is.

I'll explain it in this thread. Why not. So I think what I'll be describing is the modal setup for DS batch jobs. Nothing out of the ordinary here. That said, lots of companies do not do this because they entrusted their DS infra to people who aren't engineers. Hence unnecessary things like sagemaker, vertex AI, etc.