Senior PowerPoint Engineer's Threads

Jan 9 • 4 tweets • 1 min read

Don't ask about linear regression assumptions as a data science interviewer. It's tacky. Linear regression doesn't assume "normal residuals." It doesn't really assume much at all beyond the data it asks of you to have. It's just an equation. You can do whatever you want with it. Certain uses of linear regression require certain assumptions, which can be met to varying degrees of conformity. That much is true. But linear regression itself is just a thing you put data into and then it outputs some numbers.

Jul 24, 2024 • 5 tweets • 2 min read

after working with data scientists whose jobs are more ostensibly math related than SRE, I think you could banish 90% of them to the shadow realm if you pop quiz them on high school math. There is no shame in admitting you need the refresher and then doing the refresher.

https://twitter.com/samwhoo/status/1815506551114801326

I would expect a massive overlap between the ones incredulous that Sam needs a refresher on polynomials and the ones who would get shadow realm'd by like a high school honors precalculus exam question. try being 5 years into a SWE job and doing one of those, you might be humbled!

May 5, 2024 • 5 tweets • 1 min read

my equivalent of "I could beat a chimpanzee in a fight" is I think I could fix the spam on Twitter. like how hard could it really be to detect this and squash this specific category of spam

Oct 29, 2023 • 27 tweets • 5 min read

Linear regression brain-dump mega thread.

Simple (i.e. 1 independent variable + constant term) linear regression coefficients are equivalent to:

β = Cov(x, y) / Var(x)

Understanding where this comes from is valuable for understanding both regression and ML/AI more generally. Let's first note that the closed form solution of OLS for X of arbitrary size is:

β = (X'X)⁻¹ X'y

With scalars, multiplying A by the inverse of some value B is like dividing A by B:

A * B⁻¹ = A / B

So X'X is being "divided" in some sense.

Jun 19, 2023 • 9 tweets • 2 min read

Hold up, I think I was one of the 101 participants in this study. Guess it doesn't matter because they ended up faking the data anyway but they had you throw away the sheet of paper into a pristine emty trash bin and it was so obvious what they were doing. datacolada.org/109

In retrospect younger me should have cheated more on these Dan Ariely studies. I thought there was a nonzero risk that cheating may invalidate future study eligibility but I would have been doing Dan Ariely and co a favor and I'd have gotten more money. Win win.

Jun 13, 2023 • 14 tweets • 3 min read

A take: data lineage / data catalog tools are really just lipstick on a pig and they're not going to solve your discoverability problems. Not usually worth implementing.

What you need is some planning ahead, agreed upon org practices, elbow grease, and people making quick calls. If you're using dbt you already have a table-level data lineage, just fill the docs in and turn on persist_docs (no reason to not turn this on).

Even then, people will be asking you directly questions and/or doing things incorrectly. Column level lineage isn't gonna stop that!

May 19, 2023 • 4 tweets • 1 min read

College admissions really just comes down to whatever a school wants to prioritize. In some sense there is no such thing as objective criteria because it begs the question of the decision to prioritize certain criteria above others. 1/

https://twitter.com/ryxcommar/status/1659596501935353856

Even if SAT scores were insurmountably correlated with parental HH income (and they are strongly correlated!), you can just do:

satscore = b_0 + b_income * income + ε_satscore

And evaluate candidates on ε_satscore. Boom, economic justice!*

* if the school actually cares.

2/

Mar 28, 2023 • 4 tweets • 1 min read

Addendum to last night's thread but this is another one of those phenomena that is not surprising if you see ChatPGPT a little less like talking to a robot and a little more like querying a giant corpus of text off the internet.

https://twitter.com/peteskomoroch/status/1640758062016507905

Every time I post this, and I've said it multiple times, I get people saying it's doing more than that. In a sense, yeah sure. But "querying a giant corpus of text" should be at least a nonzero part of your mental model of what it does lest you be mystified by things like this.

Mar 28, 2023 • 12 tweets • 2 min read

Mini thread on how I like to explore ChatGPT in interesting ways.

ChatGPT is just a giant corpus of text from the internet. When you type stuff into ChatGPT, you are querying that text.

This is a better first order approximation for what ChatGPT is doing than "it's rly smart." So there was a prompt a while ago that turned ChatGPT into a terminal that was really cool.

That works surprisingly well because there are a bajillion code logs on the internet. (Not because ChatGPT is rly smart.)

Mar 28, 2023 • 4 tweets • 1 min read

Large language model trained on ten thousand paperclip optimizer thought experiment blog posts by LessWrong bloggers, being asked to query these blog posts: "I would not make paperclips."

Big yud: "I can't believe it."

https://twitter.com/ESYudkowsky/status/1640511156254289926

Just feels like sometimes Yud has the machine learning equivalent of a child's object permanence.

Jan 9, 2023 • 24 tweets • 5 min read

The optimal, general purpose + not-overengineered infrastructure for production data science batch jobs is actually really easy to do once you know what it is.

I'll explain it in this thread. Why not.

https://twitter.com/ryxcommar/status/1612277817563119617

So I think what I'll be describing is the modal setup for DS batch jobs. Nothing out of the ordinary here. That said, lots of companies do not do this because they entrusted their DS infra to people who aren't engineers. Hence unnecessary things like sagemaker, vertex AI, etc.

Share this page!

Enter URL or ID to Unroll