Tweet

Charles 🎉 Frye

Feb 26 • 24 tweets • 7 min read

@chipro

Read through these awesome notes by @chipro and noticed something interesting about distribution shifts: they form a lattice, so you can represent them like you do sets, ie using a Venn diagram!

I find this view super helpful for understanding shifts, so let's walk through it.

https://twitter.com/chipro/status/1490924046350909442

(inb4 pedantry: the above diagram is an Euler diagram, not a Venn diagram, meaning not all possible joins are represented. that is good, actually, for reasons to be revealed!)

From the notes: joint distribution of data X and targets Y is shifting. We can decompose the joint into two pieces (marginal and conditional) in two separate ways (from Y or X).

There are four major classes of distribution shift, defined by which pieces vary and which don't.

Wait, you say, where's the fourth shift?

Because of price increases due to global supply chain issues, we can only afford 3 distribution shifts in this post, where there should be 4.

Actually, it just doesn't have a canonical name, because it's "too difficult to study" -- perhaps because it involves the opposite conditional, P(X|Y), from what the model learns, P(Y|X)?

In any case, I'll call it "mechanism drift".

Etymology: if we think of our labels Y as latent variables in a generative model, X|Y is the generative process. Ideally, there's an intervening causal mechanism (dogs cause patterns of light on camera sensors, sentiments cause sentences), which is what has changed in this case.

At first, these four types seem independent -- I have a label shift problem or a covariate shift problem, but not both.

But that's not the case! We can have covariate shift with or without label shift. Read the attached if you want an example.

This gave me pause, because I realized my intuition was based on only one type of shift happening at a time!

So I dug a little deeper.

Fundamentally, this is because Bayes' rule says that

P(Y|X) P(X) = P(X|Y) P(Y)

And if we vary just one term on one side, at least one of the other two terms must vary to compensate.

So distribution shifts are always entangled with each other!

We can choose to vary or not vary each term in this equation, effectively fixing some subset of the distributions while the others change.

So we can relate our distribution shift types to subsets of {Y|X, X, X|Y, Y}.

A power set, the set of all subsets, has nice structure.

You may have seen that structure depicted geometrically, via Venn diagrams (right), or graphically, via Hasse diagrams (left).

But as noted, Bayes' rule limits what changes we're allowed to make! For example, we can't just change P(Y), because then the RHS of Bayes wouldn't equal the LHS. So there's no such thing as, say, "just the labels changing".

There's also two more disallowed cases: say that P(X) and P(Y|X) both shift, but P(Y) and P(X|Y) are fixed. The latter means the joint P(X,Y) can't change, so the changes must've canceled each other out. So there'd be no shift!

That means we can get away with a diagram that's missing those spots.

And in fact, there's a common "Venn diagram" that is missing two combinations. This one, where circles on opposite corners do not touch:

(legally, this is not a Venn diagram, because a Venn diagram, according to some "official" definitions, must represent all possible combinations, like a power set. but no one knows what an "Euler diagram" is, so chalk this one up to another L for prescriptivist pedants)

Let's write down this "Venn" diagram with possible choices of distributions to vary while holding the others fixed.

If we label our diagram with the shift types, we reveal an interesting structure for our four classes of distribution shift: not all pairs can occur together (e.g. no covariate + concept shift) and any triple just results in one type "winning out".

There's no "ordering" of the shifts based on the triples, btw. Label beats covariate if you add in concept, but covariate beats label if you add in mechanism.

Instead, there's a ring:

covariate <-> label <-> concept <-> mechanism <-> covariate

So what are the take-homes?

First, I think this diagram and the "power set" view behind it helps clarify what these drifts are: aggregates!

E.g. in the general case, label shift is when concept drift and covariate shift happen together in the absence of mechanism drift.

(and if label shift isn't accompanied by both, it's accompanied by one or the other! and mutatis mutandis for all the other shifts, based on adjacency in the ring from two tweets ago)

Second, I think it suggests some missing pieces in our distribution shift toolkit.

Can we build specific tools for when shifts happen together? Can we leverage the fact that, say, we know X and Y are both shifting, leaving the conditionals intact, and use a special approach?

(or maybe those exist already, would love to know if so!)

Finally, I think it clarifies that the unloved fourth shift type, "mechanism drift", is important and worthy of study.

It's hard to study in general, but what if we know it's happening at the same time as covariate shift, so the true discriminative function isn't changing?

PS: mechanism + concept is truly terrifying.

The generative model and the discriminative function are both changing, and so is the joint, so the loss is changing (going up, probably).

But the marginals are fixed!

So the shift would only show up if you checked the joint 😬

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @charles_irl

Charles 🎉 Frye

@charles_irl

Feb 25

https://twitter.com/arankomatsuzaki/status/1494488254304989228

There's been some back-and-forth about this paper on getting gradients without doing backpropagation, so I took a minute to write up an analysis on what breaks and how it might be fixed.

tl;dr: the estimated gradients are _really_ noisy! like wow

charlesfrye.github.io/pdfs/SNR-Forwa…

https://twitter.com/arankomatsuzaki/status/1494488254304989228

The main result I claim is an extension of Thm 1 in the paper. They prove that the _expected value_ of the gradient estimate is the true gradient, and I worked out the _variance_ of the estimate.

It's big! Each entry has variance equal to the entire true gradient's norm😬

(Sketch of the proof: nothing is correlated, everything has 0 mean and is symmetric around the origin, the only relevant terms are chi-squared r.v.s with known variances that get scaled by the gradient norms. gaussians are fun!)

Read 10 tweets

Charles 🎉 Frye

@charles_irl

Nov 18, 2021

@weights_biases

the final video for the @weights_biases Math4ML series, on probability, is now up on YouTube!

@_ScottCondron and I talk entropies, divergence, and loss functions

🔗:

https://twitter.com/charles_irl/status/1457840021772259332?s=20

this is the final video in a four-part series of "exercise" videos, where Scott and I work through a collection of Jupyter notebooks with automatically-graded Python coding exercises on math concepts

read more in this 🧵

https://twitter.com/charles_irl/status/1457840021772259332?s=20

each exercise notebook has a corresponding lecture video.

the focus of the lectures is on intuition, and in particular on intuition that i think programmers trying to get better at ML will grok

Read 7 tweets

Charles 🎉 Frye

@charles_irl

Nov 8, 2021

@weights_biases

New video series out this week (and into next!) on the @weights_biases YouTube channel.

They're Socratic livecoding sessions where @_ScottCondron and I work through the exercise notebooks for the Math4ML class.

Details in 🧵⤵️

@_ScottCondron

Socratic: following an ancient academic tradition, I try to trick @_ScottCondron into being wrong, so that students can learn from mistakes and see their learning process reflected in the content.

@PyTorchLightnin

(i was inspired to try this style out by the @PyTorchLightnin Master Class series, in which @_willfalcon and @alfcnz talk nitty-gritty of DL with PyTorch+Lightning while writing code. strong recommend!)

Read 8 tweets

Charles 🎉 Frye

@charles_irl

Aug 24, 2021

@PyTorch

If you're like me, you've written a lot of PyTorch code without ever being entirely sure what's _really_ happening under the hood.

Over the last few weeks, I've been dissecting some training runs using @PyTorch's trace viewer in @weights_biases.

Read on to learn what I learned!

I really like the "dissection" metaphor

a trace viewer is like a microscope, but for looking at executed code instead of living cells

its powerful lens allows you to see the intricate details of what elsewise appears a formless unity

kinda like this, but with GPU kernels:

number one take-away: at a high level, there's two executions of the graph happening.

one, with virtual tensors, happens on the CPU.

it keeps track of metadata like shapes so that it can "drive" the second one, with the real tensor data, that happens on the GPU.

Read 9 tweets

Charles 🎉 Frye

@charles_irl

Jul 31, 2020

https://twitter.com/DrLukeOR/status/1289305330027921408

another great regular online talk series! they're talking about GPT-3 now

https://twitter.com/DrLukeOR/status/1289305330027921408

@realSharonZhou

@realSharonZhou: sees opportunities in medicine for with "democratization" of design of e.g. web interfaces.

this could be key for healthcare providers who have clinical expertise and know what patients need but don't have web design skills.

@DrHughHarvey

@DrHughHarvey sees this as a step towards the holy grail of ML in radiology: a model that takes in an image and returns a full radiology report.

jump from GPT2 to GPT3 was just size. what might trillion-parameter models bring in other domains?

Read 8 tweets

Charles 🎉 Frye

@charles_irl

Jul 24, 2020

@daniela_witten

1/hella

this 🧵 by @daniela_witten is a masterclass in both the #SVD and in technical communication on Twitter.

i want to hop on this to expand on the "magic" of this decomposition and show folks where the rabbit goes, because i just gave a talk on it this week!

🧙‍♂️🐇💨😱

https://twitter.com/WomenInStat/status/1285612667839885312

tl;dr: the basic idea of the SVD works for _any_ function.

it's a three step decomposition:

- throw away the useless bits ⤵
- rename what remains 🔀
- insert yourself into the right context ⤴

@weights_biases

also, if you're more of a "YouTube talk" than a "tweet wall" kinda person, check out the video version, given as part of the @weights_biases Deep Learning Salon webinar series

Read 19 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Charles 🎉 Frye

Try unrolling a thread yourself!

More from @charles_irl

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Charles 🎉 Frye

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?