Tweet

Dmitry Kobak

@hippopedoid

Mar 30 • 13 tweets • 8 min read Twitter logo

Scrolly

@giffmana

We held a reading group on Transformers (watched videos / read blog posts / studied papers by @giffmana @karpathy @ch402 @amaarora @JayAlammar @srush_nlp et al.), and now I _finally_ roughly understand what attention does.

Here is my take on it. A summary thread. 1/n

Consider BERT/GPT setting.

We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.

How can this be set up? 2/n

Fully-connected layer does not work: it cannot take input of variable length (and would have too many params anyway).

Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.

How can we make the tokens interact? 3/n

Here is the CORE TRICK:

X @ X.T @ X has the same shape as X, and all tokens get to interact with all other tokens (via scalar products in X.T @ X). Neat!

f(x) = X @ X.T @ X is _almost_ a self-attention layer, but this f(x) has no trainable parameters.

Can we add some? 4/n

Easy: we multiply each X in X @ X.T @ X with 768x768 weight matrices. They are called values, keys, and queries.

But conceptually, values can be folded into MLP later on, and keys are redundant given queries. We basically need only 1 matrix (KQ) that defines inner product. 5/n

This forms "affinities" between all pairs of tokens which are normalized using softmax. We got the attention matrix!!

Then tokens are added up with attention weights.

The interpretation is that this allows each token affect each other token based on the affinity they feel. 6/n

Transformer architecture interleaves attention layers with MLP layers.

An MLP layer (with 1 hidden layer) processes each token separately. Each token is processed using the same identical weights, so it's like a 1-convolution. 7/n

Putting it together, our X goes through self-attention (3 * 768^2 params) and then through MLP (8 * 768^2 params), all wrapped by skip connections.

One such block has ~7M params.

BERT-base stacks 12 blocks, giving ~85M params. (Plus another 25M for initial token encoding.) 8/n

BERT and GPT use identical architecture and identical loss: they mask some input tokens and let the model predict them.

BERT masks random tokens.

GPT always masks *the last* token only.

This is what makes GPT good at generating texts: it's trained to always append a token. 9/n

What about Vision Transformers (ViTs)?

It's the same thing!

Images are split into 16x16 patches. Patches play the role of tokens. 224x224 image gives 196 patches of 768 dimensions. This can directly go into a BERT-like architecture + classifier softmax on top. 10/n

@giffmana

The Google team behind ViTs (@giffmana) later showed that one can replace all attention layers with "1-convolution" *horizontal* MLPs (each dimension processed separately). That's "MLP-Mixer".

Amazing that it works!

But with a big caveat: it requires a fixed input size :( 11/n

@giffmana

RESOURCES:

1. Intro lecture on Transformers by @giffmana:

2. Building nanoGPT from scratch by @karpathy:

3. Analyzing attention-only circuits by @ch402 and team: transformer-circuits.pub/2021/framework…

I highly recommend all three! 12/12

@pbloemesquire

Update:

4. Transformers from scratch by @pbloemesquire: peterbloem.nl/blog/transform….

I haven't seen this blog post before, but it's brilliant! The best blog post about transformers that I've seen. Highly recommended. 13/12

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @hippopedoid

Dmitry Kobak

@hippopedoid

Dec 2, 2022

@FredHamprecht

A very long overdue thread: happy to share preprint led by Sebastian Damrich from @FredHamprecht's lab.

*From t-SNE to UMAP with contrastive learning*
arxiv.org/abs/2206.01816

I think we have finally understood the *real* difference between t-SNE and UMAP. It involves NCE! [1/n]

@jnboehm

In prior work, we (@jnboehm @CellTypist) showed that UMAP works like t-SNE with extra attraction. We argued that it is because UMAP relies on negative sampling, whereas t-SNE does not.

Turns out, this was not the whole story. [2/n]

https://twitter.com/hippopedoid/status/1285154214147235840

@jnboehm

Because UMAP uses negative sampling, its effective loss function is very different from its stated loss function (cross-entropy). @jnboehm showed it via Barnes-Hut UMAP, while Sebastian and Fred did mathematical analysis in their NeurIPS 2021 paper proceedings.neurips.cc/paper/2021/has… [3/n]

Read 8 tweets

Dmitry Kobak

@hippopedoid

Apr 26, 2022

@signmagazine

My paper on Poisson underdispersion in reported Covid-19 cases & deaths is out in @signmagazine. The claim is that underdispersion is a HUGE RED FLAG and suggests misreporting.

Paper: rss.onlinelibrary.wiley.com/doi/10.1111/17…
Code: github.com/dkobak/covid-u…

Figure below highlights 🇷🇺 and 🇺🇦. /1

What is "underdispersion"? Here is an example. Russia reported the following number of Covid deaths during the first week of September 2021: 792, 795, 790, 798, 799, 796, 793.

Mean: 795. Variance: 11. For Poisson random data, mean=variance. So this is *underdispersed*. /2

For comparison, during the same week US reported 1461, 1185, 1202, 1795, 2010, 2003, 1942 deaths. Mean: 1657. Variance: 135470. So this is *overdispersed*.

Overdispersion is not surprising: day-of-week reporting fluctuations, epidemic growth, etc.

But underdispersion is. /3

Read 11 tweets

Dmitry Kobak

@hippopedoid

Sep 30, 2021

So what's up with the Russian election two weeks ago? Was there fraud?

Of course there was fraud. Widespread ballot stuffing was videotaped etc., but we can also prove fraud using statistics.

See these *integer peaks* in the histograms of the polling station results? 🕵️‍♂️ [1/n]

These peaks are formed by polling stations that report integer turnout percentage or United Russia percentage. E.g. 1492 ballots cast at a station with 1755 registered voters. 1492/1755 = 85.0%. Important: 1492 is not a suspicious number! It's 85.0% which is suspicious. [2/n]

We can use binomial Monte Carlo simulation to find how many polling stations with integer percentages there should be by chance. Then we can compute the number of EXCESS integer polling stations (roughly the summed heights of all INTEGER PEAKS).

Resulting excess is 1300. [3/n]

Read 11 tweets

Dmitry Kobak

@hippopedoid

Sep 23, 2021

@lpachter

Chari et al. (@lpachter) have updated their preprint and doubled down on their claim that an 🐘-looking embedding, a random (!) embedding, and 2D PCA, all preserve data structure "similar or better" than t-SNE.

I still think this claim is absurd. [1/n]

https://twitter.com/lpachter/status/1440695021502545934

https://twitter.com/hippopedoid/status/1437421945956470785

They literally say: "Picasso can quantitatively represent [local and global properties] similarly to, or better, than the respective t-SNE/UMAP embeddings".

In my thread below I argued it's a non-sequitur from Fig 2, due to insufficient metrics. [2/n]

https://twitter.com/hippopedoid/status/1437421945956470785

@lpachter

I argued that they should also consider metrics like kNN recall or kNN classification accuracy, where t-SNE would fare much better than these other methods.

I thought it should be obvious from this figure (using MNIST). But now @lpachter says it's a "mirage".

Is it? [3/n]

Read 12 tweets

Dmitry Kobak

@hippopedoid

Sep 13, 2021

@lpachter

I am late to the party (was on holidays), but have now read @lpachter's "Specious Art" paper as well as ~300 quote tweets/threads, played with the code, and can add my two cents.

Spoiler: I disagree with their conclusions. Some claims re t-SNE/UMAP are misleading. Thread. 🐘

https://twitter.com/lpachter/status/1431325969411821572

The paper has several parts and I have too many comments for a twitter thread, so here I will only focus on the core of the authors' argument against t-SNE/UMAP, namely Figures 2 and 3. We can discuss the rest some other time. [2/n]

In this part, Chari et al. claim that:

* t-SNE/UMAP preserve global and local structure very poorly;
* Purposefully silly embedding that looks like an elephant performs as well or even better;
* Even *untrained* neural network performs around as well.

[3/n]

Read 12 tweets

Dmitry Kobak

@hippopedoid

Jan 12, 2021

@KrishnaswamyLab

OK, I'll bite.

PHATE (nature.com/articles/s4158…) from @KrishnaswamyLab is like Isomap meeting Diffusion Maps: MDS on geo distances obtained via diffusion. Cool paper!

So let's test it on: (1) MNIST, (2) Tasic2018, (3) n=1.3mln from 10x. Does it work as well as promised? 🧐 [1/7]

https://twitter.com/KrishnaswamyLab/status/1346302977712254976

Here is MNIST.

PHATE finds the same 4/7/9 and 8/5/3 mega-clusters that are also emphasized by UMAP, but fails to separate some of the digits within mega-clusters, e.g. green & red (3 and 5) overlap a lot.

IMHO that's a clearly worse performance than t-SNE or UMAP. [2/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Of course PHATE was designed for continuous data and that's where it's supposed to shine. But the original paper and tweets like this one and the one above make it look as if it hands-down outperforms t-SNE/UMAP for clustered data.

I'm unconvinced. [3/7]

https://twitter.com/KrishnaswamyLab/status/1201937339574050822

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Dmitry Kobak

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @hippopedoid

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Dmitry Kobak

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!