Latest Twitter Threads by @francoisfleuret on Thread Reader App

Apr 27 • 6 tweets • 2 min read

As expected, that was popular. Here is my attempt at consolidating all the answers into a list.

- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively

- GQA (Group Query Attention): more Q than (K, V)

https://twitter.com/francoisfleuret/status/1916197804214456393

- RMSNorm instead of Layernorm: normalize only the scaling

- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it

- SwiGLU: non-linearity for the FFN block with per-component gating

Jan 24 • 6 tweets • 2 min read

This being said, here is the TL;DR:

On the model architecture side, @deepseek_ai v3/r1 is a standard GPT that is a "causal decoder only", hence an auto-regressive models made of causal attention blocks. It is huge, with 671 billion parameters.

1/6

https://twitter.com/francoisfleuret/status/1882839985063366881

When you generate a sequence, each new token has to "look" at the K/V of the previous tokens, so they have to be cached.

(1) To reduce the memory footprint of that cache, they store only a low-dimension projection of the Xs that enter each block.

2/6

Jan 11 • 6 tweets • 2 min read

That kind of visualization is valid for discrete "dimensions" and consists of putting "hyper"-rows next to each other, mixing the row indexes and the in-row indexes. It is IMO more confusing than anything, and certainly does not help to get an intuitive grasp.

1/6

https://twitter.com/spikedoanz/status/1877944710284668955

The real thing is hard to grasp. E.g. an hyper-cube is "simply" a 3d cube that exists in the 4th dimension with the same length.

The intersection of a 3mx3mx3mx3m hypercube with a 3d space moving along the 4th dimension at 1m/s, would be a 3mx3mx3m cube appearing for 3s.

2/6

Feb 11, 2024 • 11 tweets • 3 min read

We often see people using the word "random variable" (RV), but their mathematical definition is unclear to most.

Here is an attempt at a TL;DR to give an intuition.

1/11

P.S. Okay, now that I have written it, I fear it won't help If you want to define the notion of something "random", the natural strategy is to define a distribution, that is, in the finite case, a list of values / probabilities.

So for instance, the head / tail result of a coin flipping would be (H, 0.5) (T, 0.5).

2/11

Jan 18, 2024 • 19 tweets • 3 min read

Information Theory is awesome so here is a TL;DR about Shanon's entropy.

This field is about quantifying the amount "of information" contained in a signal and how much can be transmitted under certain conditions.

1/11 What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not.

The key concept is Shannon entropy.

2/11

Jan 13, 2024 • 18 tweets • 5 min read

Since these experiments have been popular, here is a recap that will be from now the thread for updates.

The motivation for all this came from discussions at @neurips_conf with @tri_dao, @_albertgu, and @srush_nlp. What I took back from them was that the reason RNNs have been replaced with transformers is purely computational. The latter are more "GPU friendly" since with enough ores, the O(T) operations can be done in O(1).

Apr 24, 2022 • 12 tweets • 4 min read

To investigate the ability of a GPT-like model to "understand geometrical composition" I made a minimalist CLVR-like task on which I tested my own minimal GPT.

A thread! The task consist of a random arrangements of up to five colored pixels in a 6x8 image, from which I generate a bunch of boolean geometrical properties.

Here are a few train samples.

Jun 6, 2020 • 5 tweets • 3 min read

One more toyish example in @pytorch: The double descent with polynomial regression. (thread)

If we use this fitting on a piece-wise function, at first polynomials will tend to go "more and more" through samples, but result in a very irregular functional. With 8 samples, degree 7 reaches train error ~0.

May 19, 2020 • 6 tweets • 4 min read

To illustrate attention mechanisms, I made a toy task seq2seq task and implemented an attention layer from scratch. It worked beautifully (thread) The toy task is to translate a 1d time series composed of two triangular impulses and to rectangular impulses so that their heights are equalized in each shape group to their average.

Share this page!

Enter URL or ID to Unroll