François Fleuret Profile picture
Research Scientist @meta (FAIR), Prof. @Unige_en, co-founder @nc_shape. I like reality.
Jan 24 6 tweets 2 min read
This being said, here is the TL;DR:

On the model architecture side, @deepseek_ai v3/r1 is a standard GPT that is a "causal decoder only", hence an auto-regressive models made of causal attention blocks. It is huge, with 671 billion parameters.

1/6 When you generate a sequence, each new token has to "look" at the K/V of the previous tokens, so they have to be cached.

(1) To reduce the memory footprint of that cache, they store only a low-dimension projection of the Xs that enter each block.

2/6
Jan 11 6 tweets 2 min read
That kind of visualization is valid for discrete "dimensions" and consists of putting "hyper"-rows next to each other, mixing the row indexes and the in-row indexes. It is IMO more confusing than anything, and certainly does not help to get an intuitive grasp.

1/6 The real thing is hard to grasp. E.g. an hyper-cube is "simply" a 3d cube that exists in the 4th dimension with the same length.

The intersection of a 3mx3mx3mx3m hypercube with a 3d space moving along the 4th dimension at 1m/s, would be a 3mx3mx3m cube appearing for 3s.

2/6
Feb 11, 2024 11 tweets 3 min read
We often see people using the word "random variable" (RV), but their mathematical definition is unclear to most.

Here is an attempt at a TL;DR to give an intuition.

1/11

P.S. Okay, now that I have written it, I fear it won't help If you want to define the notion of something "random", the natural strategy is to define a distribution, that is, in the finite case, a list of values / probabilities.

So for instance, the head / tail result of a coin flipping would be (H, 0.5) (T, 0.5).

2/11
Jan 18, 2024 19 tweets 3 min read
Information Theory is awesome so here is a TL;DR about Shanon's entropy.

This field is about quantifying the amount "of information" contained in a signal and how much can be transmitted under certain conditions.

1/11 What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not.

The key concept is Shannon entropy.

2/11
Jan 13, 2024 18 tweets 5 min read
Since these experiments have been popular, here is a recap that will be from now the thread for updates.

The motivation for all this came from discussions at @neurips_conf with @tri_dao, @_albertgu, and @srush_nlp. What I took back from them was that the reason RNNs have been replaced with transformers is purely computational. The latter are more "GPU friendly" since with enough ores, the O(T) operations can be done in O(1).
Apr 24, 2022 12 tweets 4 min read
To investigate the ability of a GPT-like model to "understand geometrical composition" I made a minimalist CLVR-like task on which I tested my own minimal GPT.

A thread! The task consist of a random arrangements of up to five colored pixels in a 6x8 image, from which I generate a bunch of boolean geometrical properties.

Here are a few train samples.
Jun 6, 2020 5 tweets 3 min read
One more toyish example in @pytorch: The double descent with polynomial regression. (thread) If we use this fitting on a piece-wise function, at first polynomials will tend to go "more and more" through samples, but result in a very irregular functional. With 8 samples, degree 7 reaches train error ~0.
May 19, 2020 6 tweets 4 min read
To illustrate attention mechanisms, I made a toy task seq2seq task and implemented an attention layer from scratch. It worked beautifully (thread) The toy task is to translate a 1d time series composed of two triangular impulses and to rectangular impulses so that their heights are equalized in each shape group to their average. ImageImageImageImage