Gabriele Berton Profile picture
Postdoc @Amazon working on MLLM - ex @CarnegieMellon @PoliTOnews @IITalk
Jan 31 8 tweets 2 min read
Here is a Google NeurIPS paper on how to improve LLM results at virtually no cost:

Scaling Embedding Layers in Language Models

Normal LLMs have a fixed vocabulary, usually around 200k tokens, and each token has its own embedding. [1/N] Image These embeddings do not depend on the context, and this is suboptimal, because tokens with multiple meanings are tied to a single embedding.

For example the word "right" means different things when saying "turn right" or "you're right": intuitively, being able to assign different embeddings to the token "right" depending on context helps the LLM.
Dec 12, 2025 8 tweets 3 min read
NeurIPS 2025 paper by the Qwen team:
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

TLDR: in GRPO-like RLVR you should apply the loss only to the 20% highest entropy tokens. [1/7] Image Background:
GRPO takes as input questions and answers. For each question we generate 8 answers with CoT, and score them as true/false if their final answer is correct/wrong. We then apply the loss to learn from the correct generated answers and unlearn from the wrong ones. [2/n]
Dec 11, 2025 4 tweets 1 min read
Just like Lex, I tried Tesla’s world model at NeurIPS

Here is what I learned (obviously they couldn't confirm or deny my hypotheses 😉)

At Tesla's booth, you could drive in the world model in real-time, like a video game. It is orders of magnitude faster than other WMs. [1/4] It felt like ~10 FPS which suggests they might avoid standard diffusion entirely or use a highly distilled few-step model.
Reaction time was quick, with the bottleneck likely being streaming rather than inference (it wasn't running locally). [2/4]
Jul 3, 2025 6 tweets 2 min read
Here's a cool paper using LLMs for lossless text compression, in what they call LLMZip, which outperforms SOTA text compression methods

The idea is very intuitive

Given a sentence to compress, like "My first attempt", they feed the first 2 tokens ("My" and " first") to... [1/6] Image the LLM, predict the logits for the next token and sort tokens by these logits which in this case would be [" test", " time", " attempt", ...]. Now you can encode the token " attempt" with its position in the ranking, in this case the number 2 [2/6]
Jun 27, 2025 4 tweets 2 min read
Video-XL (CVPR25) is a really cool paper, which allows to do video understanding (with a VLM) on hour-long videos

The idea is to extract visual tokens (individually from N frames of a video with a visual encoder), and then instead of passing all these tokens ... [1/4] Image to the LLM (which would blow up the memory if the sequence is too long) they sequentially compress them (with a factor of M) into smaller representations, in the form of KV-cache [2/4] Image
Jun 10, 2025 6 tweets 3 min read
Want to try a SOTA image localization model, on your own images?

We'll be at #CVPR presenting a demo of MegaLoc!

With our demo you can localize photos from San Francisco using MegaLoc, a SOTA image localization model, and it works in real time! MegaLoc is trained on ~10M images from 5 different datasets, combining best practices from Visual Place Recognition models

It is SOTA on countless datasets on multiple tasks (landmark retrieval, VPR, visual localization), and is robust to OOD images like night and underwater! Image
Image
Image
Image
May 15, 2025 9 tweets 3 min read
HuggingFace released a nice blog post about the current state of VLMs

Here's a summary, covering recent trends, specialized capabilities, agents, video LMs, new alignment techniques, and HF's fav VLMs [1/8]

Recent trends: Image 1) any-to-any models, with multi-modal input and output. An example is Qwen 2.5 Omni
2) reasoning models: pretty much a ViT with a reasoning LLM on top. Some models can reason and crop the image accordingly, o3 style
3) Small VLMs, like HF's SmolVLM2, with ~1B parameters [2/8] Image
Image
May 14, 2025 11 tweets 4 min read
While everyone is hating on Meta for the Llama 4 debacle, they dropped some very impressive CLIP-like models and VLMs

They came out in two twin papers, released on the same day

Here's a summary, some honest thoughts, and some things I personally liked and disliked of them [1/n] Image
Image
Results are impressive. In both papers.

The CLIP-like models are an engineering feat, trained with standard CLIP-style image-text alignment with known best practices: progressively increasing resolution, LAMB optimizer, strong augmentation, and lots of data. [2/n] Image
Apr 28, 2025 5 tweets 2 min read
Ok there's a new paper in my top 3 favorites

Vision transformers need registers

Clear problem, elegant solution, well written, easy to understand, good results, limitations included.

No fancy losses or layers. No equation (at all!)

Here's a short summary: (1/4) Image ViTs benefit from using tokens that encode global information, like the CLS. Having multiple of such "global tokens" helps the transformer, however there is only one CLS: the ViT then "secretly" chooses some low-content patches/tokenes (for example patches of sky) to ... (2/4) Image
Apr 27, 2025 4 tweets 2 min read
I'm fascinated by similarities between papers on seemingly unrelated tasks

For example LightGlue (image matching paper from ETH) and LayerSkip (LLM paper from Meta)

Both papers do Early Exit: if an intermediate layer is confident about its prediction, skip the final layers Image
Image
I do believe the two papers evolved independently, though there's a chance that LayerSkip's authors (October 2024) got the idea from LightGlue (April 2023)

Obviously the differences between the two papers are countless, but I like that the underlining idea is similar
Apr 22, 2025 5 tweets 2 min read
How to select pre-training data for LLMs?

Two papers came out last week from AllenAI and Nvidia that do it in a similar way, building on the intuition that good data is good regardless the size of the LLM.

This intuition can be used to select good data in a cheap manner... Image
Image
(training a large LLM on many subsets would be unfeasibly expensive).

Here some similarities and differences between these two papers:

Both papers split the whole available training data into subsets, train a small LLM on the subsets, and see how this performs: its...
Dec 19, 2024 9 tweets 3 min read
Libraries and tools that every deep learning project should use: loguru, tqdm, torchmetrics, einops, python 3.11, black. Optional: prettytable. Good for debugging: lovely_tensors. Any other ones I've missed?

Below a few words on each of them: Image loguru: a nice logging library. With a few lines of initialization you can call info() and debug() functions that print to stdout and log files without having to pass logger objects around. Also, you can set it to log the error traceback in case your code crashes (last line) Image