Gabriele Berton Profile picture
Jun 27 4 tweets 2 min read Read on X
Video-XL (CVPR25) is a really cool paper, which allows to do video understanding (with a VLM) on hour-long videos

The idea is to extract visual tokens (individually from N frames of a video with a visual encoder), and then instead of passing all these tokens ... [1/4] Image
to the LLM (which would blow up the memory if the sequence is too long) they sequentially compress them (with a factor of M) into smaller representations, in the form of KV-cache [2/4] Image
This means that when you need to create the last compressed visual token, the query needs to attend to N/M KV instead of N

In practice, you sacrifice some parallelization to reduce memory usage

It is probably easier to understand how it works by looking at the figure [3/4] Image
And because I like to draw comparisons between vision and NLP, Video-XL is somewhat similar to DeepMind's LLM paper called "Long Context In-Context Compression by Getting to the Gist of Gisting" [4/4]

Video-XL: arxiv.org/abs/2409.14485…

Gist of Gisting: arxiv.org/abs/2504.08934 Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gabriele Berton

Gabriele Berton Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @gabriberton

Jun 10
Want to try a SOTA image localization model, on your own images?

We'll be at #CVPR presenting a demo of MegaLoc!

With our demo you can localize photos from San Francisco using MegaLoc, a SOTA image localization model, and it works in real time!
MegaLoc is trained on ~10M images from 5 different datasets, combining best practices from Visual Place Recognition models

It is SOTA on countless datasets on multiple tasks (landmark retrieval, VPR, visual localization), and is robust to OOD images like night and underwater! Image
Image
Image
Image
MegaLoc is so robust that most of its "mistakes" are... mistakes in the labels. With it we even found wrong GPS on images from Google Street View datasets!

See this example from Pitts30k-val (Google Street View): it is obviously the same place, but not according to the GPS! Image
Image
Read 6 tweets
May 15
HuggingFace released a nice blog post about the current state of VLMs

Here's a summary, covering recent trends, specialized capabilities, agents, video LMs, new alignment techniques, and HF's fav VLMs [1/8]

Recent trends: Image
1) any-to-any models, with multi-modal input and output. An example is Qwen 2.5 Omni
2) reasoning models: pretty much a ViT with a reasoning LLM on top. Some models can reason and crop the image accordingly, o3 style
3) Small VLMs, like HF's SmolVLM2, with ~1B parameters [2/8] Image
Image
4) MoE VLMs: usually using an MoE model as decoder
5) Vision Language Action models (VLA): we've all seen the videos of a robot with π0.5 (a VLA) cleaning up messy rooms [3/8] Image
Read 9 tweets
May 14
While everyone is hating on Meta for the Llama 4 debacle, they dropped some very impressive CLIP-like models and VLMs

They came out in two twin papers, released on the same day

Here's a summary, some honest thoughts, and some things I personally liked and disliked of them [1/n] Image
Image
Results are impressive. In both papers.

The CLIP-like models are an engineering feat, trained with standard CLIP-style image-text alignment with known best practices: progressively increasing resolution, LAMB optimizer, strong augmentation, and lots of data. [2/n] Image
Unlike previous work (CLIP, SigLIP, AIMv2), they add a second training step with videos-text alignment: for each video, they sample 8 frames, pass them through the ViT, average the 8 embeddings and align them with the text. The text is a combination of synthetic ... [3/n] Image
Read 11 tweets
Apr 28
Ok there's a new paper in my top 3 favorites

Vision transformers need registers

Clear problem, elegant solution, well written, easy to understand, good results, limitations included.

No fancy losses or layers. No equation (at all!)

Here's a short summary: (1/4) Image
ViTs benefit from using tokens that encode global information, like the CLS. Having multiple of such "global tokens" helps the transformer, however there is only one CLS: the ViT then "secretly" chooses some low-content patches/tokenes (for example patches of sky) to ... (2/4) Image
stuff them with global information. This comes at a cost, as such tokens "forget" their local information. Therefore the paper proposes to add some extra tokens, called registers, which are just discarded after the final layer, and help the CLS carry information about ... (3/4) Image
Read 5 tweets
Apr 27
I'm fascinated by similarities between papers on seemingly unrelated tasks

For example LightGlue (image matching paper from ETH) and LayerSkip (LLM paper from Meta)

Both papers do Early Exit: if an intermediate layer is confident about its prediction, skip the final layers Image
Image
I do believe the two papers evolved independently, though there's a chance that LayerSkip's authors (October 2024) got the idea from LightGlue (April 2023)

Obviously the differences between the two papers are countless, but I like that the underlining idea is similar
Read 4 tweets
Apr 22
How to select pre-training data for LLMs?

Two papers came out last week from AllenAI and Nvidia that do it in a similar way, building on the intuition that good data is good regardless the size of the LLM.

This intuition can be used to select good data in a cheap manner... Image
Image
(training a large LLM on many subsets would be unfeasibly expensive).

Here some similarities and differences between these two papers:

Both papers split the whole available training data into subsets, train a small LLM on the subsets, and see how this performs: its...
performance is used as a proxy for data quality.
The main difference is that DataDecide splits the data according to its data source (usually training datasets are a collection of multiple datasets), while CLIMB creates clusters with each documents embeddings (meaning...
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(