Kimi-VL-A3B-Thinking is the first ever capable open-source reasoning VLM with MIT license ❤️
> it has only 2.8B activated params 👏
> it's agentic 🔥
> surpasses gpt-4o
I've put it to test (see below ⤵️)
Try it:
This model consists of a dynamic res handling MoonViT encoder, a projection layer and a 16B MoE decoder (with 2.8B active params)
the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens huggingface.co/spaces/moonsho…
we'll give this model a test on agentic capabilities but here's an example from paper:
people who are baffled by DeepSeek have been and still are sleeping on Qwen, InternLM, ByteDance and Tencent
here's couple of fan-favorite models from them 🪭
Qwen2VL: best open-source vision language model
here's demo for 7B, use 72B for longer context length and better performance huggingface.co/spaces/Ganymed…
Qwen released a new series of 1M context length models yesterday
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗
The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶
The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️
NVIDIA solved physics and open-sourced it? Can we just build our own autonomous robots now? 🤯
They released Cosmos: new family of open world foundation models (WFMs) 🌌
Unwrapping the release and why it's so revolutionary 🧶
They have released 7 models:
- four autoregressive video2world models
- four diffusion-based video2world and text2world models
- prompt upsampler
- content safety model
World Foundation Models (WFMs) are essentially pre-trained on open-world video data, which you can then fine-tune it on your specific application with less labels (be it autonomous driving or robotic arms)
This release matters so much for embodied applications because labelling frames for post-training of robotics models cost a lot of time, it will immensely accelerate the development of the embodied AI 🤖