people who are baffled by DeepSeek have been and still are sleeping on Qwen, InternLM, ByteDance and Tencent
here's couple of fan-favorite models from them 🪭
Qwen2VL: best open-source vision language model
here's demo for 7B, use 72B for longer context length and better performance huggingface.co/spaces/Ganymed…
Qwen released a new series of 1M context length models yesterday
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗
The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶
The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️
NVIDIA solved physics and open-sourced it? Can we just build our own autonomous robots now? 🤯
They released Cosmos: new family of open world foundation models (WFMs) 🌌
Unwrapping the release and why it's so revolutionary 🧶
They have released 7 models:
- four autoregressive video2world models
- four diffusion-based video2world and text2world models
- prompt upsampler
- content safety model
World Foundation Models (WFMs) are essentially pre-trained on open-world video data, which you can then fine-tune it on your specific application with less labels (be it autonomous driving or robotic arms)
This release matters so much for embodied applications because labelling frames for post-training of robotics models cost a lot of time, it will immensely accelerate the development of the embodied AI 🤖
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥👏
OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing. 👏
Model:
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.
no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.huggingface.co/microsoft/Omni…
Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏