Post

More from @mervenoyann

merve

@mervenoyann

May 5

A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to @huggingface transformers 🔥

D-FINE is the sota real-time object detector that runs on T4 (free Colab) 🤩
Keep reading for the paper explainer, notebooks & demo 👀

> Collection with all checkpoints and demo huggingface.co/collections/us…
Notebooks:
> Tracking github.com/qubvel/transfo…
> Inference github.com/qubvel/transfo…
> Fine-tuning github.com/qubvel/transfo…

h/t @qubvelx @ariG23498 pic.x.com/oI95hkb6Xr

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve 🥲☹️

D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate👏

Read 4 tweets

merve

@mervenoyann

Apr 11

DO NOT SLEEP ON THIS MODEL @Kimi_Moonshot 🤯

Kimi-VL-A3B-Thinking is the first ever capable open-source reasoning VLM with MIT license ❤️
> it has only 2.8B activated params 👏
> it's agentic 🔥
> surpasses gpt-4o

I've put it to test (see below ⤵️)

Try it:
This model consists of a dynamic res handling MoonViT encoder, a projection layer and a 16B MoE decoder (with 2.8B active params)
the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens huggingface.co/spaces/moonsho…

we'll give this model a test on agentic capabilities but here's an example from paper:

Read 4 tweets

merve

@mervenoyann

Feb 28

Microsoft released the most powerful vision language action model this week 🔥

MAGMA-8B can operate in both physical and digital world: embodied robots, web automation and more! 🤯

Model: huggingface.co/microsoft/Magm…

Demo for UI agents: huggingface.co/spaces/microso…

New gaming agent demo based on Magma: huggingface.co/spaces/microso…

The authors follow a very typical multimodal architecture where they encode images and videos and concatenate with text reps and send to LLM

The flavor comes from two new prompt formats: Set-of-Mark for action grounding and Trace-of-Mark for action planning

Read 7 tweets

merve

@mervenoyann

Jan 27

people who are baffled by DeepSeek have been and still are sleeping on Qwen, InternLM, ByteDance and Tencent

here's couple of fan-favorite models from them 🪭

Qwen2VL: best open-source vision language model
here's demo for 7B, use 72B for longer context length and better performance
huggingface.co/spaces/Ganymed…

Qwen released a new series of 1M context length models yesterday

here's an 14B onehuggingface.co/Qwen/Qwen2.5-1…

Read 8 tweets

merve

@mervenoyann

Jan 9

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗

The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) for images and videos ⏯️
take a look 🧶

The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

All of them are in this collection huggingface.co/collections/By…

The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

Read 4 tweets

merve

@mervenoyann

Jan 7

NVIDIA solved physics and open-sourced it? Can we just build our own autonomous robots now? 🤯

They released Cosmos: new family of open world foundation models (WFMs) 🌌

Unwrapping the release and why it's so revolutionary 🧶

They have released 7 models:
- four autoregressive video2world models
- four diffusion-based video2world and text2world models
- prompt upsampler
- content safety model

Collection on Hugging Face: huggingface.co/collections/nv…
Tech report: research.nvidia.com/publication/20…
Try them here: (try here build.nvidia.com/nvidia/cosmos-…

World Foundation Models (WFMs) are essentially pre-trained on open-world video data, which you can then fine-tune it on your specific application with less labels (be it autonomous driving or robotic arms)

This release matters so much for embodied applications because labelling frames for post-training of robotics models cost a lot of time, it will immensely accelerate the development of the embodied AI 🤖

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

merve

Try unrolling a thread yourself!

More from @mervenoyann

merve

merve

merve

merve

merve

merve

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!