Latest Twitter Threads by @skalskip92 on Thread Reader App

Sep 24 • 10 tweets • 5 min read

I finally solved player recognition

- player and number detection with RF-DETR
- player tracking with SAM2
- team clustering with SigLIP, UMAP and KMeans
- number recognition with SmolVLM2

stay tuned for YT tutorial:

↓ full breakdown + code youtube.com/c/Roboflow

we start with RF-DETR model fine-tuned to detect players, numbers, referees, ball, rim

model + dataset: universe.roboflow.com/roboflow-jvuqo…

Jul 17 • 6 tweets • 3 min read

VLMs are getting a lot better at detection and segmentation

with supervision-0.26.0 we shipped more tools allowing you to parse and visualize results from top VLMs

links to demos end examples below

link: github.com/roboflow/super…

added support for parsing and visualizing detection results from @alibaba_cloud Qwen2.5-VL, @moondreamai, and @GoogleDeepMind Gemini 2.0 and 2.5 models.

this comes in addition to existing support for @Microsoft Florence-2 and @GoogleDeepMind PaliGemma.

Jun 12 • 7 tweets • 3 min read

CVPR 2025 papers pt. 2 - SAMWISE

SAMWISE adds language understanding and temporal reasoning to SAM2; you can segment and track objects in videos just by describing them

more papers:

↓ more github.com/SkalskiP/top-c…

- paper: arxiv.org/pdf/2411.17646
- code: github.com/ClaudiaCuttano…
- video: youtube.com/watch?v=OL3xvz…

Mar 12 • 7 tweets • 3 min read

YOLOE is real-time zero-shot detector (similar to YOLO-World), but allowing you to prompt with text or boxes

here I used YOLOE to detect croissants on conveyer using box prompt; I just picked first frame, drawn box and run prediction on other frames; runs at around 15 fps on T4

- paper: arxiv.org/abs/2503.07465
- code: github.com/THU-MIG/yoloe

Feb 18 • 7 tweets • 4 min read

I've been playing with Qwen2.5-VL object detection over the past few days; take a look

notebook link: github.com/roboflow/noteb…

you can prompt the model to detect multiple objects classes at the same time

Jan 23 • 7 tweets • 3 min read

the first episode of VLMs zero-to-hero will be about Word2Vec

we will train a Skip-Gram model on 17M words from wikipedia; notebook is already in the repository, and the video should be out in about a week

link: github.com/SkalskiP/vlms-… x.com/skalskip92/sta…

Skip-Gram model predicts the surrounding context words based on a given center word.

Nov 20, 2024 • 4 tweets • 3 min read

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware

check out this SAM2 vs SAMURAI comparison!

- paper: arxiv.org/pdf/2411.11922
- code: github.com/yangchris11/sa…
- license: Apache-2.0

- enhance the visual tracking accuracy of SAM 2 by incorporating motion information through motion modeling, to effectively handle the fast-moving and occluded objects

- propose a motion-aware memory selection mechanism that reduces error in crowded scenes in contrast to the original fixed-window memory by selectively storing relevant frames decided by a mixture of motion and affinity scores

Oct 17, 2024 • 10 tweets • 3 min read

YOLO11 zero to hero tutorial!

- label images for training
- understand the YOLO annotation format
- train YOLO11 on your local machine and in Google Colab
- save and deploy the fine-trained model
- and more ↓

link: youtu.be/etjkjZoG2F0 x.com/skalskip92/sta…

label images for YOLO11 training

Sep 24, 2024 • 4 tweets • 2 min read

last month, we launched @roboflow workflows; this week, we rolled out a massive update with two models you asked about

here's how to perform zero-shot segmentation in a few seconds using Florence-2 and SAM2

inference:

↓ wanna know how those models work?github.com/roboflow/infer…

here's my Florence-2 object detection fine-tuning tutorial

Aug 22, 2024 • 11 tweets • 4 min read

over 200 hours of work compressed into a 90-minute video

the football AI tutorial is finally out!

link to video:

↓ key takeaways

code is here: github.com/roboflow/sports

Aug 8, 2024 • 9 tweets • 4 min read

perspective transformation tutorial

I know many of you have been waiting for this tutorial for a long time, and it's finally here!

link:

↓ key takeaways blog.roboflow.com/camera-calibra…

keypoint detection is a computer vision task that involves identifying specific points of interest in an image or video. keypoints represent distinctive features or landmarks, such as facial features, body joints, or object corners.

Aug 7, 2024 • 6 tweets • 4 min read

dev team at @roboflow is cooking something cool

workflows is a no-code platform that allows you to compose and deploy complicated computer vision pipelines

let me show you how to detect cars and read license plates without any coding

tutorial: blog.roboflow.com/license-plate-…

something like ComfyUI, but hosted on @roboflow infrastructure

gives you access to all public models available on the platform (100k+), foundational models like YOLO-World, and APIs like OpenAI

code is open-source; you can run it locally if you want

github.com/roboflow/infer…

Jul 31, 2024 • 5 tweets • 2 min read

SAM2 can be used for ReID (reidentification) across multiple camera views

top video - reference video; bottom two videos - new previously unseen camera angles

I only annotated 3 frames from the reference video

I only provided point annotations for 3 frames—23 points in total. In return, SAM2 gave me precise masks even for frames coming from previously unseen camera angles.

Jul 23, 2024 • 4 tweets • 2 min read

player clustering component of my Football AI project is pushed to GitHub

- feature extraction with SigLIP
- dimensionality reduction with UMAP
- clustering with KMeans

code: github.com/roboflow/sport…

https://twitter.com/skalskip92/status/1808874766515818840

the classifier is trained for each game separately using 20-30 video frames; no labels are required

Jul 1, 2024 • 10 tweets • 4 min read

Florence-2 fine-tuning YouTube tutorial is finally out! (sorry it took me so long)

- running the pre-trained model with different vision tasks
- configuring LoRA
- training and benchmarking
- Florence-2 vs. top vision model

link:

↓ key takeaways

deep dive into the dataset format you'll need for Florence-2 object detection fine-tuning

Jun 26, 2024 • 6 tweets • 2 min read

"What is so special about Florence-2? Other models can do those things. Is this model accurate/faster than lava, gpt4v, and new YOLO models like v10 and yolo-world?"

I got this question today. Here's a short comparison of all those models.

↓ Florence-2:
- MIT license (you can use it for free)
- can perform zero-shot (no training required) object detection, instance segmentation, and image captioning (all in one model)
- you can fine-tune it on the custom dataset (you can do it on relatively cheap hardware)
- can run it on edge devices

Jun 20, 2024 • 7 tweets • 4 min read

Wednesday afternoon session of posters #CVPR2024

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks [Poster #102]

TL;DR: Florence-2 is a vision foundation model designed for diverse computer vision and vision-language tasks using a unified, prompt-based representation, excelling in tasks such as captioning, object detection, grounding, and segmentation.

↓

- paper:
- @huggingface demo: arxiv.org/pdf/2311.06242
huggingface.co/spaces/gokaygo…

Jun 18, 2024 • 8 tweets • 4 min read

ViP-LLaVA, a model that understands not only textual prompts but also visual prompts, such as pointing with an arrow, drawing an ellipse, or marking with a specific color

cool model presented yesterday by @yong_jae_lee at #CVPR2024 "Prompting in Vision" workshop

↓ read more

- paper:
- code:
- demo:

plus a ViP-LLaVA poster to be presented at CVPR (Thu 20 Jun 1:30 p.m. EDT — 3 p.m. EDT #317) arxiv.org/abs/2312.00784
github.com/WisconsinAIVis…
pages.cs.wisc.edu/~mucai/vip-lla…

Jun 13, 2024 • 4 tweets • 2 min read

awesome blog post by Linas from the supervision team, showing how to detect and segment small objects

link: blog.roboflow.com/small-object-d…

before / after

same model, just with or without supervision

docs: supervision.roboflow.com/latest/how_to/…

Jun 6, 2024 • 6 tweets • 3 min read

I finally managed to fine-tune PaliGemma on the custom segmentation dataset

most of you have probably noticed that I've been spamming all sorts of PaliGemma tutorials for the past few weeks; I have one more

shoutout to @__kolesnikov__ for all the help!

↓ read more + code

here is the PaliGemma response describing one instance segmentation result; it consists of three elements

- 4 location tokens (0 - 1023)
- 16 segmentation tokens (0 - 127)
- category name

May 27, 2024 • 4 tweets • 2 min read

YOLOv10 is fast and light but is NOT the best choice for detecting small objects in the distance.

- YOLOv8 - top-right
- YOLOv9 - bottom-left
- YOLOv10 - bottom-right

YOLOv10 performs worse.

I have created @huggingface Space, where you can simultaneously test YOLOv8, YOLOv9, and YOLOv10 on your images.

link: huggingface.co/spaces/Skalski…

Share this page!

Enter URL or ID to Unroll