Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

SkalskiP

@skalskip92

Sep 24 • 10 tweets • 5 min read • Read on X

Scrolly

I finally solved player recognition

- player and number detection with RF-DETR
- player tracking with SAM2
- team clustering with SigLIP, UMAP and KMeans
- number recognition with SmolVLM2

stay tuned for YT tutorial:

↓ full breakdown + code youtube.com/c/Roboflow

we start with RF-DETR model fine-tuned to detect players, numbers, referees, ball, rim

model + dataset: universe.roboflow.com/roboflow-jvuqo…

I recently used the same model to build a jump shot make-or-miss demo, which will also be included in my upcoming YT tutorial

google colab: github.com/roboflow/noteb…

SAM2.1 tracks objects across video using visual prompts like boxes or points

we use a fine-tuned RF-DETR to detect all players in the first frame, pass these detections to SAM2.1, and track them in the following frames

I sample frames, detect players, crop the central regions, generate SigLIP embeddings, reduce them with UMAP, and cluster with KMeans to separate players into two teams

I used the same strategy in last year’s Football AI project. I also recorded a YouTube tutorial covering it

check it out if you haven’t already:

reading player numbers from small and blurry crops is not easy

traditional OCR models struggle with this task

for this reason, we decided to use SmolVLM2, fine-tuned on a custom multi-modal dataset

@andimarafioti @mervenoyann

model + dataset: universe.roboflow.com/roboflow-jvuqo…

the next step is to match each jersey number to the right player using the mask IoS metric

unlike IoU, which measures overlap against the union, IoS measures it against the smaller area, so a smaller object fully inside a larger one gives IoS = 1

as player positions change, jersey numbers are not always clear, so relying on a single prediction is unreliable

to reduce errors, we validate numbers across frames

you can see in the video how numbers stabilize once they stay visible across consecutive frames

code for those of you who really want to have some fun: colab.research.google.com/github/roboflo…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @skalskip92

SkalskiP

@skalskip92

Jul 17

VLMs are getting a lot better at detection and segmentation

with supervision-0.26.0 we shipped more tools allowing you to parse and visualize results from top VLMs

links to demos end examples below

link: github.com/roboflow/super…

added support for parsing and visualizing detection results from @alibaba_cloud Qwen2.5-VL, @moondreamai, and @GoogleDeepMind Gemini 2.0 and 2.5 models.

this comes in addition to existing support for @Microsoft Florence-2 and @GoogleDeepMind PaliGemma.

here's an awesome @huggingface space by @SergioPaniego and @onuralpszr, where they compare Moondream and Qwen2.5-VL object understanding using supervision-0.26.0 for parsing and visualization

huggingface.co/spaces/sergiop…

Read 6 tweets

SkalskiP

@skalskip92

Jun 12

CVPR 2025 papers pt. 2 - SAMWISE

SAMWISE adds language understanding and temporal reasoning to SAM2; you can segment and track objects in videos just by describing them

more papers:

↓ more github.com/SkalskiP/top-c…

- paper: arxiv.org/pdf/2411.17646
- code: github.com/ClaudiaCuttano…
- video: youtube.com/watch?v=OL3xvz…

https://x.com/skalskip92/status/1818999363399450639

SAM2 supports visual prompts like points and boxes but have no native support for text prompts.

I often showed how combining SAM2 with VLMs enabled language-guided image segmentation.

SAMWISE allows direct text-driven video object segmentation.

https://x.com/skalskip92/status/1818999363399450639

Read 7 tweets

SkalskiP

@skalskip92

Mar 12

YOLOE is real-time zero-shot detector (similar to YOLO-World), but allowing you to prompt with text or boxes

here I used YOLOE to detect croissants on conveyer using box prompt; I just picked first frame, drawn box and run prediction on other frames; runs at around 15 fps on T4

- paper: arxiv.org/abs/2503.07465
- code: github.com/THU-MIG/yoloe

just like YOLO-World, YOLOE allows you to prompt images with text

here are two examples where I asked for:
- ["dog", "eye", "tongue", "nose", "ear"] - the model missed the ear here
- ["dogs tail"]

Read 7 tweets

SkalskiP

@skalskip92

Feb 18

I've been playing with Qwen2.5-VL object detection over the past few days; take a look

notebook link: github.com/roboflow/noteb…

you can prompt the model to detect multiple objects classes at the same time

if there are too many objects in the image, or we try to detect many classes at once, the model can get confused and spins in circles until it reach token limit.

Read 7 tweets

SkalskiP

@skalskip92

Jan 23

x.com/skalskip92/sta…

the first episode of VLMs zero-to-hero will be about Word2Vec

we will train a Skip-Gram model on 17M words from wikipedia; notebook is already in the repository, and the video should be out in about a week

link: github.com/SkalskiP/vlms-… x.com/skalskip92/sta…

Skip-Gram model predicts the surrounding context words based on a given center word.

during training, the Skip-Gram model learns word embeddings (numerical representations of words) that capture semantic relationships, which can then be used for various natural language processing tasks like word similarity.

Read 7 tweets

SkalskiP

@skalskip92

Nov 20, 2024

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware

check out this SAM2 vs SAMURAI comparison!

- paper: arxiv.org/pdf/2411.11922
- code: github.com/yangchris11/sa…
- license: Apache-2.0

- enhance the visual tracking accuracy of SAM 2 by incorporating motion information through motion modeling, to effectively handle the fast-moving and occluded objects

- propose a motion-aware memory selection mechanism that reduces error in crowded scenes in contrast to the original fixed-window memory by selectively storing relevant frames decided by a mixture of motion and affinity scores

state-of-the-art performance on various VOT benchmarks, including GOT-10k, LaSOT-ext, and NeedForSpeed

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

SkalskiP

Try unrolling a thread yourself!

More from @skalskip92

SkalskiP

SkalskiP

SkalskiP

SkalskiP

SkalskiP

SkalskiP

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!