SkalskiP Profile picture
Open-source Lead @roboflow. VLMs. GPU poor. Dog person. Coffee addict. Dyslexic. | GH: https://t.co/dEmzMDGq5H | HF: https://t.co/4Lx1Yw34W7
Mar 12 7 tweets 3 min read
YOLOE is real-time zero-shot detector (similar to YOLO-World), but allowing you to prompt with text or boxes

here I used YOLOE to detect croissants on conveyer using box prompt; I just picked first frame, drawn box and run prediction on other frames; runs at around 15 fps on T4
Image
- paper: arxiv.org/abs/2503.07465
- code: github.com/THU-MIG/yoloe Image
Feb 18 7 tweets 4 min read
I've been playing with Qwen2.5-VL object detection over the past few days; take a look

notebook link: github.com/roboflow/noteb…Image you can prompt the model to detect multiple objects classes at the same time Image
Jan 23 7 tweets 3 min read
the first episode of VLMs zero-to-hero will be about Word2Vec

we will train a Skip-Gram model on 17M words from wikipedia; notebook is already in the repository, and the video should be out in about a week

link: github.com/SkalskiP/vlms-… x.com/skalskip92/sta…Image Skip-Gram model predicts the surrounding context words based on a given center word. Image
Nov 20, 2024 4 tweets 3 min read
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware

check out this SAM2 vs SAMURAI comparison!

- paper: arxiv.org/pdf/2411.11922
- code: github.com/yangchris11/sa…
- license: Apache-2.0 - enhance the visual tracking accuracy of SAM 2 by incorporating motion information through motion modeling, to effectively handle the fast-moving and occluded objects

- propose a motion-aware memory selection mechanism that reduces error in crowded scenes in contrast to the original fixed-window memory by selectively storing relevant frames decided by a mixture of motion and affinity scoresImage
Oct 17, 2024 10 tweets 3 min read
YOLO11 zero to hero tutorial!

- label images for training
- understand the YOLO annotation format
- train YOLO11 on your local machine and in Google Colab
- save and deploy the fine-trained model
- and more ↓

link: youtu.be/etjkjZoG2F0 x.com/skalskip92/sta…Image label images for YOLO11 training Image
Sep 24, 2024 4 tweets 2 min read
last month, we launched @roboflow workflows; this week, we rolled out a massive update with two models you asked about

here's how to perform zero-shot segmentation in a few seconds using Florence-2 and SAM2

inference:

↓ wanna know how those models work?github.com/roboflow/infer… here's my Florence-2 object detection fine-tuning tutorial

Aug 22, 2024 11 tweets 4 min read
over 200 hours of work compressed into a 90-minute video

the football AI tutorial is finally out!

link to video:

↓ key takeaways
code is here: github.com/roboflow/sports
Aug 8, 2024 9 tweets 4 min read
perspective transformation tutorial

I know many of you have been waiting for this tutorial for a long time, and it's finally here!

link:

↓ key takeaways blog.roboflow.com/camera-calibra…
keypoint detection is a computer vision task that involves identifying specific points of interest in an image or video. keypoints represent distinctive features or landmarks, such as facial features, body joints, or object corners.
Aug 7, 2024 6 tweets 4 min read
dev team at @roboflow is cooking something cool

workflows is a no-code platform that allows you to compose and deploy complicated computer vision pipelines

let me show you how to detect cars and read license plates without any coding

tutorial: blog.roboflow.com/license-plate-…
Image something like ComfyUI, but hosted on @roboflow infrastructure

gives you access to all public models available on the platform (100k+), foundational models like YOLO-World, and APIs like OpenAI

code is open-source; you can run it locally if you want

github.com/roboflow/infer…
Jul 31, 2024 5 tweets 2 min read
SAM2 can be used for ReID (reidentification) across multiple camera views

top video - reference video; bottom two videos - new previously unseen camera angles

I only annotated 3 frames from the reference video I only provided point annotations for 3 frames—23 points in total. In return, SAM2 gave me precise masks even for frames coming from previously unseen camera angles.

Image
Image
Image
Jul 23, 2024 4 tweets 2 min read
player clustering component of my Football AI project is pushed to GitHub

- feature extraction with SigLIP
- dimensionality reduction with UMAP
- clustering with KMeans

code: github.com/roboflow/sport…

the classifier is trained for each game separately using 20-30 video frames; no labels are required Image
Jul 1, 2024 10 tweets 4 min read
Florence-2 fine-tuning YouTube tutorial is finally out! (sorry it took me so long)

- running the pre-trained model with different vision tasks
- configuring LoRA
- training and benchmarking
- Florence-2 vs. top vision model

link:

↓ key takeaways
deep dive into the dataset format you'll need for Florence-2 object detection fine-tuning

Image
Image
Image
Jun 26, 2024 6 tweets 2 min read
"What is so special about Florence-2? Other models can do those things. Is this model accurate/faster than lava, gpt4v, and new YOLO models like v10 and yolo-world?"

I got this question today. Here's a short comparison of all those models.

↓ Florence-2:
- MIT license (you can use it for free)
- can perform zero-shot (no training required) object detection, instance segmentation, and image captioning (all in one model)
- you can fine-tune it on the custom dataset (you can do it on relatively cheap hardware)
- can run it on edge devices
Jun 20, 2024 7 tweets 4 min read
Wednesday afternoon session of posters #CVPR2024

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks [Poster #102]

TL;DR: Florence-2 is a vision foundation model designed for diverse computer vision and vision-language tasks using a unified, prompt-based representation, excelling in tasks such as captioning, object detection, grounding, and segmentation.

Image
Image
Image
Image
- paper:
- @huggingface demo: arxiv.org/pdf/2311.06242
huggingface.co/spaces/gokaygo…
Image
Jun 18, 2024 8 tweets 4 min read
ViP-LLaVA, a model that understands not only textual prompts but also visual prompts, such as pointing with an arrow, drawing an ellipse, or marking with a specific color

cool model presented yesterday by @yong_jae_lee at #CVPR2024 "Prompting in Vision" workshop

↓ read more - paper:
- code:
- demo:

plus a ViP-LLaVA poster to be presented at CVPR (Thu 20 Jun 1:30 p.m. EDT — 3 p.m. EDT #317) arxiv.org/abs/2312.00784
github.com/WisconsinAIVis…
pages.cs.wisc.edu/~mucai/vip-lla…
Image
Jun 13, 2024 4 tweets 2 min read
awesome blog post by Linas from the supervision team, showing how to detect and segment small objects

link: blog.roboflow.com/small-object-d…
before / after

same model, just with or without supervision

docs: supervision.roboflow.com/latest/how_to/…

Image
Image
Jun 6, 2024 6 tweets 3 min read
I finally managed to fine-tune PaliGemma on the custom segmentation dataset

most of you have probably noticed that I've been spamming all sorts of PaliGemma tutorials for the past few weeks; I have one more

shoutout to @__kolesnikov__ for all the help!

↓ read more + code Image here is the PaliGemma response describing one instance segmentation result; it consists of three elements

- 4 location tokens (0 - 1023)
- 16 segmentation tokens (0 - 127)
- category name Image
May 27, 2024 4 tweets 2 min read
YOLOv10 is fast and light but is NOT the best choice for detecting small objects in the distance.

- YOLOv8 - top-right
- YOLOv9 - bottom-left
- YOLOv10 - bottom-right

YOLOv10 performs worse. Image I have created @huggingface Space, where you can simultaneously test YOLOv8, YOLOv9, and YOLOv10 on your images.

link: huggingface.co/spaces/Skalski…
May 20, 2024 5 tweets 3 min read
I updated my PaliGemma fine-tuning notebook

many of you mentioned that the notebook lacked a benchmark for the fine-tuned model. I have just added mAP and the confusion matrix.

btw, you can expect the PaliGemma fine-tuning tutorial this week

↓ read more Image here's the link to the updated notebook: colab.research.google.com/github/roboflo…
Mar 25, 2024 4 tweets 2 min read
detecting small objects is hard

I spent some time today writing a short how-to guide on using supervision (in combination with the most popular CV libraries) to detect small objects.

btw is that a good idea for a video tutorial?

link:

↓ read more supervision.roboflow.com/develop/how_to…
this image is 3840x2160; running the model even with increased input resolution (1280) and lowered confidence threshold (0.2) doesn't yield much results Image
Mar 21, 2024 9 tweets 5 min read
time analysis with computer vision

- blurring faces
- detection and tracking
- smoothing detections
- filtering detections by zone
- calculating time

let me know if you want me to explain anything else. ;)

code:

↓ read more github.com/roboflow/super…
the full tutorial will be available on Monday on the @roboflow YouTube channel; subscribe so you don't miss it.

link to YouTube: youtube.com/@Roboflow
Image