SkalskiP Profile picture
Jul 1, 2024 10 tweets 4 min read Read on X
Florence-2 fine-tuning YouTube tutorial is finally out! (sorry it took me so long)

- running the pre-trained model with different vision tasks
- configuring LoRA
- training and benchmarking
- Florence-2 vs. top vision model

link:

↓ key takeaways
deep dive into the dataset format you'll need for Florence-2 object detection fine-tuning

Image
Image
Image
starting this week all datasets on @roboflow Universe can be downloaded in a format compatible with Florence-2 Image
using PEFT library to configure LoRA for Florence-2 fine-tuning Image
defining training loop and fine-tuning on custom dataset Image
interesting to see that fine-tuned Florence-2 can accidentally produce misspelled class names Image
in the end I got mAP = 0.75 with Florence-2 vs. mAP = 0.91 with YOLOv8 (on the same dataset)
Image
Image
Florence-2 vs. other top computer vision models right now Image
here's my Florence-2 overview blog post if you want to learn more about the model

link: blog.roboflow.com/florence-2
and here's my Google Colab if you want to follow along

link: colab.research.google.com/github/roboflo…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with SkalskiP

SkalskiP Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @skalskip92

Nov 13, 2025
RF-DETR paper is finally on arXiv

- real time detection with DINOv2 backbone
- runs neural architecture search (NAS) over about 6000 architecture variants
- uses weight sharing across all configs
- first real-time segmentation DETR to break past top YOLO results

↓ more
RF-DETR used DINOv2 backbone

- strong visual priors
- boosts results on small and unusual datasets
- transfers better than COCO-optimized backbones
- gives a solid base for NAS to build fast real time variants without losing quality Image
Read 7 tweets
Sep 24, 2025
I finally solved player recognition

- player and number detection with RF-DETR
- player tracking with SAM2
- team clustering with SigLIP, UMAP and KMeans
- number recognition with SmolVLM2

stay tuned for YT tutorial:

↓ full breakdown + code youtube.com/c/Roboflow
we start with RF-DETR model fine-tuned to detect players, numbers, referees, ball, rim

model + dataset: universe.roboflow.com/roboflow-jvuqo…
I recently used the same model to build a jump shot make-or-miss demo, which will also be included in my upcoming YT tutorial

google colab: github.com/roboflow/noteb…
Read 10 tweets
Jul 17, 2025
VLMs are getting a lot better at detection and segmentation

with supervision-0.26.0 we shipped more tools allowing you to parse and visualize results from top VLMs

links to demos end examples below

link: github.com/roboflow/super…
added support for parsing and visualizing detection results from @alibaba_cloud Qwen2.5-VL, @moondreamai, and @GoogleDeepMind Gemini 2.0 and 2.5 models.

this comes in addition to existing support for @Microsoft Florence-2 and @GoogleDeepMind PaliGemma. Image
here's an awesome @huggingface space by @SergioPaniego and @onuralpszr, where they compare Moondream and Qwen2.5-VL object understanding using supervision-0.26.0 for parsing and visualization

huggingface.co/spaces/sergiop…
Read 6 tweets
Jun 12, 2025
CVPR 2025 papers pt. 2 - SAMWISE

SAMWISE adds language understanding and temporal reasoning to SAM2; you can segment and track objects in videos just by describing them

more papers:

↓ more github.com/SkalskiP/top-c…
SAM2 supports visual prompts like points and boxes but have no native support for text prompts.

I often showed how combining SAM2 with VLMs enabled language-guided image segmentation.

SAMWISE allows direct text-driven video object segmentation.

Read 7 tweets
Mar 12, 2025
YOLOE is real-time zero-shot detector (similar to YOLO-World), but allowing you to prompt with text or boxes

here I used YOLOE to detect croissants on conveyer using box prompt; I just picked first frame, drawn box and run prediction on other frames; runs at around 15 fps on T4
Image
just like YOLO-World, YOLOE allows you to prompt images with text

here are two examples where I asked for:
- ["dog", "eye", "tongue", "nose", "ear"] - the model missed the ear here
- ["dogs tail"] Image
Image
Read 7 tweets
Feb 18, 2025
I've been playing with Qwen2.5-VL object detection over the past few days; take a look

notebook link: github.com/roboflow/noteb…Image
you can prompt the model to detect multiple objects classes at the same time Image
if there are too many objects in the image, or we try to detect many classes at once, the model can get confused and spins in circles until it reach token limit. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(