- player and number detection with RF-DETR
- player tracking with SAM2
- team clustering with SigLIP, UMAP and KMeans
- number recognition with SmolVLM2
SAM2.1 tracks objects across video using visual prompts like boxes or points
we use a fine-tuned RF-DETR to detect all players in the first frame, pass these detections to SAM2.1, and track them in the following frames
I sample frames, detect players, crop the central regions, generate SigLIP embeddings, reduce them with UMAP, and cluster with KMeans to separate players into two teams
I used the same strategy in last year’s Football AI project. I also recorded a YouTube tutorial covering it
check it out if you haven’t already:
reading player numbers from small and blurry crops is not easy
traditional OCR models struggle with this task
for this reason, we decided to use SmolVLM2, fine-tuned on a custom multi-modal dataset
the next step is to match each jersey number to the right player using the mask IoS metric
unlike IoU, which measures overlap against the union, IoS measures it against the smaller area, so a smaller object fully inside a larger one gives IoS = 1
as player positions change, jersey numbers are not always clear, so relying on a single prediction is unreliable
to reduce errors, we validate numbers across frames
you can see in the video how numbers stabilize once they stay visible across consecutive frames
added support for parsing and visualizing detection results from @alibaba_cloud Qwen2.5-VL, @moondreamai, and @GoogleDeepMind Gemini 2.0 and 2.5 models.
this comes in addition to existing support for @Microsoft Florence-2 and @GoogleDeepMind PaliGemma.
here's an awesome @huggingface space by @SergioPaniego and @onuralpszr, where they compare Moondream and Qwen2.5-VL object understanding using supervision-0.26.0 for parsing and visualization
YOLOE is real-time zero-shot detector (similar to YOLO-World), but allowing you to prompt with text or boxes
here I used YOLOE to detect croissants on conveyer using box prompt; I just picked first frame, drawn box and run prediction on other frames; runs at around 15 fps on T4
you can prompt the model to detect multiple objects classes at the same time
if there are too many objects in the image, or we try to detect many classes at once, the model can get confused and spins in circles until it reach token limit.
Skip-Gram model predicts the surrounding context words based on a given center word.
during training, the Skip-Gram model learns word embeddings (numerical representations of words) that capture semantic relationships, which can then be used for various natural language processing tasks like word similarity.
- enhance the visual tracking accuracy of SAM 2 by incorporating motion information through motion modeling, to effectively handle the fast-moving and occluded objects
- propose a motion-aware memory selection mechanism that reduces error in crowded scenes in contrast to the original fixed-window memory by selectively storing relevant frames decided by a mixture of motion and affinity scores
state-of-the-art performance on various VOT benchmarks, including GOT-10k, LaSOT-ext, and NeedForSpeed