🔸 Scale imbalance
It happens when the objects have different sizes with different numbers of objects: e.g. small objects vs. big objects.
✅ Potential Solution
• Oversample small objects using the Copy&Paste data augmentation
• Use higher resolution images
🔸 Objective imbalance
It happens when calculating a total loss (classification and regression losses). One loss might dominate another.
✅ Potential Solution
• Use the weighted loss
🔸 Class imbalance:
It might be related to either: 1- Foreground-Background imbalance classes, or 2- Foreground-Foreground (positive) classes: e.g. Person class vs Parking Meter class in the COCO dataset
✅ Potential Solution:
• For Case 1: Use the Focal Loss
• For Case 2: Oversample under-represented classes
🔸 Spatial imbalance:
It refers to a set of factors related to the spatial properties of the bounding boxes such as regression penalty, location, and IoU.
For example, the L2 loss penalizes more severely shifted predicted boxes than the L1 or Smooth L1 losses.
✅ Potential Solution:
• Use L1 or Smooth L1 Loss
• Use anchor-free models
If you found this thread valuable,
Follow me for more threads on advanced object detection techniques used in real-world applications → @ai_fast_track
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How do you use transfer learning with images with 3+ (or 1) channel(s)?
Timm library, developed by @wightmanr, has an elegant way to handle that:
You can specify any input channel number (e.g. in_chans=1 or in_chans=8) using timm.create_model() function like this:
@wightmanr m = timm.create_model('resnet34', pretrained=True, in_chans=8)
How does it work?
• Case 1: number of input channels is 1
timm simply sums the 3 channel weights into one single channel
@wightmanr • Case 2: number of input channels is 8 (more than 3)
timm repeats the 3 channel weights as many times as required, and then select the required number of input channels weights
In 8 channels example, that would be: repeat 3 times (9 channels generated), then keep the first 8
🔥 ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
Heads up: I’m preparing a visual summary on ZSD-YOLO.
So, what is Zero-Shot Detection?
• Zero-shot detection allows a model to detect something in an image even if the model has never seen that thing before
• So, if you have an image of a Chimpanzee and the model has never seen a Chimpanzee before, you can use your zero-shot detector to locate it in the image
• ZSD-YOLO leverages 2 models:
- CLIP: a pretrained Vision-Language model
- YOLOv5: a modified version that replaces the classification branch
Many open-world applications require the detection of novel objects.
but state-of-the-art object detection and instance segmentation models are unable to do so.
• It’s because models learn to suppress any unannotated objects by treating them as background
• To address that issue, the authors propose a simple yet surprisingly powerful data augmentation and training scheme they call Learning to Detect Every Thing (LDET)
• To avoid suppressing hidden (unannotated) objects, background objects that are visible but unlabeled, they paste annotated objects on a background image sampled from a small region of the original image (see figure)