Ravi Kiran S Profile picture
Associate Prof, IIIT-H, India | Alum: UW, IISc | Enjoy working on interdisciplinary problems involving multimedia | Meme fan | ๐Ÿ‡ฎ๐Ÿ‡ณ

Aug 20, 2021, 17 tweets

๐Ÿ“ข In our #ACMMM21 paper, we highlight issues with training and evaluation of ๐—ฐ๐—ฟ๐—ผ๐˜„๐—ฑ ๐—ฐ๐—ผ๐˜‚๐—ป๐˜๐—ถ๐—ป๐—ด deep networks. ๐Ÿงต๐Ÿ‘‡

For far too long, ๐—ฐ๐—ฟ๐—ผ๐˜„๐—ฑ ๐—ฐ๐—ผ๐˜‚๐—ป๐˜๐—ถ๐—ป๐—ด works in #CVPR, #AAAI, #ICCV, #NeurIPS have reported only MAE, but not standard deviation.

Looking at MAE and standard deviation from MAE, a very grim picture emerges. E.g. Imagine a SOTA net with MAE 71.7 but deviation is a whopping 376.4 !

How do we address this ? There is no easy answer. The problem lies all over the processing pipeline ! The standard pipeline for ๐—ฐ๐—ฟ๐—ผ๐˜„๐—ฑ ๐—ฐ๐—ผ๐˜‚๐—ป๐˜๐—ถ๐—ป๐—ด looks like ๐Ÿ‘‡

ISSUE-1:Standard sampling procedure for creating train-validation-test splits implicitly assumes uniform distribution over target range. But benchmark dataset distribution of crowd counts is discontinuous and heavy-tailed. Uniform sampling causes tail to be underrepresented.

The problem is that sampling being done is too fine a resolution, i.e. individual counts.

OUR FIX: Coarsen the resolution. Partition the count range into bins optimal for uniform sampling.

We employ a Bayesian stratification approach to obtain bins which can be uniformly sampled from, for minibatching.

ISSUE-2: Minimizing per-instance loss averaged over minibatch poses same issues as those during minibatch creation (imbalance, bias). OUR FIX: A novel bin sensitive loss function. Instead of loss depending only on error, we also consider count bin to which data sample belongs

ISSUE-3: The imbalanced data distribution also causes MSE to be an ineffective representative of performance across the entire test set.

OUR FIX: Instead of using a single pair of numbers (mean, standard deviation) to characterize performance across the *entire* count range, we suggest that reporting them for each bin. This provides a much broader idea of performance across count range.

If a single summary statistic is still desired, mean and standard deviation of bin-level performance measures can be combined in a bin-aware manner.

Bin-level results demonstrate that our proposed modifications reduce error standard deviation in a noticeable manner. The comparatively large deviations when binning is not used, can clearly be seen.

However, the large magnitudes of deviations relative to MAE are still a big concern.

Studying and addressing issues we have raised would enable statistically reliable ๐—ฐ๐—ฟ๐—ผ๐˜„๐—ฑ ๐—ฐ๐—ผ๐˜‚๐—ป๐˜๐—ถ๐—ป๐—ด approaches in future. Our project page deepcount.iiit.ac.in contains interactive visualizations for examining results on a per-dataset and per-model basis.

Code and pretrained models can be found at github.com/atmacvit/bincrโ€ฆ

Our crowd counting paper can be read here

... and this work is a happy collaboration with @ganramkr ๐Ÿ˜€

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling