π’ In our #ACMMM21 paper, we highlight issues with training and evaluation of π°πΏπΌππ± π°πΌππ»ππΆπ»π΄ deep networks. π§΅π
For far too long, π°πΏπΌππ± π°πΌππ»ππΆπ»π΄ works in #CVPR, #AAAI, #ICCV, #NeurIPS have reported only MAE, but not standard deviation.
Looking at MAE and standard deviation from MAE, a very grim picture emerges. E.g. Imagine a SOTA net with MAE 71.7 but deviation is a whopping 376.4 !
How do we address this ? There is no easy answer. The problem lies all over the processing pipeline ! The standard pipeline for π°πΏπΌππ± π°πΌππ»ππΆπ»π΄ looks like π
ISSUE-1:Standard sampling procedure for creating train-validation-test splits implicitly assumes uniform distribution over target range. But benchmark dataset distribution of crowd counts is discontinuous and heavy-tailed. Uniform sampling causes tail to be underrepresented.
The problem is that sampling being done is too fine a resolution, i.e. individual counts.
OUR FIX: Coarsen the resolution. Partition the count range into bins optimal for uniform sampling.
We employ a Bayesian stratification approach to obtain bins which can be uniformly sampled from, for minibatching.
ISSUE-2: Minimizing per-instance loss averaged over minibatch poses same issues as those during minibatch creation (imbalance, bias). OUR FIX: A novel bin sensitive loss function. Instead of loss depending only on error, we also consider count bin to which data sample belongs
ISSUE-3: The imbalanced data distribution also causes MSE to be an ineffective representative of performance across the entire test set.
OUR FIX: Instead of using a single pair of numbers (mean, standard deviation) to characterize performance across the *entire* count range, we suggest that reporting them for each bin. This provides a much broader idea of performance across count range.
If a single summary statistic is still desired, mean and standard deviation of bin-level performance measures can be combined in a bin-aware manner.
Bin-level results demonstrate that our proposed modifications reduce error standard deviation in a noticeable manner. The comparatively large deviations when binning is not used, can clearly be seen.
However, the large magnitudes of deviations relative to MAE are still a big concern.
Studying and addressing issues we have raised would enable statistically reliable π°πΏπΌππ± π°πΌππ»ππΆπ»π΄ approaches in future. Our project page deepcount.iiit.ac.in contains interactive visualizations for examining results on a per-dataset and per-model basis.
π’ Introducing SynSE, a language-guided approach for generalized zero shot learning of pose-based action representations! Great effort by @bublaasaur and @divyanshu1709#actionrecognition
For enabling compositional generalization to novel action-object combinations, the action description is transformed into individual Part-of-Speech based embeddings.
The PoS-based embeddings are aligned with action sequence embedding via a VAE-based generative space. This alignment is optimized using within and cross modality constraints.