I'm super excited to share our work on self-supervised learning for audio. We extend the permutation pre-text task by using differentiable ranking and show improved performance on low-resource tasks (it also works great on images and video)
1/
When using permutations in pretraining, a subset of permutations are used to train a classifier which predicts permutations as classes.
However, since there are n! different permutations of length n, it's not feasible to use any reasonable fraction of them for classes.
2/
We fix this problem by using a differentiable ranking objective which allows arbitrary permutations to be used.
By increasing the number of usable permutations, we find improved representations are learned which can be used on downstream tasks.
3/
The paper has nice graphs, cool math, describes a simple algorithm which is quite competitive.
4/
The work was done while I was @GoogleAI with an incredible team. @qberthet, @mblondel_ml, Olivier Teboul, and @neilzegh all did amazing work and I learned a ton from them.