I finally got around to reading the Involution #CVPR2021 paper (arxiv.org/abs/2103.06255). Here is a summary and some thoughts: 🧵👇 (1/n) Image
The method replaces traditional spatial convolution layers in a CNN with a type of dynamic convolution with a different convolution kernel for each (i,j) spatial location. The kernels are spatially varying, and data dependent, i.e., predicted. (2/n)
Putatively, this would endow the layer with the ability to capture more complex representations, and be particularly effective in the case of large input images where different regions can appear very different. (3/n)
The number of parameters in traditional convolution layers in a CNN is rather large. Predicting a different set of parameters of a 3x3 convolution layer for each spatial location would amount to predicting a tensor with 9 times the number of input channels. (4/n)
That's a no go, obviously. So instead of using a full convolution layer, the paper uses a very restricted depthwise convolution layer, where each filter is shared by multiple channels. (5/n)
A convolution kernel is of size Co x Ci x K x K, a depthwise convolution kernel is of size Ci x K x K, and an involution kernel is of size G x K x K (x H x W), where G << Ci. Each K x K filter is applied to Ci/G channels (6/n)
There is a bunch of marketing spin in the paper, pitching this as some sort of inversion of the 'inherence' of a convolution. I am not sure about others, but I prefer papers to instead describe things as they are, without this unnecessary mumbo jumbo. (7/n) Image
The parameters of the involution are predicted from the input tensor using two 1x1 convolution layers with a bottleneck structure. For the experiments, the paper replaces all 3x3 convolutions in a ResNet with 7x7 involutions. The experiments are a mixed bag: (8/n)
Across a range of tasks, involution ResNets demonstrate better accuracy than convolution ResNets, while using a little fewer FLOPs and a bit fewer parameters. (9/n)
However, it is not an apples to apples comparison! Involution is comparing normal convolutions to (effectively) depthwise convolutions, and that too with a larger kernel size. It is easy to lose perspective of this given the "inverting the inherence" spin. (10/n)
If I was the reviewer, I would ask for comparisons on networks that use inverted bottleneck blocks, for instance. Setting this up could be an interesting project for a bachelors or masters student, looking for a small and clearly defined project. (11/n)
Anyone familiar with ImageNet Accuracy-FLOP plots would notice that the RedNet (red) curves are nowhere near where depthwise convolution based models tend to fall on these plots. (12/n) Image
I am not saying that the method isn't worthwhile, but rather that better baselines are needed to know the true worth of the method. If this was a submission I was reviewing, I would rate it as 'Borderline' pre-rebuttal (13/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Dushyant Mehta

Dushyant Mehta Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!