I finally got around to reading the Involution #CVPR2021 paper (arxiv.org/abs/2103.06255). Here is a summary and some thoughts: 🧵👇 (1/n)
The method replaces traditional spatial convolution layers in a CNN with a type of dynamic convolution with a different convolution kernel for each (i,j) spatial location. The kernels are spatially varying, and data dependent, i.e., predicted. (2/n)
Putatively, this would endow the layer with the ability to capture more complex representations, and be particularly effective in the case of large input images where different regions can appear very different. (3/n)
The number of parameters in traditional convolution layers in a CNN is rather large. Predicting a different set of parameters of a 3x3 convolution layer for each spatial location would amount to predicting a tensor with 9 times the number of input channels. (4/n)
That's a no go, obviously. So instead of using a full convolution layer, the paper uses a very restricted depthwise convolution layer, where each filter is shared by multiple channels. (5/n)
A convolution kernel is of size Co x Ci x K x K, a depthwise convolution kernel is of size Ci x K x K, and an involution kernel is of size G x K x K (x H x W), where G << Ci. Each K x K filter is applied to Ci/G channels (6/n)
There is a bunch of marketing spin in the paper, pitching this as some sort of inversion of the 'inherence' of a convolution. I am not sure about others, but I prefer papers to instead describe things as they are, without this unnecessary mumbo jumbo. (7/n)
The parameters of the involution are predicted from the input tensor using two 1x1 convolution layers with a bottleneck structure. For the experiments, the paper replaces all 3x3 convolutions in a ResNet with 7x7 involutions. The experiments are a mixed bag: (8/n)
Across a range of tasks, involution ResNets demonstrate better accuracy than convolution ResNets, while using a little fewer FLOPs and a bit fewer parameters. (9/n)
However, it is not an apples to apples comparison! Involution is comparing normal convolutions to (effectively) depthwise convolutions, and that too with a larger kernel size. It is easy to lose perspective of this given the "inverting the inherence" spin. (10/n)
If I was the reviewer, I would ask for comparisons on networks that use inverted bottleneck blocks, for instance. Setting this up could be an interesting project for a bachelors or masters student, looking for a small and clearly defined project. (11/n)
Anyone familiar with ImageNet Accuracy-FLOP plots would notice that the RedNet (red) curves are nowhere near where depthwise convolution based models tend to fall on these plots. (12/n)
I am not saying that the method isn't worthwhile, but rather that better baselines are needed to know the true worth of the method. If this was a submission I was reviewing, I would rate it as 'Borderline' pre-rebuttal (13/n)
• • •
Missing some Tweet in this thread? You can try to
force a refresh