We address the following three key questions of multi-head self-attentions (MSAs) and ViTs:
Q1. What properties of MSAs do we need to better optimize NNs?
Q2. Do MSAs act like Convs? If not, how are they different?
Q3. How can we harmonize MSAs with Convs?
(2/7)
Q1. What Properties of MSAs Do We Need?
MSAs have their pros and cons. MSAs improve NNs by flattening the loss landscapes. A key feature is their data specificity, not long-range dependency. On the other hand, ViTs suffers from non-convex losses.
(3/7)
Q2. Do MSAs Act Like Convs?
MSAs and Convs exhibit opposite behaviors. Therefore, MSAs and Convs are complementary. For example, MSAs are low-pass filters, but Convs are high-pass filters. It suggests that MSAs are shape-biased, whereas Convs are texture-biased.
(4/7)
Q3. How Can We Harmonize MSAs With Convs?
MSAs at the end of a stage (not a model) play a key role. We thus introduce AlterNet by replacing Convs at the end of a stage with MSAs. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes.
(5/7)
Then, how to apply MSA to your own CNN model?
1. Alternately replace Conv blocks with MSA blocks from the end of a baseline CNN. 2. If the added MSA block does not improve predictive performance, replace a Conv block located at the end of an earlier stage with an MSA.
(6/7)
In summary, MSA ≠ Conv with weak inductive bias.
The self-attention formulation is ANOTHER inductive bias that complements Convs.