We address the following three key questions of multi-head self-attentions (MSAs) and ViTs:
Q1. What properties of MSAs do we need to better optimize NNs?
Q2. Do MSAs act like Convs? If not, how are they different?
Q3. How can we harmonize MSAs with Convs?