Overview
We borrow the term “concerto” from music, where a concerto features contrasting elements: soloists (concertino) performing in dialogue with an accompanying orchestra (ripieno). Analogously, in our design, the concertino computes local self-attention, while the ripieno captures global (or shared) self-attention. The proposed Concerto Self-Attention can be applied in both the spatial and channel domains, resulting in four distinct self-attention mechanisms. All four self-attention mechanisms have linear complexity, enabling favorable computational efficiency and execution time.