Authors: Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang Description: Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way. Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels. Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.