Context-Aware Attention Network for Image-Text Retrieval - Crossminds logo

Context-Aware Attention Network for Image-Text Retrieval

Sep 29, 2020
Authors: Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li Description: As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.

Reactions (0) | Note
    📝 No reactions yet
    Be the first one to share your thoughts!