Cross-Modality Relevance for Reasoning on Language and Vision