Learning to Generate Grounded Visual Captions without Localization Supervision

ECCV 2020