Hierarchical Recurrent Encoder (2017)

The Hierarchical Recurrent Encoder architecture as specified in our CVPR 2017 paper. The model was trained on VisDial v0.9 train+val and uses VGG-16 to extract image features, and NeuralTalk2 for captioning. Code available here .

Late Fusion (2019)

Built on the Late Fusion architecture specified in our CVPR 2017 paper. The model was trained on VisDial v1.0 train+val. It uses ResNeXt detector features, and Pythia for captioning. Code available here .