Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Given an image, dialog history, and a follow-up question about the image, the agent has to answer the question.
The Hierarchical Recurrent Encoder architecture as specified in our CVPR 2017 paper. The model was trained on VisDial v0.9 train+val and uses VGG-16 to extract image features, and NeuralTalk2 for captioning. Code available here.
For more details about the dataset, task and models, please visit visualdialog.org.
Terms of Service
Text or media that you upload on the demo will be stored and used for research purposes.