Image captioning via multimodal embeddings
Image captioning is an ongoing problem in computer vision with the aim of generating semantically and syntactically correct captions. Vanilla image captioning models fail to capture the structural relationship between objects that are available in images. To overcome this problem, scene graphs (knowledge graphs) that describe the relationship between objects have been added to models and improve on results. Current image captioning models do not consider combining image features and scene graphs in a common latent space, before generating captions. Graph convolutional neural networks have been designed to capture dependency information and are showing promising results in computer vision. This research aimed to investigate whether the inclusion of scene graph and image features in a multimodal layer will improve on image captioning models. Results show that by including scene graph features, image captioning results improve based on the standard image captioning evaluation metrics. Qualitative analysis shows that by including scene graphs, the structural relationships between objects in captions improve.
Research Report submitted in partial fulfilment of the requirements for the degree of Master of Science by coursework and research report in Artificial Intelligence to the Faculty of Science, University of the Witwatersrand, Johannesburg,
Image captioning, Multimodal embeddings