Image captioning via multimodal embeddings

Algu, Shikash

Image captioning via multimodal embeddings

Files

ALGU Shikash 2373769 MSc CWRR research report.pdf (2.45 MB)

Date

2022

Authors

Algu, Shikash

Abstract

Image captioning is an ongoing problem in computer vision with the aim of generating semantically and syntactically correct captions. Vanilla image captioning models fail to capture the structural relationship between objects that are available in images. To overcome this problem, scene graphs (knowledge graphs) that describe the relationship between objects have been added to models and improve on results. Current image captioning models do not consider combining image features and scene graphs in a common latent space, before generating captions. Graph convolutional neural networks have been designed to capture dependency information and are showing promising results in computer vision. This research aimed to investigate whether the inclusion of scene graph and image features in a multimodal layer will improve on image captioning models. Results show that by including scene graph features, image captioning results improve based on the standard image captioning evaluation metrics. Qualitative analysis shows that by including scene graphs, the structural relationships between objects in captions improve.

Description

Research Report submitted in partial fulfilment of the requirements for the degree of Master of Science by coursework and research report in Artificial Intelligence to the Faculty of Science, University of the Witwatersrand, Johannesburg,

Keywords

Image captioning, Multimodal embeddings

URI

https://hdl.handle.net/10539/36947

Collections

ETD Collection

Full item page

Image captioning via multimodal embeddings

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By