Dear Users, System will be interrupted for the whole day on the 29 January 2025 for the software upgrade, apologies for any inconvenience this might cause. Regards, WIReDSpace Admin

Image captioning via multimodal embeddings

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Image captioning is an ongoing problem in computer vision with the aim of generating semantically and syntactically correct captions. Vanilla image captioning models fail to capture the structural relationship between objects that are available in images. To overcome this problem, scene graphs (knowledge graphs) that describe the relationship between objects have been added to models and improve on results. Current image captioning models do not consider combining image features and scene graphs in a common latent space, before generating captions. Graph convolutional neural networks have been designed to capture dependency information and are showing promising results in computer vision. This research aimed to investigate whether the inclusion of scene graph and image features in a multimodal layer will improve on image captioning models. Results show that by including scene graph features, image captioning results improve based on the standard image captioning evaluation metrics. Qualitative analysis shows that by including scene graphs, the structural relationships between objects in captions improve.

Description

Research Report submitted in partial fulfilment of the requirements for the degree of Master of Science by coursework and research report in Artificial Intelligence to the Faculty of Science, University of the Witwatersrand, Johannesburg,

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By