Generating Rich Image Descriptions from Localized Attention

Poulton, David

Generating Rich Image Descriptions from Localized Attention

dc.contributor.author	Poulton, David
dc.contributor.supervisor	Klein, Richard
dc.date.accessioned	2024-10-26T18:49:19Z
dc.date.available	2024-10-26T18:49:19Z
dc.date.issued	2023-08
dc.description	A dissertation submitted in fulfilment of the requirements for the degree of Master of Science in Computer Science, to the Faculty of Science, School of Computer Science & Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023.
dc.description.abstract	The field of image captioning is constantly growing with swathes of new methodologies, performance leaps, datasets, and challenges. One new challenge is the task of long-text image description. While the vast majority of research has focused on short captions for images with only short phrases or sentences, new research and the recently released Localized Narratives dataset have pushed this to rich, paragraph length descriptions. In this work we perform additional research to grow the sub-field of long-text image descriptions and determine the viability of our new methods. We experiment with a variety of progressively more complex LSTM and Transformer-based approaches, utilising human-generated localised attention traces and image data to generate suitable captions, and evaluate these methods on a suite of common language evaluation metrics. We find that LSTM-based approaches are not well suited to the task, and under-perform Transformer-based implementations on our metric suite while also proving substantially more demanding to train. On the other hand, we find that our Transformer-based methods are well capable of generating captions with rich focus over all regions of the image and in a grammatically sound manner, with our most complex model outperforming existing approaches on our metric suite.
dc.description.sponsorship	National Research Foundation (NRF) of South Africa.
dc.description.submitter	MM2024
dc.faculty	Faculty of Science
dc.identifier	0000-0001-5953-5032
dc.identifier.citation	Poulton, David. (2023). Generating Rich Image Descriptions from Localized Attention. [Master's dissertation, University of the Witwatersrand, Johannesburg]. https://hdl.handle.net/10539/41976
dc.identifier.uri	https://hdl.handle.net/10539/41976
dc.language.iso	en
dc.publisher	University of the Witwatersrand, Johannesburg
dc.rights	©2023 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg.
dc.rights.holder	University of the Witwatersrand, Johannesburg
dc.school	School of Computer Science and Applied Mathematics
dc.subject	Computer vision
dc.subject	Natural language processing
dc.subject	Machine learning
dc.subject	Deep learning
dc.subject	Data fusion
dc.subject	Multi-modal models
dc.subject	UCTD
dc.subject.other	SDG-9: Industry, innovation and infrastructure
dc.title	Generating Rich Image Descriptions from Localized Attention
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Poulton_Generating_2023.pdf
Size:: 14.49 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.43 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (Masters)