Natural language sequence analysis for automatic keyword assignment: an application in scientific publications

dc.contributor.authorBreuning, Pieter
dc.date.accessioned2022-12-21T07:32:59Z
dc.date.available2022-12-21T07:32:59Z
dc.date.issued2022
dc.descriptionA research report submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in partial fulfilment of the requirements for the degree of Master of Science (by coursework and research report), 2022
dc.description.abstractThis study aims to investigate the task of keyword assignment in text documents, whichis a sparse multi-label classification of a discrete input over a large set of target variables. Most existing approaches focus on keyword extraction, which cannot producedescriptive keywords that do not appear in the text. Domain-specific word embeddings in conjunction with neural classifiers have proven to be effective for the task of keyword assignment in the literature. This work takes a similar approach, using word embeddings from pre-trained transformer models that are able to capture context deeply and bi-directionally. Two language models are used to produce such word embeddings i.e. BERT and OpenAI GPT, such that both bi-directional and left-to-right context models can be compared. As a basis for comparison, bag-of-words representations as well as pre-trained Word2Vec language models are also implemented. The datasets used for training and inference consist of publicly-available abstracts from scientific journals. Separate datasets are collected and used for pre-training of the language models, and for training the classifiers on the tasks of both domain prediction and keyword prediction (assignment). Several language model variations are tested to evaluate the impact of pre-training, and several classification architectures are tested in search of best performance. The results show little differentiation in model performance for the simpler task of domain prediction, but significant differentiation in the complex task of keyword assignment. It is found that the transformer models produce the best resultsi n this complex task, and that the results are thematically coherent with respect to in-put texts. It is further found that performance is improved by pre-training transformermodels using domain-specific text corpora.
dc.description.librarianCK2022
dc.facultyFaculty of Science
dc.identifier.urihttps://hdl.handle.net/10539/33876
dc.language.isoen
dc.schoolSchool of Computer Science and Applied Mathematics
dc.titleNatural language sequence analysis for automatic keyword assignment: an application in scientific publications
dc.typeThesis
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MSc_Dissertation - 295821 - December 2021.pdf
Size:
1.96 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.43 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections