Natural language sequence analysis for automatic keyword assignment: an application in scientific publications

Breuning, Pieter

Natural language sequence analysis for automatic keyword assignment: an application in scientific publications

dc.contributor.author	Breuning, Pieter
dc.date.accessioned	2022-12-21T07:32:59Z
dc.date.available	2022-12-21T07:32:59Z
dc.date.issued	2022
dc.description	A research report submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in partial fulfilment of the requirements for the degree of Master of Science (by coursework and research report), 2022
dc.description.abstract	This study aims to investigate the task of keyword assignment in text documents, whichis a sparse multi-label classification of a discrete input over a large set of target variables. Most existing approaches focus on keyword extraction, which cannot producedescriptive keywords that do not appear in the text. Domain-specific word embeddings in conjunction with neural classifiers have proven to be effective for the task of keyword assignment in the literature. This work takes a similar approach, using word embeddings from pre-trained transformer models that are able to capture context deeply and bi-directionally. Two language models are used to produce such word embeddings i.e. BERT and OpenAI GPT, such that both bi-directional and left-to-right context models can be compared. As a basis for comparison, bag-of-words representations as well as pre-trained Word2Vec language models are also implemented. The datasets used for training and inference consist of publicly-available abstracts from scientific journals. Separate datasets are collected and used for pre-training of the language models, and for training the classifiers on the tasks of both domain prediction and keyword prediction (assignment). Several language model variations are tested to evaluate the impact of pre-training, and several classification architectures are tested in search of best performance. The results show little differentiation in model performance for the simpler task of domain prediction, but significant differentiation in the complex task of keyword assignment. It is found that the transformer models produce the best resultsi n this complex task, and that the results are thematically coherent with respect to in-put texts. It is further found that performance is improved by pre-training transformermodels using domain-specific text corpora.
dc.description.librarian	CK2022
dc.faculty	Faculty of Science
dc.identifier.uri	https://hdl.handle.net/10539/33876
dc.language.iso	en
dc.school	School of Computer Science and Applied Mathematics
dc.title	Natural language sequence analysis for automatic keyword assignment: an application in scientific publications
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: MSc_Dissertation - 295821 - December 2021.pdf
Size:: 1.96 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.43 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

ETD Collection