Natural language sequence analysis for automatic keyword assignment: an application in scientific publications
This study aims to investigate the task of keyword assignment in text documents, whichis a sparse multi-label classification of a discrete input over a large set of target variables. Most existing approaches focus on keyword extraction, which cannot producedescriptive keywords that do not appear in the text. Domain-specific word embeddings in conjunction with neural classifiers have proven to be effective for the task of keyword assignment in the literature. This work takes a similar approach, using word embeddings from pre-trained transformer models that are able to capture context deeply and bi-directionally. Two language models are used to produce such word embeddings i.e. BERT and OpenAI GPT, such that both bi-directional and left-to-right context models can be compared. As a basis for comparison, bag-of-words representations as well as pre-trained Word2Vec language models are also implemented. The datasets used for training and inference consist of publicly-available abstracts from scientific journals. Separate datasets are collected and used for pre-training of the language models, and for training the classifiers on the tasks of both domain prediction and keyword prediction (assignment). Several language model variations are tested to evaluate the impact of pre-training, and several classification architectures are tested in search of best performance. The results show little differentiation in model performance for the simpler task of domain prediction, but significant differentiation in the complex task of keyword assignment. It is found that the transformer models produce the best resultsi n this complex task, and that the results are thematically coherent with respect to in-put texts. It is further found that performance is improved by pre-training transformermodels using domain-specific text corpora.
A research report submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in partial fulfilment of the requirements for the degree of Master of Science (by coursework and research report), 2022