Knowledge-driven language modelling for text embedding and semantic similarity

dc.contributor.authorBhana, Nimesh
dc.date.accessioned2023-07-20T10:37:54Z
dc.date.available2023-07-20T10:37:54Z
dc.date.issued2023
dc.descriptionA dissertation submitted in fulfilment of the requirements for the degree of Master of Science to the Faculty of Science, School of Computer Science and Applied Mathematics University of Witwatersrand, Johannesburg, 2023
dc.description.abstractLanguage Models such as BERT have grown in popularity due to their ability to be pre-trained and perform robustly on a wide range of Natural Language Processing tasks. Often seen as an evolution over traditional word embedding techniques, they are capable of producing representations of text, useful for tasks such as semantic similarity. However, state-of-the-art models often have high computational requirements and lack global context or domain knowledge which is required for complete language understanding. To address these limitations, an investigation of the benefits of knowledge incorporation into the fine-tuning stages of BERT is done. An existing K-BERT model, which enriches sentences with triples from a Knowledge Graph, is adapted for the English language and extended to inject contextually relevant information into sentences. Given the appropriate knowledge, Knowledge-enabled BERT (K-BERT) outperforms similar models, USE & SBERT, suited for text embedding and semantic similarity. Performance is based on the STS-B and ag_news_subset datasets. Knowledge ablation studies conducted indicate that injected knowledge causes noise. When this noise is minimised, we see statistically significant performance improvements for knowledge-driven tasks. Results show evidence that, given the appropriate task, modest injection, with relevant, high quality knowledge is most performant. However, achieving successful integration autonomously is non-trivial.
dc.description.librarianNG (2023)
dc.facultyFaculty of Science
dc.identifier.urihttps://hdl.handle.net/10539/35731
dc.language.isoen
dc.schoolSchool of Computer Science and Applied Mathematics
dc.titleKnowledge-driven language modelling for text embedding and semantic similarity
dc.typeDissertation
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nimesh Bhana 2371061 Research Report.pdf
Size:
3.23 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.43 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections