Knowledge-driven language modelling for text embedding and semantic similarity
Language Models such as BERT have grown in popularity due to their ability to be pre-trained and perform robustly on a wide range of Natural Language Processing tasks. Often seen as an evolution over traditional word embedding techniques, they are capable of producing representations of text, useful for tasks such as semantic similarity. However, state-of-the-art models often have high computational requirements and lack global context or domain knowledge which is required for complete language understanding. To address these limitations, an investigation of the benefits of knowledge incorporation into the fine-tuning stages of BERT is done. An existing K-BERT model, which enriches sentences with triples from a Knowledge Graph, is adapted for the English language and extended to inject contextually relevant information into sentences. Given the appropriate knowledge, Knowledge-enabled BERT (K-BERT) outperforms similar models, USE & SBERT, suited for text embedding and semantic similarity. Performance is based on the STS-B and ag_news_subset datasets. Knowledge ablation studies conducted indicate that injected knowledge causes noise. When this noise is minimised, we see statistically significant performance improvements for knowledge-driven tasks. Results show evidence that, given the appropriate task, modest injection, with relevant, high quality knowledge is most performant. However, achieving successful integration autonomously is non-trivial.
A dissertation submitted in fulfilment of the requirements for the degree of Master of Science to the Faculty of Science, School of Computer Science and Applied Mathematics University of Witwatersrand, Johannesburg, 2023