A word embedding trained on South African news data
dc.contributor.author | Mafunda, Martin Canaan | |
dc.contributor.author | Schuld, Maria | |
dc.contributor.author | Durrheim, Kevin | |
dc.contributor.author | Mazibuko, Sindisiwe | |
dc.date.accessioned | 2023-04-10T20:47:46Z | |
dc.date.available | 2023-04-10T20:47:46Z | |
dc.date.issued | 2022-12-23 | |
dc.description.abstract | This article presents results from a study that developed and tested a word embedding trained on a dataset of South African news articles. A word embedding is an algorithm-generated word representation that can be used to analyse the corpus of words that the embedding is trained on. The embedding on which this article is based was generated using the Word2Vec algorithm, which was trained on a dataset of 1.3 million African news articles published between January 2018 and March 2021, containing a vocabulary of approximately 124,000 unique words. The efficacy of this Word2Vec South African news embedding was then tested, and compared to the efficacy provided by the globally used GloVe algorithm. The testing of the local Word2Vec embedding showed that it performed well, with similar efficacy to that provided by GloVe. The South African news word embedding generated by this study is freely available for public use. | |
dc.description.librarian | CA2022 | |
dc.description.sponsorship | The authors are grateful for support from the University of KwaZulu-Natal’s Big Data and Informatics’ Research Flagship, South Africa’s National Research Foundation (NRF-Grant UID: 137755), and the South African Centre for Digital Language Resources (SADiLaR-Grant #OR-AAALV). SADiLaR is a national centre supported by the South African Department of Science and Innovation (DSI) | |
dc.identifier.citation | Mafunda, M. C., Schuld, M., Durrheim, K., Mazibuko, S. (2022). A word embedding trained on South African news data.The African Journal of Information and Communication (AJIC), 30, 1-24. https://doi.org/10.23962/ajic.i30.13906 | |
dc.identifier.doi | https://doi.org/10.23962/ajic.i30.13906 | |
dc.identifier.uri | https://doi.org/10.23962/ajic.i30.13906 | |
dc.orcid.id | https://orcid.org/0000-0001-9008-5834 | |
dc.orcid.id | https://orcid.org/0000-0001-8626-168X | |
dc.orcid.id | https://orcid.org/0000-0003-2926-5953 | |
dc.orcid.id | https://orcid.org/0000-0003-4376-4230 | |
dc.publisher | LINK Centre, University of the Witwatersrand (Wits), Johannesburg | |
dc.rights | Copyright (c) 2022 Martin Canaan Mafunda, Maria Schuld, Kevin Durrheim, Sindiswe Mazibuko. This article is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence: https://creativecommons.org/licenses/by/4.0 | |
dc.title | A word embedding trained on South African news data |