A word embedding trained on South African news data

dc.contributor.authorMafunda, Martin Canaan
dc.contributor.authorSchuld, Maria
dc.contributor.authorDurrheim, Kevin
dc.contributor.authorMazibuko, Sindisiwe
dc.date.accessioned2023-04-10T20:47:46Z
dc.date.available2023-04-10T20:47:46Z
dc.date.issued2022-12-23
dc.description.abstractThis article presents results from a study that developed and tested a word embedding trained on a dataset of South African news articles. A word embedding is an algorithm-generated word representation that can be used to analyse the corpus of words that the embedding is trained on. The embedding on which this article is based was generated using the Word2Vec algorithm, which was trained on a dataset of 1.3 million African news articles published between January 2018 and March 2021, containing a vocabulary of approximately 124,000 unique words. The efficacy of this Word2Vec South African news embedding was then tested, and compared to the efficacy provided by the globally used GloVe algorithm. The testing of the local Word2Vec embedding showed that it performed well, with similar efficacy to that provided by GloVe. The South African news word embedding generated by this study is freely available for public use.
dc.description.librarianCA2022
dc.description.sponsorshipThe authors are grateful for support from the University of KwaZulu-Natal’s Big Data and Informatics’ Research Flagship, South Africa’s National Research Foundation (NRF-Grant UID: 137755), and the South African Centre for Digital Language Resources (SADiLaR-Grant #OR-AAALV). SADiLaR is a national centre supported by the South African Department of Science and Innovation (DSI)
dc.identifier.citationMafunda, M. C., Schuld, M., Durrheim, K., Mazibuko, S. (2022). A word embedding trained on South African news data.The African Journal of Information and Communication (AJIC), 30, 1-24. https://doi.org/10.23962/ajic.i30.13906
dc.identifier.doihttps://doi.org/10.23962/ajic.i30.13906
dc.identifier.urihttps://doi.org/10.23962/ajic.i30.13906
dc.orcid.idhttps://orcid.org/0000-0001-9008-5834
dc.orcid.idhttps://orcid.org/0000-0001-8626-168X
dc.orcid.idhttps://orcid.org/0000-0003-2926-5953
dc.orcid.idhttps://orcid.org/0000-0003-4376-4230
dc.publisherLINK Centre, University of the Witwatersrand (Wits), Johannesburg
dc.rightsCopyright (c) 2022 Martin Canaan Mafunda, Maria Schuld, Kevin Durrheim, Sindiswe Mazibuko. This article is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence: https://creativecommons.org/licenses/by/4.0
dc.titleA word embedding trained on South African news data
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AJIC-Issue-30-2022-Mafunda-et-al.pdf
Size:
655.6 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.43 KB
Format:
Item-specific license agreed upon to submission
Description: