School of Statistics and Actuarial Science (ETDs)
Permanent URI for this communityhttps://hdl.handle.net/10539/38022
Browse
6 results
Search Results
Item Geographically Weighted Statistical Machine Learning Methods for Predicting Net Primary Productivity in the Eastern Sahel Region(University of the Witwatersrand, Johannesburg, 2025-06) Letsela, Kopano Lazarus; Mlambo, Farai; Adam, ElhadiSustainable land management and ecosystem resilience are essential for climate adaptation and resource conservation, particularly in regions susceptible to environmental degradation. This study applies the Geographically Weighted Statistical Machine Learning (GWSML) methods to predict Net Primary Productivity (NPP) in the eastern Sahel, a semi-arid region characterised by high climate variability, land degradation, and socio-economic vulnerability. By integrating Geographically Weighted Regression (GWR), Geographically Weighted Random Forests (GWRF), and Geographically Weighted Neural Networks (GWNN), the research addresses spatial heterogeneity and nonlinearity in environmental data, overcoming the limitations of traditional global models. Using data from Niger, Chad, and Sudan, spanning 2019-2021, the models leverage spatially explicit climatic variables—rainfall, temperature, soil moisture, and elevation—to estimate NPP with high accuracy. The data were processed and analysed using Ordinary Kriging (OK) to handle missing data, followed by model calibration. Spatial autocorrelation in residuals was examined using Moran’s I, and the evaluation was conducted using spatial regression and geographically weighted machine learning techniques. Model performance evaluation was carried out using key metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2). This evaluation involved a comparative analysis with global models, particularly Ordinary Least Squares (OLS), and traditional machine learning methods, including Random Forests (RF) and Neural Networks (NN). The results demonstrate that machine learning methods, enhanced with geographical weighting, outperform traditional/global approaches by capturing localised variations and nonlinear dependencies. The best results produced for this study were an R2 of 0.9360, RMSE of 0.0333, and MSE of 0.0012, all achieved by the GWNN model. The GWRF model yielded the best MAE of 0.0191, albeit with a lower R2 of 0.9308 compared to GWNN. However, GWR produced better performance than global models, with an R2 of 0.9207. The study results show that GWR, GWRF, and GWNN outperform global regression models in their ability to capture spatial variability. Concurrently, GWRF and GWNN significantly improve prediction accuracy, effectively capturing nonlinear relationships and spatial heterogeneity between NPP and its drivers. The findings highlight the importance of spatially adaptive models for predicting ecological productivity and informing climate adaptation strategies. These models can help mitigate land degradation and promote sustainable agriculture in regions with spatial heterogeneity. Integrating these methods into ecological modelling promises improved outcomes for socio-economic stability, environmental sustainability, and food security in developing climate-vulnerable regions like the eastern Sahel.Item Optimising Visual Clarity using Clustering Techniques for Overcrowded Biplots(University of the Witwatersrand, Johannesburg, 2025-06) Balisa, Yamkela; Ganey, RaeesaThe increasing use of data in various industries has driven the need for effective data analysis and visualisation. Data visualisation is a key methodology for extracting insights from the data. One powerful visualisation technique based on dimensionality reduction methods is the biplot. Biplots are multivariate scatterplots that facilitate the visualisation of high-dimensional data by projecting it onto lower dimensional spaces, usually two or three dimensions. This reduction in dimensionality is achieved using techniques such as Principal Component Analysis (PCA) for continuous data. A biplot simultaneously represents both samples and variables within the same visualisation. However, biplots often face challenges when dealing with a very large number of variables in data. A key issue is the overcrowding of variables within the biplot, making it difficult to obtain meaningful insights. To address this issue, this study explores the integration of unsupervised learning techniques, specifically clustering into the biplot framework. Unsupervised learning refers to a type of machine learning approach in which the algorithm learns patterns and relationships in the data without prior knowledge of the expected output. Clustering, a fundamental unsupervised learning technique, involves grouping similar data points into clusters, enabling the identification of underlying structures and relationships. By applying clustering, specifically the k-means clustering algorithm, this study aims to cluster similar variables into distinct clusters within the biplot. Similar variables are determined by the proximity of their endpoints and the angles they form within the biplot. Ultimately, the refined biplot displays only a representative cluster of vectors, thus enhancing the clarity and interpretability.Item Predicting Future Stock Price with Sentiment Analysis: Recurrent vs. Attention Based Learning for Regression Tasks(University of the Witwatersrand, Johannesburg, 2023-08) Mcdonald, Bernard; Nasejje, JustineStock price prediction is a lucrative challenge as successful prediction could yield significant profits for investors – attracting research utilising novel data sources and modelling techniques. This research aimed to accurately predict the future closing price of the top five stocks of the NASDAQ100 index by leveraging Twitter data and recent advancements in machine learning. Three representations of large-scale Twitter data were derived: company, stock market, and general public sentiment. Company sentiment and stock market sentiment were Granger-causal (p < 0.10) for the closing price of four and two of the five companies considered, respectively. Five stock price prediction models were built: ARIMA, RNN, LSTM, GRU, and a novel Transformer model. A hyperparameter grid search selected feature subsets containing sentiment data as optimal in sixteen of the twenty (80%) model-dataset combinations fitted. Assessed using the RMSE, all the machine learning models outperformed the ARIMA model. The attention-based Transformer model outperformed the recurrent models in both predictive performance and model computational training efficiency. The model produced test RMSEs of 1.22, 2.07, 35.54, 16.61, and 4.95 when predicting the closing price of Apple, Microsoft, Amazon, Alphabet, and Facebook respectively.Item Clustering and Classification Techniques in the Presence of Outliers: An Application to the Johannesburg Stock Exchange Stocks(University of the Witwatersrand, Johannesburg, 2024) Maphalla, Retsebile; Chipoyera, HWIn this study, the impact of outliers on clustering using the K-means algorithm was explored. It was observed that a high prevalence of outliers can seriously compromise the results of clustering. A novel algorithm called Clustering-quality-aided outlier detection (CQAOD) is proposed in this study. The novelty stems from the fact that apart from identifying outliers, good quality clustering is achieved and the “optimal” number of clusters for K-means clustering of multivariate Gaussian data is simultaneously proffered. In the case of the Johannesburg Stock Exchange (JSE) data, an investigation to compare the efficacy of the following clustering techniques: Hierarchical clustering, spectral clustering, Clustering Large Applications (Clara), Density-based spatial clustering of applications with noise (DBSCAN) was done with the aim of constructing a diversified stock portfolio. The study found that the hierarchical clustering algorithm is the best algorithm to cluster the shares on the JSE