Mahamuza, Phemelo Hope2025-06-182024-03Mahamuza, Phemelo Hope. (2024). Assessing and comparing the performance of different machine learning regression algorithms in predicting Chlorophyll-a concentration in the Vaal Dam, Gauteng. [Master's dissertation, University of the Witwatersrand, Johannesburg]. WIReDSpace. https://hdl.handle.net/10539/45165https://hdl.handle.net/10539/45165This research report is submitted in partial fulfilment of the requirements of degree of Master of Science (M.Sc. Geographic information science and Remote sensing), to the Faculty of Science, School of Geography, Archaeology and Environmental Studies, University of the Witwatersrand, Johannesburg, 2024.The state of Vaal Dam is influenced by various land uses surrounding the Dam, including agricultural activities, mining operations, industrial enterprises, urban settlements, and nature reserves. Mining activities, farming practices, and sewage outflows from nearby villages led to access contamination within the Dam, increasing algal bloom levels. Sentinel-2 MSI data were utilized to forecast and comprehend the spatial pattern of Chlorophyll-a concentration, indicating algal bloom occurrence in the Vaal Dam. Targeting Sentinel-2 Level-1C, the image was preprocessed on the Google Earth Engine (GEE) with acquisition dates from 25 – 26 October 30, 2016, corresponding to the on-site data collection between October 26 and October 28, 2016. Due to limited resources, up-to-date data on the Vaal Dam could not be collected. However, since this study focuses on applying various machine learning regression models to predict chlorophyll-a levels in waterbodies, the dataset is used to test the models rather than reflect the current state of the Vaal Dam. The dataset, comprising 23 samples, was divided into 70% training and 30% test sets, allowing for comprehensive model evaluation. Band ratio reflectance values were extracted from the satellite image and correlated with in-field Chlorophyll-a values. The highest correlation coefficient values were utilized to train five machine-learning models employed in this study: Random Forest (RF), Support Vector Regression (SVR), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge Regression, and Multilinear Regression (MLR). Each model underwent training with ten iterations each; the best learning iteration was then used to generate the final Chlorophyll-a predictive model. The predictive models were validated using the Sentinel-2 MSI satellite data and in-situ measurements using R2, RMSE, and MAPE. Among the five machine learning algorithms trained, RF performed the best, with an R2 of 0.86 and 0.95, an RMSE of 1.38 and 0.8, and MAPE of 15.09% and 10.92% for the training and testing sets, respectively, indicating its ability to handle small, non-linear datasets. SVR also demonstrated a fair performance, particularly in handling multicollinearity in the data points with an R2 of 0.68 and 0.87, an RMSE of 2.37 and 1.56, and MAPE of 18.13% and 19.28% for the training and testing sets, respectively. The spatial pattern of Chlorophyll-a concentrations, mapped from the RF model, indicated that high concentrations of Chlorophyll-a are along the Dam shorelines, suggesting a significant impact of land use activities on pollution levels. This study emphasizes the importance of selecting suitable machine learning algorithms tailored to the dataset's characteristics. RF and SVR demonstrated proficiency in handling nonlinearity, with RF displaying enhanced generalization and resistance to overfitting. Limited field data evenly distributed across the Dam and satellite overpass dates may affect result accuracy. Future research should align satellite pass dates with fieldwork dates and ensure an even distribution of in-field samples across the Dam to represent all land uses and concentration levels.en©2024 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg.Machine learningRandom forestSupport vector regressionSentinel-2Chlorophyll-aWater qualityUCTDAssessing and comparing the performance of different machine learning regression algorithms in predicting Chlorophyll-a concentration in the Vaal Dam, GautengDissertationUniversity of the Witwatersrand, JohannesburgSDG-6: Clean water and sanitationSDG-9: Industry, innovation and infrastructure