Imputation of missing values and the application of transfer machine learning to predict water quality in acid mine drainage treatment plants

Date
2024
Journal Title
Journal ISSN
Volume Title
Publisher
University of the Witwatersrand, Johannesburg
Abstract
Access to clean water is one of the most difficult challenges of the 21st century. Natural unpolluted water bodies are becoming one of the most dramatically declining resources due to environmental pollution. In countries like South Africa which has a mining-centred economy, toxic pollution from mine tailing dumps and unused mines leach into the underground water table and contaminate it. This is known as Acid Mine Drainage (AMD) and poses a grave threat to humans, animals and the environment due to its toxic element and acidic content. It is, therefore, imperative that sustainable wastewater treatment procedures be put in place in order to decrease the toxicity of the AMD such that clean water may be recovered. An efficient circular economy is created in the process since original wastewater can be recycled to not only provide clean water, but also valuable byproducts such as sulphur (from the elevate sulphate content) and other important minerals. Traditional analytical chemistry methods used to measure sulphate are usually time-consuming, expensive and inefficient, thereby, leading to incomplete analytical results being reported. To address this, this study aimed at imputing missing values for sulphate concentrations in one AMD treatment plant dataset and then using that to conduct transfer learning to predict concentrations in two other AMD treatment plants datasets. The approach involved using historical water data and applying geochemical modelling as a thermodynamical tool to assess the water chemistry and conduct preliminary data cleaning. Based on this, Machine Learning (ML) was then used to predict the sulphate concentrations, thus, addressing limited data on this parameter in the datasets. With complete and accurate sulphate concentrations, it is possible to conduct further modelling and experimental work aimed at recovering important minerals such as octathiocane, S8 (a commercial form of sulphur), gypsum and metals. Historical data obtained from the three AMD treatment plants in Johannesburg, South Africa (viz., Central Rand, East Rand and West Rand) were obtained and the larger Central Rand dataset was split into smaller untreated AMD (Pump A and Pump B) subsets. Thermodynamic and solution equilibria aspects of the water were assessed using the PHREEQC geochemical modelling code. This served as a preliminary data cleanup step. Eight baseline as well as three ensemble machine learning regression models were trained on the Central Rand subsets and compared to each other to find the best performing model that was then used to conduct Transfer Learning (TL) onto the East Rand and West Rand datasets to predict their sulphate levels. The findings pointed to a high correlation of sulphate to temperature (°C), Total Dissolved Solids (mg/L) and most importantly, iron (mg/L). The linear correlation between iron and sulphate substantiated pyrite (FeS2) as their source following weathering. Water quality parameters were found to be dependent on factors such as weather and geography this was evident in the treated water that had quite different chemistry to that of the untreated AMD. Neutralisation agents used were based on those parameters, thus, further delineating the chemistry of the treated and untreated water. The best performing ML model was the Stacking Ensemble (SE) regressor trained on Pump B’s data and combined the best performing models namely, Linear Regressor (LR), Ridge Regressor (RD), K-Nearest Neighbours Regressor (KNNR), Decision Tree Regressor (DT), Extreme Gradient Boosting Regressor (XG), Random Forest Regressor (RF) and Multi-Layer Perceptron Artificial Neural Network Regressor (MLP) as the level 0 models and LR as the level 1 model. Level 0 consisted of training heterogenous base models to obtain the crucial features from the dataset. These individual predictions and features were then fed to a single meta-learner model in in the next layer (level 1) to generate a final prediction. The stacking ensemble model performed well and achieved Mean Squared Error (MSE) of 0.000011, Mean Absolute Error (MAE) of 0.002617 and R2 of 0.999737 in under 2 minutes. This model was selected to be used for TL to the East Rand and West Rand datasets. Ensemble methods (bagging, boosting and stacking) outperformed individual baseline models. However, when comparing stacking ensemble ML that combined all the baseline models with stacking ensemble ML that only combined the best performing models, it was found that there was no significant improvement in excluding bad models from the stack as long as the good models were included. In one case, it was actually beneficial to include the bad performing models. All models were trained in under 2 minutes which proved the benefit of using ML approaches compared to traditional approaches. The treated water data was highly uncorrelated such that model training was unsuccessful with the highest achievable R2 value being 0.14, thus, no treated water model was available for TL. TL was successfully conducted on the cleaned and modelled East Rand AMD dataset using the Central Rand (Pump B) stacking regressor and a high level of accuracy with respect to Mean Square Error (MSE), Mean Absolute Error (MAE) and R2 (MSE:0.00124, MAE:0.0290 and R2:0.963) between the predicted and true sulphate values was achieved. This was achieved despite a marked difference in the distributions between the Central Rand and East Rand datasets which further proved the power of utilizing ML for water data. TL was successful in imputing missing values in the West Rand dataset following prediction of sulphate levels in the cleaned and modelled West Rand AMD and treated water datasets. No true values for sulphate levels in the West Rand dataset were given, as such, accuracy comparisons could not be made. However, a general baseline idea of the amount of sulphate present in the West Rand treatment plant could now be understood. The sulphate levels in all three treatment plants (Central Rand, East Rand and West Rand) were found to greatly differ from each other with the Central Rand having the most normal distribution, the East Rand having the most precise distribution and the West Rand having the most variable distribution. Whilst the sulphate levels in the treated effluent waters could not be reliably predicted due to inherent issues (e.g., analytical inaccuracies and inconsistences) and poor correlations within the treated water datasets, sulphate levels in all three of the untreated AMD datasets were successfully predicted with a high degree of accuracy. This underpinned the observation made previously about the discrepancies between treated and untreated water. The study has shown that it is possible to impute missing values in one water dataset and use transfer learning to complete and consolidate another similar, but scarce, dataset(s). This approach has been lacking in the water industry, resulting in the reliance and use of traditional methods that are expensive and inadequate. This has caused water practitioners to abandon scarce datasets, thus, losing potentially valuable information that could be useful for water remediation and recovery of valuable resources from the water. As a spin off from the study, it has been indicated that automation of such data analysis is possible. This was achieved by developing a Graphical User Interface (GUI) for ease of use of the SE-ML model by those with little to no programming background nor ML knowledge e.g., the laboratory staff at the AMD treatment plants. This can also be used for teaching purposesin academia.
Description
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfillment of the requirements for the degree of Master of Science, 2024
Keywords
Acid Mine Drainage, Sulphate, Machine Learning, Regression, Water Quality, Stacking-Ensemble Machine Learning
Citation
Hasrod, Taskeen. (2024). Imputation of missing values and the application of transfer machine learning to predict water quality in acid mine drainage treatment plants [ PhD thesis, University of the Witwatersrand, Johannesburg]. WireDSpace.