Browsing by Author "Celik, Turgay"

Now showing 1 - 4 of 4

A Data Science Framework for Mineral Resource Exploration and Estimation Using Remote Sensing and Machine Learning
(University of the Witwatersrand, Johannesburg, 2023-08) Muhammad Ahsan, Mahboob; Celik, Turgay; Genc, Bekir
Exploring mineral resources and transforming them into ore reserves is imperative for sustainable economic growth, particularly in low income developing economy countries. Limited exploration budgets, inaccessible areas, and long data processing times necessitate the use of advanced multidisciplinary technologies for minerals exploration and resource estimation. The conventional methods used for mineral resources exploration require expertise, understanding and knowledge of the spatial statistics, resource modelling, geology, mining engineering and clean validated data to build accurate estimations. In the past few years, data science has become increasingly important in the field of minerals exploration and estimation. This study is a step forward in this field of data science and its integration with minerals exploration and estimation. The research has been conducted to develop a state-of-the-art data science framework that can effectively use limited field data with remotely sensed satellite data for efficient mineral exploration and estimation, which was validated through case studies. Satellite remote sensing has emerged as a powerful modern technology for mineral resources exploration and estimation. This technology has been used to map and identify minerals, geological features, and lithology. Using digital image processing techniques (band ratios, spectral band combinations, spectral angle mapper and principal component analysis), the hydrothermal alteration of potential mineralization was mapped and analysed. Advanced machine learning and geostatistical models have been used to evaluate and predict the mineralization using field based geochemical samples, drillholes samples, and multispectral satellite remote sensing based hydrothermal alteration information. Several machine learning models were applied including the Convolutional Neural Networks (CNN), Random Forest (RF), Support Vector Machine (SVM), Support Vector Regression (SVR), Generalized Linear Model (GLM), and Decision Tree (DT). The geostatistical models used include the Inverse Distance Weighting (IDW) and Kriging with different semivariogram models. IDW was used to interpolate data points to make a prediction on mineralization, while Kriging used the spatial autocorrelation to make predictions. In order to assess the performance of machine learning and geostatistical models, a variety of predictive accuracy metrics such as confusion matrix, a receiver operating characteristic (ROC) curve, and a success-rate curve were used. In addition, Mean Absolute Error, Mean Square Error, and root mean square prediction error were also used. The results obtained based on the 10 m spatial resolution show that Zn is best predicted with RF with significant R2 values of 0.74 (p < 0.01) and 0.7 (p < 0.01) during training and testing. However, for Pb, the best prediction is made by SVR with significant R2 values of 0.72 (p < 0.01) and 0.64 (p < 0.01) for training and testing, respectively. Overall, the performance of SVR and RF outperforms the other machine learning models with the highest testing R2 values. The experimental results also showed that there is no single method that can be used independently to predict the spatial distribution of geochemical elements in streams. Instead, a combinatory approach of IDW and kriging is advised to generate more accurate predictions. For the case study of copper prediction, the results showed that the RF model exhibited the highest predictive accuracy, consistency and interpretability among the three ML models evaluated in this study. RF model also achieved the highest predictive efficiency in capturing known copper (Cu) deposits within a small prospective area. In comparison to the SVM and CNN models, the RF model outperformed them in terms of predictive accuracy and interpretability. The evaluation results have showed that the data science framework is able to deliver highly accurate results in minerals exploration and estimation. The results of the research were published through several peer reviewed journal and conference articles. The innovative aspect of the research is the use of machine learning models to both satellite remote sensing and field data, which allows for the identification of highly prospective mineral deposits. The framework developed in this study is cost-effective and time-saving and can be applied to inaccessible and/or new areas with limited ground-based knowledge to obtain reliable and up- to-date mineral information.
Rationalization of Deep Neural Networks in Credit Scoring
(University of the Witwatersrand, Johannesburg, 2023-07) Dastile, Xolani Collen; Celik, Turgay
Machine learning and deep learning, which are subfields of artificial intelligence, are undoubtedly pervasive and ubiquitous technologies of the 21st century. This is attributed to the enhanced processing power of computers, the exponential growth of datasets, and the ability to store the increasing datasets. Many companies are now starting to view their data as an asset, whereas previously, they viewed it as a by-product of business processes. In particular, banks have started to harness the power of deep learning techniques in their day-to-day operations; for example, chatbots that handle questions and answers about different products can be found on banks’ websites. One area that is key in the banking sector is the credit risk department. Credit risk is the risk of lending money to applicants and is measured using credit scoring techniques that profile applicants according to their risk. Deep learning techniques have the potential to identify and separate applicants based on their lending risk profiles. Nevertheless, a limitation arises when employing deep learning techniques in credit risk, stemming from the fact that these techniques lack the ability to provide explanations for their decisions or predictions. Hence, deep learning techniques are coined as non-transparent models. This thesis focuses on tackling the lack of transparency inherent in deep learning and machine learning techniques to render them suitable for adoption within the banking sector. Different statistical, classic machine learning, and deep learning models’ performances were compared qualitatively and quantitatively. The results showed that deep learning techniques outperform traditional machine learning models and statistical models. The predictions from deep learning techniques were explained using state-of-the-art explanation techniques. A novel model-agnostic explanation technique was also devised, and credit-scoring experts assessed its validity. This thesis has shown that different explanation techniques can be relied upon to explain predictions from deep learning and machine learning techniques.
Regularized Deep Neural Network for Post-Authorship Attribution
(University of the Witwatersrand, Johannesburg, 2024) Modupe, Abiodun; Celik, Turgay; Marivate, Vukosi
Post-authorship attribution is the computational process of determining the legitimate author of an online text snippet, such as an email, blog, forum post, or chat log, by employing stylometric features. The process consists of analysing various linguistic and writing patterns, such as vocabulary, sentence structure, punctuation usage, and even the use of specific words or phrases. By comparing these features to a known set of writing pieces from potential authors, investigators can make educated hypotheses about the true authorship of a text snippet. Additionally, post-authorship attribution has applications in fields like forensic linguistics and cybersecurity, where determining the source of a text can be crucial for investigations or identifying potential threats. Furthermore, in a verification procedure to proactively uncover misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks, finding a suitable text representation to adequately symbolise and capture an author’s distinctive writing from a computational linguistics perspective is typically known as a stylometric analysis. Additionally, most of the posts on social media or online are generally rife with ambiguous terminologies that could potentially compromise and influence the precision of the early proposed authorship attribution model. The majority of extracted stylistic elements in words are idioms, onomatopoeias, homophones, phonemes, synonyms, acronyms, anaphora, and polysemy, which are fundamentally difficult to interpret by most existing natural language processing (NLP) systems. These difficulties make it difficult to correctly identify the true author of a given text. Therefore, further advancements in NLP systems are necessary to effectively handle these complex linguistic elements and improve the accuracy of authorship attribution models. In this thesis, we introduce a regularised deep neural network (RDNN) model to solve the challenges that come with figuring out who wrote what after the fact. The proposed method utilises a convolutional neural network, a bidirectional long short-term memory encoder, and a distributed highway network to effectively address the post-authorship attribution problem. The neural network was utilised to generate lexical stylometric features, which were then fed into the bidirectional encoder to produce a syntactic feature vector representation. The feature vector was then fed into the distributed high-speed networks for regularisation to reduce network generalisation errors. The regularised feature vector was then given to the bidirectional decoder to learn the author’s writing style. The feature classification layer is made up of a fully connected network and a SoftMax function for prediction. The RDNN method outperformed the existing state-of-the-art methods in terms of accuracy, precision, and recall on the majority of the benchmark datasets. These results highlight the potential of the proposed method to significantly improve classification performance in various domains. Again, the introduction of an interactive system to visualise the performance of the proposed method would further enhance its usability and effectiveness in quantifying the contribution of the author’s writing characteristics in both online text snippets and literary documents. It is useful in processing the evidence that is needed to support claims or draw conclusions about the author’s writing style or intent during the pre-trial investigation by the law enforcement agent in the court of law. The incorporation of this method into the pretrial stage greatly strengthens the credibility and validity of the findings presented in court and has the potential to revolutionise the field of authorship attribution and enhance the accuracy of forensic investigations. Furthermore, it ensures a fair and just legal process for all parties involved by providing concrete evidence to support or challenge claims. We are also aware of the limitations of the proposed methods and recognise the need for additional research to overcome these constraints and improve the overall reliability and applicability of post-authorship attribution of online text snippets and literary documents for forensic investigations. Even though the proposed methods have revealed some unusual differences in author writing style, such as how influential writers, regular people, and suspected authors use language, the evidence from the results with the features extracted from the texts has shown promise for identifying authorship patterns and aiding in forensic analyses. However, much work remains to be done to validate the methodologies’ usefulness and dependability as effective authorship attribution procedures. Further research is needed to determine the extent to which external factors, such as the context in which the text was written or the author’s emotional state, may impact the identified authorship patterns. Additionally, it is crucial to establish a comprehensive dataset that includes a diverse range of authors and writing styles to ensure the generalizability of the findings and enhance the reliability of forensic analyses. Furthermore, the dataset used in this thesis does not include a diverse variety of authors and writing styles, such as impostors attempting to impersonate another author, which limits the generalizability of the conclusions and undermines the credibility of forensic analysis. More studies can be conducted to broaden the proposed strategy for detecting and distinguishing impostors’ writing styles from those of authentic authors when committing crimes on both online and literary documents. It is conceivable for numerous criminals to collaborate to perpetrate a crime, which could aid in improving the proposed methods for detecting the existence of multiple impostors or the contribution of each criminal writing style based on the person or individual they are attempting to mimic. The likelihood of numerous offenders working together complicates the investigation and necessitates advanced procedures for identifying their individual contributions, as well as both authentic and manufactured impostor contents within the text. This is especially difficult on social media, where fake accounts and anonymous profiles can make it difficult to determine the true identity of those involved, which can come from a variety of sources, including text, WhatsApps, chat images, videos, and so on, and can lead to the spread of misinformation and manipulation. As a result, promoting a hybrid approach that goes beyond text as evidence could help address some of the concerns raised above. For example, integrating audio and visual data may provide a more complete perspective of the scenario. As a result, such an approach exacerbates the restrictions indicated in the distribution of data and may necessitate more storage and analytical resources. However, it can also lead to a more accurate and nuanced analysis of the situation
The dynamics of pathology dataset creation using urine cytology as an example
McAlpine, Ewen; Michelow, Pamela; Celik, Turgay
Introduction: Dataset creation is one of the first tasks required for training AI algorithms but is underestimated in pathology. High-quality data are essential for training algorithms and data should be labelled accurately and include sufficient morphological diversity. The dynamics and challenges of labelling a urine cytology dataset using The Paris System (TPS) criteria are presented. Methods: 2,454 images were labelled by pathologist consensus via video conferencing over a 14-day period. During the labelling sessions, the dynamics of the labelling process were recorded. Quality assurance images were randomly selected from images labelled in previous sessions within this study and randomly distributed throughout new labelling sessions. To assess the effect of time on the labelling process, the labelled set of images was split into 2 groups according to the median relative label time and the time taken to label images and intersession agreement were assessed. Results: Labelling sessions ranged from 24 m 11 s to 41 m 06 s in length, with a median of 33 m 47 s. The majority of the 2,454 images were labelled as benign urothelial cells, with atypical and malignant urothelial cells more sparsely represented. The time taken to label individual images ranged from 1 s to 42 s with a median of 2.9 s. Labelling times differed significantly among categories, with the median label time for the atypical urothelial category being 7.2 s, followed by the malignant urothelial category at 3.8 s and the benign urothelial category at 2.9 s. The overall intersession agreement for quality assurance images was substantial. The level of agreement differed among classes of urothelial cells - benign and malignant urothelial cell classes showed almost perfect agreement and the atypical urothelial cell class showed moderate agreement. Image labelling times seemed to speed up, and there was no evidence of worsening of intersession agreement with session time. Discussion/conclusion: Important aspects of pathology dataset creation are presented, illustrating the significant resources required for labelling a large dataset. We present evidence that the time taken to categorise urine cytology images varies by diagnosis/class. The known challenges relating to the reproducibility of the AUC (atypical) category in TPS when compared to the NHGUC (benign) or HGUC (malignant) categories is also confirmed.