Why is this an anomaly? Explaining anomalies using sequential explanations

Date
2020
Authors
Mokoena, Tshepiso
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Anomaly detection has received much attention throughout the years. Currently, human analysts in real-world applications use anomaly detectors to assist them in identifying potential anomalous data points. Unfortunately, most anomaly detectors do not provide the analysts with explanations about what makes a data point anomalous, resulting in the analysts to consider the information related to the entire feature space of each detected data point to decide whether they are truly anomalous or not. This process can be time-consuming and costly in most domains, especially if the feature space is large, and feature interactions are critical to the analyst's judgement. To assist the analyst and minimise the number of features that they must analyse to identify true anomalies con dently, we introduce an explanation called a Sequential Explanation(SE). A SE for a detected data point contains subsets of features that explain why the detected data point could be anomalous to the analyst. The rst subset in the SE contains only one feature; the second subset contains two features; the third subset contains three features, and so on. The subsets of features explain to the analyst why the detected data point could be anomalous. The subsets of features in the SE are incrementally presented to the analyst one at a time, in order, until the analyst has acquired enough information to decide whether the data point is an anomaly or not. In this thesis, we introduce two novel methods of generating SEs that will work alongside any anomaly detector. The rst method is the outlier-based method that adds features in the SE by using an anomaly detector's outlier scoring measure guided by a search algorithm. The sample-based method uses sampling to turn the problem into a classical feature selection problem such that any feature selection algorithm can be used to generate the SE. In our experiments we (i) analyse the performance and complexity of di erent anomaly detectors' outlier scoring measures and search algorithms in the outlier-based SEs, (ii) analyse the performance and complexity of di erent feature selection methods in the sample-based SEs and (iii) compare the outlier and sample-based SEs based on their performance and complexity. In addition we also introduced a new and improved method of evaluating explanations called the area under the curve of the analyst certainty curve (AUCC). Our results show that both the outlier and sample-based methods can generate SEs which signi cantly outperform randomly presenting features to the analyst. In conclusion, we found that our SEs were able to identify the features that explain the anomalies and that our new evaluation method is an improvement on the previous evaluation method used to evaluate explanations
Description
A dissertation submitted in partial fulfilment of the requirements for the degree Master of Science, Faculty of Science, University of the Witwatersrand, Johannesburg, 2020
Keywords
Citation
Collections