Using Optimisation Techniques
 to Granulise Rough Set
 Partitions
 Bodie Crossingham
 A dissertation submitted to the Faculty of Engineering and the Built Environment,
 University of the Witwatersrand, Johannesburg, in fulfilment of the requirements
 for the degree of Master of Science in Engineering.
 Johannesburg, 2008
Declaration
 I declare that this dissertation is my own, unaided work, except where otherwise ac-
 knowledged. It is being submitted for the degree of Master of Science in Engineering
 in the University of the Witwatersrand, Johannesburg. It has not been submitted
 before for any degree or examination in any other university.
 Signed this day of 20
 Bodie Crossingham
 i
Abstract
 Rough set theory (RST) is concerned with the formal approximation of crisp sets
 and is a mathematical tool which deals with vagueness and uncertainty. RST can be
 integrated into machine learning and can be used to forecast predictions as well as to
 determine the causal interpretations for a particular data set. The work performed
 in this research is concerned with using various optimisation techniques to granulise
 the rough set input partitions in order to achieve the highest forecasting accuracy
 produced by the rough set. The forecasting accuracy is measured by using the area
 under the curve (AUC) of the receiver operating characteristic (ROC) curve. The
 four optimisation techniques used are genetic algorithm, particle swarm optimisa-
 tion, hill climbing and simulated annealing. This newly proposed method is tested
 on two data sets, namely, the human immunodeficiency virus (HIV) data set and
 the militarised interstate dispute (MID) data set. The results obtained from this
 granulisation method are compared to two previous static granulisation methods,
 namely, equal-width-bin and equal-frequency-bin partitioning. The results conclude
 that all of the proposed optimised methods produce higher forecasting accuracies
 than that of the two static methods. In the case of the HIV data set, the hill climb-
 ing approach produced the highest accuracy, an accuracy of 69.02% is achieved in a
 time of 12624 minutes. For the MID data, the genetic algorithm approach produced
 the highest accuracy. The accuracy achieved is 95.82% in a time of 420 minutes.
 The rules generated from the rough set are linguistic and easy-to-interpret, but this
 does come at the expense of the accuracy lost in the discretisation process where
 the granularity of the variables are decreased.
 ii
To my family and friends...
 iii
Acknowledgements
 I wish to thank my parents for all their support they have given me through out
 my studies. I thank them for instilling in me the important lesson of working hard.
 Most of all, thanks to my parents who provided me with the greatest item of worth
 - opportunity. I would also like to thank my siblings for all their encouragement and
 support. I also would like to thank my friends and girl friend for all their support
 through out my studies.
 I would like to acknowledge my supervisor Prof. Tshilidzi Marwala. It was his
 determination to succeed and his wealth of knowledge that drives all of us in C-lab
 to achieve as much as we can. This determination of his is expressed in one of his
 favourite quotes, ?The opportunity of a lifetime must be seized during the lifetime
 of the opportunity?. Prof Marwala inspires all those around him, and it is this
 inspiration that drove me to pursue my Masters. I thank him for encouraging me
 to study further and for proving me with the financial backing to study as well as
 to attend conferences.
 I thank all my colleagues working with me in the C-lab. It is great to be surrounded
 by such ambitious guys with whom I can share my ideas and get feedback from all
 your comments and criticisms on my work.
 iv
Contents
 Declaration i
 Abstract ii
 Acknowledgements iv
 Contents v
 List of Figures ix
 List of Tables xi
 Nomenclature xiii
 1 Introduction 1
 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
 1.2 Research Focus and Approach . . . . . . . . . . . . . . . . . . . . . . 2
 1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
 1.3.1 Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . 4
 1.3.2 Optimisation Techniques . . . . . . . . . . . . . . . . . . . . . 4
 v
CONTENTS
 1.3.3 Testing using HIV and MID data . . . . . . . . . . . . . . . . 7
 1.4 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
 2 The Theory of Rough Sets 10
 2.1 Information Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
 2.2 Information System . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
 2.3 Indiscernibility Relation . . . . . . . . . . . . . . . . . . . . . . . . . 12
 2.4 Lower and Upper Approximations . . . . . . . . . . . . . . . . . . . 13
 2.5 Rough Membership Function . . . . . . . . . . . . . . . . . . . . . . 16
 2.6 Rough Set Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 2.7 Rough Sets Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 20
 3 Using Optimisation Techniques for Discretisation 21
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
 3.2 Discretisation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 22
 3.3 Equal-Width-Bin and Equal-Frequency-Bin Partitioning . . . . . . . 23
 3.3.1 Equal-Width-Bin Partitioning . . . . . . . . . . . . . . . . . . 23
 3.3.2 Equal-Frequency-Bin Partitioning . . . . . . . . . . . . . . . 24
 3.4 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
 3.5 Particle Swarm Optimisation . . . . . . . . . . . . . . . . . . . . . . 28
 3.6 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
 3.7 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 vi
CONTENTS
 4 Experimental Investigation I : Modelling HIV 36
 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
 4.2 HIV Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
 4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
 4.2.2 Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . . 40
 4.3 Results Obtained on HIV data . . . . . . . . . . . . . . . . . . . . . 41
 4.3.1 Results of Optimisation Techniques . . . . . . . . . . . . . . . 42
 4.3.2 Comparison of Methods on HIV data . . . . . . . . . . . . . 44
 5 Experimental Investigation II : Modelling MID 50
 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
 5.2 MID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
 5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
 5.2.2 Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . . 54
 5.3 Results Obtained on MID data . . . . . . . . . . . . . . . . . . . . . 55
 5.3.1 Results of Optimisation Techniques . . . . . . . . . . . . . . . 55
 5.3.2 Comparison of Methods on MID data . . . . . . . . . . . . . 57
 6 Conclusion 63
 6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 63
 6.2 Recommendations for Further Work . . . . . . . . . . . . . . . . . . 64
 A Algorithms 66
 vii
CONTENTS
 A.1 Rough Set Model Generation . . . . . . . . . . . . . . . . . . . . . . 66
 A.2 Receiver Operating Characteristics (ROC) Curve . . . . . . . . . . . 67
 A.3 Area Under Curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . 68
 B Flow Diagrams 69
 B.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
 B.2 Particle Swarm Optimisation . . . . . . . . . . . . . . . . . . . . . . 71
 B.3 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
 B.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
 C Published Work 74
 References 75
 viii
List of Figures
 2.1 Approximating the set of HIV positive patients of Table 2.1 using all
 the conditional attributes. . . . . . . . . . . . . . . . . . . . . . . . . 15
 3.1 Block diagram of the sequence of events in creating a rough set model 22
 3.2 Figure illustrating a local maxima versus a global maxima . . . . . . 31
 4.1 Receiver operating characteristic (ROC) curves for GA, EWB and
 EFB on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
 4.2 Receiver operating characteristic (ROC) curves for PSO, EWB and
 EFB on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
 4.3 Receiver operating characteristic (ROC) curves for HC, EWB and
 EFB on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
 4.4 Receiver operating characteristic (ROC) curves for SA, EWB and
 EFB on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
 4.5 Receiver operating characteristic (ROC) curves for GA, PSO, HC and
 SA on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
 4.6 Discretisation points for the various attributes using the GA method
 on HIV data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
 5.1 Receiver operating characteristic (ROC) curves for GA, EWB and
 EFB on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
 ix
LIST OF FIGURES
 5.2 Receiver operating characteristic (ROC) curves for PSO, EWB and
 EFB on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
 5.3 Receiver operating characteristic (ROC) curves for HC, EWB and
 EFB on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
 5.4 Receiver operating characteristic (ROC) curves for SA, EWB and
 EFB on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
 5.5 Receiver operating characteristic (ROC) curves for GA, PSO, HC and
 SA on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
 5.6 Discretisation points for the various attributes using the GA method
 on MID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
 B.1 Flow diagram of genetic algorithm optimisation . . . . . . . . . . . . 70
 B.2 Flow diagram of particle swarm optimisation . . . . . . . . . . . . . 71
 B.3 Flow diagram of hill climbing optimisation . . . . . . . . . . . . . . . 72
 B.4 Flow diagram of hill climbing optimisation . . . . . . . . . . . . . . . 73
 x
List of Tables
 2.1 Information table of the HIV data. . . . . . . . . . . . . . . . . . . . 11
 2.2 Confusion matrix representing the dispositions of the set of instances 18
 4.1 Summary of the HIV data set variables . . . . . . . . . . . . . . . . 39
 4.2 Representation of a Confusion Matrix . . . . . . . . . . . . . . . . . 41
 4.3 Confusion matrix for EWB discretisation on HIV data . . . . . . . . 42
 4.4 Confusion matrix for EFB discretisation on HIV data . . . . . . . . 42
 4.5 Confusion matrix of the GA discretisation on HIV data . . . . . . . 42
 4.6 Confusion matrix of a the PSO discretisation on HIV data . . . . . . 43
 4.7 Confusion matrix of a the HC discretisation on HIV data . . . . . . 43
 4.8 Confusion matrix of a the SA discretisation on HIV data . . . . . . . 44
 4.9 Results obtained for the four optimisation and two static discretisa-
 tion methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
 5.1 Summary of the MID data set variables . . . . . . . . . . . . . . . . 54
 5.2 Confusion matrix for EWB discretisation on MID data . . . . . . . . 55
 5.3 Confusion matrix for EFB discretisation on MID data . . . . . . . . 55
 5.4 Confusion matrix of the GA discretisation on MID data . . . . . . . 56
 xi
LIST OF TABLES
 5.5 Confusion matrix of a the PSO discretisation on MID data . . . . . 56
 5.6 Confusion matrix of a the HC discretisation on MID data . . . . . . 56
 5.7 Confusion matrix of a the SA discretisation on MID data . . . . . . 56
 5.8 Results obtained for the four optimisation and two static discretisa-
 tion methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
 xii
Nomenclature
 RST Rough Set Theory
 HIV Human Immunodeficiency Virus
 MID Militarised Interstate Dispute
 EWB Equal Width Bin
 EFB Equal Frequency Bin
 GA Genetic Algorithm
 PSO Particle Swarm Optimisation
 HC Hill Climbing
 SA Simulated Annealing
 RRHC Random-Restart Hill Climbing
 NN Neural Network
 AUC Area Under Curve
 ROC Receiver Operating Characteristic
 MLP Multi-Layer Perceptron
 SVM Support Vector Machine
 GSS Golden Selection Search
 COW Correlates of War
 PPS Probability Proportion to Size
 GDP Gross Domestic Product
 ARD Automatic Relevance Detection
 VFSR Very Fast Simulated Re-annealing
 ASA Adaptive Simulated Annealing
 xiii
Chapter 1
 Introduction
 1.1 Background
 Rough set theory (RST) was introduced by Zdzislaw Pawlak in the early 1980s [1].
 RST is a mathematical tool which deals with vagueness and uncertainty, it allows for
 the approximation of sets that are difficult to describe with the available information.
 It is of fundamental importance to artificial intelligence (AI) and cognitive science
 and is highly applicable to this study of performing the task of machine learning
 and decision analysis. RST can be seen as an extension of classical set theory by
 incorporating knowledge into its formalism, this allows sets to be approximated
 into a lower and an upper approximation of the original set. These approximations
 are the basic notion of rough set theory. Data can be acquired from measurement
 systems or human experts, and in principle, this data must be discrete [2]. There are
 methods which allow for continuous data to be processed, this involves discretisation,
 which converts the continuous data into discrete intervals. Several methods that
 have been previously, and that are currently used to perform this task are; boolean
 reasoning, Enc-MALIC, as well as the more popularly used equal-width-bin (EWB)
 partitioning and equal-frequency-bin (EFB) partitioning [3, 4].
 For knowledge acquisition from data with numerical attributes, special techniques
 are applied, most frequently a step taken before the main step of rule induction or
 1
1.2. RESEARCH FOCUS AND APPROACH
 decision tree generation, called discretisation, is used [5]. It is this step that this re-
 search focuses around. In this work, the use of combinatorial optimisation techniques
 to discretise the rough set partitions are explored. Combinatorial optimisation is
 related to algorithm and computational complexity theory, and is the intersection of
 the artificial intelligence, mathematical and software engineering fields. Combina-
 torial optimisation is the process of finding an optimal solution in a problem space.
 In the case of optimising the partition sizes of a rough set, the particular optimi-
 sation technique will search the entire search space, and create the partition sizes
 according to an objective criteria. The criteria used in this research is the maximi-
 sation of the classification accuracy of the rough set. Four optimisation techniques
 are used, namely; genetic algorithm (GA), particle swarm optimisation (PSO), hill
 climbing (HC) and simulated annealing (SA). The results produced from these four
 methods are compared to that of the more commonly used equal-width-bin and
 equal-frequency-bin partitioning methods. To test the above method, human im-
 munodeficiency virus (HIV) data is used as well as militarised interstate dispute
 (MID) data.
 RST?s main advantage lies in its ability to deal with imprecise or conflicting data
 and it also produces easy-to-interpret and linguistic rules. Its downfall though is
 its classification or forecasting accuracy. Rough sets compromise accuracy over
 rule interpretability and this is brought about in the discretisation process where
 the granularity of the variables are decreased. RST is used for classification as it
 produces a balance between transparency of the rough set model and accuracy of
 data estimation, but it does come at the cost of high computational effort.
 1.2 Research Focus and Approach
 The approach taken to determine the effects of optimising rough sets using various
 techniques is as follows:
 1. Create a rough set model to be used on the various data sets.
 2
1.2. RESEARCH FOCUS AND APPROACH
 2. Using the HIV data set, discretise the partitions using EWB and EFB parti-
 tioning, this determines the classification accuracy produced of the well known,
 non-optimised methods.
 3. Apply the GA, PSO, HC and SA optimisation techniques to the rough set for
 analysis on the HIV data, doing so generates the optimal partition sizes for
 each method.
 4. Using these newly generated optimal values, compare them to the results pro-
 duced using EWB and EFB partitioning.
 5. Apply the above methodology to the conflict data set (MID) and investigate
 the results produced by the optimised and non-optimised methods.
 The rough sets are discretised into four partitions for the purpose of comparison
 for this thesis. Four partitions are chosen as it provides a good balance between
 accuracy achieved and computational time.
 The focus of the method described in this research is the classification accuracy
 of the rough set; this accuracy is based on the receiver operating characteristic
 (ROC) curves. The area under the curve (AUC) of the ROC curves is used as the
 performance criteria for each method and this will be explained fully in Chapter 2.
 It must be noted that, although there are many aspects involved in RST from
 the generation of reducts to the optimisation of the number of split/discretisation
 points, the focus of this research is how to optimally granulise the input data into
 a predefined four partitions. For the process of knowledge acquisition there are two
 discretisation approaches; the first approach is to discretise numerical attributes
 during the process of knowledge acquisition, the second and more commonly used
 approach is to use discretisation as a preprocessing step [5]. This research uses
 the later approach and once discretisation is performed, the attributes are then
 represented in a decision table from which rules are extracted.
 3
1.3. LITERATURE REVIEW
 1.3 Literature Review
 1.3.1 Rough Set Theory
 RST is a mathematical tool which deals with vagueness and uncertainty. It is of
 fundamental importance to artificial intelligence (AI) and cognitive science and is
 highly applicable to this study of performing the task of machine learning and de-
 cision analysis. The advantages of rough sets as with many other AI techniques
 are that they do not require rigid a priori assumptions on the mathematical nature
 of such complex relationships as do commonly used multivariate statistical tech-
 niques [6, 7, 8]. RST is based on the assumption that the information of interest
 is associated with some information of its universe of discourse [9, 10]. The main
 concept of rough set theory is an indiscernibility relation (indiscernibility meaning
 indistinguishable from one another). Rough set theory handles inconsistent infor-
 mation using two approximations, namely the upper and lower approximation. RST
 theory is primarily limited to binary-concept, providing either yes/no results in de-
 cision processes or positive/negative results in classification processes [11].
 Rough sets have been used in many real-life applications, these include various
 biomedical applications [12, 13, 14], control algorithm acquisition [15], prediction
 of aircraft component failure, fault diagnosis and stock market analysis [16, 17, 18].
 The use of rough sets have also been investigated in the analysis of conflict [19, 20].
 Very limited work has been done on the investigation of applying RST to HIV.
 Tettey et al have looked at applying RST to HIV [21], and it is based on this work
 that an investigation on the maximisation of rough set classification is performed.
 1.3.2 Optimisation Techniques
 The genetic algorithm (GA) was developed extensively by John Holland in the mid
 70?s [22]. It is inspired by the principles of genetics and evolutionary biology, it uses
 techniques such as inheritance, mutation, selection and crossover. The GA employs
 4
1.3. LITERATURE REVIEW
 the principle of survival of the fittest in its search process, the fitness of every
 individual in the population is evaluated, multiple individuals are stochastically
 selected from the current population (based on their fitness), and modified (mutated)
 to form a new population. A new population is then used in the next iteration of
 the algorithm. The process terminates when either a set number of generations are
 executed or once a satisfactory fitness level is reached. It must be noted that if
 the algorithm terminates after reaching the maximum number of generations, the
 optimal solution may not have been reached. The fitness/evaluation function is the
 only part of the GA that has any knowledge of the problem.
 Particle swarm optimisation (PSO) is a stochastic, population-based solving algo-
 rithm. It was invented by Kennedy and Eberhart in the mid 1990?s whilst trying
 to simulate the swarm of birds as part of a sociocognitive study investigating the
 knowledge of ?collective intelligence? in biological populations [23]. As mentioned,
 PSO is based on the analogy of flocks of birds, schools of fish, and herds of animals
 and how they maintain an ?evolutionary advantage? by either adapting to their en-
 vironment, avoiding predators or finding rich sources of food, all by the process of
 ?information sharing? [23]. PSO initialises by creating a random population of solu-
 tions (particles), each solution?s fitness is evaluated (according to a fitness function)
 and the fittest or best solution is noted (often known as pbest). All the solutions
 in the problem space have its own set of coordinates and these coordinates are as-
 sociated with the fittest solution found (so far). Another value noted is the best
 solution found in the neighbourhood of the particular solutions or particles (often
 known as lbest). The PSO concept consists of accelerating the velocities of each
 particle towards the pbest and lbest locations, the acceleration for both pbest and
 lbest are separate randomly weighted values [24]. The next iteration takes place
 after all particles have been moved. It is this behaviour that mimics the swarm of
 birds after which the technique was developed.
 Hill climbing (HC) is another optimisation technique used to partition the rough
 sets. Hill climbing works on the premise of obtaining the state of a current node and
 then moving towards a state which is better than the current one. Depending on the
 5
1.3. LITERATURE REVIEW
 variant of hill climbing implemented, either the closest node is chosen for its state
 to be evaluated, which is referred to as simple hill climbing. Another variant would
 be if all successors are compared and the closest solution is chosen, which is called
 steepest ascent hill climbing. For the purpose of this research, the later is chosen.
 The algorithm is called gradient ascent as the objective function is maximised, where
 as, if the function were to be minimised, the algorithm would be gradient descent.
 The final optimisation technique investigated to discretise the rough set partitions is
 simulated annealing. Simulated annealing (SA) was invented in 1983 by Kirkpatrick,
 Gelatt and Vecchi [25]. SA is an iterative procedure that continuously updates one
 candidate solution until a termination condition is reached [25], it is a technique
 that mathematically mirrors the cooling of a set of atoms to a state of minimum
 energy. The SA algorithm replaces the current solution by a random nearby solution,
 chosen with a probability that depends on the difference between the corresponding
 function values and on a global parameter T (temperature). SA is based on the
 manner in which liquids freeze or metals recrystalise in the process of annealing. SA
 is a generalisation of a Monte Carlo method and relies on the Metropolis acceptance
 criterion [26]. SA is advantageous as its implementation allows the solution to move
 further away as well as closer to the solution, this prevents the solution getting stuck
 at a local minima but rather at the desired global minima. These techniques will be
 explained in more detail in Chapter 3.
 There have been many uses of the four mentioned optimisation techniques, but no
 work has been done yet on the optimisation of rough set partition sizes. Limited
 work performed on RST using optimisation techniques include; feature extraction
 using rough set theory and genetic algorithms [11], finding minimal reducts using
 PSO [27]. Chen et al have used a GA to optimise the number of partitions. For
 a given number of partitions, the optimisation of the size of the cuts within each
 granule is yet to be investigated.
 6
1.3. LITERATURE REVIEW
 1.3.3 Testing using HIV and MID data
 The proposed method is tested on two data sets. These two data sets are the
 HIV (human immunodeficiency virus) data set and the MID (militarised interstate
 dispute) data set. Previous uses of the particular data sets are listed below but a
 full description of the data used is given in Chapters 4 and 5.
 Methods applied to HIV data:
 Previously, computational intelligence techniques have been used extensively to anal-
 yse HIV. Leke et al have used autoencoder network classifiers, inverse neural net-
 works, as well as conventional feedforward neural networks to analyse HIV [28, 29,
 30], they used the inverse neural network for adaptive control of HIV status to
 understand how the demographic factors affect the risk of HIV infections [30].
 Although an accuracy of 84% is achieved when using the autoencoder method [28],
 it is disadvantageous due to its ?black box? nature, this disadvantage also applies to
 the other mentioned neural network techniques. The reason NNs are considered to
 be black boxes is because their connection weights and transfer functions are frozen
 upon completion of training of the neural network [31]. Chapter 4 will discuss
 previous work performed on HIV in full.
 Methods applied to MID data:
 One of the methods that have been applied successfully to MID?s are neural networks
 (NN) [32, 33, 7], in particular the multi-layer perceptron (MLP) neural network.
 Neural networks are a mathematical technique especially suitable to the interactive,
 non-linear and contingent relations across the variables that may trigger militarised
 interstate disputes [33, 34]. Neural networks also provide a clear answer to the
 accuracy of predictions with numerical accuracies of the number of dyads involved
 in conflict as well as those dyads whose relationship is peace. If the prediction
 7
1.4. OUTLINE OF DISSERTATION
 accuracy is only what is required, it is argued that on the basis of these prediction
 accuracies, the interpretation of NN?s results are unambiguous [32], this research
 is however interested in the causal interpretations as well as prediction accuracy
 of MIDs and thus uses RST. There are disadvantages associated with the use of
 NN to accurately forecast conflict; the first being, NN are not efficient classifiers
 of rare events, the conflict data is skewed which has the effect of biasing the result
 towards the modal value (i.e. the result with the most common events) and this is
 problematic as more than 96% of the results have the outcome of zero (peace); and
 the second disadvantage is that it is difficult to comprehend the causal model that
 the trained networks have internally constructed [32]. To overcome the first problem
 of rare events prediction, the cross validation technique is used as it allows for the
 maximisation of data [35], this is explained in Chapter 5.
 Support vector machines (SVM) are another method that have been used to forecast
 and analyse conflict [8]. Habtemariam and Marwala compared the use of MLP and
 SVM to analyse conflict situations within a given set of data. It was found that
 SVM classifies both true positive and true negative results on the data with a higher
 degree of accuracy than that of the MLP method. SVM predicts correct results with
 an accuracy of 79% while an accuracy of 74% is achieved for the MLP [8]. Receiver
 operating characteristic (ROC) curves were used to compare the performance of the
 NN and the SVM classifiers. The area under curve (AUC) is used to measure the
 relative performances. The NN and SVM had AUC of 0.81 and 0.84 with errors of
 0.00998 and 0.01022 respectively [8]. Marwala and Lagazio have conducted a study
 into modelling interstate conflict using Bayesian trained neural networks. A control
 theory approach which makes use of the golden selection search (GSS) optimisation
 which is then used to identify input variables that would give a peaceful outcome [36].
 1.4 Outline of Dissertation
 As mentioned, RST has been used previously to model interstate conflict as well
 as HIV. Amongst all the applications to which RST has been applied, it has yet
 8
1.4. OUTLINE OF DISSERTATION
 been investigated what the optimal partition sizes are for a particular data set when
 classifying using rough sets. A significant contribution is given in Chapters 4 and 5
 where the results obtained using optimisation methods are compared to that of the
 non-optimised EWB and EFB techniques. The outline of this thesis is given below:
 Chapter 2 describes the core concepts of rough set theory. It also describes how
 the rough sets are formulated and how the generated rules are extracted. The
 performance criteria or accuracy of the rough set is determined as the area under
 the curve (AUC) of the ROC curves, these concepts are also explained.
 Chapter 3 gives a detailed description of the four optimisation techniques used
 and how they are applied to RST as performed in this work. The non-optimised
 methods of EWB and EFB partitioning are also explained.
 Chapter 4 displays the results obtained using a neural network on the human
 immunodeficiency virus data set after which the results obtained for the application
 of rough sets to HIV is given. The HIV data set is given with a description of each
 variable. Also included in Chapter 4 is the comparison of the results obtained using
 the optimised rough set partitions against that of EWB and EFB partitioning.
 Chapter 5 describes the militarised interstate dispute (MID) data set. The vari-
 ables used are given and explained, and the data set used (correlates of war) is
 discussed. Similarly to Chapter 4, the classification accuracy obtained for the rough
 set using both the optimised and non-optimised methods are compared.
 Chapter 6 summarises the findings of this work and presents recommendations for
 future work on this subject.
 Appendix A gives the algorithms for various procedures as discussed in the thesis.
 Appendix B illustrates the flow diagrams of the four optimisation techniques used.
 Appendix C lists the papers published based on the work performed in this thesis
 9
Chapter 2
 The Theory of Rough Sets
 The main goal of rough sets is to synthesise approximations of concepts from the
 acquired data. Unlike other methods used to handle uncertainty, rough set theory
 has its own unique advantages: it does not require any preliminary or additional
 information about the empirical training data such as probability distributions in
 statistics; basic probability assignment in Dempster Shafer theory of evidence or the
 value of possibility in fuzzy set theory [37].
 Rough set theory deals with the approximation of sets that are difficult to describe
 with the available information [13]. It deals predominantly with the classification of
 imprecise, uncertain or incomplete information. Two approximations, namely the
 upper and lower approximation, are formed to deal with inconsistent information.
 These approximations along with other concepts that are fundamental to RST theory
 are given below.
 2.1 Information Table
 The data used is represented using an information table, an example of an informa-
 tion table can be shown using data from the HIV data set for the ith object. This
 is given in Table 2.1 below.
 In the information table, each row represents a new case (or object). Besides HIV
 10
2.2. INFORMATION SYSTEM
 Table 2.1: Information table of the HIV data.
 Race Mothers Age Education Gravidity Parity Fathers Age HIV Status
 Obj(1) 2 32 13 1 1 22 1
 Obj(2) 3 22 5 2 1 25 1
 Obj(3) 1 35 6 1 0 33 0
 Obj(4) 1 35 6 1 0 33 0
 Obj(5) 3 22 5 2 1 25 0
 Obj(6) 2 30 10 1 1 30 0
 Obj(7) 3 22 5 2 1 25 1
 . . . . . . . .
 Obj(i) 2 27 9 3 2 30 0
 Status, each of the columns represent the respective case?s variables (or condition
 attributes). The HIV Status is the outcome (also called the concept or decision
 attribute) of each object. The outcome contains either a 1 or 0, and this indicates
 whether the particular case is infected with HIV or not. For the application of
 this research, the outcome is known. This a posteriori knowledge is expressed as
 the decision attribute and this process is known as supervised learning. It can be
 seen from Table 2.1 that cases/objects 3 and 4 have the same condition attributes,
 and cases 5 and 7 have the same condition attributes. These cases are referred to as
 indiscernible cases; this concept is explained later. Taking into consideration that
 the process represented is supervised learning, of the indiscernible cases, cases 3 and
 4 have the same decision attribute while cases 5 and 7 have conflicting/different
 decisions.
 2.2 Information System
 Once the information table is obtained, the data is discretised into four partitions
 as mentioned earlier. An information system can be understood by a pair ? =
 (U,A), where U and A, are finite, non-empty sets called the universe, and the set
 of attributes, respectively [14].
 For every attribute a ? A, we associate a set Va, of its values, where Va is called the
 11
2.3. INDISCERNIBILITY RELATION
 value set of a [1].
 a : U? Va (2.1)
 Any subset B of A determines a binary relation I(B) on U, which is called an
 indiscernibility relation.
 In the case as shown in Table 2.1, the process is one of supervised learning, i.e. the
 decision is known. Information systems of this kind are known as decision systems.
 Formally, a decision system is any information system of the form ? = (U, A?{d}),
 where d /? A is the decision attribute [9]. The elements of A are the conditional
 attributes.
 Let ? = (U, A ? {d}) be given and assume the set Vd = {v1d, ..., v
 r(d)
 d } to be the
 extension of the decision attribute. Denote the image d(U) = {k | d(x) = k, x ? U},
 where d(U) ? Vd. The cardinality of set d(U) is called the rank of d, and is denoted
 by r(d) [38].
 The decision d determines the partition CLASS?(d) = {X1?, ..., X
 r(d)
 ? } of the uni-
 verse U , where Xk? = {x ? U | d(x) = v
 k
 d} for 1 ? k ? r(d). CLASS?(d) is
 called the classification of objects in ? determined by the decision d. The set Xi? is
 called the ith decision class of ?. The concept of an indiscernibility relation will be
 explained next.
 2.3 Indiscernibility Relation
 An information table expresses all the knowledge about a model. This table is
 generally unnecessarily large, this is in part because it is redundant in at least two
 ways; the same or indiscernible objects may be represented several times, and some
 of the attributes may be superfluous [2]. The concept of indiscernibility is one of the
 main concepts of rough set theory (indiscernibility meaning indistinguishable from
 12
2.4. LOWER AND UPPER APPROXIMATIONS
 one another). Sets that are indiscernible are called elementary sets and these are
 considered the building blocks of RST?s knowledge of reality. A union of elementary
 sets is called a crisp set, while any other sets are referred to as rough or vague.
 More formally, for a given information system ?, then for any subset B?A, there is
 an associated equivalence relation I(B) called the B-indiscernibility relation and is
 represented as shown in 2.2 below:
 (x, y) ? I(B) iff a(x) = a(y) (2.2)
 RST offers a tool to deal with indiscernibility, the way in which it works is for each
 concept/decision X, the greatest definable set containing X and the least definable
 set containing X are computed. These two sets are called the lower and upper
 approximation respectively.
 2.4 Lower and Upper Approximations
 The sets of cases/objects with the same outcome variable are assembled together.
 This is done by looking at the ?purity? of the particular objects attributes in relation
 to its outcome. For example, with cases 5 and 7 in Table 2.1, it is not possible to
 induce a crisp (precise) description of such cases, it is from this that the notion of
 rough sets emerges. In most cases it is not possible to define cases into crisp sets, in
 such instances lower and upper approximation sets are defined.
 The lower approximation is defined as the collection of cases whose equivalence
 classes are fully contained in the set of cases we want to approximate [13]. Let
 x ? U be a target set that we wish to represent using attribute subset B. The lower
 approximation of set X is denoted BX and mathematically it is represented as:
 BX = {x ? U : B(x)?X} (2.3)
 13
2.4. LOWER AND UPPER APPROXIMATIONS
 Where B(x) The upper approximation is defined as the collection of cases whose
 equivalence classes are at least partially contained in the set of cases we want to
 approximate [13]. The upper approximation of set X is denoted BX and is mathe-
 matically represented as:
 BX = {x ? U : B(x)?X negationslash= ?} (2.4)
 It is through these lower and upper approximations that any rough set is defined.
 Lower and upper approximations are defined differently in literature, but it follows
 that a crisp set is only defined for BX = BX. Objects that cannot decisively be
 classified into X on the basis of knowledge in B, are grouped into the B-boundary
 region of X, this is represented as:
 BNB(X) = BX ?BX (2.5)
 The B-outside region of X is the set U ?BX which consists of objects that can be
 stated with certainty of not belonging to X (on the basis of knowledge of B) [2].
 Using the example given in Table 2.1 we can synthesise definitions of the outcome in
 terms of the conditional attributes. If we consider the outcome of HIV positive i.e.
 Let H = {x | HIV (x) = 1}, then we obtain the following approximation regions:
 ? BH = Obj(1), Obj(2);
 ? BH = Obj(1), Obj(2), Obj(5), Obj(7);
 ? BNB(H) = Obj(5), Obj(7)
 ? U ?BH = Obj(3), Obj(4), Obj(6).
 Thus it follows that the outcome HIV Positive is rough since the boundary region
 is not empty. Figure 2.1 illustrates this.
 14
2.4. LOWER AND UPPER APPROXIMATIONS
 Figure 2.1: Approximating the set of HIV positive patients of Table 2.1 using all
 the conditional attributes.
 The above state approximations have the following properties [2]:
 1. B(X) ? X ? B(X)
 2. B(?) = B(?) = ?;B(U) = B(U) = U
 3. B(X ? Y ) = B(X) ?B(Y )
 4. B(X ? Y ) = B(X) ?B(Y )
 5. X ? Y implies B(X) ? B(Y ) and B(X) ? B(Y )
 6. B(X ? Y ) ? B(X) ?B(Y )
 7. B(X ? Y ) ? B(X) ?B(Y )
 8. B(U ?X) = ?B(X)
 9. B(U ?X) = ?B(X)
 10. B(B(X)) = B(B(X)) = B(X)
 11. B(B(X)) = B(B(X)) = B(X)
 It can be seen that the interior and the closure of the set in the topology generated
 by data are the lower and upper approximations respectively. Four basic classes of
 rough sets can be defined, i.e. four categories of vagueness:
 15
2.5. ROUGH MEMBERSHIP FUNCTION
 1. X is roughly B-definable, iff B(X) negationslash= ? and B(X) negationslash= U
 2. X is internally B-undefinable, iff B(X) = ? and B(X) negationslash= U
 3. X is externally B-undefinable, iff B(X) negationslash= ? and B(X) = U
 4. X is totally B-undefinable, iff B(X) = ? and B(X) = U
 If X is roughly B-definable, the information provided by the attributes of B is
 sufficient to determine that some elements of U belongs to X, as well as that, some
 elements of U belong to U ?X.
 If X is internally B-undefinable, the information provided by the attributes of B
 is sufficient to determine that some elements of U belongs to U ? X, but we are
 unable to determine whether any element of U belongs to X.
 If X is externally B-undefinable, the information provided by the attributes of B
 is sufficient to determine that some elements of U belongs to X, but we are unable
 to determine whether any element of U belongs to U ?X.
 If X is totally B-undefinable, the information provided by the attributes of B is
 insufficient to determine whether any elements of U belongs to X or U ?X.
 It must be noted that for most cases in RST, reducts are generated to enable us to
 discard functionally redundant information [1]. Rules can be extracted from these
 reducts, alternatively rules can be extracted from the generated approximations,
 this work uses the later.
 2.5 Rough Membership Function
 A rough membership function is a function ?BX : U ? [0, 1] that, when applied
 to object x, quantifies the degree of relative overlap between the set X and the
 indiscernibility set [x] to which x belongs. This membership function is a measure
 16
2.6. ROUGH SET ACCURACY
 of the plausibility of which an object x belongs to set X. The membership function
 is defined as:
 ?XA =
 |[X]B ?X|
 [X]B
 (2.6)
 The rough membership function can be treated as a fuzzification of rough approxima-
 tion. It makes the translation from rough approximation into membership function.
 The fact that rough membership functions are derived from data, makes it stand
 out [39].
 2.6 Rough Set Accuracy
 Once the rules are extracted, they can be tested using a set of testing data. The
 classification output is expressed as a decision value which lies between 0 and 1. The
 accuracy of the rough set is determined using a technique called a receiver operat-
 ing characteristic (ROC) curve. A ROC curve is a technique used for visualising,
 organising and selecting classifiers based on their performance. ROC curves depict
 the trade-off between hit rates and false alarm rates of classifiers, i.e. it depicts the
 trade-off between correctly predicted positives and falsely predicted positives. ROC
 curves have become increasingly important in the areas of cost-sensitive learning
 and learning in the presence of unbalanced data sets [40]. The rough set (classifier)
 will classify results based on a set of testing data that is given to it. The actual
 result or outcome of the testing data is known as an instance. Having obtained a
 classifier and an instance, there are four possible outcomes:
 ? If the instance is positive and it is classified as positive, then it is regarded as
 a true positive
 ? If the instance is positive and it is classified as negative, then it is regarded as
 a false negative
 17
2.6. ROUGH SET ACCURACY
 ? If the instance is negative and it is classified as negative, then it is regarded
 as a true negative
 ? If the instance is negative and it is classified as positive, then it is regarded as
 a false positive
 Given a classifier and a set of instances (test set), a two-by-two confusion matrix can
 be constructed. The confusion matrix will be in the form of that shown in Table 2.2.
 Correctly predicted positives are shown in the confusion matrix as actual true cases
 that are predicted as true cases. Falsely predicted positives are shown as actual true
 cases that are predicted as false cases.
 Table 2.2: Confusion matrix representing the dispositions of the set of instances
 Actual True Cases Actual False Cases
 Predicted True Cases TP FP Pos
 Predicted False Cases FN TN Neg
 P N
 where:
 P = Positive
 Pos = Positive
 N = Negative
 Neg = Negative
 It is from the confusion matrix that several of the common metrics are derived, this
 will be illustrated below. True positive rate (tp rate), otherwise known as sensitivity
 of a classifier is estimated as:
 tp rate ?
 Positives correctly classified
 Total postives
 (2.7)
 The false positive rate (fp rate), otherwise known as specificity of a classifier is:
 fp rate ?
 Negatives incorrectly classified
 Total negatives
 (2.8)
 18
2.6. ROUGH SET ACCURACY
 Formally these are represented as:
 tp rate =
 TP
 P
 (2.9)
 fp rate =
 FP
 N
 (2.10)
 As previously mentioned, ROC accuracy is a measure of the trade-off between rela-
 tive accuracy in correctly predicted positives against incorrectly predicted positives.
 Mathematically this is represented as:
 accuracy =
 TP + TN
 P +N
 (2.11)
 The ROC curve is a two-dimension depiction of a classifier performance. To compare
 two-or-more classifiers, we can reduce the ROC performance to a single scalar value
 representing expected performance [40]. A common method used and the method
 used in this research is, the area under curve (AUC). The AUC is deemed to be a
 better measure of classifier performance than accuracy [41, 42, 43].
 The AUC is a portion of the area of the unit square on which the ROC curve is
 plotted and therefore, will have a value between 0 and 1.0. For a realistic classifier,
 the AUC will range between 0.5 (indicating that the test ranks no better than
 chance) to 1.0 (indicating perfect performance) [44]. The AUC of the ROC curve is
 computed using the trapezoidal method of integration, this can be shown to equal the
 Wilcoxon-Mann-Whitney statistic, or the probability that the classifier will assign
 a higher value to a positive case than to a negative case, assuming that the pair is
 randomly drawn from the population that the ROC curve is derived from [13].
 19
2.7. ROUGH SETS FORMULATION
 2.7 Rough Sets Formulation
 The process of modelling the rough set can be broken down into five stages;
 The first stage would be to select the data. The two data sets to be used to test the
 method are obtained from the South African antenatal survey of 2001 [45], and the
 correlates of war (COW) project [46].
 The second stage involves pre-processing the data to ensure it is ready for analysis,
 this stage involves discretising the data and removing unnecessary data (cleaning
 the data). This pre-processing step of removing outliers and cleaning the data
 is explained for the HIV and MID data in sections 4.2.2 and 5.2.2 respectively.
 Although the optimal selection of set sizes for the discretisation of attributes will
 not be known at first, an optimisation technique will be run on the set to ensure
 that the highest degree of accuracy is obtained when forecasting outcomes. This
 will be explained more clearly in Chapter 3.
 If reducts were considered, the third stage would be to use the cleaned data to
 generate reducts. A reduct is the most concise way in which we can discern object
 classes [47]. In other words, a reduct is the minimal subset of attributes that enables
 the same classification of elements of the universe as the whole set of attributes [1].
 To cope with inconsistencies, lower and upper approximations of decision classes are
 defined in this stage [1, 12, 14, 47, 48].
 Stage four is where the rules are extracted or generated. The rules are normally
 determined based on condition attributes values [49]. Once the rules are extracted,
 they can be presented in an if CONDITION(S)-then DECISION format [50].
 The fifth and final stage involves testing the newly created rules on a test set. The
 accuracy will be noted and sent back into the optimisation method used in step
 two and the process will continue until the optimum or highest accuracy is achieved.
 The algorithm for computing rough sets and extracting rules is given in Appendix A,
 Algorithm 1.
 20
Chapter 3
 Using Optimisation Techniques for
 Discretisation
 3.1 Introduction
 Chapter 1 explains that the data acquired from measurement systems or human
 experts must in principle be discrete [2]. The methods which allows continuous data
 to be processed involves discretisation. There are several methods available to per-
 form discretisation but the two popularly used equal-width-bin (EWB) partitioning
 and equal-frequency-bin (EFB) partitioning are investigated [3, 4]. These meth-
 ods are compared against the newly proposed method of using various optimisation
 techniques to discretise the continuous data.
 The way in which these optimisation techniques work in conjunction with the rough
 set is explained in Appendix A, Algorithm 1, and it is also illustrated in Figure 3.1.
 The optimisation technique is run to create a set of four partitions from the given
 input data. Using these partitions, the rough set model is generated and the clas-
 sification accuracy is determined using the AUC of the model produced against the
 unseen testing data. This result (AUC) is sent back to the optimiser and the parti-
 tion sizes are changed accordingly to ensure the rough set produces a better model,
 i.e. a model with a higher classification accuracy.
 21
3.2. DISCRETISATION CRITERIA
 Figure 3.1: Block diagram of the sequence of events in creating a rough set model
 3.2 Discretisation Criteria
 Discretisation can be performed using different criteria. These criteria will be ex-
 plained and the criteria used in this research will be stated.
 ? Local versus Global methods: Global methods work over an N-dimensional real
 space, where each attribute value set is partitioned into intervals independent
 of the other attributes. Local methods on the other hand produce partitions
 that are applied to localised regions of the object space [51]. In other words,
 methods that process the entire attribute value set are called global, global dis-
 cretisation performs globally through the entire data set prior to proceeding
 with induction, an example of this method would be binning [5, 52, 53]. Meth-
 ods that work on one attribute at a time are called local, local discretisation
 performs locally on partial regions during induction [5].
 ? Static versus Dynamic methods: Many discretisation methods require the max-
 imum number intervals, k, to produce when discretising an attribute. Static
 methods such as binning, perform one discretisation pass of the data for each
 attribute and determine the value of k independent of the other attributes. In
 static discretisation the continuous features are discretised prior to the classifi-
 cation task. Dynamic methods conduct a search through the space of possible
 k values for all attributes simultaneously, thereby determining interdependen-
 cies in attribute discretisation [51, 53]. Dougherty et al state there is no clear
 22
3.3. EQUAL-WIDTH-BIN AND EQUAL-FREQUENCY-BIN PARTITIONING
 advantage of either method [53].
 ? Supervised versus Unsupervised methods: Methods that utilise the decision
 attribute of objects when performing discretisation are referred to as supervised
 discretisation methods. In contrast, methods that do not make use of decision
 attributes are called unsupervised discretisation methods [51].
 The methods implemented in this research are global, dynamic and supervised.
 Binning as will be discussed below, uses global, static, and unsupervised learning
 methods. The two non-optimised, well known techniques of equal-width-bin, and
 equal-frequency-bin partitioning will be discussed, after which an explanation of the
 four optimisation techniques used will be given.
 3.3 Equal-Width-Bin and Equal-Frequency-Bin Parti-
 tioning
 While an investigation into the number of discretised partitions is a promising avenue
 of research, several studies have been conducted and for the purpose of this paper,
 the input data into the rough set is discretised into four partitions. Four partitions
 are chosen as they offer a good balance between classification accuracy produced and
 computational time taken to produce the optimal partition sizes. As mentioned in
 Chapter 1, there are several approaches available to discretise the continuous input
 data. Two approaches that are considered merely for the purpose of comparison
 against the newly proposed method in this research are, equal-width-bin (EWB)
 and equal-frequency-bin (EFB) partitioning.
 3.3.1 Equal-Width-Bin Partitioning
 EWB partitioning divides the range of observed values of an attribute into k equal
 sized bins. In this research, k is taken as four. One notable problem of such a method
 23
3.3. EQUAL-WIDTH-BIN AND EQUAL-FREQUENCY-BIN PARTITIONING
 is it is vulnerable to outliers that may drastically skew the data range. This problem
 was eliminated during the pre-processing step of cleaning the data as explained in
 Chapter 2. The way which the data was discretised using EWB partitioning is as
 follows:
 ? Determine the largest and smallest value for each attribute, these can be de-
 noted as L and S.
 ? The width of each interval, W , can be calculated as follows:
 W =
 L? S
 4
 (3.1)
 ? The interval boundaries can determined as: S + W,S + 2W,S + 3W . These
 boundaries can be determined for any amount of intervals k, up to the S +
 (k ? 1)W term.
 3.3.2 Equal-Frequency-Bin Partitioning
 EFB partitioning sorts the values of each attribute in ascending order and divides
 them into k bins where (given m instances) each bin contains mk adjacent values.
 In most instances there are most probably duplicated values. This is performed as
 follows:
 ? Sort the values of each attribute a, (i.e. va1 , v
 a
 2 , ..., v
 a
 m) into four intervals (m
 being the number of instances)
 ? Each interval now contains:
 ? =
 m
 4
 sequential values. (3.2)
 ? The cut points are computed by ci =
 vi?+vi?+1
 2 for i = 1, 2, 3. The cut point
 for k intervals can be computed for i = 1, ..., k ? 1.
 24
3.4. GENETIC ALGORITHM
 The results obtained using the above two methods, as well as of those using the
 proposed optimisation methods will be given in Chapters 4 and 5.
 3.4 Genetic Algorithm
 Genetic algorithms (GAs) are population based search methods. GAs are popular
 and widely used due to their ease of implementation, intuitiveness and their ability
 to solve highly nonlinear optimisation problems. A GA is a stochastic search pro-
 cedure for combinatorial optimisation problems based on the mechanism of natural
 selection [54]. Genetic algorithms are a particular class of evolutionary algorithms
 that use techniques inspired by evolutionary biology such as inheritance, mutation,
 selection and crossover. The fitness/evaluation function is the only part of the GA
 that has any knowledge about the problem. The fitness function tries to maximise
 the AUC. The GA represents the design variables of each individual design with bi-
 nary strings of 0?s and 1?s, these are referred to as chromosomes. The variables are
 limited to the predefined upper and lower bound values. These limits are coded into
 the GA encoding feature. They perform the task of optimisation, the GA employs
 three main operators to propagate its population from one generation to the next.
 The first operator is selection. During each successive generation, a proportion of
 the population is selected to breed a new generation. This is done on the basis of
 ?survival of the fittest?. The fitness of each solution is evaluated using a fitness
 function, and it is this function that maximises the AUC. After each generation, the
 AUC is evaluated and sent to the rough set for evaluation. This process continues
 until the termination criteria is reached. Several selection functions are available
 and in this paper, roulette wheel selection is used. Roulette wheel selection is a
 technique whereby members of the population of chromosomes are chosen in a way
 that is proportional to their fitness. The better the fitness of the chromosome, the
 greater the chance it will be selected, however it is not guaranteed that the fittest
 member goes to the next generation. Other selection functions that are available
 include tournament selection, normal geometric selection and ranking selection.
 25
3.4. GENETIC ALGORITHM
 The second operator is crossover which mimics reproduction or mating in biological
 populations. The crossover technique used is uniform crossover; uniform crossover
 first generates a random crossover mask (a binary string with the same size of
 chromosomes) and then exchanges relative genes between parents according to the
 mask. The parity of each bit in the mask determines, for each corresponding bit
 in an offspring, which parent it will receive that bit from [55]. Other crossover
 functions that can be used include; single-point and two-point crossover as well as
 simple, arithmetic and heuristic crossover.
 The third operator used is mutation, this is whereby an arbitrary bit in a generic
 sequence or string will be mutated or changed. This is analogous to biological
 mutation. The reason for implementing this operator is to promote diversity from
 one generation of chromosomes to another, this prevents the GA from getting stuck
 at a local optimum but rather a global optimum. Boundary mutation is chosen, this
 is whereby a variable is randomly selected and is set to either the upper or lower
 bound depending on a randomly generated uniform number. It must be noted that
 the chosen operators are case specific, dependent on the nature of the optimisation,
 different operators would have to be explored. Other mutation functions are uniform,
 non-uniform and multi-non-uniform. If the individuals (parents) were represented
 in binary format, binary mutation could have been used.
 There was no significant difference between all the above mentioned functions for
 each operator, but alternatives were mentioned for the purpose of completeness. The
 pseudo-code algorithm for genetic algorithm is given on the next page
 26
3.4. GENETIC ALGORITHM
 Genetic Algorithm Pseudo Code:
 ? Initialise a population P of chromosomes at random.
 ? Evaluate the Fitness of each chromosome (individual) h in the population P .
 ? While Number Of Generations < Termination Number of Generations (100)
 do Create a new population Ps
 1. Selection: The probability of selecting a chromosome hi from the popu-
 lation P is represented by Pr(hi). This probability is defined as:
 Pr(hi) =
 Fitnessi
 ?PopSize
 j=1 Fitnessj
 (3.3)
 Add this chromosome from population P to population Ps.
 2. Crossover : Probabilistically obtain two parents ?h1, h2? from the popu-
 lation P according to Pr(hi). Create new offspring by copying the cor-
 responding gene from either parent according to a randomly generated
 crossover mask. Add all new offspring to the population Ps.
 3. Mutation: Let X and Y be two m-dimensional row vectors denoting
 individuals. For real X and Y , the lower and upper bounds for each
 variable i in population Ps are defined as ai and bi respectively. Boundary
 mutation randomly selects one variable j, and sets it to either its lower
 or upper bound, where r = U(0, 1):
 x
 ?
 i =
 ?
 ?????
 ?????
 ai, if i = j,r < 0.5
 bi, if i = j,r ? 0.5
 xi, otherwise
 4. Insert new members into the population P , i.e. update P ? Ps
 5. Evaluate the new members h in P , i.e. compute fitness
 ? Return the best chromosome as the solution
 27
3.5. PARTICLE SWARM OPTIMISATION
 An initial population of 20 individuals is chosen. As mentioned in order to prevent
 premature convergence to a local minima, the mutation diversification mechanism
 is implemented. Other diversification mechanisms such as elitism can also be im-
 plemented in an attempt to improve the accuracy. One of the advantages of GAs is
 that they refine a single solution as they search for optima in a multi-dimensional
 landscape, genetic algorithms operate on entire populations of candidate solutions
 in parallel. Their parallel nature is their main strength, it is through this that GAs
 are more likely to get stuck at a global optimum rather than a local optimum. It is
 also due to its parallel nature that GAs are not that sensitive to initial conditions,
 which implies that the GAs time to convergence is rather predictable. The drawback
 of using GAs is that they are much slower than most traditional methods i.e. a good
 initial guess will allow traditional optimisation techniques to converge quickly to-
 wards the solution, whereas GAs will waste time testing the fitness of all suboptimal
 solutions. Although GAs will always approximate a solution but it must be noted,
 due to their stochastic nature, this approximation is only an estimate, whereas with
 traditional methods, if they can find an optimum, they will find it exactly [22]. The
 addition of an elitist strategy could further improve results. The flow chart of the
 genetic algorithm optimisation is shown in Appendix B, Figure B.1.
 3.5 Particle Swarm Optimisation
 Particle swarm optimisation (PSO) is not only a tool for optimisation, but also a tool
 for representing sociocognition of human and artificial agents, based on principles
 of social psychology. Eberhart and Kennedy developed PSO based on the analogy
 of flocks of birds and schools of fish [56].
 Social optimisation can be thought of as the process that occurs when a person that is
 trying to solve a problem, interacts with another person. The information processed
 in the individuals thinking is referred to as cognition. If one person?s cognition does
 not follow from the other or that one follows from the converse of the other, they are
 described as dissonant. Dissonant cognitions produce an aversive consequence that
 28
3.5. PARTICLE SWARM OPTIMISATION
 the individuals will try to reduce. This can be done by reducing both cognitions [57,
 58]. The particle swarm simulates this kind of social optimisation.
 The way in which PSO works is as follows; PSO initialises by creating a random
 population of solutions (particles), these particles are placed in the parameter space
 of some problem or function. Each particle evaluates the fitness at its current loca-
 tion and the fittest or best solution is noted (often known as pbest). Each particle
 has its own set of coordinates, and these coordinates are associated with the fittest
 solution found so far. The movement of each particle is determined by combining
 some aspect of the history of its own fitness values with those of other particles in
 the neighbourhood of that particle. The best solution found in the neighbourhood
 of particular solutions or particles (often known as lbest) is also noted. This infor-
 mation is an analogy of knowledge of how the other particles around a particular
 particle have performed. Each particle tries to modify it position using the following
 information [59, 23]:
 ? its current position p(x, y)
 ? its current velocity v(x, y)
 ? the distance between its current position p(x, y) and pbest.
 ? the distance between its current position p(x, y) and lbest.
 This modification can be represented by the concept of velocity. Velocity of a particle
 can be modified using the following equation [59]:
 vk+1i = wv
 k
 i + c1rand1 ? (pbesti ? s
 k
 i ) + c2rand2 ? (lbest? s
 k
 i ) (3.4)
 where:
 29
3.5. PARTICLE SWARM OPTIMISATION
 vki = Velocity of particle i at iteration k
 w = Weighting function
 cj = Weighting factor
 rand = Random number between 0 and 1
 ski = Current position of particle i at iteration k
 pbesti = pbest of particle i
 lbest = lbest of the group
 The following weighting function is used in Equation 3.4:
 w = wmax ?
 wmax ? wmin
 itermax
 ? iter (3.5)
 where:
 wmax = Initial weight
 wmin = Final weight
 itermax = Maximum iteration number
 iter = Current iteration number
 Using the above equation, a velocity that gradually moves the current searching point
 close to pbest and lbest can be calculated. The current position of the searching point
 can be calculated as follows:
 sk+1i = s
 k
 i + v
 k+1
 i (3.6)
 The next iteration takes place after all particles have been moved. PSO was imple-
 mented with the maximum number of generations of 30, and the initial number of
 particles also set to 30. PSO has several similarities with other optimisation tech-
 niques, these include a randomly generated initial population and a search for an
 optima through a series of generations, but unlike GA for example, PSO does not
 have evolutionary operators such as mutation and crossover, instead it has particles
 that travel (or swarm) through the solution space looking for an optima. The flow
 chart of PSO is given in Appendix B, Figure B.2.
 30
3.6. HILL CLIMBING
 3.6 Hill Climbing
 Hill climbing is an optimisation technique that belongs to the group of local search
 algorithms, meaning the algorithm moves from solution to solution in the search
 space until an optimal solution is found. The algorithm tries to maximise the fit-
 ness function (accuracy) by iteratively comparing two solutions, it adopts the best
 solution and continues the comparison (i.e. move further up the hill). This iter-
 ation terminates when there are no better solutions on either side of the current
 solution (i.e. it has reached the peak). There are several variants or methods of
 hill climbing, the first and most basic form is simple hill climbing; in this method
 the first closest node is chosen for evaluation. A second variant is called steepest
 ascent hill climbing, in this method all successors are compared and the closest to
 the solution is chosen. Other variants that can be investigated include next-ascent
 hill climbing and zero-temperature Monte Carlo hill climbing [60]. In this paper,
 steepest ascent hill climbing is implemented, a major disadvantage of both simple
 and steepest ascent hill climbing is that it only finds the local optimum. This can
 be visualised as shown in Figure 3.2 [61].
 Figure 3.2: Figure illustrating a local maxima versus a global maxima
 The algorithm may move to a point at the top of a local maxima, searching around
 it, the algorithm cannot see a better solution, yet, looking at the entire search
 space there is a better solution at the global maxima. This local maxima will
 be returned as the best solution and if this occurs a sub-optimal solution has been
 found. There are other local search algorithms that overcome this and these include:
 stochastic hill climbing, random walks and iterated hill-climbing. The advantages
 31
3.6. HILL CLIMBING
 of hill climbing are that it requires no memory as there is no backtracking and it is
 trivial to implement.
 The pseudo code algorithm for hill climbing is given below.
 Hill Climbing Pseudo Code:
 ? Set initial conditions
 ? currentNode = startNode = initial configuration = S
 ? Compute SFitness, i.e. initial configuration fitness
 ? Repeat:
 ? Compute a neighbouring configuration S? by local transformation from
 the current point in the search space to the next point
 ? Compute fitness of S? (S?Fitness)
 ? If S?Fitness > SFitness then S = S
 ?
 ? Until maxima is reached
 The flow chart illustrating the implementation of hill climbing is given in Ap-
 pendix B, Figure B.3.
 The hill climbing (HC) algorithm is chosen to have an initial 20 starting points.
 As with all the other optimisation techniques, hill climbing uses the AUC as the
 objective/fitness function, and because it is a ?greedy? algorithm, other methods or
 variations of hill climbing should be investigated. A ?greedy? algorithm is an algo-
 rithm that looks at the next solution, and if the next solution is a better solution
 (fitter), this algorithm will take on the position of the next solution. This is disad-
 vantageous as it does not look at the entire search space, and if the next solution
 (as it is climbing the hill) is worse than the present one, it will return stating it has
 found the best solution, this may not be the case as explained above referring to
 local and global maxima.
 32
3.7. SIMULATED ANNEALING
 A variation of steepest ascent hill climbing is random-restart hill climbing (RRHC).
 RRHC is an algorithm built on top of the hill climbing algorithm, what it does is
 that it runs an outer loop over HC, where each step chooses a random initial point
 (x0) to start the HC. The best initial point is stored (xbest) and if a new run of HC
 produces a better initial point than the stored one (if x0 is better than xbest), then
 the stored one gets replaced (xbest := x0). There are several other variations of HC
 but these methods are not investigated in this research.
 3.7 Simulated Annealing
 Simulated annealing is an algorithm that locates a good approximation to the global
 optimum of a given function. It originated as a generalisation to the Monte Carlo
 method and relies on the Metropolis algorithm1. As is the case with GA, PSO and
 HC, SA continuously updates the solution until a termination criteria is reached. SA
 is a well established stochastic technique originally developed to model the natural
 process of crystallisation and later adopted as an optimisation technique [26]. The
 SA algorithm replaces a current solution with a ?nearby? random solution with
 a probability that depends on the difference between the corresponding function
 values and the temperature (T ). T decreases throughout the process, so as T starts
 approaching zero, there is less random changes in the solution. As with the case
 of greedy search methods, SA keeps moving towards the best solution, except that
 it has the advantage of reversal in fitness. That means it can move to a solution
 with worse fitness than it currently has, but the advantage of that is that it ensures
 the solution is not found at a local maxima, but rather a global maxima. This
 is the major advantage that SA has over most other methods, but once again its
 drawback is its computational time, the SA algorithm will find the global optimum if
 specified but it can approach infinite time in doing so. The probability of accepting
 the reversal is given by Botlzman?s equation [62]:
 P (?E) ? e?
 ?E
 T (3.7)
 1See [26] for more information regarding the Metropolis algorithm
 33
3.7. SIMULATED ANNEALING
 Where ?E is the difference in energy (fitness) between the old and new states, and T
 is the temperature of the system. The rate at which temperature decreases depends
 on the cooling schedule chosen. The following cooling model is used [62]:
 T (k) =
 T (k ? 1)
 1 + ?
 (3.8)
 Where T (k) is the current temperature, T (k ? 1) is the previous temperature, and
 ? dictates the cooling rate. Other advantages of SA are that it can easily be tuned,
 and similarly like the GA, SA can deal with highly nonlinear models with many
 constraints. It must be noted, the precision of the numbers used in the implemen-
 tation of SA can have a significant effect on the outcome. As with GAs, SA is also
 amenable to parallel implementation, and a method to improve the computational
 time is to implement either very fast simulated re-annealing (VFSR) or adaptive
 simulated annealing (ASA) [63]. The pseudo code algorithm for SA is given below:
 Simulated Annealing Pseudo Code:
 //s : current state; s? : state of neighbour of s
 //sbest : current best state
 //E(s) : is the function to be optimised (internal energy)
 //e = E(s) : current energy
 //ebest : current best energy
 //e? = E(s?) : energy of neighbour of s
 //k : iteration number, initially set to 0
 //T : temperature of the systems
 //P (e, e?, T ) : function of the energies of e and e?
 The algorithm of the simulated annealing pseudo code is continued on the next page.
 34
3.7. SIMULATED ANNEALING
 ? While k < kmax and e > emax Do
 ? s? := s
 ? e? := E(s)
 ? If e? < ebest Then
 sbest := s?; ebest := e?
 ? If random() < P (e, e?, T ( kkmax )) Then
 s := s?; e = e?
 ? k := k + 1
 ? Return sbest
 This process can be visualised using the flow chart as shown in Appendix B, Fig-
 ure B.4.
 Using the optimisation techniques given in this chapter, and using the description of
 generating rough sets as given in Chapter 2, this newly proposed method is formed.
 The method as shown in Figure 3.1 and described in Appendix A, Algorithm 1 is
 generated, and in order to test this method, it is executed on two separate data
 sets, namely, the human immunodeficiency virus (HIV) data set and the militarised
 interstate dispute (MID) data set, from which rules are extracted, and classification
 accuracies obtained. An explanation of these data sets as well as the results of the
 experimental tests are given in Chapters 4 and 5.
 35
Chapter 4
 Experimental Investigation I :
 Modelling HIV
 4.1 Introduction
 In the last 20 years, over 60 million people have been infected with HIV (human
 immunodeficiency virus), and of those cases, 95% are in developing countries [64].
 In 2006 alone, an estimated 39.5 million people around the world were living with
 HIV, with 27.5 million of those people living in Sub-Saharan Africa. During this
 year, AIDS (acquired immune deficiency syndrome) claimed an estimated 2.9 million
 lives [65]. HIV has been identified as the causative agent of AIDS. The effect of AIDS
 is not only detrimental to the individual infected but has a devastating effect on the
 economic, social, security and demographic levels of a country. Because AIDS is
 killing people in the prime of their working and parenting lives, it represents a grave
 threat to economic development. In the worst affected countries, the epidemic has
 already reversed many of the development achievements of the past generation [65].
 AIDS has a large negative impact on the social and security levels of a country.
 Social levels drop as the health and educational development, that is supposed to
 benefit poor people, is impeded as well as the average life expectancy drops. It is
 estimated that by 2010 the number of orphans will be double that of 2006 [65].
 36
4.1. INTRODUCTION
 AIDS and HIV has been the subject of extensive research, literature, and political
 debate. Although a wide array of viewpoints have stated on the subject, one consis-
 tent presupposition is the biological disease exists as an ontological reality [66]. Early
 studies on HIV/AIDS focused on the individual characteristics and behaviours in
 determining HIV risk, Fee and Krieger refer to this as biomedical individualism [67].
 But it has been determined that the study of the distribution of health outcomes and
 their social determinants is just as important, this is referred to as social epidemi-
 ology [68]. Therefore risk factor epidemiology examines the individual demographic
 and social characteristics in an attempt to determine the factors that expose an indi-
 vidual to the risk of obtaining HIV/AIDS. This study uses individual characteristics
 as well as social and demographic factors in determining the risk of HIV.
 It is thus evident from above that the analysis of HIV is of utmost importance. By
 correctly forecasting HIV, the causal interpretations of a patients being seropositive
 (infected by HIV) is made much easier. Previously, computational intelligence tech-
 niques have been used extensively to analyse HIV. Leke et al have used autoencoder
 network classifiers, inverse neural networks, as well as conventional feedforward neu-
 ral networks to analyse HIV on the same data set used in this research [28, 29, 30].
 They used the inverse neural network for adaptive control of HIV status to under-
 stand how the demographic factors affect the risk of HIV infections [30]. Lee et al
 used neural networks to assess HIV/AIDS related health performance [69]. Lumini
 and Nanni used machine learning for HIV-1 protease cleavage site prediction [70].
 Although an accuracy of 84% is achieved when using the autoencoder method of
 Leke et al [28], it is disadvantageous due to its ?black box? nature, this also applies
 to the other mentioned neural network techniques. Neural networks offer accuracy
 over analysis of data, but in the case of analysing HIV data, it can be argued that
 interpretability of the data is of more importance than just prediction. It is due
 to this fact that rough set theory (RST) is proposed to forecast and interpret the
 causal effects of HIV. RST produces linguistic rules that govern the predictions that
 they make and it is these rules that provide decision-makers and policy-makers with
 more accurate information to allow them to implement better health care systems
 37
4.2. HIV DATA
 and HIV prevention schemes.
 4.2 HIV Data
 The data set used in this research was obtained from the South African antena-
 tal sero-prevalence survey of 2001. The data was obtained through questionnaires
 completed by pregnant women attending selected public clinics and was conducted
 concurrently across all nine provinces in South Africa. The sentinel population for
 the study only included pregnant women attending an antenatal clinic for the first
 time during their current pregnancy. The choice of the first antenatal visit is made
 to minimise the chance for one woman attending two clinics and being included in
 the study more than once [45]. The probability proportion to size (PPS) sampling
 method was used to determine the sample size of the 2001 antenatal survey [45].
 PPS is a sampling technique used (in surveys) to ensure the probability of select-
 ing a sample unit (attribute) is proportional to the size of the population. The
 collected data used to study the conditions associated with HIV consists of six dif-
 ferent attributes. The six demographic variables considered are: race, age of mother,
 education, gravidity, parity and, age of father. For any set of attributes considered,
 there is an associated outcome or decision. The outcome is either HIV positive or
 negative.
 4.2.1 Data
 As previously mentioned, HIV infections rate are described using six demographic
 variables. These variables are further described below.
 Race: This variable is a nominal polytomous variable as there are four values associ-
 ated with it but these values have no order. The four possible variables are African,
 White, Coloured and, Asian.
 Age of Mother: This variable consists of continuous integers, as the values range
 38
4.2. HIV DATA
 from 15 years old to 49 years old. The value taken for the age of mother was taken
 at the time of conducting the survey.
 Education: This variable also contains continuous integers, with the values ranging
 from 0 to 13. The value of 0 indicates that no education has been undertaken by
 the mother, while an education level of 13 represents tertiary education.
 Gravidity: Gravidity is given as continuous integers ranging from 0 to 11. Gravidity
 is the number of pregnancies, complete or incomplete, experienced by the mother.
 Parity: Parity is also given as continuous integers with values ranging from 0 to
 11. Parity is the number of times the mother has given birth. Multiple births are
 however counted as one. Both parity and gravidity are important factors as they
 show the reproductive activity as well as the reproductive health state of the women
 taking part in the test.
 Age of Father: As with age of mother, this variable consists of continuous integers,
 the values range from 15 years old to 49 years old.
 HIV Status: This variable is given as the outcome or decision value. This value
 is dichotomous as the variables or values are given in binary, with a 0 representing
 HIV negative while a 1 represents HIV positive.
 A summary of the data is given in Table 4.1
 Table 4.1: Summary of the HIV data set variables
 Variable Description Value Range
 x1 Race {African, White, Coloured, Asian}
 x2 Age of Mother [15, 49]
 x3 Education [0, 13]
 x4 Gravidity [0, 11]
 x5 Parity [0, 11]
 x6 Age of Father [15, 49]
 y1 HIV Status 0 or 1
 39
4.2. HIV DATA
 4.2.2 Cleaning the Data
 As mentioned the data was collected from surveys, and like any data that is in
 raw form, there are several steps that need to be undertaken to ensure the data
 is transformed into a usable form. This usable form includes the removal of any
 outliers such as missing and/or incorrect data. The data needs to be cleaned and
 removed of irregularities before any processing can be performed on it. The first
 irregularity would be the case of missing data. This could be due to the fact that
 surveyees may have omitted certain information, it could also be attributed to the
 errors being made when the data was entered onto the computer. Such cases are
 removed from the data set. The second irregularity would be information that is
 false. Such an instance would be if gravidity was zero and parity was at least one.
 Gravidity is defined as the number of times that a woman has been pregnant, and
 parity is defined as the number of times that she has given birth. Therefore it is
 impossible for a woman to have given birth, given that she has not been pregnant,
 such cases are removed from the data set. As mentioned earlier, multiple births are
 still indicated with a parity of one, therefore if parity is greater than gravidity, that
 particular case is removed from the data set. Only 12945 cases remained from the
 initial total of 13087. Of the 12945 cases, the data sets were balanced and then split
 into training and testing data using the ratio of 70% to 30% respectively. The rough
 set model was created using the training data, while the performance of the rough
 set model was validated using the testing data.
 Once the data had been pre-processed as described above, it was used to create
 a rough set model, and the model was validated using testing data. The results
 obtained by applying the newly proposed rough set approach to human immunode-
 ficiency virus (HIV) data are given below.
 40
4.3. RESULTS OBTAINED ON HIV DATA
 4.3 Results Obtained on HIV data
 A confusion matrix is a visualisation tool represented in the form of a table/matrix.
 Each column of the matrix represents the instances in a predicted class, while each
 row represents the instances in an actual class. A confusion matrix is used to see if
 the classification of a certain class is being confused with that of another, i.e. it illus-
 trates the mislabelling of classes with another. The confusion matrix is represented
 in terms of actual positives and negatives against predicted positives and negatives.
 Table 4.2 illustrates how the actual and predicted classification is shown.
 Table 4.2: Representation of a Confusion Matrix
 Predicted Positive Predicted Negative
 Actual Positive a b
 Actual Negative c d
 Where:
 ? a is the number of correct predictions that an instance is positive
 ? b is the number of incorrect of predictions that an instance negative
 ? c is the number of incorrect predictions that an instance is positive
 ? d is the number of correct predictions that an instance is negative
 The four different optimisation techniques results will be plotted against each other
 using a ROC curve. These techniques will also be individually plotted against that of
 EWB and EFB partitioning to determine how the newly proposed method performs
 against that of the more well known static methods.
 The confusion matrix for the EWB partitioning is given in Table 4.3 below. Table 4.4
 gives the confusion matrix for EFB partitioning. The confusion matrices for the
 optimisation methods are given in section 4.3.1.
 41
4.3. RESULTS OBTAINED ON HIV DATA
 Table 4.3: Confusion matrix for EWB discretisation on HIV data
 Equal-Width-Bin Partitioning
 Predicted Positive Predicted Negative
 Actual Positive 800 43
 Actual Negative 630 185
 59.41%
 Table 4.4: Confusion matrix for EFB discretisation on HIV data
 Equal-Frequency-Bin Partitioning
 Predicted Positive Predicted Negative
 Actual Positive 485 378
 Actual Negative 374 423
 54.70%
 4.3.1 Results of Optimisation Techniques
 The results obtained for the four optimisation techniques as described in Chapter 3
 are given in this section. The results of equal-width-bin and equal-frequency-bin
 partitioning are compared to those of the various optimisation approaches. The
 confusion matrix of each method will be given in each respective section along with
 the ROC curve of the particular optimisation technique plotted against that of EWB
 and EFB. Table 4.9 summarises the findings.
 Genetic Algorithm
 The genetic algorithm was run with a roulette wheel selection function, a uniform
 crossover operator, a boundary mutation operator, an initial population of 20 indi-
 viduals and the termination function of 100 generations. The confusion matrix for
 the GA is given in Table 4.5.
 Table 4.5: Confusion matrix of the GA discretisation on HIV data
 Genetic Algorithm
 Predicted Positive Predicted Negative
 Actual Positive 677 135
 Actual Negative 496 327
 61.41%
 42
4.3. RESULTS OBTAINED ON HIV DATA
 Particle Swarm Optimisation
 Particle swarm optimisation was implemented with the maximum number of gener-
 ations of 30, and the initial number of particles also set to 30. The confusion matrix
 of the results of PSO is given in Table 4.6.
 Table 4.6: Confusion matrix of a the PSO discretisation on HIV data
 Particle Swarm Optimisation
 Predicted Positive Predicted Negative
 Actual Positive 578 239
 Actual Negative 380 445
 62.30%
 Hill Climbing
 Steepest ascent hill climbing was implemented with 20 initial starting points. The
 theory of hill climbing (HC) as with the other methods is given in Chapter 3. The
 hill climbing confusion matrix is given in Table 4.7.
 Table 4.7: Confusion matrix of a the HC discretisation on HIV data
 Hill Climbing
 Predicted Positive Predicted Negative
 Actual Positive 652 162
 Actual Negative 438 377
 63.17%
 Simulated Annealing
 SA has the cooling model as given in Equation 3.8, it was run with a random
 generator with the bounds of the maximum and minimum input values, an initial
 temperature of 1, a stopping temperature of 1e?8, a maximum number of consecutive
 rejection of 200 and a maximum number of successes within one temperature set to
 10. The confusion matrix for the results obtained using SA is given in Table 4.8.
 43
4.3. RESULTS OBTAINED ON HIV DATA
 Table 4.8: Confusion matrix of a the SA discretisation on HIV data
 Simulated Annealing
 Predicted Positive Predicted Negative
 Actual Positive 665 141
 Actual Negative 466 350
 62.58%
 4.3.2 Comparison of Methods on HIV data
 The ROC curves of the various optimisation techniques plotted against EWB and
 EFB are given below. Figure 4.5 gives the ROC curve of the four optimised tech-
 niques plotted against each other. As stated in Chapter 2, the performance of the
 rough set is calculated as the AUC of the ROC curve. The AUC can be interpreted
 as the probability that when we randomly pick one positive and one negative ex-
 ample, the classifier will assign a higher score to the positive example than to the
 negative. The higher the AUC, the better the classification is.
 Genetic Algorithm ROC Curve
 Figure 4.1: Receiver operating characteristic (ROC) curves for GA, EWB and EFB
 on HIV data
 44
4.3. RESULTS OBTAINED ON HIV DATA
 Particle Swarm Optimisation ROC Curve
 Figure 4.2: Receiver operating characteristic (ROC) curves for PSO, EWB and EFB
 on HIV data
 Hill Climbing ROC Curve
 Figure 4.3: Receiver operating characteristic (ROC) curves for HC, EWB and EFB
 on HIV data
 45
4.3. RESULTS OBTAINED ON HIV DATA
 Simulated Annealing ROC Curve
 Figure 4.4: Receiver operating characteristic (ROC) curves for SA, EWB and EFB
 on HIV data
 It is clearly visible from the previous four ROC curves that each optimisation tech-
 nique has a better classification accuracy than both that of EWB and EFB parti-
 tioning. Figure 4.5 illustrates the performance of all four optimisation techniques
 against each other. It is clearly shown in Figure 4.5 that there is no significant
 difference between the various optimisation techniques. There is narrow band in
 which all the ROC curves fall. This narrow band indicates that no one particular
 optimisation technique is much better suited than the others. Table 4.9 tabulates
 the results from which it can be seen that, albeit marginal, hill climbing produces
 the highest classification accuracy when using the AUC measure. The table displays
 the number of rules produced, the computational time (minutes) and the AUC. As
 can be seen from Table 4.9 the simulated annealing optimisation has the lowest com-
 putational time with respect to the optimised methods. From the table, it is evident
 that the computational time required for the static EWB and EFB partitioning is
 much less than that for the optimised approaches. Having said that, the higher
 forecasting accuracy obtained using optimised methods will results in more concrete
 evidence from which policies can be generated. Less rules are also generated from
 46
4.3. RESULTS OBTAINED ON HIV DATA
 the optimised methods approach and it is from these extracted rules from which the
 causal interpretations are then formulated by a linguistic approximation.
 Figure 4.5: Receiver operating characteristic (ROC) curves for GA, PSO, HC and
 SA on HIV data
 Table 4.9: Results obtained for the four optimisation and two static discretisation
 methods
 Number of Rules Comp. Time AUC
 Genetic Algorithm 172 3726 0.6748
 Particle Swarm Optimisation 146 1690 0.6718
 Hill Climbing 209 12624 0.6902
 Simulated Annealing 197 1159 0.6802
 Equal Width Bin 231 2 0.5952
 Equal Frequency Bin 307 2 0.5986
 Once the highest classification accuracy was achieved for each method, the partition
 sizes were noted. Figure 4.6 shows the optimal cut points and partition sizes for the
 six variables for the GA method. The other methods discretisation points can be
 represented in the same way but is not shown. These cut points as shown in the
 figure indicate which values should fall into which range. As for EWB and EFB
 partitioning, the partitions are split as described in Chapter 3.
 The results shown in this chapter indicate that the newly proposed method of using
 optimisation techniques to granulise/discretise rough set partitions is feasible and it
 47
4.3. RESULTS OBTAINED ON HIV DATA
 Figure 4.6: Discretisation points for the various attributes using the GA method on
 HIV data
 also produces higher forecasting/classification accuracies than that of the previously
 used static methods of equal-width-bin and equal-frequency-bin partitioning. Of
 the four optimisation techniques, it can be stated that no one particular technique
 is superior to any other. There are marginal difference in the accuracies produced
 (using AUC), but in all cases easy-to-interpret, linguistic rules are generated. A case
 generated from the lower and upper approximation is represented as shown on the
 next page.
 48
4.3. RESULTS OBTAINED ON HIV DATA
 Lower Approximation Rules
 ? If Race = African and Mothers Age = 23 and Education = 4 and Gravidity
 = 2 and Parity = 1 and Fathers Age = 20 Then HIV = Most Probably
 Positive
 ? If Race = Asian and Mothers Age = 30 and Education = 13 and Gravidity
 = 1 and Parity = 1 and Fathers Age = 33 Then HIV = Most Probably
 Negative
 Upper Approximation Rules
 ? If Race = Coloured and Mothers Age = 33 and Education = 7 and Gravid-
 ity = 1 and Parity = 1 and Fathers Age = 30 Then HIV = Positive with
 plausibility = 0.33333
 ? If Race = White and Mothers Age = 20 and Education = 5 and Gravidity =
 2 and Parity = 1 and Fathers Age = 20 Then HIV = Positive with plausibility
 = 0.06666
 The proposed method used in this research is further validated by applying it to
 another data set. The second data set used to test the method is the militarised
 interstate dispute (MID) data set. The results will be given in Chapter 5 in a similar
 form as that given in this chapter.
 49
Chapter 5
 Experimental Investigation II :
 Modelling MID
 5.1 Introduction
 Recent developments in the conflict literature has emphasised the importance of
 treating international conflicts as complex phenomena often displaying nonlinear
 and nonmonotonic patterns of interactions [71, 72]. Various methods have been
 implemented and there are still efforts underway to study interstate interactions or
 militarised interstate disputes. Militarised interstate disputes (MID) are defined as
 a set of interactions between or among states involving threats to use military force,
 displays of military force, or actual uses of military force [73]. In other words MID?s
 are conflicts between states that do not involve a full scale war. The various states
 involved in the analysis of conflict are referred to dyads, therefore the variables
 associated with the analysis of conflicts are referred to as dyadic variables.
 Beck et al. [33] state that no legitimate statistical model has managed to forecast
 an international conflict with a probability of greater than 0.5. Statistical models
 have also produced results that are conflicting; Thompson and Tucker state that for
 democratic dyads the absence of war and the low probability of serious disputes are
 restricted, in other words countries or states that are democratic have a reduced
 50
5.1. INTRODUCTION
 chance of being involved in conflict [74]. However, Mansfield and Snyder argue that
 democratic states or countries may eventually become peaceful but in their early
 years they are prone to conflict or war involvement [75]. Lagazio et al state that
 the failure of statistical models is resultant of interstate variables that are related
 to MIDs are non-linear, highly interdependent and context dependent [32].
 Thus the need for modelling and forecasting conflict using artificial techniques has
 become inevitable. Even though political scientists have long eschewed forecasting
 of conflicts in favour of causal interpretations, it is argued that causal theories are
 considered harder to verify than forecast, yet accurate forecasts can be used to verify
 claims about the causal structure, or at least in part [33]. Accurate forecasting will
 aid in the interpretation of the conflict.
 Due to the shift from statistical methods to AI techniques, several AI methods have
 been applied to conflict management. Neural networks using Gaussian approxima-
 tion have been used to model the complex interrelationships between the dyadic
 parameters and the MID?s [7, 32, 33]. Other methods that have been implemented
 include neuro-fuzzy, regression, and support vector machines (SVM). Similarly, game
 theory has been used in conflict analysis [20]. Habtemariam and Marwala introduced
 support vector machines (SVM) to the study of conflict management [8], although
 SVM?s offer high forecasting accuracy, they lack the ability to perform a causal anal-
 ysis on the results achieved. Marwala and Lagazio proposed the use of Automatic
 relevance detection (ARD) to incorporate transparency into the prediction of neural
 networks. The results obtained allowed the ranking of importance of the 7 dyadic
 variables. In decreasing order the variables were ranked; Democracy, Capability,
 Dependency, Allies, Contiguity, Distance and Major power [36]. Offering a better
 balance between accuracy and interpretability, Tettey and Marwala proposed the
 use of neuro-fuzzy methods to control and analyze conflict [76].
 51
5.2. MID DATA
 5.2 MID Data
 The MID data used in this study was obtained from the correlates of war project [46].
 The correlates of war (COW) project is an academic study of the history of warfare.
 COW seeks to facilitate the collection, dissemination and use of accurate and reliable
 quantitative data in international relations [46]. It is argued however, that the COW
 project has tried to side-step the subjectivity problem by standardising their core
 concepts. Such standardisation by a research clique is often viewed and condemned
 by competitors as adding an ideological dimension that constrains or dogmatises
 research [77]. Although extensive data collection efforts have been made, a great
 deal of research is underway to ensure satisfactory and reliable conflict models are
 created. But by using databases such as those provided by COW, conflict causality
 is understood and the results obtained can then be specified as a tool that will
 contribute to decision making and policy formulation.
 The collected data, referred to as interstate variables are used to study the in-
 teractions associated with militarised interstate disputes. The various interstate
 or dyadic variables of the data set consist of kantian and realist variables [72]. Kan-
 tian variables arise from the Kantian peace theory which was developed by Immanuel
 Kant [72, 78]. This theory states that by controlling variables such as democracy, in-
 terdependence and intergovernmental organizations, the risk of war can be reduced,
 such a theory is also known as the ?liberal? theory [72]. The realist theory on the
 other hand encompasses a variety of theories and approaches, all of which share a
 belief that states are primarily motivated by the desire for military and economic
 power or security rather than ideals or ethics. Realists believe that overwhelming
 power will reduce the likelihood of conflict [72, 79].
 The realist variables include Allies, Contiguity, Major Powers, Distance and Capa-
 bility. The kantian variables include Democracy and Dependency. The above set
 of attributes have associated outcomes or concepts, the outcome is either peace or
 conflict. These variables will be explained in below.
 52
5.2. MID DATA
 5.2.1 Data
 The seven dyadic variables as stated previously will be explained below and sum-
 marised in Table 5.1
 Allies: This variable is dichotomous, consisting of a binary decision value. Allies
 refers to the measure of the military alliance between dyads. A value of 1 implies
 that there is an alliance of some kind, whether it be a mutual defence treaty or
 neutrality pact. A value of 0 implies that there is no alliance between the dyads.
 Contiguity: This is also a dichotomous value, and it indicates whether the two
 countries or dyads share any common boundary or border. If they do share a border
 a value of 1 is assigned, and a value of 0 is assigned if they do not.
 Major Powers: This dichotomous value is assigned to state if either of the two
 dyads are major powers or not. A country is considered a major power if it has
 substantial relative global destructive power based on a consensus of historians. If
 either dyad is a major power a value of 1 is assigned, or a 0 if not.
 Distance: This continuous variable is simply the distance between the two coun-
 tries. It is taken as the logarithmic distance (base 10) between the capital cities of
 the two countries. The values in the data set range from between 1.87 and 9.41.
 Capability: Capability is a continuous variable and is the logarithm to base 10
 ratio of the total population plus the number of people in urban areas plus industrial
 energy consumption plus iron and steel production plus number of military personnel
 in active duty plus military expenditure in dollars in the last 5 years measured on
 stronger country to weaker country.
 Democracy: This variable consists of continuous values in the range [-10, 10] stat-
 ing the democracy level of the relevant country. -10 would denote an extremely auto-
 cratic state whereas 10 denotes an extremely democratic state. The joint democracy
 level is then calculated as a minimum of both the states democracy scores as it is
 assumed the less democratic state plays a determinant role for an occurrence of a
 53
5.2. MID DATA
 conflict.
 Dependency: Dependency is a continuous variable that measure the economic
 interdependence of the less economically dependent state. It is measured as the sum
 of the countries imports and exports with its partner divided by the gross domestic
 product (GDP) of the stronger country, in other words it is the minimum bilateral
 trade-to-GDP ratio between the two countries.
 A summary of the described dyadic variables is given in Table 5.1.
 Table 5.1: Summary of the MID data set variables
 Variable Description Value Range
 x1 Allies 0 or 1
 x2 Contiguity 0 or 1
 x3 Major Powers 0 or 1
 x4 Distance [1.84, 9.41]
 x5 Capability [0, 9.49]
 x6 Democracy [-10, 10]
 x7 Dependency [0, 0.1719]
 y1 Outcome 0 or 1
 5.2.2 Cleaning the Data
 As is the case with the HIV data, any outliers if any, or missing attribute cases
 must be removed. The MID data used contains 27737 cases each consisting of seven
 dyadic attributes or variables and an associated dispute outcome. The data contains
 26845 peace examples and 892 conflict examples therefore for accurate classification
 using AUC, the data set needs to be balanced. Once the data was balanced, the
 data set was used to generated two different sets, one training set and one testing
 set. This data was split on a ratio of 70% training data to 30% testing data. The
 training data was used to generate the rough set model while the testing data was
 used to assess out-of-sample accuracy.
 54
5.3. RESULTS OBTAINED ON MID DATA
 5.3 Results Obtained on MID data
 An important statement Beck et al make is, although political scientists are less
 likely to evaluate conflict models per se than, say economists, forecasting underlines
 all evaluations of such a model, and it is this information that will be of signifi-
 cant use [33]. The results using the four different optimisation techniques will be
 given. This includes the AUC along with their ROC curves as well as the respective
 confusion matrices. Each method will be plotted against that of EWB and EFB
 partitioning as well as all the methods will be plotted against each other.
 5.3.1 Results of Optimisation Techniques
 As is the case with HIV, the confusion matrices will be given for both EWB and EFB
 partitioning as well as for each optimisation technique. For the purpose of consis-
 tency, the chosen operators used for the optimisation techniques and the numerical
 values used to fine tune each method are maintained from the HIV experimental
 investigation. These values will not be stated again as they are given in the respec-
 tive sections in Chapter 4. The confusion matrices are first given for the two static
 methods, and then for each optimisation method under their respective section.
 Table 5.2: Confusion matrix for EWB discretisation on MID data
 Equal-Width-Bin Partitioning
 Predicted Positive Predicted Negative
 Actual Positive 181 79
 Actual Negative 70 205
 72.15%
 Table 5.3: Confusion matrix for EFB discretisation on MID data
 Equal-Frequency-Bin Partitioning
 Predicted Positive Predicted Negative
 Actual Positive 213 72
 Actual Negative 79 171
 71.77%
 55
5.3. RESULTS OBTAINED ON MID DATA
 Genetic Algorithm
 Table 5.4: Confusion matrix of the GA discretisation on MID data
 Genetic Algorithm
 Predicted Positive Predicted Negative
 Actual Positive 256 25
 Actual Negative 39 215
 88.04%
 Particle Swarm Optimisation
 Table 5.5: Confusion matrix of a the PSO discretisation on MID data
 Particle Swarm Optimisation
 Predicted Positive Predicted Negative
 Actual Positive 233 32
 Actual Negative 73 197
 80.37%
 Hill Climbing
 Table 5.6: Confusion matrix of a the HC discretisation on MID data
 Hill Climbing
 Predicted Positive Predicted Negative
 Actual Positive 225 34
 Actual Negative 70 206
 80.56%
 Simulated Annealing
 Table 5.7: Confusion matrix of a the SA discretisation on MID data
 Simulated Annealing
 Predicted Positive Predicted Negative
 Actual Positive 241 20
 Actual Negative 67 207
 83.74%
 56
5.3. RESULTS OBTAINED ON MID DATA
 5.3.2 Comparison of Methods on MID data
 The optimised approaches are compared against EWB and EFB partitioning. The
 comparison is based on classification accuracy using AUC as the performance mea-
 sure as is done with HIV. Figure 5.5 displays the four optimised approaches plotted
 against each other, this is performed to determine if one particular technique is bet-
 ter suited than the others for discretisation of input data into the rough set using
 the MID data set. Table 5.8 summarises the results of all the methods applied to the
 conflict (MID) data set, it states the number of rules produced, accuracies (AUC)
 and the computational time required (in minutes).
 Genetic Algorithm ROC Curve
 Figure 5.1: Receiver operating characteristic (ROC) curves for GA, EWB and EFB
 on MID data
 57
5.3. RESULTS OBTAINED ON MID DATA
 Particle Swarm Optimisation ROC Curve
 Figure 5.2: Receiver operating characteristic (ROC) curves for PSO, EWB and EFB
 on MID data
 Hill Climbing ROC Curve
 Figure 5.3: Receiver operating characteristic (ROC) curves for HC, EWB and EFB
 on MID data
 58
5.3. RESULTS OBTAINED ON MID DATA
 Simulated Annealing ROC Curve
 Figure 5.4: Receiver operating characteristic (ROC) curves for SA, EWB and EFB
 on MID data
 Figure 5.5: Receiver operating characteristic (ROC) curves for GA, PSO, HC and
 SA on MID data
 Figure 5.5 illustrates the performance of each optimisation techniques against each
 other. As can be seen from the graph, the genetic algorithm approach produces the
 highest accuracy (AUC).
 59
5.3. RESULTS OBTAINED ON MID DATA
 Table 5.8: Results obtained for the four optimisation and two static discretisation
 methods
 Number of Rules Comp. Time AUC
 Genetic Algorithm 422 420 0.9582
 Particle Swarm Optimisation 260 394 0.9281
 Hill Climbing 187 988 0.8934
 Simulated Annealing 396 100 0.9352
 Equal Width Bin 123 1 0.7724
 Equal Frequency Bin 84 1 0.7672
 From the results given in Table 5.8, it can be seen when using the conflict data
 set, that more rules are produced when a higher accuracy is produced. Once again
 the optimised approaches produced higher accuracies than the previously more well
 known static methods of equal-width-bin and equal-frequency-bin partitioning. It
 must be stated that although the optimised approaches produce higher classifica-
 tion accuracies, it does come at the expense of computational time as can be seen
 when a comparison is made between the optimised and static methods. The genetic
 algorithm approach produced the greatest forecasting accuracy (AUC) of 95.82%,
 while considering only the optimisation methods, simulated annealing is the most
 time efficient, i.e. has the lowest computational time.
 Using the results produced by each optimisation method, the optimal known dis-
 cretisation points can be determined, and using these partition cuts, rough sets can
 be run on the particular data set and forecast with a certain plausibility based on the
 rough membership function. From these rough sets, rules can also be extracted in
 an if CONDITION(S)-then DECISION format. These rules are linguistic and easy
 to interpret. To illustrate this, a lower approximation case generated from the MID
 data set is represented as shown below. Before the input attributes are discretised,
 it will look in the form as it is shown below. The outcome or decision can only be
 determined once a rule set has been formed (from discretised attributes). Therefore,
 for a given set of attributes, an outcome associated with it can be determined.
 ? If Allies = 1 (True) and Contiguity = 0 (False) and Major Powers = 1 (True)
 and Distance = 8 (High) and Capability = 9 (High) and Democracy = 10
 60
5.3. RESULTS OBTAINED ON MID DATA
 (True) and Dependency = 0.17 (High) ThenOutcome = Most Probably Peace
 The findings as illustrated in the generated rule shown above, are as what one
 would expect. Although an attribute sensitivity analysis does not make sense when
 applied to rough set theory, the rules produced do provide insight into the factors
 that may contribute to conflict. This can be done by varying a particular attribute or
 several attributes simultaneously and observe what factors result in what particular
 outcome. Another important observation is the rules can be generated without the
 use of a priori information. The optimal cut points and partition sizes for the seven
 dyadic variables for the GA method are shown in Figure 5.6.
 From the findings of both experimental investigations (HIV and MID), the use of op-
 timisation techniques produce higher classification/forecasting accuracies than that
 of the static EWB and EFB methods. These higher accuracies produce more ac-
 curate rules from which forecasting and causal interpretations can be performed.
 For the case of HIV, hill climbing produced the best results but this came at the
 huge expensive of computational time. For MID, the genetic algorithm approach
 produced the best classification accuracy, yet simulated annealing produced an accu-
 racy nearly as high as that of GA but in the least computational time. As expected,
 there is no significant difference between the use of any of the four optimisation tech-
 niques. In most cases an optimum will be found. Using the method of hill climbing
 implemented, the risk of finding a local rather than a global optimum is increased.
 Genetic algorithms and simulated annealing are better optimisation techniques to
 be applied to the general problem of discretising rough sets due to their nature of
 finding a global optimum. Although the GA optimisation method produced good
 classification results, it must be noted that in some cases the GA may prematurely
 converge towards local optima.
 61
5.3. RESULTS OBTAINED ON MID DATA
 Figure 5.6: Discretisation points for the various attributes using the GA method on
 MID data
 62
Chapter 6
 Conclusion
 6.1 Summary of Findings
 The use of rough set theory is very useful when dealing with vague and uncertain
 data. It is also particularly useful for generating rules for the purpose of forecasting
 and determining causal interpretations of a particular data set. An important step
 in rough set model generation is the discretisation/granulisation of the input data
 into partitions. Four partitions are chosen in this research as they provide a good
 balance between accuracy achieved and computational time. Several methods have
 been used to perform this granulisation step, and in particular, two well known static
 methods called equal-width-bin (EWB) and equal-frequency-bin (EFB) partitioning.
 The work performed in this research is concerned with using various optimisation
 techniques to granulise the rough set input partitions to achieve the highest forecast-
 ing accuracy produced by the rough set. This forecasting accuracy is measured by
 using the area under the curve (AUC) of the receiver operating characteristic (ROC)
 curve. The four optimisation techniques used are genetic algorithm, particle swarm
 optimisation, hill climbing and simulated annealing. This newly proposed method is
 tested on two data sets, namely, the human immunodeficiency virus (HIV) data set
 and the militarised interstate dispute (MID) data set. The HIV data was obtained
 from the antenatal sero-prevalence survey taken in South Africa in 2001. The MID
 63
6.2. RECOMMENDATIONS FOR FURTHER WORK
 data used is from the correlates of war (COW) project. The COW project facilitates
 the collection, dissemination and use of accurate and reliable quantitative data in
 international relations.
 The results of both data sets indicate that the optimised methods produce higher
 forecasting/classification accuracies than that of the static EWB and EFB parti-
 tioning methods. When comparing the optimised approaches against each other, as
 is the case with both data sets, as expected there is no significant difference between
 the methods used. When looking at the MID results, the hill climbing accuracy is
 marginally lower than the other three methods. This could be attributed to the fact
 that the implementation of hill climbing used allows the optimum to be found at a
 local rather than a global point. It must be noted that depending on the linearity
 and complexity of the data set, the optimal technique to be used to discretise the
 partitions will vary. The rough sets produce a balance between transparency of the
 rough set model and accuracy of HIV estimation, but it does come at a cost of high
 computational effort. The data was balanced and split into a 70% to 30% training
 data to testing data ratio. The highest accuracy achieved for the HIV data set is
 obtained by using the hill climbing approach, and an accuracy of 69.02% is achieved
 in a time of 12624 minutes. For the MID data, the genetic algorithm approach pro-
 duced the highest accuracy. This accuracy is 95.82% in a time of 420 minutes. As
 shown in Chapters 4 and 5, the rules extracted are linguistic, but this transparency
 of the model does come at the expense of classification accuracy. This loss in ac-
 curacy is brought about in the discretisation process where the granularity of the
 variables are decreased. A highly interpretable model is very useful in the study
 of HIV and conflict management studies as it allows for policy formulation and the
 testing of causal hypotheses.
 6.2 Recommendations for Further Work
 The work performed shows that the application of rough set theory to HIV and
 MID is feasible. The accuracies obtained for MID are very good while the results
 64
6.2. RECOMMENDATIONS FOR FURTHER WORK
 produced for HIV are not as good. Altering the parameters and implementation
 of the optimisation techniques may lead to an increase in the rough set forecasting
 accuracy. For example, in the genetic algorithm, the introduction of other diver-
 sification mechanisms such as elitism can also be implemented. Other work that
 could be investigated includes using optimisation techniques to run in parallel with
 the existing framework to determine the number of partitions required. An investi-
 gation into the bounds of the optimisation techniques should also be performed to
 determine whether this has a large effect on the results or not. The effect of the rule
 generation from the inclusion of reducts should be tested, it can also be investigated
 whether optimisation techniques serve any purpose in discretising the input data
 given the fact that reducts are going to remove redundant information.
 Current policies and theories of the relevant field can be tested against the obtained
 results. The model should also be tested in conjunction with experts in the particular
 field. These experts can indicate whether the results obtained are feasible and more
 importantly whether the results can actually be used to forecast future occurrences.
 If the data had been balanced the same way as that in the MID study performed
 by [80], and the HIV study performed by [50] the results from this research could
 be compared.
 65
Appendix A
 Algorithms
 In this appendix, various algorithms that were discussed in the thesis are presented.
 A.1 Rough Set Model Generation
 Algorithm 1: Algorithm to generate rough set model
 Input : Condition and Decision Attributes
 Output: Certain and Possible Rules
 Obtain data set to be used ;1
 repeat2
 for conditionalattribute? 1 to sizeoftrainingdata do3
 Pre-process data to ensure its ready for analysis;4
 Discretise data according to optimisation technique;5
 Compute the lower approximation, such that:6
 BX = {x ? U: B(x )?X };
 Compute the upper approximation, such that:7
 BX = {x ? U: B(x )?X negationslash= ?};
 From the generated rules, calculate plausibility measures of which an8
 object x belongs to set X. This is defined by: ?XA =
 |[X]B ?X|
 [X]B
 ;
 Extract the certain rules from the lower approximation generated for9
 each subset;
 Similarly, extract the possible rules from the upper approximation of10
 each subset;
 Remove the rules generated for the purpose of testing on unseen data;11
 Compute the classifier performance using AUC;12
 end13
 until Optimisation technique termination condition ;14
 66
A.2. RECEIVER OPERATING CHARACTERISTICS (ROC) CURVE
 A.2 Receiver Operating Characteristics (ROC) Curve
 The ROC curve implemented is the efficient method given by [40]. This is given by:
 Algorithm 2: Algorithm to calculate ROC curve
 Input : L, the test set of instances; f(i), the probabilistic classifier?s
 estimate that instance i is positive; P and N , the number of
 positive and negative examples.
 Output : R, a list of ROC points increasing by fp rate.
 Require: P > 0 and N > 0
 begin1
 Lsorted ? L sorted decreasing by f scores;2
 FP ? TP ? 0;3
 R? ??;4
 fprev ? ??;5
 i? 1;6
 while i ? |Lsorted| do7
 if f(i) negationslash= fprev then8
 push (FPN ,
 TP
 P ) onto R;9
 fprev ? f(i);10
 end11
 if Lsorted[i] is a positive example then12
 TP ? TP + 113
 else14
 FP ? FP + 115
 end16
 i? i+ 1;17
 end18
 push (FPN ,
 TP
 P ) onto R;19
 end20
 There are various approaches to calculating ROC graphs, and although one possible
 method is given above, it must be noted that this is not the only method that can
 be used.
 The area under curve (AUC) of the ROC curve is computed in a similar way as to
 that of just the ROC curve. This implementation will be given in Algorithm 3 on
 the next page.
 67
A.3. AREA UNDER CURVE (AUC)
 A.3 Area Under Curve (AUC)
 Algorithm 3: Algorithm to calculate the AUC
 Input : L, the set of test examples; f(i), the probabilistic classifier?s
 estimate that instance i is positive; P and N , the number of
 positive and negative examples.
 Output : A, the area under the ROC curve
 Require: P > 0 and N > 0
 begin1
 Lsorted ? L sorted decreasing by f scores;2
 FP ? TP ? 0;3
 FPprev ? TPprev ? 0;4
 A? 0;5
 fprev ? ??;6
 i? 1;7
 while i ? |Lsorted| do8
 if f(i) negationslash= fprev then9
 A? A+ trapezoid area(FP, FPprev, TP, TPprev);10
 fprev ? f(i);
 FPprev ? FP ;11
 TPprev ? TP ;12
 end13
 if i is a positive example then14
 TP ? TP + 1;15
 else16
 FP ? FP + 1;17
 end18
 end19
 A? A+ trapezoid area(1, FPprev, 1, TPprev);20
 A? A(P?N) ;21
 end22
 The algorithm describing the trapezoid area function is given in Algorithm 4.
 Algorithm 4: trapezoid area function
 Function: trapezoid area (X1,X2,Y1,Y2)
 Base? |X1?X2|;1
 Heightavg ?
 (Y 1+Y 2)
 2 ;2
 return : Base?Heightavg
 68
Appendix B
 Flow Diagrams
 In this appendix, the flow diagrams for the genetic algorithm, particle swarm optimi-
 sation, hill climbing and simulated annealing optimisation techniques are illustrated.
 Each flow chart will be given on a new page starting from the next page.
 69
B.1. GENETIC ALGORITHM
 B.1 Genetic Algorithm
 Figure B.1: Flow diagram of genetic algorithm optimisation
 70
B.2. PARTICLE SWARM OPTIMISATION
 B.2 Particle Swarm Optimisation
 Figure B.2: Flow diagram of particle swarm optimisation
 71
B.3. HILL CLIMBING
 B.3 Hill Climbing
 Figure B.3: Flow diagram of hill climbing optimisation
 72
B.4. SIMULATED ANNEALING
 B.4 Simulated Annealing
 Figure B.4: Flow diagram of hill climbing optimisation
 73
Appendix C
 Published Work
 In this appendix the papers published based on the work performed in this thesis
 are listed.
 ? B. Crossingham and T. Marwala. ?Using Optimisation Techniques to Gran-
 ulise Rough Set Partitions?. Mathematical and Statistical Physics, In Proceed-
 ings of International Symposium on Computational Models of Life Sciences,
 American Institute of Physics, Volume 1, 2007, pp. 248-258.
 ? B. Crossingham and T. Marwala. ?Using Genetic Algorithms to Optimise
 Rough Set Partition Sizes for HIV Data Analysis?. Studies in Computational
 Intelligence, In Advances in Intelligent and Distributed Computing, Springer,
 Volume 78, 2008, pp. 245 - 250.
 These papers are attached to the back of this dissertation.
 74
References
 [1] Z. Pawlak, Rough Sets, Theoretical Aspects of Reasoning about Data, ch. 3, p. 33.
 Kluwer Academic Publishers, 1991.
 [2] J. Komorowski, L. Polkowski, and A. Skowron, ?Rough sets: a tutorial. in s.k. pal
 and a. skowron, editors, rough-fuzzy hybridization: A new method for decision
 making,? 1998.
 [3] A. F. B. Jaafar, J. Jais, M. H. B. H. A. Hamid, Z. B. A. Rahman, and D. Be-
 naouda, ?Using rough set as a tool for knowledge discovery in dss,? in Proceedings
 of the 4th International Conference on Multimedia and Information and Com-
 munication Technologies in Education, Seville, Spain, November 2006.
 [4] U. Fayyad and K. Irani, ?Multi-interval discretization of continuous-valued at-
 tributes for classification learning,? in Proceedings of the 13th International Joint
 Conference on Artificial Intelligence, p. 10221027, Los Alamos, CA 1993.
 [5] J. W. Grzymala-Busse, ?Mining numerical data - a rough set approach,? in Pro-
 ceedings of the Rough Sets and Emerging Intelligent Systems Paradigms, pp. 12?
 21, June 2007.
 [6] G. D. Garson, ?A comparison of neural network and expert systems algorithms
 with common multivariate procedures for analysis of social science data,? Social
 Science Computer Review, vol. 9, no. 3, pp. 399?433, 1991.
 [7] L. Zeng, ?Prediction and classification with neural network models,? Sociological
 Methods and Research, pp. 499?524, 27 1999.
 75
REFERENCES
 [8] E. A. Habtemariam and T. Marwala, ?Artificial intelligence for conflict man-
 agement,? in Proceedings of the IEEE International Joint Conference on Neural
 Networks (Montreal, Canada), pp. 2583?2588, 2005.
 [9] J. Komorowski, Z. Pawlak, L. Polkowski, and A. Skowron, A Rough Set Perspec-
 tive on Data and Knowledge. Oxford University Press: The Handbook of Data
 Mining and Knowledge Discovery, 1999.
 [10] Y. Yang and R. John, ?Roughness bounds in set-oriented rough set operations,?
 in 2006 IEEE International Conference on Fuzzy Systems, pp. 1461?1468, Van-
 couver, Canada 2006.
 [11] L. Zhai, L. Khoo, and S. Fok, ?Feature extraction using rough set theory and
 genetic algorithms-an application for the simplification of product quality eval-
 uation,? Computers and Industrial Engineering, vol. 43, pp. 661?676, 2002.
 [12] A. Ohrn, ?Discernibility and rough sets in medicine: Tools and applications.?
 PhD Thesis, Department of Computer and Information Science Norwegian Uni-
 versity of Science and Technology, 1999.
 [13] A. Ohrn and T. Rowland, ?Rough sets: A knowledge discovery technique for
 multifactorial medical outcomes,? American Journal of Physical Medicine and
 Rehabilitation, vol. 79, pp. 100?108, 2000.
 [14] A. Deja and P. Peszek, ?Applying rough set theory to multi stage medical
 diagnosing,? Fundamenta Informaticae, pp. 387?408, 54 2003.
 [15] A. Mrozek, Rough sets in computer implementation of rule-based control of
 industrial process. In R. Sowinski (Ed.), Intelligent decision supportHandbook of
 applications and advances of the rough sets theory, pp. 19?32. Kluwer Academic
 Publishers, 1992.
 [16] J. Pe-a, S. Ltourneau, and A. Famili, ?Application of rough sets algorithms to
 prediction of aircraft component failure,? in Proceedings of the Third Interna-
 tional Symposium on Intelligent Data Analysis, Amsterdam 1999.
 76
REFERENCES
 [17] F. E. H. Tay and L. Shen, ?Fault diagnosis based on rough set theory,? Engi-
 neering Applications of Artificial Intelligence, vol. 16, p. 3943, 2003.
 [18] R. H. Golan and W. Ziarko, ?A methodology for stock market analysis utilizing
 rough set theory,? in Proceedings of Computational Intelligence for Financial
 Engineering, pp. 32?40, New York, USA 1995.
 [19] R. Deja and D. Slezak, ?Rough set theory in conflict analysis,? in New Frontiers
 in Artificial Intelligence : Joint JSAI 2001 Workshop Post-Proceedings, pp. 349?
 353, 2001.
 [20] Z. Pawlak, ?Some remarks on conflict analysis,? European Journal of Opera-
 tional Research, vol. 166, p. 649654, 2005.
 [21] T. Tettey, F. V. Nelwamondo, and T. Marwala, ?HIV data analysis via rule
 extraction using rough sets,? in Proceedings of the 11th International Conference
 on Intelligent Engineering Systems, June 2007.
 [22] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learn-
 ing, pp. 1?25. Addison-Wesley, Reading, MA, 1989.
 [23] R. Hassan, B. Cohanim, and O. de Weck, ?A comparison of particle swarm
 optimisation and the genetic algorithm,? in Proceedings of the 46th American
 Institute of Aeronautics and Astronautics, Austin, Texas, April 2005.
 [24] R. Poli, W. B. Langdon, and O. Holland, ?Extending particle swarm optimisa-
 tion via genetic programming,? in Proceedings of the 8th European Conference
 on Genetic Programming, vol. 3447 of Lecture Notes in Computer Science, (Lau-
 sanne, Switzerland), Springer, 2005.
 [25] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, ?Optimization by simulated
 annealing,? Science, Number 4598, 13 May 1983, vol. 220, no. 4598, pp. 671?
 680, 1983.
 [26] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, ?Equa-
 tion of state calculations by fast computing machines,? The Journal of Chemical
 Physics, vol. 21, no. 6, pp. 1087?1092, 1953.
 77
REFERENCES
 [27] N. P. X. Wang, J. Yang and X. Teng, Finding Minimal Rough Set Reducts
 with Particle Swarm Optimization, ch. Rough Sets, Fuzzy Sets, Data Mining,
 and Granular Computing, pp. 451?460. Lecture Notes in Computer Science,
 3641 ed., 2005.
 [28] B. B. Leke, T. Marwala, and T. Tettey, ?Autoencoder networks for HIV classi-
 fication,? Current Science, vol. 91, pp. 1467?1473, 2006.
 [29] B. B. Leke, T. Marwala, T. Tim, and M. Lagazio, ?Prediction of HIV status
 from demographic data using neural networks,? in Proceedings of the IEEE Inter-
 national Conference on Systems, Man and Cybernetics, pp. 2339?2344, Taiwan
 2006.
 [30] B. B. Leke, T. Marwala, and T. Tettey, ?Using inverse neural network for HIV
 adaptive control,? International Journal of Computational Intelligence Research,
 vol. 3, pp. 11?15, 2007.
 [31] J. Cannady, ?Artificial neural networks for misuse detection,? in Proceedings
 of the 1998 National Information Systems Security Conference, Arlington, VA
 1998.
 [32] M. Lagazio and B. Russet, Temporal Stability and Causal Complexity, ch. A
 Neural Network Analysis of Militarized Disputes, 1885-1992: Temporal Stability
 and Causal Complexity, pp. 28?62. New Jersey: University of Michigan Press,
 2003.
 [33] N. Beck, G. King, and L. Zeng, ?Improving quantitative studies of international
 conflict: A conjecture,? American Political Science Review, pp. 21?33, 94 2000.
 [34] P. Schrodt, ?Prediction of interstate conflict outcomes using a neural network,?
 Social Science Computer Review, vol. 9, no. 3, pp. 359?380, 1991.
 [35] S. Theodolite and K. Outnumbers, Pattern Recognition. New York: Academic
 Press, first ed., 1999.
 78
REFERENCES
 [36] T. Marwala and M. Lagazio, ?Modelling and controlling interstate conflict,?
 in Proceedings of the IEEE International Joint Conference on Neural Networks,
 pp. 1233?1238, Budapest, Hungary 2004.
 [37] Z. Pawlak and T. Munakata, ?Rough control application of rough set theory to
 control,? in Proceedings of the 4th European Congress on Intelligent Techniques
 and Soft Computing, vol. 1, pp. 209?218, September 1996.
 [38] A. Pagnoni, S. Parisi, and S. Lombardo, ?Analysis of patient flows via data
 mining,? Medininfo, vol. 10, pp. 1379?1383, 2001.
 [39] N. S. Hoa and N. G. Son, ?Rough set approach to approximation of concepts
 from taxonomy,? in Proceedings of Knowledge Discovery and Ontologies Work-
 shop, (Pisa, Italy), September 2004.
 [40] T. Fawcett, ?ROC graphs: Notes and practical considerations for researchers.?
 http://home.comcast.net/ tom.fawcett/public html/papers/ROC101.pdf, 2004.
 [41] A. Bradley, ?The use of the area under the ROC curve in the evaluation of
 machine learning algorithms,? Pattern Recognition, vol. 30, pp. 1145?1159, 1997.
 [42] S. Rosset, ?Model selection via the AUC,? in In Proceedings of the 21st Inter-
 national Conference on Machine Learning, July 2004.
 [43] C. X. Ling, J. Huang, and H. Zhang, ?A statistically consistent and more
 discriminating measure than accuracy,? in Proceedings of the 18th International
 Joint Conferences on Artificial Intelligence, August 2003.
 [44] D. Mossman and E. Somoza, ?ROC curves, test accuracy, and the description
 of diagnostic tests,? Neuropsychiatry and Clinical Neurosciences, vol. 3, pp. 330?
 333, 1991.
 [45] R. Department of Health, ?National HIV and syphilis sero-prevalence
 survey of women attending public antenatal clinics in south africa.?
 http://www.info.gov.za/otherdocs/2002/hivsurvey01.pdf, 2001.
 [46] ?Correlates of war.? www.correlatesofwar.org/. Last accessed: 22/2/2007.
 79
REFERENCES
 [47] F. Witlox and H. Tindemans, ?The application of rough sets analysis in activity-
 based modelling. opportunities and constraints,? Expert Systems with Applica-
 tions, vol. 27, p. 585592, 2004.
 [48] M. Inuiguchi and T. Miyajima, ?Rough set based rule induction from two deci-
 sion tables,? European Journal of Operational Research, vol. In Press, Corrected
 Proof, 2006.
 [49] C. Goh and R. Law, ?Incorporating the rough sets theory into travel demand
 analysis,? Tourism Management, vol. 24, p. 511517, 2003.
 [50] B. B. Leke, Computational Intelligence for Modelling HIV. PhD thesis, Uni-
 versity of the Witwatersrand, School of Electrical and Information Engineering,
 2007.
 [51] N. H. Son, ?Discretization of real value attributes: A boolean reasoning ap-
 proach.? PhD Thesis, Department of Mathematics, Computer Science and Me-
 chanics, Warsaw University, 1997.
 [52] M. R. Chmielewski and J. W. Grzymala-Busse, ?Global discretization of con-
 tinuous attributes as preprocessing for machine learning,? in Proceedings of the
 3rd International Workshop on Rough Sets and Soft Computing, pp. 294?301,
 1994.
 [53] J. Dougherty, R. Kohavi, and M. Sahami, ?Supervised and unsupervised dis-
 cretization of continuous features,? in Proceedings of the 12th International Con-
 ference in Machine Learning, Los Altos, CA 1995.
 [54] S. Malve and R. Uzsoy, ?A genetic algorithm for minimizing maximum late-
 ness on parallel identical batch processing machines with dynamic job arrivals
 and incompatible job families,? Computers and Operations Research, vol. 34,
 pp. 3016?3028, 2007.
 [55] C. H. Lim, Y. S. Yoon, and J. H. Kim, ?Genetic algorithm in mix propor-
 tioning of high-performance concrete,? Cement and Concrete Research, vol. 34,
 p. 409420, 2004.
 80
REFERENCES
 [56] J. Kennedy and R. Eberhart, ?Particle swarm optimization,? in Proceedings
 of IEEE International Conference on Neural Networks, vol. 4, pp. 1942?1948,
 Perth, Australia 1995.
 [57] J. Cooper, ?Dissonance and the return of the self-concept,? Psychological In-
 quiry, vol. 3, no. 4, p. 320323, 1992.
 [58] J. H. Wingfield, A Goal Systemic Analysis of Cognitive Dissonance Phenomena.
 PhD thesis, University of Maryland, College Park, 2005.
 [59] S. Naka, T. Genji, T. Yura, and Y. Fukuyama, ?Hybrid particle swarm opti-
 mization for distribution state estimation,? IEEE Transactions on Power Sys-
 tems, vol. 18, pp. 60?68, 2003.
 [60] M. Mitchell, J. Holland, and S. Forest, ?When will a genetic algorithm outper-
 form hill climbing,? Advances in Neural Information Processing Systems 6, J.
 Cowan, G. Tesauro, and J. Alspector eds., pp. 51?58, San Mateo, Calif.: Morgan
 Kaufman 1994.
 [61] t. f. e. Wikipedia, ?Hill climbing.? http://en.wikipedia.org/wiki/Image:Local maximum.png,
 2007.
 [62] K. Bryan, P. Cunningham, and N. Bolshkova, ?Application of simulated an-
 nealing to the biclustering of gene expression data,? IEEE Transactions on In-
 formation Technology in Biomedicine, vol. 10, no. 3, pp. 519?525, 2006.
 [63] R. Salazar and R. Toral, ?Simulated Annealing using Hybrid Monte Carlo,?
 arXiv:cond-mat/9706051, 2006.
 [64] A. Lasry, G. S. Zaric, and M. W. Carter, ?Multi-level resource allocation for HIV
 prevention: A model for developing countries,? European Journal of Operational
 Research, vol. 180, p. 786799, 2007.
 [65] ?UNAIDS.? www.unaids.org/en/HIV data/2006GlobalReport/default.asp/.
 Last accessed: 20/3/2007, 2006.
 [66] D. Reznik, ?A critique of the AIDS metanarrative.? Website:
 http://www.clas.ufl.edu/users/dreznik/Aarrative.pdf.
 81
REFERENCES
 [67] E. Fee and N. Krieger, ?Understanding AIDS: historical interpretations and the
 limits of biomedical individualism,? American Journal of Public Health, vol. 83,
 pp. 1477?1486, 1993.
 [68] K. E. Poundstone, S. A. Strathdee, and D. D. Celentano, ?The social epidemi-
 ology of human immunodeficiency virus/acquired immunodeficiency syndrome,?
 Epidemiol Reviews, vol. 26, pp. 22?35, 2004.
 [69] C. W. Lee and J. A. Park, ?Assessment of HIV/AIDS-related health perfor-
 mance using an artificial neural network,? Information & Management, vol. 38,
 pp. 231?238, 2001.
 [70] A. Lumini and L. Nanni, ?Machine learning for HIV-1 protease cleavage site
 prediction,? Pattern Recognition Letters, vol. 27, pp. 1537?1544, 2006.
 [71] M. Lagazio and T. Marwala, ?Assessing Different Bayesian Neural Network
 Models for Militarized Interstate Dispute,? Social Science Computer Review,
 vol. 24, pp. 119?131, 2006.
 [72] J. R. Oneal and B. Russet, ?Causes of peace: Democracy, interdependence, and
 international organizations, 1885-1992,? in 2001 Annual Meeting of the American
 Political Science Association, San Francisco, CA, USA 2001.
 [73] C. Gochman and Z. Maoz, ?Militarized interstate disputes 1816-1976,? Journal
 of Conflict Resolution, vol. 28, no. 4, pp. 585?615, 1984.
 [74] W. R. Thompson; and R. Tucker, ?A tale of two democratic peace critiques,?
 The Journal of Conflict Resolution, vol. 41, pp. 428?454, 1997.
 [75] E. Mansfield and J. Snyder, ?Democratization and the danger of war,? Inter-
 national Security, vol. 20, pp. 5?38, 1995.
 [76] T. Tettey and T. Marwala, ?Neuro-fuzzy modelling and fuzzy rule extraction
 applied to conflict management,? Lecture Notes in Computer Science, In Neural
 Information Processing, Springer Berlin / Heidelberg, vol. 4234, pp. 1087?1094,
 2006.
 82
REFERENCES
 [77] M. G. Marshall, ?The scientific study of international conflict processes: Post-
 cards at the edge of the millennia,? Centre for Systematic Peace, Website:
 http://www.cidcm.umd.edu/inscr/papers/icpmgm.pdf.
 [78] L. P. Pojman, ?Kants perpetual peace and cosmopolitanism.? Website:
 http://www.blackwell-synergy.com/doi/pdf/10.1111/j.1467-9833.2005.00258.x.
 [79] G. Rose, ?Neoclassical realism and theories of foreign policy,? World Politics,
 pp. 144?172, 51 1998.
 [80] M. Lagazio, T. Marwala, and T. Tettey, ?An Integrated Human-Computer
 System for Controlling Interstate Disputes,? arXiv:0704.3862, 2006.
 83