SUPPORT VECTOR MACHINE PREDICTION OF
 HIV-1 DRUG RESISTANCE USING THE VIRAL
 NUCLEOTIDE PATTERNS
 Seare Tesfamichael Araya
 A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannes-
 burg, in fulfilment of the requirements for the degree of Master of Science
 Johannesburg, 2006
Declaration
 I declare that this research report is my own, unaided work. It is being submitted for the degree
 of Master of Science in the University of the Witwatersrand, Johannesburg. It has not been
 submitted before any degree or examination in any other University.
 Seare Tesfamichael Araya
 20 th day of October 2006
 ii
Abstract
 Drug resistance of the HI virus due to its fast replication and error-prone mutation is a key fac-
 tor in the failure to combat the HIV epidemic. For this reason, performing pre-therapy drug
 resistance testing and administering appropriate drugs or combination of drugs accordingly is
 very useful. There are two approaches to HIV drug resistance testing: phenotypic (clinical)
 and genotypic (based on the particular virus?s DNA). Genotyping tests HIV drug resistance by
 detecting specific mutations known to confer drug resistance. It is cheaper and can be comput-
 erised. However, it requires being able to know or learn what mutations confer drug resistance.
 Previous research using pattern recognition techniques has been promising, but the performance
 needs to be improved. It is also important for techniques that can quickly learn new rules when
 faced with new mutations or drugs.
 A relatively recent addition to these techniques is the Support Vector Machines (SVMs).
 SVMs have proved very successful in many benchmark applications such as face recognition,
 text recognition, and have also performed well in many computational biology problems where
 the number of features targeted is large compared to the number of available samples. This
 paper explores the use of SVMs in predicting the drug resistance of an HIV strain extracted
 from a patient based on the genetic sequence of those parts of the viral DNA encoding for the
 two enzymes, Reverse Transcriptase or Protease, which are critical for the replication of the
 HIV virus. In particular, it is the aim of this reseach to design the model without incorporating
 the biological knowledge at hand to enable the resulting classifier accommodate new drugs and
 mutations.
 To evaluate the performance of SVMs we used cross validation technique to measure the
 unbiased estimate on 2045 data points. The accuracy of classification and the area under the re-
 ceiver operating characteristics curve (AUC) was used as a performance measure. Furthermore,
 to compare the performance of our SVMs model we also developed other prediction models
 based on popular classification algorithms, namely neural networks, decision trees and logistic
 regressions.
 The results show that SVMs are a highly successful classifier and out-perform other tech-
 niques with performance ranging between (94.13%?96.33%) accuracy and (81.26% - 97.49%)
 AUC. Decision trees were rated second and logistic regression performed the worst.
 iii
Acknowledgements
 The work on this thesis has been an inspiring, often exciting, sometimes challenging, but always
 interesting experience. It has been made possible by many other people, who have supported
 me.
 I am very grateful to my supervisor Prof. Scott Hazelhurst who has supported me with his
 encouragement and many fruitful discussions. I would also like to express my sincere thanks
 to Lynn Morris from the NICD for her invaluable experience and advice regarding data and
 additional information.
 I would also like to acknowledge the Carnegie Trust and WITS AIDS research institute for some
 financial help towards this research. I am also grateful for the support I got from the National
 Bioinformatics Network (NBN).
 Finally, I wish to thank my parents for their continuous support and encouragement.
 iv
Contents
 Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
 1 Introduction 1
 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
 1.2 HIV and the drug resistance problem . . . . . . . . . . . . . . . . . . . . . . . 2
 1.3 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
 1.4 Drug resistance as a pattern recognition problem . . . . . . . . . . . . . . . . . 8
 1.5 Overview of the research approach . . . . . . . . . . . . . . . . . . . . . . . . 9
 1.6 Overview of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
 1.7 Contribution of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
 1.8 Structure of the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
 2 Background and Related Work 13
 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
 2.2 Introductory Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
 2.3 HIV Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
 2.3.2 HIV phylogeny and genome . . . . . . . . . . . . . . . . . . . . . . . 15
 2.3.3 HIV replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 2.3.4 HIV drug resistance and assay of drug resistance testing . . . . . . . . 17
 2.4 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
 2.5 Statistical pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
 2.5.1 Bayesian decision theory . . . . . . . . . . . . . . . . . . . . . . . . . 27
 2.5.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 28
 2.6 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 2.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 2.6.2 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 2.6.3 Feedforward multilayer perceptrons . . . . . . . . . . . . . . . . . . . 35
 2.7 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
 v
2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
 2.7.2 Constructing of decision tree . . . . . . . . . . . . . . . . . . . . . . . 39
 2.7.3 Splitting test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
 2.7.4 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
 2.7.5 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
 2.8 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
 3 Support Vector Machine (SVM) 47
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
 3.2 Risk minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
 3.2.1 Empirical Risk Minimisation (ERM) . . . . . . . . . . . . . . . . . . 49
 3.2.2 Structural Risk Minimisation (SRM) . . . . . . . . . . . . . . . . . . . 52
 3.3 Linear support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . 53
 3.3.1 Linear separable case (Maximum margin classifier) . . . . . . . . . . . 53
 3.3.2 Linearly non-separable case: Soft margin classifier . . . . . . . . . . . 56
 3.4 Non-linear support vector machine: The Kernel trick . . . . . . . . . . . . . . 58
 3.5 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
 3.6 SVM implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
 3.6.1 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
 3.6.2 Shrinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
 3.6.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
 3.7 Comparison of SVMs to other statistical techniques . . . . . . . . . . . . . . . 65
 3.8 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
 3.8.1 SVMs benchmark applications . . . . . . . . . . . . . . . . . . . . . . 67
 3.8.2 Other machine learning techniques in predicting drug resistance . . . . 70
 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
 4 The Experiment 72
 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
 4.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
 4.3 Classification algorithms description . . . . . . . . . . . . . . . . . . . . . . . 75
 4.4 Model selection, generalisation and error estimation . . . . . . . . . . . . . . . 78
 4.4.1 Hold-out and cross-validation method . . . . . . . . . . . . . . . . . . 78
 4.5 Performance evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 80
 4.6 Data and input encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
 4.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
 4.6.2 Input encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
 4.6.3 Alternative input encoding . . . . . . . . . . . . . . . . . . . . . . . . 86
 4.7 Detailed methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
 4.7.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . 88
 vi
4.7.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
 4.7.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
 4.7.4 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
 5 Results and Discussion 91
 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
 5.2 Performance of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
 5.2.1 Effect of polynomial kernel parameter on classification . . . . . . . . . 93
 5.2.2 Effect of RBF Kernel Parameter on classification . . . . . . . . . . . . 96
 5.2.3 Remark on SVMs performance . . . . . . . . . . . . . . . . . . . . . 99
 5.3 Performance of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 103
 5.4 Performance of decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
 5.5 Performance of logistic regression . . . . . . . . . . . . . . . . . . . . . . . . 107
 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
 5.7 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
 5.8 Limitations of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
 6 Conclusion and Future work 113
 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
 References 116
 A Statistical and Parametric Classifiers 126
 A.1 Bayesian decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
 B SVM: Mathematical formulation 130
 B.1 Linear support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . 130
 B.1.1 Linear separable case - Maximum margin classifier . . . . . . . . . . . 130
 B.1.2 Linearly non-separable case - Soft margin classifier . . . . . . . . . . . 135
 B.2 Important definitions and theorems . . . . . . . . . . . . . . . . . . . . . . . . 138
 C Genetic codes 140
 C.1 RNA codon table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
 C.2 The IUPAC nucleotides Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 141
 vii
List of Figures
 1.1 A toy example illustrating the SVMs training method. . . . . . . . . . . . . . . 5
 2.1 Schematic diagram of HIV-1 genome . . . . . . . . . . . . . . . . . . . . . . 16
 2.2 Viral replication cycle [Wikipedia 2004] . . . . . . . . . . . . . . . . . . . . . 18
 2.3 Model for statistical pattern recognition [Jain et al. 2000, page 8] . . . . . . . . 25
 2.4 Number of bins required for different features (a) one feature (b) two features
 (c) three features [Gutierrez-Osuna] . . . . . . . . . . . . . . . . . . . . . . . 29
 2.5 A perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
 2.6 A multilayered perceptron with one hidden layer. . . . . . . . . . . . . . . . . 36
 2.7 Decision tree for conditions to play tennis . . . . . . . . . . . . . . . . . . . . 40
 2.8 Non-leaf tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
 3.1 Asymptotic behaviour of a minimum empirical and corresponding expected risk
 for consistent ERM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
 3.2 Three points in R2, shattered by oriented lines. The VC dimension of the set of
 oriented lines in R2 is therefore, three [Burges 1998, page 4]. . . . . . . . . . . 51
 3.3 The optimal classifier needs to find some appropriate structure that minimises
 both the empirical risk and the confidence term [Vapnik 1995, page 98] . . . . 52
 3.4 Optimal separating hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . 54
 3.5 Maximum margin hyperplane for linear separable case . . . . . . . . . . . . . 55
 3.6 Maximum margin hyperplane for linear non-separable case . . . . . . . . . . . 57
 3.7 Linearly non-separable training sample . . . . . . . . . . . . . . . . . . . . . . 59
 3.8 Mapping from two dimensional input space into two dimensional feature space
 where data can be linearly separated . . . . . . . . . . . . . . . . . . . . . . . 59
 3.9 Decision surface given by (a) Polynomial kernel, and (b) RBF kernel. Support
 vectors are indicated by dark filled points [Osuna et al. 1997] . . . . . . . . . . 62
 4.1 An ROC curve for three classifiers A, B and C . . . . . . . . . . . . . . . . . . 82
 5.1 d vs Accuracy/AUC graph for SVM with polynomial kernel. . . . . . . . . . . 94
 5.2 ? vs Accuracy/AUC graph for SVM with RBF kernel. . . . . . . . . . . . . . . 98
 5.3 C vs Accuracy/AUC graph for SVM with RBF kernel. . . . . . . . . . . . . . 98
 5.4 Example Decision tree for NFV . . . . . . . . . . . . . . . . . . . . . . . . . 105
 5.5 Biologically interpreted decision tree for NFV given in Figure 5.4. . . . . . . . 106
 viii
5.6 Accuracy of the different classifiers . . . . . . . . . . . . . . . . . . . . . . . 109
 5.7 Area under the ROC curve of the different classifiers . . . . . . . . . . . . . . 110
 ix
List of Tables
 2.1 Antiretroviral Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
 2.2 Examples of Pattern Recognition applications [Jain et al. 2000, page 5] . . . . 22
 3.1 Mean error on a test set [Meyer et al. 2002] . . . . . . . . . . . . . . . . . . . 67
 4.1 A contingency table or confusion Matrix . . . . . . . . . . . . . . . . . . . . . 81
 4.2 Sequence distribution between resistance (R) and susceptible(S) . . . . . . . . 84
 5.1 List to result tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
 5.2 SVMs classifier performance with polynomial kernel . . . . . . . . . . . . . . 95
 5.3 Selected results for SVM with RBF kernel showing the effect of ? on the clas-
 sifier performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
 5.4 Selected results for SVM with RBF kernel showing the effect of C on the clas-
 sifier performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
 5.5 SVMs classifier performance with Radial Basis Function (RBF) kernel . . . . . 102
 5.6 Performance of neural networks model . . . . . . . . . . . . . . . . . . . . . . 104
 5.7 Performance of pruned C4.5 classification algorithm . . . . . . . . . . . . . . 105
 5.8 Performance of logistic regression . . . . . . . . . . . . . . . . . . . . . . . . 108
 5.9 Performance of the different classification algorithms . . . . . . . . . . . . . . 108
 C.1 Standard amino acids used in proteins, and the codons that code for each amino
 acid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
 C.2 The IUPAC nucleotides Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 141
 x
Chapter 1
 Introduction
 1.1 Introduction
 The Human Immunodeficiency virus (HIV) is a rapidly evolving virus resulting in the AIDS
 epidemic. The rapid evolution of this virus is due to fast replication and its mutant behaviour.
 This mutant behaviour of the virus gives it the advantage of acquiring drug resistance which
 is the main reason for the failure of much diagnostic treatment. One solution that has proved
 to give better results is pre-therapy drug resistance testing. This solution allows for adminis-
 tration of drugs that will prolong viral suppression and help reconstruct the patient?s immunity.
 There are two approaches for testing HIV drug resistance, phenotypic and genotypic testing.
 Phenotypic testing is laboratory based and measures the relative drug susceptibility of an HIV
 strain directly [Beerenwinkel et al. 2003a]. Genotypic testing, on the other hand, considers the
 genetic information of the HIV strain extracted from the patient and the ability to interpret such
 data [Beerenwinkel et al. 2003a]. Due to the cost benefit and other advantages genotyping is
 more widely used than phenotyping.
 Genotypic testing aims at identifying specific mutation points on the viral genetic makeup
 that are known to confer drug resistance based on the biological knowledge available about
 the virus and the drug/s. In genotyping, the genetic make up of the HIV strain extracted from
 a patient is examined for the existence of mutations (patterns) that are known to confer drug
 resistance to a drug or combination of drugs and then be classified as resistant or not resistant to
 the drug or combination of drugs accordingly. There are a number of ways of performing this
 task and interpreting the results of genotyping is not related to the testing process, hence this
 approach has been an ideal application for computerised expert systems [Lathrop et al. 1999].
 Based on these characteristics, one can consider genotyping as a classification problem which
 is best solved using pattern recognition techniques than a traditional algorithmic approach. A
 number of applications addressing the HIV drug resistance problem as a pattern recognition
 problem are currently available. Draghici and Potter [2003] and Wang and Larder [2003] used
 neural networks as a tool for predicting the drug resistance profile of HIV strain based on genetic
 sequence of viral protease and amino acids on selected positions of protease where mutation is
 known to confer drug resistance respectively. Beerenwinkel et al. [2002] used decision tree
 1
models to predict phenotypic resistance from genotypic information.
 Pattern recognition techniques have been proved to be a better option when there are no
 known rules relating the input and output of a problem, and the interaction between different
 reactants is unknown. One such pattern recognition technique is the Support Vector Machine
 (SVMs), which is a statistical learning method proposed by Boser et al. [1992]. Although this
 technique is a relatively recent addition to pattern recognition techniques, it has shown superior
 performance as a tool for pattern recognition in numerous benchmark applications. Further-
 more, its ability to give a better performance based on limited training samples makes it an
 ideal approach in computational biology problems, where generating data is costly or difficult.
 One such computational biology problem is HIV drug resistance. This research investigates
 the performance of SVMs in predicting drug resistance of an HIV strain based on the genetic
 sequence of two enzymes; namely Reverse Transcriptase (RT) and Protease (PRO), which are
 critical during viral replication and where most mutations are exhibited [Shafer 2002b]. The
 performance of SVMs will be compared to three traditional methods of pattern recognition:
 neural networks, decision trees and logistic regression.
 The rest of the chapter is organised as follows. The next section gives an introduction to HIV
 and the drug resistance problem. This section also shows that this is an open area of research
 and indicates the importance of the research. Section 1.3 gives an introduction to Support
 Vector Machines, which is the pattern recognition technique used in this research. This section
 will highlight why SVMs are an ideal tool for this research and present some applications of
 SVMs in computational biology. Section 1.4 and 1.5 give the formal definition of the problem
 and an overview of the approach taken to assess the performance of the different algorithms
 respectively. Section 1.6 gives an overview of the results found in this work followed by a
 section highlighting the contribution of the research. Finally, section 1.8 gives the structure of
 this document.
 1.2 HIV and the drug resistance problem
 The Human Immunodeficiency Virus (HIV) and its infectious agents have resulted in the world-
 wide AIDS epidemic. World Health Organisation (WHO) statistics shows that as of December
 2003, HIV had already infected between 34 and 46 million people around the world with the
 majority of these infection in sub-Saharan African countries [UNAIDS 2004]. Although it has
 been more than two decades since HIV was first discovered, there is no effective drug to fully
 stop the virus from replicating and stop the epidemic. At the present there are more than 18
 antiretroviral drugs available and the best they achieved so far is prolonging viral suppression
 and helping with immunologic reconstruction using combinatorial therapy (combination of up
 to 3 or 4 drugs) [Shafer 2002a]. A major reason for the failure to halt the epidemic is HIV drug
 resistance.
 HIV drug resistance refers to the loss in the ability of a drug or combination of drugs to
 suppress the replication of the virus. The two main reasons for drug resistance are fast replica-
 2
tion and mutation [Klatt 2003]. Mutation refers to the change in the genetic make up (RNA) of
 the virus during replication. During replication, the viral RNA integrates with the human cell
 DNA in a number of steps with the aid of different enzymes such as RT and PRO [Klatt 2003].
 HIV mutants result from the high viral replication rate and RT infidelity. RT infidelity is due to
 the lack of proof-reading mechanism that preserves the genetic composition of double stranded
 DNA genome within the viral RT [Shafer 2002a]. Although all mutations change the structure
 of the virus, not all of them have the same effect. Some mutations cause the virus to become
 extremely infectious while others make it weak. For example, if a mutation occurred that made
 the newly replicated virus resistant to a particular drug or combination of drugs, treating the pa-
 tient with this drug or combination of drugs will have a negative effect. On the contrary, other
 mutations slow down the replication process of the virus. Therefore, one of the main concern
 during pharmaceutical therapy of an HIV patient is identifying particular mutations known to
 confer drug resistance and administering drug/s accordingly. Identification and prediction of
 HIV drug resistance is the problem this research is addressing. To address this problem a num-
 ber of approaches have been proposed within the last decade. In this regard, one of the most
 effective approaches has been pre-therapy HIV drug resistance testing [Hoffmann and Kamps
 2003].
 There are two approaches of testing an HIV?s strains drug resistance: namely phenotypic
 testing and genotypic testing [Klatt 2003]. Phenotypic testing directly measures the replication
 of the virus in the presence of a drug or combination of drugs. Genotypic testing on the other
 hand relies on the genetic sequence of the HIV virus and theoretical knowledge of specific
 mutation and related drug resistance. The primary objective of genotypic testing is to detect
 specific mutations known to confer resistance to antiretroviral drugs in well-defined regions of
 the RT and PRO, the two critical enzymes in viral replication [Klatt 2003]. Genotypic is the
 most widely used because of its simplicity, speed, cost benefit and unlike phenotypic testing
 it doen not require a specialised laboratory. Furthermore, interpreting the result of genotypic
 is independent of the testing process. The later advantage, has opened the door for successful
 technology from other disciplines to collaborate in solving the problem. Such an application is
 the computerised expert system [Lathrop et al. 1999].
 The first research carried out to address this problem of perdicting the effect of a drug
 or combination of drugs based on the genetic sequence of the HIV strain extracted from the
 patient using computerised expert systems was conducted by Lathrop et al. [1999]. Lathrop
 et al. [1999] introduced the AI system, CTSHIV that uses the scientific knowledge about HIV
 drug resistance to customise treatment to an individual patient. Following this breakthrough
 a number of techniques such as neural networks [Draghici and Potter 2003; Wang and Larder
 2003], decision trees [Beerenwinkel et al. 2002], and different regression methods have been
 applied to predict drug resistance of an HIV strain directly or indirectly. Although most of
 these applications have been successful, the performance obtained is still far from optimal.
 Furthermore, the design of these systems were highly dependent on the biological knowledge
 available about the particular drug or combination of drugs and corresponding mutations that
 3
are known to confer resistance. This means that for these systems to work properly a continuous
 update (with possible redesigning) of the systems is required with the discovery of new mutation
 points and new drugs. Therefore, the search for new techniques that not only outperform the
 existing ones but also address the dynamic behaviour of the virus (mutation and drug resistance)
 is required. This research will investigate such a technique by taking no biological knowledge
 about the virus into consideration while designing the model. Furthermore, this research uses
 the position 1? 250 of RT and position 1 ? 99 of PRO genetic sequence. Hence, the resulting
 model bases its decision not only on mutations already reported to cause drug resistance but
 patterns yet not known by humans but reflected in the genetic sequence of the strain.
 1.3 Support vector machines
 Support Vector Machines (SVMs) are a statistical pattern recognition technique proposed by
 Boser et al. [1992] to perform a number of classification and regression tasks in a wide variety
 of application domains [Cristianini and Shawe-Taylor 2000]. Since their introduction, SVMs
 have been successfully applied to many pattern recognition and regression applications such
 as text categorisation [Joachims 1998], speech recognition [Ganapathiraju 2002], face detec-
 tion [Osuna et al. 1997], object recognition [Scho?lkopf 1997], and handwritten text recognition
 [Cortes and Vapnik 1995] with remarkable performance. The achievements by SVMs are cred-
 ited to the two basic principles behind SVMs: namely the principle of Structural Risk Minimi-
 sation (SRM) and the ability to project data into high dimensional feature space where complex
 data can be classified with ease [Vapnik 1995]. In this section we will see a high level intro-
 duction on how a simple maximum margin classifier, which is the simplest form of SVMs, is
 extended into complex classifiers that are complex enough to model real world problems, yet
 simple enough to be analysed mathematically. But before discussing the classification method,
 an introductory summary of risk minimisation and the principle of Structural Risk Minimisa-
 tion will be presented. A more detailed explanation of risk minimisation and the mathematical
 formulation of SVMs will be presented in Chapter 3.
 Like all pattern recognition techniques, SVMs are also aimed at obtaining the best classifica-
 tion based on a limited number of training samples. There are a number of optimisation criteria
 to estimate the performance of a classifier. One such criterion, that is most commonly used by
 traditional pattern recognition techniques, is empirical risk minimisation [Ganapathiraju 2002].
 The empirical risk of a classifier is defined as the sum of the cost of misclassification of the
 training sample. Based on this definition, it is intuitive to say that there can be a number of con-
 figurations that can minimise the empirical risk or even achieve zero empirical risk. But the best
 configuration is the one that trades-off between the empirical risk and expected error (the error
 on an unseen validation set). To decide on the best configuration that gives the least expected
 error, Vapnik and Chervonenskis proposed the statistical learning theory. This theory suggests
 that it is necessary to minimise the capacity of the set of functions (for example the degree of
 a polynomial function) relative to the number of training samples along with the empirical risk
 4
[Vapnik 1995]. This is called the principle of Structural Risk Minimisation (SRM). There are
 a number of possible approaches to minimise both the capacity and the empirical risk. Some
 pattern recognition techniques accomplish this by defining the capacity of the set of functions
 and minimising the empirical risk while others minimise the capacity of the set of functions for
 a predefined empirical risk. However, SVMs accomplished this by simultaneously minimising
 both the empirical risk and the capacity of the hypothesis space [Osuna et al. 1997].
 0
 0
 1
 1
 01
 01
 01
 01
 0
 0
 1
 1
 0011
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 01
 01
 0101
 01
 0
 0
 1
 1
 0
 0
 1
 1
 0
 00
 1
 11
 01
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 01
 0
 0
 1
 1
 0
 0
 1
 1
 01
 01
 01
 01
 0
 0
 1
 1
 01
 01
 01
 0
 0
 1
 1
 01
 01
 0
 0
 1
 1
 01
 0
 0
 1
 1
 01
 01
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 01
 01
 0
 0
 1
 1
 0
 0
 1
 1
 01
 01
 01
 ?
 (a) Input space
 Class 1 (y =1 )
 Class 2 (y = ?1)
 (b) Feature space
 Margin = 2/|w|
 y(w.x + b) = 0
 y(w.x + b) = 1
 Figure 1.1: A toy example illustrating the SVMs training method. For a given non-linear clas-
 sification problem shown in the input space, the SVMs map the training data nonlinearly into a
 possible higher dimensional feature space and construct the optimal hyperplane with maximum
 margin there. The square boxes and circles indicate positive and negative examples to be clas-
 sified. The solid separating hyperplane in the feature space is the optimal hyperplane and the
 margin is defined as the distance between this hyperplane and any one of the hyperplane shown
 by broken line. The training samples that lie on the hyperplane indicated by a broken line are
 called support vectors
 The simplest form of SVMs works well when the training sample is linearly separable by an
 optimal hyperplane, leaving all members of the same class on one side of the hyperplane and the
 rest on the other (see Figure 1.1.b). The optimal hyperplane is defined as the one that maximises
 the minimum distance between either of the two classes and itself. This distance is called the
 margin of the classifier. To find the optimal hyperplane, the constrained optimisation problem
 should be solved (i.e. maximising the margin constrained to the equalities given in figure 1.1.b).
 The solution of this optimisation problem is the orientation of the optimal hyperplane. Using
 the classical Lagrangian approach to solve this optimisation problem, the orientation of the
 hyperplane will be given in terms of the training samples and their corresponding non-negative
 Lagrangian multiplier [Cristianini and Shawe-Taylor 2000]. The classical Lagrangian approach
 also has the advantage of emphasising the importance of some training examples over the rest
 and this is reflected on the solution by showing that not all of the training samples have non-zero
 Lagrangian multipliers but rather a subset of the samples. Those training samples with non-
 zero Lagrangian multipliers are the ones that determine the optimal hyperplane and are called
 5
Support Vectors (SV) [Vapnik 1995]. Note that for a classifier to perform well, the number of
 SVs should be relatively small otherwise over-fitting might occur [Vapnik 1995]. Given the
 orientation of the hyperplane and the threshold (which can also be determined mathematically
 in terms of the SVs, their corresponding label and Lagrangian multiplier) classification of an
 unseen sample will be according to the sign of the function:
 f(x) = sign(?
 i
 yi?i(x.xi) + b) (1.1)
 where x is the sample to be classified, ?i, the Lagrangian multiplier for the Support Vector xi is
 found as a solution for the optimisation problem and the orientation of the optimal hyperplane
 w is given by
 ?
 i ?ixi and b is the threshold..
 So far we have seen the case where the training sample is perfectly separable by a linear
 hyperplane. But how do SVMs handle the case where the data is not linearly separable or not
 linear at all?
 The case where the data is not linearly separable is handled by associating a misclassifica-
 tion cost whenever necessary. Hence the task is not only finding an optimal hyperplane that
 maximises the margin but also minimises the misclassification cost. The result of this opti-
 misation problem also gives us the same decision function (equation 1.1), but the user will be
 required to freely select a parameter that trades off between the width of the margin and the
 misclassification error when performing the training (defining the model).
 Most real life classification problems are not linearly separable and need a complex classi-
 fier. SVMs handle such complex classification problems by mapping the data from the input
 space (Figure 1.1.a) into a possibly high dimensional feature space (Figure 1.1.b), where the
 data can be separated by a simple maximum margin classifier. That means one needs to choose
 a mapping ? : Rd ? Rm usually an Hilbert space (H ) where m ? d (usually the mapping
 is into a higher dimensional space) such that the data which was not linearly separable will
 become linearly separable (the two cases specified above). One question associated with this is
 how do we choose the mapping function ?(x).
 As it can be seen from the above equation, the decision function is based on the dot product
 between the sample to be classified and the support vectors. Hence the decision function in
 the feature space will have the form f(x) = sign(?i yi?i(?(x).?(xi)) + b). In addition to
 this, Mercer?s theorem has shown that any dot product in a feature space can be computed by a
 function in the input space without the need for explicit mapping [Ganapathiraju 2002]. These
 functions are called kernels (see section 3.4).
 There are a number of points to note here. Firstly, the use of kernel functions enables
 SVMs to implicitly perform classification on the feature space without the need to perform the
 mapping first and hence prevent the curse of dimensionality from occurring. Secondly, as is the
 case for the linearly separable and non-separable datasets, only support vectors are involved in
 determining the shape of the hyperplane.
 Selection of a kernel map is problem specific and it is up to the user to choose one. Kernel
 6
selection can be based on prior information about the problem (for example: mutation points),
 successful application of the kernel on similar problems or, in the absence of both of the above,
 kernels can be selected by empirically testing their performance in the given problem. There are
 a number of kernel functions known to give better performance depending on the characteristics
 of problem or the training patterns. However, the polynomial kernel of lower degree and the
 Radial Basis Function (RBF) are known to perform well on most datasets.
 The polynomial kernel has the form:
 k(x, y) = (x.y + 1)d, (1.2)
 where d is the degree of the polynomial. A large value of d refers to a more complicated decision
 boundary. And for linear classifiers (linear hyperplane) the value of d is set to 1.
 The radial basis function (RBF) kernel has the form:
 k(x, y) = exp(??|x? y|2), (1.3)
 where ? is the width of the kernel. The smaller ?, the smoother the decision boundaries. This
 kernel is more favourable when one class is totally encircled by the other as shown in the Figure
 1.1. The bigger ? gets, the tighter the closed boundaries (circles) become.
 Summing up, SVMs are becoming popular because:
 ? Unlike most traditional pattern recognition techniques that use empirical risk minimisa-
 tion, SVMs use structural risk minimisation, which minimises the error on yet-to-be-seen
 data and hence has good generalisation performance.
 ? Unlike other complex pattern recognition techniques, such as neural networks, which are
 very hard to analyse theoretically, SVMs are easy to theoretically analyse without losing
 the ability to solve complex pattern recognition problems [Hearst et al. 1998].
 ? Choosing different kernel functions gives different architectures which suit the problem at
 hand. Polynomial, RBF and Sigmoid kernels simulate polynomial classifiers, RBF clas-
 sifiers and three layer (including feature and output layers) neural networks respectively
 [Hearst et al. 1998].
 For these reasons and their incredible performance in many benchmark applications such as
 text, speech and image recognition, SVMs have been recently applied in many biological prob-
 lems and have outperformed other pattern recognition techniques. Knowledge-based microar-
 ray analysis by Brown et al. [2000] and classification and validation of cancer tissues using
 microarray expression data by Furey et al. [2000] are two examples worth mentioning. Com-
 mon problems among these and many other pattern recognition tasks in computational biology
 are the shortage of training samples compared to the large size of features per pattern, and the
 availability of noise in the patterns. These two problems have been the main reason for the
 7
indifferent performance of other pattern recognition techniques. However, previous applica-
 tions of SVMs in computational biology or bioinformatics have shown that these problems are
 not as critical for SVMs as they are for other pattern recognition techniques [Cristianini and
 Shawe-Taylor 2000].
 1.4 Drug resistance as a pattern recognition problem
 In the previous two sections, the HIV drug resistance problem, the methods of HIV drug resis-
 tance testing, and the advantages of the state-of-the-art pattern recognition technique, SVMs,
 were presented. This section will discuss how the HIV drug resistance problem can be consid-
 ered as a pattern recognition problem and then exactly define the problem for this research.
 In section 1.2, it was pointed out that certain mutations cause resistance and others do not.
 In genotype resistance testing, specific positions in the genetic sequence of the HIV strain will
 be checked for the existence of certain amino acids and depending on the result the relative
 susceptibility of the HIV strain to a drug is measured. For example, consider the mutation
 represented by the standard notation M184V, known to confer resistance to Eqivir (one of the
 18+ antiretroviral drugs available). The biological interpretation of the above mutation is as
 follows. If mutation occurred and the amino acid methionine (M), which is found in position
 184, is changed to Valine (V) the newly replicated virus becomes resistant to Eqivir [Hoffmann
 and Kamps 2003].
 If we see this problem from a computational (pattern recognition) angle and represent each
 sequence as a vector in a Euclidean space, different sequences will be plotted to different points
 in this space depending on their genetic make-up (the amino acid sequence and its correspond-
 ing numeric value). Considering the above example, the virus before mutation (non-resistant
 virus) will have the numerical equivalent of M at its 184th vector component and is plotted as
 a point. However, the mutant (resistant) virus will have the the numerical equivalent of V at
 its 184th vector component and will definitely be plotted into a different point in our Euclidean
 space. Once the Euclidean space is divided into disjointed regions each representing a pattern
 class (Susceptible, Intermediate, Resistante) based on the training samples available, a pattern
 (HIV strain) can be labelled to its appropriate drug profile depending upon the region to which
 it is plotted.
 The drug resistance problem this research is trying to address can be redefined based in the
 above argument as:
 Given a set of phenotypic data where each sequenced HIV strain is labelled as
 Susceptible, Resistant to a particular drug, can SVMs learn from these examples
 and predict the drug resistance profile of an unseen HIV strain?
 which is the question for this research.
 Drug profile of an HIV strain can be susceptible, intermediate or resistante. But very few of
 the pattens in the training data are given as labled intermediate. Hence we consider binary clas-
 8
sification in the research. The research question is answered by investigating the performance
 of SVMs and comparing it to the performance of neural networks, decision trees and logistic
 regression models.
 1.5 Overview of the research approach
 To answer the research question a comparative approach is used. The performance of SVMs
 in predicting the drug resistance behaviour of HIV based on the viral nucleotide sequence was
 empirically tested. The SVMs? performance was then compared to that of neural networks,
 decision trees and logistic regression. These classification techniques were selected because
 of their popularity and reputation as a classification tool and have been already applied for
 predicting HIV drug resistance. The data used for the empirical testing was the same used
 by Ravela et al. [2003] which comprise sequences of isolates from 2045 individuals. Each
 sequence constitutes 297 nucleotides of the viral protease and 720+ (some are longer than 720)
 nucleotides of the viral reverse transcriptase.
 One of the factors affecting performance of a classification algorithm is the input encoding
 techniques. When encoding the input data some biological knowledge about HIV drug resis-
 tance was incorporated. As mentioned previously, HIV drug resistance is related to mutation in
 specific positions of the viral genome (for example M184V). However not all mutations have
 the same effect. Some of these mutations are major and might cause resistance alone while
 others require the existence of the major mutations to cause resistance. To capture this property
 of the data, each sequence was converted into a vector form that highlighted both the global
 and local position of these mutations. Each sequence was grouped into non-overlapping triplets
 and each triplet was given a equivalent numeric value. With this scheme each sequence was
 converted to a vector in some higher dimensional space.
 After encoding the data, the first set of experiments was done using SVMs. The experiment
 with SVMs started by further pre-processing the input data to make it suitable for the domain-
 restricted kernels and simplify the generalisation and error estimation. Then grid-search using
 cross-validation was conducted to select an appropriate kernel and tuning its diagonal factors.
 The two kernels used in this experiment were the polynomial (Equation 1.2) and radial-basis
 function (Equation 1.3). There are two parameters of interest for each kernel selected. For the
 polynomial kernel the parameters of interest are the degree of the polynomial d and a regu-
 larisation constant which indicates the trade-off between the training error and the separating
 margin. The parameters of interest for the RBF kernel are the width of the kernel ? and the
 regularisation constant. Kernel selection is done staring from a simple dot product kernel to a
 higher degree polynomials and then different RBF kernels.
 The second set of experiments was done using neural networks. The network architecture
 used in this research is the popular feedforward multilayer perceptrons with back-propagation
 learning. There are different modifications to the standard back-propagation algorithms each
 with their advantage in terms of memory usage, convergence speed and training set size. Stan-
 9
dard and Resilient backpropagation and Levenberg-Marquardt algorithms were tested. Levenberg-
 Marquardt was found to be more efficient in terms of performance, convergence and memory
 usage for our experiment. Hence, feedforward network with fully connected neurons, log-
 sigmoid activation function and Levenberg-Marquardt learning was used. The number of input
 neurons equalled the dimensionality of the data and there was only one output. Different num-
 bers of neurons in the hidden layer were tested. The experiment with neural networks was
 repeated with reduced dimensionality. Dimensionality reduction was conducted using principal
 component analysis (PCA) where up to 90% of the information is retained.
 Experiments with decision trees and regression were also carried out. The decision tree al-
 gorithm used in this experiment was C4.5 by Quinlan [1993]. For C4.5 this experiment used the
 default values for all parameters except for the confidence factor which determines how heavy
 the pruning is. The algorithm used for logistic regression is penalised logistic regression with
 ridge estimator. The default values of all parameters were also used for the logistic regression
 models.
 1.6 Overview of the result
 The performance of the different models was tested using 75% of the data for training and the
 remaining 25% for testing. The data was randomly split into training and testing set. The per-
 formance of a model was evaluated in terms of accuracy of classification and the area under the
 receivers operating characterstics curve (AUC). The AUC proposed by Bradley [1997] which
 reflect the trade-off between the classifiers? sensitivity and specificity, is a better single number
 performance indicator.
 The results show that SVMs and decision trees performed the best followed by neural net-
 works and logistic regression. The accuracy for SVMs ranged between 94% for a reverse tran-
 scriptase inhibitor to 96% for a protease inhibitor. The AUC for SVMs model ranged between
 81% to 95% with the lowest AUC related to the lowest accuracy and the highest AUC with the
 highest accuracy. The accuracy of decision trees was between 91% and 97% and the AUC was
 between 85% and 94%. In addition to their performance, the decision tree models showed some
 other interesting results. For each tree, we compared the split criteria with previously reported
 mutation points. The result showed that most of these split criteria were positions associated
 with previously known mutations. The result from the decision tree model was also used as a
 confirmation for the input encoding technique used, and a reference point for the performance
 comparison. As specified in the previous section, the training data used for the model evaluation
 was already labelled using a rule-based algorithm. Hence, the internal structure of the patterns
 are expected to suit decision trees more than SVMs.
 The accuracy for the neural network models ranged between 64% to 91%. The AUC for
 these models was between 69% and 95%. The performance of these models is below expec-
 tation. Dimensionality reduction using PCA was performed to enhance the performance. The
 performance of the models was reduced as a result of dimensionality reduction. The loss in the
 10
performance of the neural network models are attributed to underlying properties of the data.
 Recall that drug resistance is caused by mutation but not all mutations have the same effect.
 Some cause resistance alone however, others depend on the occurrence of additional mutation
 to cause resistance. This primary-secondary relationship between mutations causing drug re-
 sistance results in some correlated attributes. Hence, might be a causes the loss of valuable
 information during dimensionality reduction using PCA.
 The performance of the logistic regression models was on average the worst compared to
 the other models. The accuracy for these models was between 75% and 92% and the AUC for
 these models was between 46% and 85%.
 The performance of SVMs and the decision tree models is almost equal, but, the input
 encoding technique and the characterstics of the training patterns might favour the decision
 tree models. Hence, the performace recorded by SVMs with the naive approach has shown
 a promising start. In general, despite the limitations this research has due to the some easily
 addressable and other more complicated issues, the result found answers the research question
 positively.
 1.7 Contribution of the research
 This research has a number of contributions. Due to the fast replication of the virus, new muta-
 tions are occurring almost at an order of billions per day inside an untreated patient. This high
 number of single point mutations, with the possibility of cross-mutations demands a dynamic
 system that can cope with the new mutations causing drug resistance and newly discovered
 drugs to address them. One such dynamic system is support vector machines. Furthermore, to
 our knowledge, no-one has applied SVMs as a prediction tool in the domain of HIV drug resis-
 tance based on the genetic sequence of the virus. Hence, application of such a dynamic system
 to this problem is the first contribution. Secondly, despite the fact that we used genetically clas-
 sified data, the design of the SVMs models does not depend on the prior biological knowledge
 about the virus and the respective drugs. Hence, if trained with phenotypic data, such a model
 will not require redesigning with the discovery of new mutation points or drugs like most of the
 existing systems. In addition, the fact that the design of the system is not dependent on the bio-
 logical data (mutations) makes the intended system capable of handling any kind of behavioural
 variation of the virus that might lead to drug resistance, provided that this behaviour is reflected
 on the genetic sequence of the virus. Thirdly, Salzberg [1999] has pointed out the importance
 of comparing classifiers on real data. HIV is one of the hard computational biology problems
 that is data-rich. Therefore, the results from this work further highlight the outstanding perfor-
 mance of SVMs as a tool for pattern recognition. Fourthly, we have seen how dimensionality
 reduction techniques affected the performance of the some of models negatively. This result
 together with the biological fact about the primary-secondary relationship between different
 mutations suggests that application of unsupervised dimensionality reduction is not a good idea
 for HIV drug resistance prediction tools. Finally, the recorded result for SVMs with the two
 11
most popular kernels showed no necessity for the formulation of new kernel functions, however
 incorporating some general prior knowledge about the virus might boost the performance even
 further.
 1.8 Structure of the document
 Chapter 2 provides background information on HIV biology, general principles of pattern recog-
 nition and some of the different approaches available. Specifically, details of statistical pattern
 recognition, Bayesian decision theory and dimensionality reduction methods with particular
 emphasis on feature extraction methods are discussed. A brief introduction to neural networks,
 decision trees and logistic regression is also given in this chapter.
 Chapter 3 gives a comprehensive background on the core pattern recognition technique
 investigated in this research (support vector machines). This chapter will start by presenting a
 detailed discussion on principles of risk minimisation. It then presents the basic formulation
 of SVMs starting from the simplest form of SVMs (linear SVMs) to the most complex non-
 linear SVMs. An overview of different implementations of SVMs, a comparison of SVMs to
 other pattern recognition techniques and previous benchmark applications of SVMs are also
 considered in this chapter. The discussion of SVMs is based on binary classification problems.
 However, an overview on multi-class extension of SVMs is covered. This chapter also reviews
 application of other pattern recognition techniques used in predicting HIV drug resistance.
 Chapter 4 starts by motivating this research and emphasises the superior quality of SVMs
 for the job. The research question for the reported research is then formulated. The chapter
 further describes the details of the experiment: the input data and the input encoding technique
 used, the performance evaluation technique and why such a technique was selected and the
 experimental steps for the different classification algorithms.
 Chapter 5 gives the results for the different models and highlights the findings of the differ-
 ent set of experiments. This chapter also compares the performace of the different models to
 each other and to previously published works. This chapter will also present limitations of this
 research.
 Chapter 6 will present the conclusion for this document. This chapter will also give direction
 for future work.
 The mathematical details of Chapter 3 and some part of Chapter 2 are presented in the
 Appendices.
 12
Chapter 2
 Background and Related Work
 2.1 Introduction
 In the previous chapter an overview of the research was presented. As presented in the previous
 chapter, the main objective of the reported work is investigating the performance of SVMs in
 predicting the drug resistance behaviour of an HIV strain extracted from a patient based on the
 genetic sequence of the viral RT or PRO mutants.
 To understand the terms and methodologies used in this research enough background on
 the biology of the virus, the general principles in pattern recognition tasks and the pros and
 cons of different pattern recognition technique should be established. Hence this chapter is
 intended to give a brief background on HIV biology, different pattern recognition techniques
 with more emphasis on statistical pattern recognition and some diagonal factors on solving
 pattern recognition problems. Furthermore, this chapter is used as an introduction for Chapter
 3 where SVMs are discussed in great detail.
 This chapter is organised as follows. Section 2.3 covers the necessary biological background
 on the HIV and drug resistance problem. This section starts with general introductory biology
 and then describes the phylogenetic information, the genetic structure and viral replication cy-
 cle. This section also discusses how drug resistance occurs and the different approaches of
 predicting drug resistance. Section 2.4 gives a historical overview of pattern recognition tech-
 niques and describes the different techniques available. Section 2.5 gives a detailed background
 on statistical pattern recognition, Bayesian decision theory, dimensionality problem and tech-
 niques for dimensionality reduction. Section 2.6, 2.7 and 2.8 give detailed background on neural
 networks, decision trees and logistic regression respectively. Finally section 2.9 concludes the
 chapter.
 13
2.2 Introductory Biology
 The basic biological process at a cellular level is identical across organisms. The hereditary
 genetic information of a living organism is stored in the deoxyribonucleic acid (DNA) or ri-
 bonucleic acid (RNA). The units of information that are assembled together to form DNA or
 RNA are four nucleic acid units called nucleotides (bases). Adenine (A), guanine (G), cytosine
 (C) and thymine (T) make up DNA and the nucleotide thymine is replaced by uracil (U) in
 case of RNA. DNA and RNA, which are a polymer of these nucleotides have other common
 characteristics beside being a chain of nucleotides. One of these property is complementarity.
 This is the exclusive bonding rule between adenine and thymine/uracil (A-T/U) and guanine
 and cytosine (G-C) [Hunter 1993].
 While RNA is a single stranded, DNA is double strand of bases in a double helix form. Each
 of these strands are often millions of bases long and its direction is determined by its head (5?
 end) and tail (3? end)1. The sequence of nucleotides in one of the helices is unrestricted. How-
 ever, due to the exclusive bonding between nucleotides, the sequence on the complementary
 strand is completely deterministic. A strand of DNA or RNA which is an exact complement
 of another strand are called reverse complement to the second [Hunter 1993]. (For example,
 ATGCCA is the reverse complement of TACGGT).
 The DNA contains genes of an organism, which are used as a template for manufacturing
 RNA which then will be used to manufacture protein. The primary role of the nucleic acids
 is to carry the encoding of the primary structure of protein. Proteins determine the shape and
 structure of a cell. Each non-overlapping triplet of nucleotides in the DNA strand are called
 codon. Each codon corresponds to a particular amino acid. Grouping of the four nucleotides
 accordingly results in 43 = 64 possible triplets encoding 20 amino-acids and three special duty
 codons called stop codons [Hunter 1993] (see Table C.1 in Appendix C for complete list of
 amino acids and the codons encoding them). However, not all triplets in the genome encode for
 protein. In higher organisms approximately 97 ? 98% of the genome is a non-coding sequence
 called introns. And the remaining 2 ? 3% called exons codes the proteins [Watson et al. 1997].
 The mechanism by which proteins are produced from DNA is a sequence of steps, which
 are known as the central dogma of molecular biology [Watson et al. 1997]. Generally, these
 steps can be summarised in two steps as transcription and translation. The process starts with
 unwinding the helix into a separate strand. From a single strand of DNA as a template the
 RNA is manufactured. However, not the entire DNA is used to manufacture the RNA. As we
 have stated in the above paragraph, only exons are genic. This process is called transcription,
 because RNA is transcribed from the DNA. The RNA is then translated into protein which
 determined the structure and function of an organism. Translating the RNA into protein starts
 from the codon which encodes the amino acid methionine (AUG). From then on each codon is
 1DNA molecules are directional, due to the asymmetric structure of sugar which constitute the skeleton of the
 molecule. Each sugar is connected to the strand in its fifth carbon preceding it in the chain and in its third carbon
 following it in the chain
 14
translated into the corresponding amino-acid and is added to the growing chain of amino-acids.
 This process is then stopped when one of the stop codons is encountered. However, there are
 some complications to the translation process. Depending on where we start encoding, we have
 three possible reading frames on each direction. (i.e. 6 open reading frames) [Hunter 1993].
 For example, the following sequence ACTGAAGTCGCCA. . . can be read as ACT-GAA-GTC-
 GCC-A. . . or CTG-AAG-TCG-CCA-. . . or TGA-AGT-CGC-CA. . ., all of these making the
 different reading framed. Usually only one of these frames will produce a functional protein.
 However, this is not always true. Therefore, identifying the correct reading frame is the primary
 task in many computation biology problems.
 2.3 HIV Biology
 2.3.1 Introduction
 Viruses are a group of submicroscopic infectious agents, unable to replicate outside a host cell.
 These submicroscopic organisms essentially contain their genetic material in terms of DNA or
 RNA surrounded by a protein coat. During replication, the virus integrates its DNA or RNA into
 the host?s DNA and takes over the cell?s biological mechanism to replicate [NIAID 2004]. If the
 virus contains its genetic material in the form of RNA, it should be first transcribed into DNA
 before integrating with the host?s DNA. In most organisms, RNA is transcribed from DNA and
 hence, those viruses that contain their genetic material in terms of RNA are called retroviruses
 to indicate the reverse transcription of RNA to DNA.
 This section is intended to give an overview on the biological background of HIV, which is a
 retrovirus. The section is organised as follows. Section 2.3.2 gives an overview on phylogenetic
 information of the virus and the structure and function of some of the major genes in the viral
 genome. It is followed by section 2.3.3 which describes the viral replication process at a very
 high level. Finally section 2.3.4 starts with the basic definition of drug resistance and gives the
 reason why HIV drug resistance occurs and describes the two general approaches of HIV drug
 resistance testing.
 2.3.2 HIV phylogeny and genome
 The HIV is a retrovirus belonging to the genus2 Lentivirus, which shares many important char-
 acteristics with other retroviruses but also has some special features [Coffin 1999]. Besides HIV,
 the genus Lentivirus includes Simian Immunodeficiency Virus (SIV), which is phylogenetically
 closely related to HIV and the distant relatives: Vesina Virus and Feline Immunodeficiency
 Virus (FIV) [Hoffmann and Kamps 2003]. HIV is divided into two subtypes namely: HIV-1
 2
 ?A taxonomic category ranking below a family and above a species and generally consisting of a group of
 species exhibiting similar characteristics. In taxonomic nomenclature the genus name is used, either alone or
 followed by a Latin adjective or epithet, to form the name of a species?.(The American Heritage Dictionary)
 15
and HIV-2 based on the molecular weight of their protein and their subordinate genes [Hoff-
 mann and Kamps 2003]. Despite the strict resemblance between HIV-1 and HIV-2, HIV-1 is
 the most common subtype and is the main infectious agent that has led to the worldwide AIDS
 epidemic. Even though HIV-2 infection is less common and less virulent, it also results in AIDS
 [Klatt 2003]. Furthermore, although HIV-1 and HIV-2 replicate in the same fashion the actual
 immune deficiency may be less severe in HIV-2 infected individuals [Hoffmann and Kamps
 2003]. Some 99% of HIV patients are infected with HIV-1 with a growing number being in-
 fected by HIV-2 [Draghici et al. 2000].
 HIV-1 is classified into two principal genetic groups designated M (main) and O (outliers).
 Genetic group M is highly prevalent and is further classified into 10 established subtypes, A
 through J. HIV-1 subtype B predominates in Europe and the Americas, whereas HIV-1 subtype
 C predominates sub-Saharan Africa [Klatt 2003]. An additional group N (non-M, non-O) has
 been discovered recently as a result of interaction between the two principal groups [Klatt 2003;
 Health Canada 2001]. In the remainder of this document HIV refers to HIV-1 unless otherwise
 specified.
 Physically, an HIV viral particle has a diameter ranging approximately from 90 to 100
 ?m and is surrounded by an envelope, which encapsulates the genome that encodes the major
 functional and structural components of the virus [Coffin 1999]. The HIV genome contains two
 single stranded RNA molecules each 9 kilobases in length. These RNA molecules contain 9
 different genes encoding 15 different proteins [Greene and Peterlin 2003]. The HIV genome is
 classified into three major classes namely: structural genes (gag, pol and env), trans-activation
 genes (tat and rev) and accessory genes (nef, vif, vpr and vpu) [Hoffmann and Kamps 2003;
 Klatt 2003]. Like all retroviruses, the major genes gag-pol-env are contained in the genome
 in the conserved order 5?-gag-pol-env-3? [Coffin 1999; Klatt 2003]. The schematic diagram in
 Figure 2.1 shows the viral genome of HIV-1 based on the diagram from http://hiv-web.
 lanl.gov/immunology/pdf/2000/intro/GenomeMaps.pdf.
 Figure 2.1: Schematic diagram of HIV-1 genome
 The structural genes (gag-pol-env) are common to all retroviruses. The major components
 encoded by gag (group-antigen) include the nucleocapsid proteins capsid (CA), matrix (MA),
 and nucleocapsid (NC). The env (envelope) gene encodes the envelope glycoproteins, outer en-
 velope glycoprotein and transmembrane glycoproteins. The env proteins of HIV have a number
 of distinct structural and functional features that enable the virus to replicate efficiently under
 the threat of the host?s immune response and which not seen in other retroviruses. The pol
 (polymerase) gene encodes the enzymes reverse transcriptase (RT), protease (PRO) and inte-
 grase (IN). These enzymes are the main stakeholders in the viral replication process and hence
 16
are major target of most antiretroviral drugs [Coffin 1999; Klatt 2003]. The other classes of
 genes are also critical in RNA transcription and viral release during the viral replication process
 [Klatt 2003; Hoffmann and Kamps 2003].
 2.3.3 HIV replication
 The replication of the virus within the host body is presented schematically in Figure 2.2. It
 is already stated above HIV virus is unable to replicate outside a living host cell. Therefore,
 viral replication starts with the virus entering the host cell [Coffin 1999]. This step of the viral
 replication is mediated by the env protein that interacts with a specific cell surface receptor.
 Once the viral core enters the cell, the genome RNA is reverse transcribed by the HIV reverse
 transcriptase into a double-stranded DNA molecule. The newly made HIV DNA is then moved
 to the cell?s nucleus and integrates with the host cell?s DNA aided by the HIV Integrase. At
 this step the viral DNA is called ?provirus?. This provirus uses the host cell?s protein-making
 machinery to produce new copies of RNA called messenger RNA (mRNA). Once these mRNAs
 are processed in the cell nucleus, they are transported to the cytoplasm aided by proteins en-
 coded by the rev gene. In the cytoplasm the HIV mRNAs are used to make long chains of viral
 proteins and enzymes with the help of the host protein making machinery and ribosomes. The
 newly made HIV core protein, enzyme and RNA will then gather inside the cell?s membrane,
 while the viral envelope protein aggregates within the adjacent membrane. Just before the new
 virus exits the cell, the long chain of proteins and enzymes that make up the immature viral
 core are cleaved into smaller pieces by a viral enzyme called Protease. Finally, the virus will be
 assembled and detaches itself from the host cell. This results in a new infectious viral particle
 [Coffin 1999].
 An animation of the HIV life-cycle is available at http://www.hopkins-aids.edu/
 hiv_lifecycle/hivcycle_txt.html
 2.3.4 HIV drug resistance and assay of drug resistance testing
 Once the HIV virus enters the human body it begins to replicate at a very high rate in the order of
 billions everyday. During replication, HIV produces perfect copies and copies containing errors
 (mutated virus). Mutation is very common in HIV because of the high rate of viral replication
 and RT infidelity which is largely due to the lack of 3?? to ? 5? proofreading ability within viral
 RT [Whitney et al. 2002]. As a result of the changes in HIV?s genetic structure (mutation), the
 ability of a drug or a combination of drugs to block HIV replication inside the body is reduced.
 This phenomenon is called drug resistance.
 As it is stated in the previous section besides RT another enzyme which plays an important
 role in viral replication and the resulting drug resistant mutants is the HIV PRO. During viral
 replication, the cell produces a long strand of genetic material that must be cut up and put
 together correctly to form new copies of the virus. Cutting up this long strand is carried out
 by the enzyme PRO. Furthermore, PRO is responsible for processing the gag and pol genes,
 17
Figure 2.2: Viral replication cycle [Wikipedia 2004]
 18
which are initially expressed as the precursor polyproteins Gag and Gag-Pro-Pol, into their
 mature stage immediately after budding [Shafer 2002a; Coffin 1999]. Due to their important
 role in the viral replication these two proteins are the targets of most of the existing HIV drugs.
 Furthermore, most mutations are exhibited in these proteins [Lathrop et al. 1999; Shafer 2002b]
 To date, there are more than eighteen antiretroviral drugs available. These drugs are gener-
 ally classified in to three major categories based on the enzyme targeted, the viral replication
 stage interfered and their chemical composition [Hoffmann and Kamps 2003]. The first class
 of drugs are Nucleoside/Nucleotide RT Inhibitors (NRTIs), which interfere with the viral repli-
 cation stage by blocking the further elongation of the proviral DNA and interrupting the chain.
 There are seven NRTIs. The second class of drugs are Non-Nucleoside RT Inhibitors (NNRTIs).
 NNRTIs interfere with the viral replication state the same way as the NRTIs. There are three
 NNRTIs. The third major category are PRO Inhibitors (PIs). PIs function by interfering with the
 viral assembly stage of the replication. There are eight PIs. The fourth drug category are fusion
 inhibitors (FI). Fusion inhibitors are members of a broader class, the entry inhibitors, which stop
 the virus from entering the cell by preventing the final phase of attachment [Beerenwinkel et
 al. 2003b]. There are also some more experimental drugs in each class [Hoffmann and Kamps
 2003]. A complete list of antiretroviral drugs is given in Table 2.1. For up to date information
 visit on-line HIV/AIDS information services such as http://www.hopkins-aids.edu.
 Drug Abbreviation Target Class
 zidovudine ZDV RT NRTI
 didanosine ddI RT NRTI
 zalcitabine ddC RT NRTI
 stavudine d4T RT NRTI
 lamivudine 3TC RT NRTI
 abacavir ABC RT NRTI
 tenofovir TDF RT NRTI
 nevirapine NVP RT NNRTI
 delavirdine DLV RT NNRTI
 efavirenz EFV RT NNRTI
 saquinavir SQV PRO PI
 indinavir IDV PRO PI
 ritonavir RTV PRO PI
 nelfinavir NFV PRO PI
 amprenavir APV PRO PI
 lopinavir LPV PRO PI
 atazanavir ATV PRO PI
 T-20 gp41 FI
 T-1249 gp41 FI
 Table 2.1: Antiretroviral Agents [Beerenwinkel et al. 2003b, page i19]
 The extreme genetic and antigenic variability of HIV, which results from the development
 of drug resistant viral strains, is the most common reason for the failure of HIV drug therapy
 [Draghici et al. 2000]. Although there are many drugs available, none of these drugs have been
 19
able to stop the viral replication totally. Despite the failure to produce an efficient drug to stop
 the viral replication, much has been done in combating the pandemic [Shafer 2002b]. One
 such achievement worth mentioning is the prolonging of the viral suppression and help with
 immunological reconstruction of the patient using combinational therapy. Furthermore, these
 successes have opened a new era of HIV therapy called highly active antiretroviral therapy
 (HAART) [Hoffmann and Kamps 2003]. HAART works based on pre-therapy drug resistance
 testing of the HIV strain extracted from the patient and giving the patient a combination of drugs
 containing one protease inhibitor. Due to the intolerable side-effects and toxicity, combination
 of drugs can not exceed four (combination of three drugs is the common one at least one of
 the three being a PI) [Lathrop and Pazzani 1999]. Hence a very careful measurement of drug
 resistance is required.
 There are two ways of HIV drug resistance testing namely: phenotypic testing and genotypic
 testing [Bean 2000].
 Phenotyping
 Phenotypic testing directly measures the drug resistance behaviour of an HIV strain. The HIV
 strain from the patient is placed in a test-tube and the growth of the virus is closely studied under
 a treatment of the drug or combination of drugs by varying the concentration and strength. The
 measured viral replication is then compared to the wild type [Hoffmann and Kamps 2003].
 Phenotypic testing has some disadvantages. Firstly, the process is time consuming and very
 expensive. Secondly, it requires a specialised laboratory. Furthermore, drug resistance of an
 HIV strain cannot be detected when the viral load is less than 20% [Bean 2000].
 Genotyping
 Genotypic testing is based on analysing certain mutations associated with drug resistance based
 on the genetic structure of the HIV strain extracted from the patient [Hoffmann and Kamps
 2003; Bean 2000]. To conduct genotypic testing the contiguous PRO and RT genes which are
 extracted from the plasma are reverse transcribed to cDNA. The cDNA is then amplified us-
 ing polymerase chain reaction (PCR) to generate sufficient DNA [Shafer 2002b]. This genetic
 sequence is then examined carefully for mutation. Depending on the number and type of muta-
 tions exhibited the test reveals whether the patient has developed resistance to a certain drug or
 combination of drugs. For example, if the mutation M184V/I is detected3, the HIV strain will
 be resistant to the NRTI Lamivudine (trade name Epivir) [Hoffmann and Kamps 2003]. For
 explanation about M184V/I revisit Section 1.4.
 Genotyping is more widely used than phenotyping. Some of reasons is that unlike pheno-
 typing, genotyping does not require a specialised laboratory, it is cost effective, it is fast and
 3Resistance mutation is described by using a number showing the position of the codon where the mutation
 occurred and two letters. The letters preceding the number represents the amino acid in the same position of the
 wild-type and the letter after the number shows the amino acid produced by the mutation. [Hoffmann and Kamps
 2003]
 20
interpreting the result does not depend on the testing process. Furthermore, the fact that it is
 totally based on the genetic sequence of the HIV strain has opened the door for leading tech-
 nologies from other fields of study to contribute. One such technology is pattern recognition
 and artificial intelligence.
 There are a number of commercially available tools to conduct genotypic resistance testing.
 But they have a number of limitations. Firstly, certain mutations cause resistance by themselves
 while others need to occur in the existence of others to cause resistance. And different geno-
 typic tools treat these mutation points differently and hence leads to discordance between the
 different tools [Ravela et al. 2003]. Secondly, interpretation of the result of genotypic testing is
 difficult [Bean 2000]. Thirdly, most of these tools use knowledge-based approach and are highly
 dependent on the known biological facts about the virus and the drug. Hence, these tools will
 not be able to perform the required task without continuous update or even require redesigning
 with discovery of new drugs and mutation points known to confer resistance to the existing or
 newly discovered drugs. For the above mentioned reasons and more, genotypic testing is open
 for more research and collaboration for other disciplines.
 2.4 Pattern recognition
 Pattern recognition is an interdisciplinary field of study developed mainly in the 1960s covering
 developments from a wide variety of disciplines ranging from psychology and physiology to
 computer science and artificial intelligence [Webb 1999]. Since then scientists have been in-
 vestigating ways to enable machines to recognise pattern the same way humans do to base their
 decision-making process on their daily life. According to Pavlidis [1977] pattern recognition
 is defined as understanding the building blocks of a given object. Tou and Gonazalez [1974]
 also defined pattern recognition as classification of an input data into a category called a pattern
 class, which is determined by some given common attributes based on exhibited patterns. A
 pattern is defined as the description of any member representing a pattern class. A pattern can
 be as basic as observation and measurements [Tou and Gonazalez 1974; Schalkoff 1991]. Ex-
 amples of a pattern could be a DNA/Protein sequence, a text document, handwritten characters,
 or a signal waveform.
 Pattern recognition is a computationally expensive task; therefore it has been a specialised
 subject in the past and applications were limited to certain domains. However recent ad-
 vancements in computer hardware (processor speed and storage) have made pattern recognition
 widely applicable in range of application areas (see Table 2.2) [Jain et al. 2000]. Although
 the applications presented in table 2.2 are diverse, one can list a number of common properties
 between these applications [Jain et al. 2000]. One common property is that the data for each
 problem is represented by a very big set of parameters (features). Another common property
 is that the features that represent the data cannot usually be defined by the domain expert, but
 should be extracted from the exhibited patterns.
 Independent of the application domain, a pattern recognition task consists of a number of
 21
iterative steps. The first step is data collection and preprocessing. This step includes recording
 measurements and/or observations, extracting and/or selecting the most representative features
 and high level examination of the data to get some idea about the underlying distribution, prob-
 ability, structure, etc., of the data. The second step is designing the pattern recognition system
 and performing the training. The pattern recognition system is usually designed based on the
 properties of the sample data and is dependent on the previous step. Sometimes it might be
 impossible to design the pattern recognition system and hence the intended system will be the
 result of the training process. Therefore careful representation and preprocessing is required for
 the performance of the system. The final step is testing the performance of the designed system
 and interpreting the result. Note that each step listed above can be further broken down into a
 number of steps.
 Problem Domain Application Input Pattern Pattern classes
 Bio-informatics Sequence analysis DNA/Protein Known types of
 sequence genes/patterns
 Data Mining Searching for Points in Compact and well
 meaningful pattern multidimensional space separated clusters
 Document Internet search Text document Semantic categories
 Classification (eg. business, sport, etc.)
 Document image Reading machine Document image Alphanumeric
 analysis for blind characters, words
 Industrial Printed circuit Intensity or range Defective or non -
 automation board inspection image defective nature
 of product
 Multi media Internet search Video clip Video genres (eg. action,
 database retrieval dialogue,etc.)
 Biometric Person identification Face, iris, Authorised users and
 recognition fingerprint access control
 Remote sensing Forecast crop Multi spectral Land use categories,
 yield image growth pattern of crops
 Speech recognition Telephone directory Speech waveform Spoken words
 enquiry without
 operator
 Table 2.2: Examples of Pattern Recognition applications [Jain et al. 2000, page 5]
 Based on the training approach, a pattern recognition task can be categorised as supervised
 or unsupervised pattern recognition. In supervised pattern recognition, the pattern recognition
 system is given a set of examples (input-output pairs) as a training set. This training set is
 usually of the form of attribute vectors, and is a subset of Rn. Given the attribute vectors, we
 can synthesise the value of a mapping function for some samples in the training set and choose a
 set of hypotheses for the problem. On the contrary, in unsupervised pattern recognition, there is
 no output value associated with the inputs and the recognition task is to get some understanding
 of the process that generated the input so as to classify the training set?s example into their
 respective classes according to their properties and also to generalise for unseen inputs. In the
 22
remaining part of this document pattern recognition refers to supervised pattern recognition
 unless otherwise specified.
 Another way of classifying pattern recognition approaches is based on some properties of
 the data and assumptions taken. The preprocessing step not only provides us with a concise
 representation of the data but also crucial information that helps us select the right pattern
 recognition approach to solve the problem at hand. This step sometimes provides the underly-
 ing and statistical basis of the patterns or the underlying structure, which is critical to pattern
 recognition. As in other cases none of the above information can be provided, all the nec-
 essary information for the desired system should therefore be investigated during the training
 process. Thus, based on the information available one approach might be better than another to
 solve a particular problem. In summary, based on the underlying principle, data representation
 and assumptions made, pattern recognition can be crudely categorised into four main practical
 approaches [Jain et al. 2000; Schalkoff 1991]. These categories are:
 ? Template matching,
 ? Syntactical/structural pattern recognition,
 ? Statistical pattern recognition,
 ? Neural pattern recognition (Neural Networks).
 Each of these approaches are described below
 There are other approaches that do not fit particularly into any of the above categories
 [Schalkoff 1991]. Such approaches include combinations of statistical and syntactical pattern
 recognition, reason-driven pattern recognition where artificial intelligence is used to infer some
 rules based on the training data, etc.
 Template matching is conceptually the simplest and the earliest form of pattern recognition.
 In some applications the pattern under investigation is almost identical to some prototype of the
 pattern class. This prototype is called a template. A template might be a certain object in the
 problem domain or a string of patterns. Therefore, the pattern recognition problem is reduced
 to matching the unknown pattern with these templates and finding the best match. In other sit-
 uations the templates may also be contained within the unknown pattern, therefore the pattern
 recognition task may also involve determining the relative positions of these templates. The
 performance of this pattern recognition approach depends on the quality of the similarity mea-
 sure used. There are a number proposed similarity measurements to determine the best match
 between the known pattern (template) and the unknown pattern. Such measurements include
 edit distance, sum-of-squares difference and the maximum-likelihood formulation proposed by
 Olson [2000]. More on similarity measures can be found in Schalkoff [1991].
 Algorithms based on template matching are easy to implement but suffer from low recog-
 nition performance due to distortion, view point change or large interclass variation among the
 23
patterns [Jain et al. 2000]. This can often be addressed by incorporating some pre-processing
 techniques such as rotating the stimulus, scaling, etc. to make it upright, of a standard size, etc.
 respectively.
 Syntactical/structural pattern recognition: In many real world problems,the structural and
 relational information contained in the patterns makes identification of quantifiable features that
 can be represented in a vector form difficult or sometimes impossible [Schalkoff 1991]. Some
 examples of pattern recognition problems satisfying these characteristics are picture recogni-
 tion, time-series analysis, text recognition. One common characteristic of these patterns is that
 some kind of inheritance or identifiable organisation is usually exhibited [Olszewski 2001].
 Syntactical pattern recognition is not only used for classification of patterns but also description
 of patterns [Fu 1974]. For example, in some pattern recognition problems, the structural infor-
 mation of the pattern is so important that classification of the pattern might not be enough. It
 might thus be necessary to describe the property of the pattern which makes it eligible to be clas-
 sified in a certain way. As another example, some patterns like fingerprints are self-dependent
 and the number of possible descriptions are extremely large, hence classifying each pattern into
 its own class is impractical. Therefore, the task of such a pattern recognition system involves
 describing the pattern rather than classifying it.
 The complexity, inheritance and identifiable organisation of the patterns, have motivated
 many syntactic methods to adopt a hierarchical perspective where these complex patterns are
 considered as a composition of simple sub-patterns and hence is hierarchically decomposed
 into simpler patterns [Fu 1974; Schalkoff 1991]. The simplest sub-patterns are usually re-
 ferred as primitives and the relationship among them represents the structural features of the
 pattern [Olszewski 2001]. Primitives can then be quantified using formal grammar or relational
 descriptions (usually graphs) to facilitate recognition, classification or description of these pat-
 terns [Schalkoff 1991].
 A statistical approach is based upon a statistical analysis of the data to be classified. The data
 are assigned to a particular class by computing class-conditional densities of the data, which is
 represented as a d-dimensional feature vector. The d-dimensional vector space is then divided
 into regions corresponding to the different class based on some criterion. Statistical pattern
 recognition will discussed in more detail in Section 2.5.
 Neural pattern recognition is a computational system inspired by the learning characteris-
 tics and the structure of a biological neural network. The key element of this approach is the
 novel structure of the information processing system. It is composed of a large number of highly
 interconnected processing elements (neurones) working in complete harmony to solve specific
 problems. Neural pattern recognition will discussed in more detail in Section 2.6.
 24
2.5 Statistical pattern recognition
 In statistical pattern recognition, a pattern is represented by a set of features (say a set of d fea-
 tures) obtained through observations/measurements and conveniently viewed as a d-dimensional
 feature vector X = (x1, x2, . . . , xd). The basic assumption in this approach is that there exists
 a multivariate class-probability distribution which can be inferred from this exhibited random
 patterns [Jain et al. 2000; Kanal 2000]. With this assumption, the problem of statistical pattern
 recognition can be formulated as follows: given a set of measurements representing the pattern
 as d-dimensional feature vectors, the purpose of the pattern recognition system is to classify
 these unknown patterns to one of the known (say c) pattern classes [Webb 1999; Jain et al.
 2000]. Classification is done by subdividing the space spanned by these feature vectors into
 c disjoint (non-overlapping) regions, where each disjoint region represents a particular class
 [Schalkoff 1991]. The vector representation of a pattern also assures that a pattern can only be
 plotted to a point in the feature space and hence can be assigned to one class based on the region
 to which it is plotted.
 Like all other pattern recognition approaches, statistical pattern recognition is also carried
 out in a number of sequential steps which can be divided into two major steps: training (learn-
 ing) and classification (testing) [Jain et al. 2000]. Each of these are further broken down into
 a number of modules. The two major steps and the modules in each step are schematically
 presented in Figure 2.3.
 Pattern
 Training
 Pattern
 Test
 Prepocessing
 Feature Classification
 LearningPrepocessing
 Feature
 Extraction/selection
 Training
 Classification
 Measurement
 Figure 2.3: Model for statistical pattern recognition [Jain et al. 2000, page 8]
 The first module is the preprocessing module. This module is responsible for the compact
 representation of each pattern. This includes removing noise, segmenting, normalising the pat-
 tern and other related operation. The preprocessing module is the same for both the training
 and the classification mode of the recognition system. Although this module gives a compact
 representation of the data, it does not necessarily give an effective one. Despite the intuition
 of considering a large number of parameters (features) to characterise a pattern in order to ob-
 tain a better recognition, in statistical pattern recognition having a large number of features
 to represent a pattern does not usually result in minimum classification error. To resolve this
 contradiction, it is necessary to find the representative features by taking into consideration the
 number of samples available and the correlation between the features. A module which is re-
 25
sponsible for this sub-task and which is located in the training mode is feature extraction and/or
 selection module. This module is recursive and the feedback path shown in the figure (Figure
 2.3) allows the designer to optimise the process.
 The primary aim of building a pattern recognition system is to build a set of decision bound-
 aries that can classify unseen patterns based on the finite set of sample patterns. Furthermore,
 Jain et al. [2000] have pointed that the performance of a classifier does not only depend on the
 complexity of the classifier but also on the interrelationship between the available sample size
 and the number of features targeted. For a classification task with arbitrarily large training sam-
 ple which is representative of the underlying distribution of the pattern, increasing the number
 of features to characterise the pattern will not have a negative effect on the performance of the
 classifier [Jain et al. 2000]. However, in real world pattern recognition problems the number
 of available samples is limited. Moreover, Duda et al. [2000] have commented that increasing
 the number of features for a finite sample size does not result in small classification errors but
 may rather reduce the performance. This comment is further supported by Trunk [1979] with
 the help of examples. This property has put a constraint on the number of features that one can
 consider for a finite number of training samples. Although there is no defined rule or guide to
 solve this problem there have been a number of recommendations and guide lines given over
 the past couple of decades. For classifiers which are based on partitioning the feature space into
 regions, the number of training samples should be an exponential function of the number of
 features considered. However, Jain and Chandrasekaran [1982] have stated that it is generally
 acceptable to have the ratio between the number of training samples and the number of features
 per pattern class to be greater than ten. This ratio must be higher for more complex the classi-
 fiers. More comprehensive discussion on this topic is covered in Raudys and Jain [1991] and
 Jain and Chandrasekaran [1982].
 There are a number of approaches to obtain a comparable ratio between the sample size and
 features targeted such as dimension reduction. The two techniques used in dimension reduction
 are feature selection and feature extraction. Feature selection is a process of selecting a subset
 of the features that effectively represent the pattern. Feature extraction on the other hand is the
 process of finding an arithmetic combination of the d-dimensional feature vector in relation to
 a lesser dimensional vector. Dimension reduction will be discussed in more detail in section
 2.5.2.
 The last step and the final module in statistical pattern recognition is the classification mod-
 ule. This module classifies patterns based on the class-conditional probability function which is
 obtained during the training process. As stated above, classification is achieved by subdividing
 the feature space into a number of disjoint regions. These decision boundaries are defined using
 Bayes decision rule [Jain et al. 2000]. Bayesian decision theory is a fundamental statistical
 approach to statistical pattern recognition problem. This approach formulates the classification
 problem in terms of the probability density function. In the next section an introduction to
 Bayesian decision theory is presented. A detailed presentation of this topic can be found in
 Appendix A.1. The presentation of these sections is based on the work by Duda et al. [2000]
 26
and Webb [1999]. The notations are after Duda et al. [2000].
 2.5.1 Bayesian decision theory
 Bayesian decision theory is the basic concept behind statistical pattern recognition techniques.
 Given a random pattern x and a k class pattern recognition task, Bayes? theorem answers the
 question: what is the probability that the observation x belongs to the pattern class ?j?
 The Bayes rule describes this probability as:
 P (?j|x) = p(x|?j)P (?j)p(x) (2.1)
 where P (?j|x) is the posterior probability, P (?j) is the prior distribution, p(x|?j) is the state-
 conditional probability and p(x) is the probability density function for x.
 Note that, the posterior probability in equation 2.1 is going to be calculated from other prob-
 ability functions that are easy to calculate from the given training samples. Once this probability
 is estimated the pattern will be assigned to the pattern class with the highest probability. ie.
 x ? ? if P (?|x) = max
 ?j
 P (?j|x) (2.2)
 (For compete presentation of Bayesian decision theory see Appendix A.1)
 Bayes decision rule assumes that all the probability functions are defined which is not true in
 most real-life pattern recognition problems. However in many of these classification problems
 one can assume the form of the class-conditional density (eg. multivariate Gaussian). Based on
 this assumption statistical classifiers will be divided into parametric and non-parametric classi-
 fiers. Some examples of parametric classifiers are linear and non-linear discriminant functions.
 Some examples of non-linear classifiers are k nearest neighbourhood and multilayer perceptron.
 So far we have seen that given the probabilistic densities or taking some assumption of about
 the probability density we will perform classification of unseen patterns based in equation 2.2.
 However, there are a number of attributes that can affect the performance of the classifier spe-
 cially when probability density estimation is involved. One such attribute is the proportionality
 between the number of features and available training samples. If the number of features is too
 large relative to the number of training samples, the classifier is likely to perform badly. To
 avoid this problem there are a number of techniques. One of the most acknowledged approach
 is dimensionality reduction discussed below. Note that, even though this problem is most likely
 in pattern recognition techniques that rely on some probability estimation, it also occurs in other
 pattern recognition techniques.
 27
2.5.2 Dimensionality reduction
 The dimension of the data is the number of features that represent each observation or simply
 the size of the vector reperesenting the observation. High dimensional data has presented statis-
 tical pattern classification systems with many challenges and new perspectives to the problem.
 Based on a simple naive intuition one could argue that classification error can be decreased by
 increasing the number of features. However, this is not true for a number of reasons. Firstly,
 high dimensional data will have a great demand of computational resources. Secondly, with
 high dimensional data, it is very hard to understand the underlying structure of the data which
 thus degrades the classifier?s performance. Besides the above mentioned reasons, the impor-
 tance of maintaining the ratio between the dimensionality of the data and the number of samples
 to increase classifier performance is discussed in section 2.5. Therefore, dimension reduction
 is important due to measurement and computational cost and classification accuracy. To see
 further the relationship between the number of features and sample size and the importance of
 dimensionality reduction, consider the 3-class pattern recognition problem of classifying three
 different geometric shapes (circle, square and triangle) adopted from Gutierrez-Osuna given in
 Figure 2.4. A simple approach is to divide the feature space into a number of uniform bins and
 compute the ratio of examples for each class at each bin as shown in Figure 2.4 (a). When a
 new object is found, the object will be placed in the feature space and choose the predominant
 class (geometric shape) in the bin it is placed. It can be seen from this figure only one feature
 is considered. Moreover there is too much overlap and hence the performance will be low. To
 address this problem we need an extra feature to represent the pattern. So lets consider a two
 and three features as shown in Figure (2.4 (b)) and (2.4 (c)) respectively.
 When the number of features increases from one to two, the number of bins increases from
 3 to 9 (32). Here, we have to make a decision whether to maintain the density of example per
 bin or keep the number of examples constant. In the first case we need to increase the number
 of examples from 9 to 27 and the latter case results in sparse 2D scatter plot. Furthermore, when
 the dimension is increased to three the problem become worse. The number of bins grows to
 27 and if the density needed to maintained 81 examples are required and on the other hand, if
 the number of examples are kept constant the 3D scatter plots are almost empty. Although this
 is a trivial example we have been able shown the problem of dimensionality that exists almost
 in all pattern recognition problem. In this example, the number of training sample (geometric
 shapes) should grow exponentially as the number of features considers. But increasing the
 features increasing continuously for a fixed sample size does not improve the performance,
 rather it starts to degrade the performance. This phenomenon is termed Curse of Dimensionality
 [Theodoridis and Koutroumbas 1999]. In practise, this means that, for a given sample size,
 there is a maximum number of features above which the performance of the classifier starts to
 degrade rather than improve. In the last decade much research have been carried out to solve
 this problem and come up with different approaches. One way is dimensionality reduction.
 Dimensionality reduction can be described as determining a subset or combination of fea-
 tures that represent the pattern without losing the class discriminatory information. There are a
 28
Figure 2.4: Number of bins required for different features (a) one feature (b) two features (c)
 three features [Gutierrez-Osuna]
 number of methods of reducing the dimensionality that are grouped into feature selection and
 feature extraction approaches. Feature selection chooses a subset of all the features which are
 more informative while feature extraction creates a new set of less dimensionality by creating
 combination of the existing features. Both these approaches select features or create features,
 which are combination of the original features that have high discrimination power and give
 better between class difference and better within class similarity. This is done by choosing
 a optimal criterion function J that defines this similarity and/or difference [Webb 1999]. Al-
 though it is not reliable when the ratio of sample size and features is small, commonly used
 criterion function is the classification error [Jain et al. 2000]. Another issue that needs to be
 considered is the number of reduced features.
 Feature extraction
 The feature extraction method determines a feature subspace of small dimensionality which is
 a linear or non-linear combination of the ordinal feature space [Jain et al. 2000; Webb 1999].
 Feature extraction can be formally defined as follows:
 Given a set of features X = {x1, x2, x3, . . . , xd}, find a subset X ? derived from X
 with |X ?| = m such that:
 J(X ?) = max
 x?Xm
 J(X)
 where J(.) is the optimal criterion function and m ? d
 29
Therefore the aim of feature extraction is to find a concise representation of the data with no
 or minimal loss of information. Feature extraction has been an important problem for decades
 and hence different methods have been proposed. These methods can be grouped in to linear
 or non-linear transformation of the original feature space. Linear and non-linear transformation
 may be distinguished further by supervised or unsupervised. The best known unsupervised
 linear transformation technique is the Principal Component Analysis [Jain et al. 2000; Webb
 1999]. In these methods, the transformed features are ranked according to the value assigned
 by criterion function and the first m ? d features will be chosen [Jain et al. 2000]. Another
 linear transformation method which uses the same principle is projection pursuit [Jain et al.
 2000]. Other methods like discriminant analysis use the within-class information to perform
 linear extraction. With the development of neural networks, new methods of feature extraction
 have been investigated. One such method is the non-linear feature extraction method known as
 Self-Organising Map (SOM).
 The mathematical formulation and detailed discussion of Principal Component Analysis
 will be presented next.
 Principal component analysis
 The Principal Component Analysis (PCA) method originated in 1901 to derive a new set of
 features which are linear combination of the original set of features describing the data sorted
 in descending order of importance according to the criterion function [Webb 1999]. If the new
 feature space is said to have a dimensionality m ? d, the first m newly derived features will be
 considered. This technique is widely used because of three important properties [Roweis 1998]:
 ? It is an optimal linear method for dimensionality reduction
 ? Model parameters are computed from the data itself
 ? Once model parameters are computed, compressing and decompressing data
 are trivial
 There are a number of ways describing PCA mathematically or geometrically. Geometrically
 it can be described as finding a set of axes rotated from the original axes to better fit the data
 and magnify between class difference [Webb 1999; Norris 2002]. Norris [2002] also described
 PCA as ?decomposing the original pattern into a set of distinct patterns over the sample and then
 recombine them to recreate the original data?. PCA is based on the statistical representation of
 the pattern and suppose the pattern is represented as X = (x1, x2, . . . , xd)T and the mean be
 denoted by:
 ? = E[x]
 and the d? d covariance matrix given by:
 ? = E[(x? ?)(x? ?)T ]
 30
The problem of finding the principal components is then defined as finding the d dimensional
 normalised vector aTi such that for yi = xTi a, the following properties holds:
 1. Var[y1] ? Var[y2] ? . . . ? Var[yd]
 2. For all i 6= j the covariance between yi and yj should be zero. i.e. yi should
 be uncorrelated with yj
 Consider the first term y1, the first principal component
 y1 = a11x1 + a12x2 + . . . + a1dxd
 defined by choosing a1 = (a11, a12, . . . a1d) to maximise the variance of y1, constrained to
 |a1|2 = 1. The variance of y1 can be written in terms of the covariance matrix and a1 as:
 Var[y1] = aT1 ?a1
 Hence to find the first principal component one must solve the problem:
 Maximise aT1 ?a1 subject to aT1 a1 = 1
 this is equivalent to maximising
 f(a1) = aT1 ?a1 ? ?aT1 a1
 where ? is a Lagrangian multiplier. Setting the partial derivatives with respect to a1 to zero and
 solving for a1 will give us the vector value of a1 that maximises the variance of y1
 ?f(a1)
 ?a1
 = ?a1 ? ?a1 = 0 (2.3)
 Comparing this expression with eigenstructure of the square matrix ?, (2.3) tells us that choos-
 ing a1 to be the eigenvector of ? with eigenvalue ? solves the maximisation problem. Further-
 more (2.3) implies:
 Var[y1] = at1?a1 = ?
 Ordering the eigenvalues ?1, ?2, . . . ?d of the covariance matrix ? in such a way that
 ?1 ? ?1 ? . . . ? ?d
 Because the aim is to maximise the variance we choose ? to be the largest eigenvalue ?1. The
 second component, the second principal component
 y2 = a21x1 + a22x2 + . . . + a2dxd
 31
can also be defined by choosing a2 = (a21, a22, . . . a2d) to maximise the variance of y2, con-
 strained to |a2|2 = 1 and the covariance between y1 and y2 is zero. The second constraint
 implies:
 E[y2y1]? E[y1]E[y2] = 0
 aT1 ?a1 = 0
 Note that since a1 is an eigenvector of ?, and hence a2 is orthogonal to a1 (i.e. aT1 a1 = 0).
 Using Lagrangian multipliers (? and ?), the problem of maximising the variance of y2 can be
 reformulated as
 f(a1, a2) = aT2 ?a2 ? ?aT2 a2 ? ?aT2 a1
 Setting the partial derivatives with respect to a2 to zero and multiplying it by at1, tells us that
 ? = 0 hence we have:
 ?a2 = ?a2
 This tells us that a2 is also eigenvector of ?, which is orthogonal to a1. Remember that we
 are still looking to maximise the variance of y2, hence a2 will be the largest of the remaining
 eigenvectors of ? (i.e. ?2).
 Following the same argument, the variance of the ith principal component is
 Var[yi] = ati?ai = ?i (2.4)
 And the total variance is:
 d?
 i=1
 Var[xi] = ?1 + ?2 + . . . + ?d =
 d?
 i=1
 Var[yi]
 Thus to compute the portion of the total variance if the data that is captured by the ith principal
 component we can use the ration:
 proportion of variance = ?i?d
 j=1 ?j
 and the portion of the total variance captures by first k principal components is
 p =
 k?
 i=1
 ?i/
 d?
 i=1
 ?i
 The value of the p is determined by the size of the reduced dimension and is chosen by the user.
 Although the value of p is problem specific choosing a value between 70% and 90% will retain
 the information of the original data well [Jolliffe 1986].
 The effect of dimensionality reduction using PCA on the performance of some of the pattern
 recognition techniques used in this research will be assessed.
 32
Feature selection
 Another approach to dimensionality reduction is feature selection. Unlike feature extraction
 that finds a feature space will lesser dimensionality which is linearly or non-linearly combina-
 tion of the input space, feature selection finds a subset of the input space that has a considerable
 information and lesser dimensionality. To decide which feature space out of the possible sub-
 sets, we can use classification performance of each subspace and select the one with the least
 classification error.
 There are a number of approaches to do feature selection. In this research feature selection
 techniques are not applied but interested reader can read the work by Saeys [2004].
 2.6 Neural networks
 2.6.1 Introduction
 The traditional von Neumann machine, which abstracts the human information processing was
 able to solve computational problems faster than a human brain, however it was inefficient
 when it comes to recognition, classification and description tasks that the human brain handle
 with ease. Real life classification/recognition problems are made complicated by a number of
 external factors such as orientation of the object, direction of vision, deformation etc. Aimed
 at addressing these limitations, a different paradigm of computing called neural networks was
 proposed way back in 1940?s [Russell 1991].
 Neural networks are systems inspired by the biological nervous system, which is composed
 of numerous inter-connected simple elements (neurons) operating in parallel. These small ele-
 ments work as a unit to perform a given task (such are prediction, classification) by adjusting
 the value of the connections between them (termed as weight). The values of these connections
 are usually adjusted dynamically based on the performance of the network on the given training
 and validation sets [Russell 1991].
 A neural network is characterised by three design decisions [Wu 1997]:
 ? pattern of connection between neurons (architecture)
 ? activation function
 ? method of determining the weight on the connection (training or learning algorithm)
 Both supervised and unsupervised learning methods can be used for neural networks. However,
 the presentation in this section is based on supervised learning.
 2.6.2 The Perceptron
 The simplest form of a neural network used to classify linearly separable patterns is the percep-
 tron created by Rosenblatt in 1961 [Rosenblatt 1962]. A perceptron has two layers namely the
 33
input layer and the output layer (see Figure 2.5). The weights and bias can be adjusted from the
 input/output pair presented using the perceptron learning rules. The perceptron learning rule is
 described below. Perceptrons have gained repetition because of their ability to generalise well
 from a given training sample and a randomly distributed connection (random initial weights)
 [Demuth and Beale 2004].
 Input layer
 output layer
 x1
 xi
 b = 1
 ? f(x) y
 w1
 wi
 wb
 .
 .
 .
 Figure 2.5: A perceptron. xi and y are the input and output respectively. wi is a weight associ-
 ated with each node in the input layer and b is the bias.
 The output of the network given in Figure 2.5 is:
 y = f(?wi.xi + wb) (2.5)
 The activation function f(x) can have any form depending on the application of the network
 (for example: the sign function, log-Sigmoid function). The log-sigmoid function is a typical
 choice [Rumelhart et al. 1994].
 The perceptron uses adaptive learning to adjust its weight in order to produce the correct
 output. The rule governing this, known as the perceptron learning rule, is as follows [Demuth
 and Beale 2004]: If the network generated the correct output make no change, if a wrong
 output is generated, adjust the weights and the bias by an amount proportional to the difference
 between the correct output and the generated output.
 In a more generalised way, perceptron learning can be mathematically it can be summarised
 as:
 wi+1 = wi + ?wi (2.6)
 ?wi = ?(Ti ? yi)xi (2.7)
 where i corresponds to the current learning set, ? is called the learning rate (0 < ? ? 1) , Ti
 refers to the target output and yi is the actual output.
 Equation 2.7 is known as the delta rule. Like any other error-correction learning rule, the
 ultimate aim of this rule is to increase the overall performance of the system or minimise the
 cost [Rumelhart et al. 1994]. The common cost function considered is the sum of squared error
 34
(SSE) (sum of squared difference between the desired output and the actual output). SSE is
 defined as:
 SSE = 1
 2
 ?
 i
 (Ti ? yi)2 (2.8)
 With this definition of the cost, the learning goal will then be minimising this function
 (Equation 2.8) with respect to the weight (wi).
 ?wi =
 ?SSE
 ?wi
 (2.9)
 ?SSE
 ?wi
 =
 ?
 i
 (Ti ? yi) ?yi?wi
 The learning rule in Equation 2.7 will become:
 ?wi =
 ?
 i
 (Ti ? yi) ?yi?wi (2.10)
 Equation 2.10 is called the Gradient descent learning.
 The perceptron learning rule is capable of solving any linearly separable classification prob-
 lem. However, the gradient descent learning is capable of minimising the squared error to the
 hypothesis (some acceptable value defined by the experimenter) even when the problem is not
 linearly separable. Although these learning rules are capable of solving classification problems
 in finite time, some classification problems (for example: XOR function) are too complicated to
 be learned with acceptable accuracy. Most real-world problems are not linearly separable and
 can not be solved with the simplest architecture (single layer network). To address this problem
 a number of perceptrons can be combined to handle multiclass and complicated classification
 problem where there is more than one output and the problem needs a complicated hyperplane
 to classify the patterns.
 2.6.3 Feedforward multilayer perceptrons
 Multilayered perceptron has additional layer called a hidden layer and the neurons in these
 layers are called hidden neurons. (They are called hidden because the neurons in these layers
 have not external connection except, the input, output or another hidden neuron). The output
 from one layer serves as input to the next layer. Figure 2.6 shows multilayered perceptrons with
 one hidden layer.
 The most popular form of multilayered perceptrons is feedforward topology. In this config-
 uration neurons on a layer are only connected to neurons in the next layer. Connection between
 neurons in the same layer and loops are also not allowed. (for example: in Figure 2.6 the input
 neurons are connected to neurons from the hidden layer and neurons in the hidden layer are
 connected to neurons in the output layer).
 35
Input layer
 hidden layer
 output layer
 x1
 x2
 x3
 xi
 ?
 ?
 ?
 ?
 ?
 y1
 yn
 .
 .
 .
 .
 .
 .
 Figure 2.6: A multilayered perceptron with one hidden layer. Multilayered perceptron is called
 fully connected if each neuron in one layer are connected to the every neuron in the next layer.
 Otherwise it is called partially connected. Each connection between neurons have associated
 weight. The activation function (?) on the hidden layers must be non-linear. Multilayered per-
 ceptron with linear activation in the hidden layers can be effectively represented by a perceptron
 with relevent activation function.
 Similar to perceptrons the output is computed from the input and the weight of the respective
 connection. Each neuron in the network has an output which is the input for the neurons in the
 next layer:
 yj = f(
 ?
 i
 wji ? xi + wb) (2.11)
 Where wji refers to the weight of the i?th connection in the j?th layer. yi is the output of the
 i?th neuron. xi is the input for the i?th neuron. xi?s equals the network input for the neurons
 in the input layer and equals the output of the layer before for the subsequent layers. The final
 output(s) of the network is computed similarly.
 The weights are adjusted during the training process to minimise the sum of squared error.
 For multilayered perceptrons SSE is defined as:
 SSE = 1
 2
 ?
 i
 (ydi ? yai)2 (2.12)
 where ydi refers to the desired output and yai is the actual output.
 During the training the outputs of the network are calculated by calculating the output at
 each layer and continuing until final the output(s) is/are generated. The weights are then ad-
 justed by an amount proportional to the SSE of the neuron feeding in to the weight at the given
 time during the training. Adjusting these weights is not direct as it is for perceptrons. The error
 for the hidden layer can only be known when the error at the output layer is known. Hence
 36
there must be a mechanism to propagate the error through the network so that each subsequent
 weight is adjusted. There are a number of algorithms to handle this task. The most popular
 backpropagation algorithm is discussed below.
 Backpropagation algorithm
 Multilayer perceptron with backpropagation learning is the most popular network architecture
 [Delen and Kadam 2005]. During the training process of this architecture, the forward pass the
 activations propagate from the input layer to the output layer of the network. On the other hand
 the backward pass weights for the connection will be adjusted based on the difference between
 the desired and actual output of the layer after it (for example in Figure 2.6 the weights of the
 output layer are adjusted based in the error of the output of the network and the weight hidden
 layer will then be calculated based on these values). The backpropagation algorithm adjusts
 the weight of the connection between the neuron i and j at the k + 1?s iteration as follows
 [Riedmiller and Braun 1993]:
 wji(k + 1) = wji(k)? ??SSE?wji (k) (2.13)
 The learning rate ? has an important effect on the convergence time of the learning. How-
 ever, while a small value for the learning rate might leave the network to require a large number
 of iterations to converge, a big value will leave the network to jump between different values and
 might cause the network not to achieve the required bound on the error. To address this prob-
 lem, one of the proposed solution is introducing a momentum term. The modified change in
 weight will be as shown in Equation 2.14. The introduction of these momentum term will give
 stability to the learning process (the parameter ?) scales the influence of the previous weight),
 however this is not always true in practice [Riedmiller 1994].
 ?wji(k) = ???SSE?wji (k) + ??wji(k ? 1) (2.14)
 The backpropagation algorithm has a number of limitation often making the training pro-
 cess too slow for real-world problems [Riedmiller and Braun 1993; Delen and Kadam 2005].
 To address this problem there are a number of modifications to these algorithms. Some of
 these modifications took the heuristic approach while others perform the numerical optimi-
 sation to speed up the training process. Some of these algorithms are Resilient propagation
 algorithm proposed by Riedmiller and Braun [1993] which uses the heuristic approach and the
 Levenberg-Marquardt algorithm proposed by Hagan and Menhaj [1994] which uses numerical
 optimisation.
 Resilient backpropagation algorithm
 The Resilient backpropagation algorithm was proposed by Riedmiller and Braun [1993] to ad-
 dress the weight-update related limitations of standard backpropagation. This algorithm uses
 37
the sign of the derivative to determine the increase or decrease in the value of the weight. The
 size of the weight change is determined with separate update value. The update value of the
 weights are as follows [Riedmiller and Braun 1993]:
 ?wji(k) =
 ?
 ????
 ????
 ?Uji(k) if ?SSE?wji (k) > 0,
 +Uji(k) if ?SSE?wji (k) < 0,
 0 else
 (2.15)
 The update value is determined as follows:
 Uji(k) =
 ?
 ????
 ????
 + ? Uji(k ? 1) if ?SSE?wji (k ? 1)? ?SSE?wji (k) > 0,
 ? ? Uji(k ? 1) if ?SSE?wji (k ? 1)? ?SSE?wji (k) < 0,
 Uji(k ? 1) else
 (2.16)
 where 0 < ? < 1 < +
 The update value is set to an initial value which is perferably proportional to the initial
 weight. 0.1 is usually a good initial value. The Resilient backpropagation algorithm is much
 faster than the standard backpropagation algorithm and has a modest memory requirement [De-
 len and Kadam 2005].
 A thorough explanation of this algorithm can be found in Riedmiller and Braun [1993];
 Riedmiller [1994].
 Levenberg-Marquardt algorithm
 Levenberg-Marquardt learning was introduced by Hagan and Menhaj [1994] to speed up the
 training process of a feedforward network. This algorithm generally gives a numerical solution
 to the problem of minimising a sum of squares (Equation 2.12) of a nonlinear function (nonlin-
 ear least squares) [Hagan and Menhaj 1994]. For the function y = f(x,w), with x and y are the
 input and the output variables respectively, the Taylor series expansion gives us [Chan 1996]:
 J?w = e (2.17)
 where J is the Jacobian matrix that contains the first derivative of the network error with respect
 to weight and e is the error for a given input.
 The Levenberg-Marquardt modification to the solution to the above equation (Equation
 2.17) by Newton?s method is given by [Hagan and Menhaj 1994; Chan 1996]:
 ?w = (JTJ + ?I)?1Je (2.18)
 where ? is a regularisation parameter introduced to prevent the ill-conditioned property of Hes-
 sian which is approximated by the square of the Jacobian (H = JTJ). I is the unity matrix.
 38
Based on the choice of value for ?, Equation (2.18) exhibits different property. When the
 Hessian ? equals zero, it will become the Newton?s method with approximated Hessian matrix.
 With large value for ? the algorithm becomes gradient descent with step equals to 1? . However
 a small value will change the algorithm to Gauss-Newton method [Hagan and Menhaj 1994;
 Delen and Kadam 2005]
 A thorough explanation of this algorithm can be found in Marquardt [1963]; Hagan and
 Menhaj [1994].
 2.7 Decision trees
 2.7.1 Introduction
 Decision trees are one of the popular classification techniques. This technique is an example
 of multistage decision process where a subset of the attributes (a single attribute for example)
 that make up the pattern rather than he whole pattern is used to examine the decision making
 process at different level of the decision tree [Webb 1999]. Decision trees are a tree-structured
 classifier (see Figure 2.7) where each node in the tree is either a leaf node, which specifies a
 class value or a decision node specifying a further test to be carried out on a single attribute
 value. The number of splits a decision node has depends on the outcome of splitting test carried
 on an attribute value on the node. For a discrete attribute P1 with possible values a1, a2, . . . , an,
 the possible outcomes of the split tests are P1 = a1, P1 = a2, . . . , P1 = an (see Figure 2.7 for
 example). If the attribute P1 has a continuous value, there are two possible outcomes, P1 ? t
 and P1 > t.
 A decision tree can be used to classify a pattern by traversing the tree from root till we reach
 a leaf node which specify the class to which pattern belongs to. Travelling from the root to a
 leaf node is carried based on the outcome of the split test on each decision node in between.
 Application of decision tree to classify a pattern is similar to applying a series of if-else-then
 statements on the pattern.
 This classification technique has a number of features that makes it popular such as speed
 of classification, the ability to interpret individual features separately and the ability to handle
 missing data [Jain et al. 2000]. Furthermore, the decision tree can easily be interpreted to
 decision rules which can be easily understood by the domain experts.
 2.7.2 Constructing of decision tree
 The main issue of decision tree construction is constructing the smallest possible decision tree
 based on a given learning set. This step is carried out in a recursive manner and is an NP-
 complete problem [Quinlan 1993]. The general method of constructing decision tree is sum-
 marised in Quinlan [1993, page 17-18] as follows:
 If there are k classes denoted {C1, C2, ..., Ck}, and a training set, T , then
 39
Outlook
 Humidity Yes Wind
 No Yes No Yes
 Sunny Overcast Rain
 High Normal Strong Weak
 Figure 2.7: Decision tree for conditions to play tennis
 ? if T contains one or more patterns which all belong to a single class Cj , then
 the decision tree is a leaf identifying class Cj .
 ? if T contains no pattern, the decision tree is a leaf determined from informa-
 tion other than T .
 ? if T contains objects that belong to a mixture of classes, then a test is chosen,
 based on a single attribute, that has one or more mutually exclusive outcomes
 {O1, O2, ..., On}. T is partitioned into subsets T1, T2, ..., Tn, where Ti contains
 all the objects in T that have outcome Oi of the chosen test. The same method
 is applied recursively to each subset of training objects.
 According to this outline, when constructing a decision tree there are certain questions one
 need to answer. Some of the questions are:
 ? how to choose the best split.
 ? when to stop growing the tree
 ? how to prune the tree
 ? how to handle missing attributes
 Different variations of decision tree such as C4.5 [Quinlan 1993], CART (classification
 and regression tree) [Breiman et al. 1984], CHAID (chi-square automatic interaction detection)
 [Kass 1980], QUEST (quick unbiased efficient statistical tree) [Loh and Shih 1997] etc. address
 these questions differently. In this dissertation we will how these question are answered in the
 context of the induction of decision tree algorithm: C4.5.
 40
2.7.3 Splitting test
 Different decision tree algorithm use different criteria to split the training sample in to subsets
 on each split test. The split evaluation criteria used by C4.5 induction algorithm is informa-
 tion gain which measures the quality of given attribute as a split criteria based on the targeted
 classification. The presentation in section follows Quinlan [1993] unless specified.
 For any subset S of set T of training examples, let freq(Ci, S) be the number of patterns
 belonging to class Ci(i = 1, 2, . . . , n). Selecting one example at random from S and declaring
 that it belongs to class Ci has a probability of:
 freq(Ci, S)
 |S| (2.19)
 where |S| is the number of examples in the subset S. And the information conveyed (in bits) by
 this ?message? is
 ? log2
 freq(Ci, S)
 |S| (2.20)
 In general for the probability distribution P = (= p1, p2, . . . , pn), where pi = freq(Ci,S)|S| , the
 information conveyed by this distribution is:
 Info(P ) =
 n?
 i
 pi ? log(pi) (2.21)
 similarly the information conveyed by the set S will be given by:
 Info(P ) = ?
 n?
 i
 freq(Ci, S)
 |S| ? log2
 freq(Ci, S)
 |S| (2.22)
 This amount is called entropy of the set S. (see Shannon [2001] for more on Entropy and
 information theory).
 For example, if we have 15 patterns in S and 9 belong to one class and the rest belong to
 another class (two class problem), the average amount information needed to identify the class
 of a pattern in S equals:
 Info(S) = Info( 9
 14
 ,
 5
 14
 ) = 0.94
 Now consider attribute X of a pattern in T . If X is a discrete value and have k possible val-
 ues, and T1, T2, . . . , Tk are the subsets of T consisting of patterns with distinct values for this
 attribute (X), then the expected information requirement of T can be found as a weights sum
 over the Ti?s:
 InfoX(T ) =
 k?
 i
 |Ti|
 |T | ? Info(Ti) (2.23)
 The information gained by X induced partitioning can be calculated by evaluation the informa-
 41
tion before and after partitioning as:
 Gain(X) = Info(T )? InfoX(T ) (2.24)
 This value favours testing on attributes with large number of distinct value. For example, for
 attribute X with unique value for each patterns in T , InfoX(T ) will be zero, maximising the
 Gain. To avoid this bias Quinlan [1993] uses information gain ratio instead. By analogy to
 Equation (2.21), the information generated by dividing T into k subsets induced by X is given
 by:
 Split(X) =
 k?
 i
 |Ti|
 |T | log2
 |Ti|
 |T | (2.25)
 This is split information. Information gain ratio is then defied as:
 Gain Ratio(X) = Gain(X)Split(X) (2.26)
 The gain and gain ratio for an attribute which is already selected as a split criteria in an
 ancestor node will be zero. Therefore this attribute will not be selected again as we go down
 the tree.
 In many real world problems attributes do not always have discrete values. Hence we need
 to address cases where attributes have continuous value. Quinlan [1993] and a later suggested
 improvement in Quinlan [1996] handles continuous attributes as follows.
 Suppose the attribute X has a continuous range. Although X has a continuous range, there
 can only be finite number of these values in set T . Computing information gain of this attribute
 starts by sorting these values in ascending order. Say the ordered values for attribute X are
 v1, v2, . . . , vm. Then for each mid-point of intervals v = vi+vi+12 for i ? [1,m? 1] we partition
 T into two sets: the first subset contains patterns with attribute X ? v and the second subset
 containing attributes with X > v. For each of the m partitions the gain ratio (Equation 2.26)
 will be computed and the partition that maximise the gain will be used as a split criteria. Note
 that if all the attributes of the pattern have continuous range, we will have binary tree.
 2.7.4 Stopping criteria
 Based on the above specified criteria, the construction of the decision tree will be carried out
 and a further splitting will be stopped and a leaf node is declared based on a number of criteria.
 Allowing the tree to grow until every pattern in the training set is correctly classified will result
 in an over-fitted model, which will not have good generalisation ability. Alternatively, if we stop
 the construction process early will result in under-fitted model which will equally have poor
 generalisation ability. One suggested solution to address this problem is employing stopping
 criteria taking both over-fitting and under-fitting in to consideration. Some of these include:
 ? If all possible split test have zero gain. In this case all the patterns in the subset belong to
 42
the same class.
 ? When the gain ratio is below some predetermined value.
 ? Number of patterns in the subset are fewer than some predetermined value or some pro-
 portion of the patterns belong to the same class.
 ? The depth of the node exceeds some predetermined value
 Employing stopping criteria enables us to avoid over-fitting; however, the results are uneven
 due to the noise and incorrectly classified patterns in the training set. To avoid these anomalies,
 C4.5 uses pruning method which is based on estimating the error rates.
 2.7.5 Pruning
 The generalisation ability of a fully grown tree can be increased by pruning the subtrees that
 are not contributing positively towards the generalisation ability if the resulting model. Pruning
 of a decision tree refers to the process of replacing a subtree with a leaf node or with the most
 frequently used branch [Quinlan 1993]. Quinlan [1987] suggested three different techniques of
 pruning. These are:
 ? Cost-complexity pruning
 ? Reduced error pruning and
 ? pessimistic pruning
 Cost-complexity pruning proposed by Breiman et al. [1984] performing pruning as follows:
 a large tree is created, and then a sub-tree is found starting from the leaves. Again from this
 sub-tree another sub-tree is found until we are left with a sub-tree containing the root node only.
 All this sub-trees will then be tested on a independent validation set and the sub-tree with the
 least cost will be selected.
 Reduced error pruning also uses sequence of sub-trees to choose one with the least mis-
 classification on an independent validation set. However, the way the sub-trees are generated
 is different from cost-complexity pruning. In this approach, for each non-leaf sub-tree, we re-
 place it with the best possible leaf that will enable the resulting tree better the original on the
 validation set. This process in carried out on the resulting tree until no more gain in classifica-
 tion is found. Both cost-complexity pruning and reduced error pruning require a independent
 validation set and this might be a disadvantage in some classification problem where the num-
 ber of training patterns are limited. There are a number of approaches in place to address such
 scenario. One of these approaches is cross-validation. However, the approach taken by C4.5 is
 a pruning method that does not require an independent validation set: pessimistic pruning.
 43
Pessimistic pruning
 Pessimistic pruning is an approach which increases the errors observed at each leaf pessimisti-
 cally using continuous correction for binomial distribution to encourage pruning [Quinlan 1987].
 Quinlan [1987] describes pessimistic pruning that effectively increases the number of observed
 error at each leaf by 0.5. However, C4.5 uses a far more pessimistic estimate as presented in
 [Kohavi and Quinlan 2002] as follows.
 When a non-leaf covering N training patterns of which E are misclassified, EN is an estimate
 of the probability of p of misclassification. However, since the decision tree is constructed on
 the same N training patterns it tends to minimise the apparent error. To make our estimate more
 realistic we can derive a confidence limit for p. For a given confidence term CF , we can find
 an upper limit pr such that p ? pr with probability 1? CF . Following [Diem 1962, as cited in
 Kohavi and Quinlan [2002]], pr satisfies:
 CF =
 ??
 ?
 (1? pr)N if E = 0?E
 i=0
 (N
 i
 )
 pir(1? pr)N?i if E > 0
 (2.27)
 Let UCF (E,N) be the upper bound on error pr.
 For the non-leaf tree T (shown in Figure 2.8), produced from N training patterns, where
 the sub-trees T ?i are already pruned: Let T ?f be the subtree corresponding to the most frequent
 outcome of the split test B and let L be the leaf labelled with the most frequent class in the
 training pattern.
 B
 T ?1 T ?2 T ?3 T ?t
 b1 b2 b3 bt
 . . .
 Figure 2.8: Non-leaf tree with already pruned subtrees T ?i
 If ET , ET ?f and EL are the number of missclassified patterns by the tree T , sub-tree T
 ?
 f and
 leaf L respectively, the corresponding estimated error rates are:
 ? UCF (ET , N)
 ? UCF (ET ?f ,N)
 ? UCF (EL, N)
 Depending on the lowest value of the above estimate, T will be left unchanged, is replaced by
 the subtree T ?f or is replaced by the leaf L.
 44
2.8 Logistic regression
 Logistic regression is a regression method which is popular for modelling dichotomous data. In
 this model, the dependent variable Y has the value 1 with the probability P (Xi) and the value 0
 with probability 1?P (Xi). The independent variables Xi can have any form. This model does
 not assume that the relationship between the independent variables and the dependent variable
 is a linear one, or that the dependent variable or the error terms are distributed normally.
 Logistic regression is used to predict the effect of the independent variables on the binary
 response. The relationship between predictor and the response is determined by the log odd or
 logit transformation of P (Xi). Given the values of Xi = (Xi1, Xi2, . . . , Xid), the probability
 of the response being 1 (odds of observing 1 versus 0), is modelled using logistic regression
 model as:
 ?i = logit(P (Xi)) = log( P (Xi)1? P (Xi)) = ?0 +
 d?
 j=1
 ?jXij (2.28)
 where ?i is the linear predictor which is a combination of the independent variables. ?0 is the
 intercept and ? = (?1, ?2, . . . ?d) is the vector slope parameter [Cessie and van Houwelingen
 1992; So 1995].
 An alternative form of the logistic regression model is [Perlich et al. 2004]:
 P (Xi) = exp(?0 +
 ?d
 j=1 ?jXij)
 1 + exp(?0 +?dj=1 ?jXij) (2.29)
 To find the best estimate for P (Xi) the loss function used in this model is the log-likelihood.
 The parameter estimate ?? maximises the log-likelihood and gives estimate P (Xi) = P (Y = 1).
 The log-likelihood for the data (X,Y ) under the logistic model is given by:
 l(?) = ?
 i
 [Yi logP (Xi) + (1? Yi) log(1? P (Xi))] (2.30)
 Maximisation of l(?) yields the maximum likelihood estimator ?? for ?. However, the model
 becomes unstable when the dimensionality of the pattern is big compared to the available sam-
 ple [Perlich et al. 2004]. To obtain a more realistic estimates for the parameters and improve the
 predictive power of model, Cessie and van Houwelingen [1992] extended a method that adjusts
 the regression estimate by shrinking the correlation matrix for the predictor value to wards a
 fixed point by adding a constant ? to the diagonal element of the matrix.
 The penalised log-likelihood is given by:
 l?(?) = l(?)? ??
 j
 ?2j /2 (2.31)
 where l(?) is the unrestricted log-likelihood function. The second term is the ridge penalty and
 ? is the ridge parameter. The ridge parameter regulates the penalty: when ? = 0 the solution
 45
will be unrestricted maximum likelihood estimator, whereas ? ? ? the ?j all tend to zero
 [Cessie and van Houwelingen 1992; Eilers et al. 2001].
 Choosing an optimal value for the smoothing or regularisation constant ? is crucial. The
 two dominant data-driven ways of selecting the ridge parameter are cross-validation and Akaik
 Information Criterion (AIC) [Schimek 2003]. Due to lack of empirical evidence in the later,
 cross -validation is the most popular and successful method[Schimek 2003]. The performance
 of cross-validation one may use either the fraction of misclassification or the strength of log-
 likelihood prediction.
 A thorough explanation of penalised logistic regression with ridge estimator can be found
 in Cessie and van Houwelingen [1992]; Eilers et al. [2001]; Zhu and Hastie [2004].
 2.9 Summary
 In this chapter a detailed overview of the biological background of HIV, a brief overview of
 general pattern recognition principles and dimensionality problem and techniques of address-
 ing it was presented. This chapter have also presented a detailed background on statistical
 pattern recognition techniques and Bayesian decision theory which is the tool broadly used to
 perform pattern classification based on some probabilistic knowledge about the random patterns
 and the pattern classes. A brief introduction to feedforward neural networks with backpropa-
 gation learning was given. Two algorithms proposed to address the problem with the standard
 backpropagation algorithm was also introduced. A brief introduction to decision trees based on
 the C4.5 algorithm was also given. Furthermore a brief introduction and direction for further
 reading for logistic regression was also give. This chapter have also covered a comprehensive
 discussion and the mathematical formulation of principal component analysis which is the best
 known unsupervised linear transformation technique.
 46
Chapter 3
 Support Vector Machine (SVM)
 3.1 Introduction
 SVMs are statistical learning technique developed by Boser et al. [1992] to perform a variety of
 supervised learning and function estimations task. Like any other machine learning technique,
 SVMs are aimed at minimising the risk (training or test error) based on some simple idea that
 provides a clear intuition of what supervised learning is all about. Most traditional machine
 learning techniques are based on the Empirical Risk Minimisation (ERM) principle, which ap-
 proximates the estimation function based on minimising empirical risk (the error on the training
 data). However, SVMs use the principle of Structural Risk Minimisation (SRM), proposed by
 Vapnik [1979], which minimises the upper bound on the test error or risk [Scho?lkopf 1997]. The
 principle of SRM gives SVMs the advantage over other machine learning techniques, which are
 based on ERM. Machine learning techniques based on ERM are more likely suffer from over-
 fitting which is less likely to occur in those based on SRM. Over-fitting refers to the fact that
 the training data is perfectly learnt but the estimation function does not perform well on unseen
 data. SVMs do not suffer from over-fitting as much and hence give better generalisation.
 Beside SRM, the ability of SVMs as a pattern recognition tool relies on their ability to
 transform the data into a higher dimensional feature space that is nonlinearly related to the in-
 put space. Transforming the data from the input space to feature space transforms a complex
 real-world classification into a simple classification problem where the classification task can
 be accomplished using a linear hyperplane [Ganapathiraju 2002]. Although the patterns are
 mapped from the input space into higher dimensional feature space, SVMs have a further ad-
 vantage of performing all computations in the input space, using a specialised function called
 kernel functions. This trick further saves SVMs from suffering the curse of dimensionality.
 As the result of the principle of SRM together with the kernel trick, SVMs are rapidly
 replacing neural networks, Radial Basis Functions (RBF) and polynomial classifiers which have
 been dominant pattern recognition tools [Hearst et al. 1998]. Furthermore, SVMs have already
 been applied in many benchmark applications including face detection [Osuna et al. 1997],
 text classification [Joachims 1998; Dumais et al. 1998; Sun et al. 2002], speech recognition
 [Ganapathiraju 2002] and in a number of other pattern recognition problems.
 47
In this chapter a brief introduction to SVMs, implementation issues and some of the bench-
 mark application of SVMs will be discussed. Section 3.2 will give a concise description on
 risk minimisation namely ERM and SRM. In section 3.3 we will give a deep insight into linear
 SVMs. In section 3.4, the non-linear Support vector machine is described in detail following a
 discussion on feature space and kernel mathematics. The material presents solution to two-class
 classification problem. Section 3.5 explains how the two-class classification will be extended
 to multi-class classification. Section 3.6 gives a brief overview of the approaches taken in im-
 plementation of SVM tools. The material presented in this section is based on Vapnik [1995];
 Burges [1998]; Osuna et al. [1997]; Cristianini and Shawe-Taylor [2000]; Scho?lkopf [1997]. If
 any other reference is used it will be cited accordingly.
 3.2 Risk minimisation
 Given a set of training samples (x1, ?1), (x2, ?2), . . . , (xl, ?l) generated independently and iden-
 tically distributed from unknown probability with:
 (x1, ?1), (x2, ?2), . . . , (xl, ?l) ? Rd ? ? (3.1)
 where xi is the input data, ?i is the class xi belongs to and l is the number of examples. The
 two-class pattern recognition problem can be described as choosing a function f? given a set of
 decision functions:
 {f? : ? ? ?}, where f? : Rd ? {?1} (3.2)
 where {f? : ? ? ?} is called the hypothesis space1 [Osuna et al. 1997] and will be denoted by
 S . ? is a set of abstract parameters introduced to enable f? correctly classify unseen example
 (x, ?), which is generated with the same underlying probability distribution P (x, ?) as the
 training data set. However, for a given set of training samples there might be a number of such
 functions f? which work on different subset of the training sample. So how do we know which
 one to choose?
 Suppose we have two disjoint subsets of the training sample X = {x1, x2, . . . , xl} and
 ?X = {x?1, x?2, . . . , x??l} such that X ? ?X = ? and there exists a function f ?? such that:
 ??
 ?
 f ??(xi) = ?i i = 1, 2, . . . , l,
 f ??(x?i) 6= ??i i = 1, 2, . . . , ?l
 If we use X as a training set and ?X the validation set, the decision function f ?? will have a
 very poor performance on the validation set. Therefore, the aim of the pattern recognition
 system should be to find the function f? which possibly performs better on both the training and
 validation set. In order to choose the function f ?? that performs better on both the training and
 1The hypothesis space is a set of all possible functions that can perform the classification and also where we
 choose the one that performs best
 48
validation set, a way of quantifying the performance is needed. By defining of a loss function
 L(?i, f?(xi)) that measures the loss, the difference between the classification according to our
 pattern recognition system and the true class of the pattern xi, the expected risk, which should
 be minimised over all classes of functions f? will then be given as:
 R(?) =
 ?
 L(?i, f?(xi))dP (x, ?) ? ? ? (3.3)
 Estimating the expected risk directly from the above equation is impossible because the distri-
 bution probability P (x, ?) is unknown. The only information available is the data, which is
 independently generated and identically distributed and can be considered a fair representation
 of the underlying distribution. Therefore, one can approximate this error by the measured mean
 error on the training set [Vapnik 1995].
 3.2.1 Empirical Risk Minimisation (ERM)
 The measured mean error (empirical risk) on the training set is given by:
 Remp(?) = 1l
 l?
 i=1
 L(?i, f?(xi)) (3.4)
 Note that there is no probability associated in equation 3.4 and Remp(?) is a real number for a
 given function f?. Once the estimation is done based on the available training sample, the func-
 tion f? that minimise the empirical risk will be considered. This is the principle of Empirical
 Risk Minimisation (ERM) [Vapnik 1995].
 Consider the 0/1 loss function for two class pattern recognition problem defined as:
 L(?i, f?(xi)) =
 ??
 ?
 0 ?i = f?(xi),
 1 ?i 6= f?(xi)
 i = 1, 2, . . . , l
 where ?i takes the value {?1}.
 Using this loss function the empirical risk will have the following form.
 Remp(?) = 12l
 l?
 i=1
 |f?(xi)? ?i| (3.5)
 ERM is intuitive and easy to implement, but the question of consistency needs to be answered
 [Vapnik 1995; Osuna et al. 1997]. This means, does a non-trivial function f? that minimises
 the empirical risk (3.4) also minimise the actual risk (3.3)?
 Consistency of empirical risk depends on both the number of samples available (the law of
 large numbers) and the capacity of the set of functions (the Vapnik Chervonenkis dimension
 (VC dimension)) of the learning machine [Osuna et al. 1997; Vapnik 1995; Scho?lkopf and
 Smola 2001]. Furthermore, Vapnik and Chervonenkis have also shown that the finiteness of the
 49
VC dimension of the hypothesis space is the necessary and sufficient condition for consistency
 of the ERM [Vapnik and Chervonenkis 1991, as cited in Osuna et al. [1997]].
 When the number of available training samples is small, the empirical risk can easily be
 minimised to a small value close to zero, but the data hardly reflects the underlying proba-
 bility distribution P (x, ?) and hence results in larger actual risk or inconsistent ERM. This
 phenomenon is termed over-fitting [Vapnik 1995]. Although it is not guaranteed, as the number
 of data points increases, the empirical risk will increase and at the same time the data will more
 likely represent the underlying probability distribution and the actual risk will converge to the
 empirical risk. In this case the ERM is said to be consistent and hence the pattern recognition
 system that performed well on the training set will also perform well on unseen problems gen-
 erated with the same underlying probability distribution P (x, ?). Figure 3.1 shows a simplified
 description of the consistency or inconsistency of ERM.
 (?)R 
emp
 (?)R 
R min
 l
 Figure 3.1: Asymptotic behaviour of a minimum empirical and corresponding expected risk
 for consistent ERM. The x?axis shows the the number of examples and the y?axis shows the
 corresponding risk
 In most real life pattern recognition problems we are limited to a finite (usually small)
 training data and hence we need to make the most out of these data. To handle this situation a
 number of techniques have been proposed in statistical pattern recognition. One such approach
 is leaving room for a difference between the empirical risk and the actual risk. Following this,
 Vapnik [1995] has shown there exists a bound for the actual risk given the empirical risk and
 the measure for the capacity of the class of functions. This bound is given as:
 R(?) ? Remp(?) + ?(h) (3.6)
 The quantity h, which is the measure of the capacity of a set of functions, is called the VC
 dimension of a set functions and the function ?(h) is called the confidence term for the ERM
 50
[Burges 1998; Scho?lkopf 1997]. The VC dimension of a set of function is defined as the maxi-
 mum number of points that can be shattered by this class of functions. A given training sample
 of d dimension is said to be shattered by the function f?, if the function can correctly classify
 the training sample into all the possible classes. For example, consider the two-class problem
 defined above. In this problem, a set of l points can have 2l different possible classifications
 and if the set of functions f? can accomplish this, we say the set of points is shattered by the set
 of functions f? [Burges 1998] (see Figure 3.2). A comprehensive discussion on convergence,
 consistency and VC dimensions and other capacity measures can be found in Vapnik [1995].
 Figure 3.2: Three points in R2, shattered by oriented lines. The VC dimension of the set of
 oriented lines in R2 is therefore, three [Burges 1998, page 4].
 From the inequality (3.6) we can see that for the consistency of the risk minimisation and
 good generalisation of the pattern recognition system, the confidence term, ?(h) should be
 minimised at the same time as the empirical risk. Suppose we only minimised the empirical
 risk to zero via the principle of ERM, the actual risk will be equal to ?(h) and a large value of
 this quantity will result in a system that has poor generalisation. And on the other hand if ?(h)
 is kept close to zero, the expected risk will be equal to the empirical risk, which needs to be
 minimised using the principle of ERM.
 Consider again the two-class problem defined above. For any ? ? ? and l > h with some
 value ? such that 0 ? ? ? 1, Vapnik and Chervonenkis [1991] proposed an upper bound on the
 error and based on this proposal the inequality (3.6) can be redefined as follows and will hold
 with a probability of at least 1? ?
 R(?) ? Remp(?) +
 ?
 h(log(2lh ) + 1)? log(?4)
 l (3.7)
 The second term in equation 3.7 is proportional to the ratio (hl ). Therefore to achieve good
 generalisation the empirical risk and the ratio between the VC-dimension and the number of
 training samples should be kept to a minimum at the same time. The relation between the two
 51
terms on the right hand side of equation 3.7 is graphically shown in Figure 3.3.
 hn1
 *h
 S S Sn1
 *
 Risk
 h
 Empirical risk
 Bound on actual  risk
 Nested structure of subsets
 (VC dimension)
 Bound on VC confidence
 Figure 3.3: The optimal classifier needs to find some appropriate structure that minimises both
 the empirical risk and the confidence term [Vapnik 1995, page 98]
 Burges [1998] pointed out that the empirical risk is a decreasing function of h and to get a
 good generalisation based on a limited training set one needs to find the optimal VC dimension.
 To overcome the problem of choosing the appropriate (finite) VC dimension Vapnik [1979]
 proposed the principle of structural risk minimisation. A detailed discussion on generalisation
 theory and the derivation of equation 3.7 can be found in Cristianini and Shawe-Taylor [2000].
 3.2.2 Structural Risk Minimisation (SRM)
 ?The principle of structural risk minimisation defines a trade-off between the quality of the
 approximation of the given data and the complexity of the approximating function.? [Vapnik
 1995, page 95]. To define the trade-off, both terms in the right hand side of equation (3.7) should
 be minimised. If we select one hypothesis class S , we are left with the task of minimising the
 empirical risk since the confidence term will be fixed for the selected hypothesis class. (By
 defining the hypothesis space, the VC dimension of the set of functions h is known and together
 with the size on the training patterns l, we can calculate the second term in equation 3.7). Let?s
 define a nested structure of hypothesis space
 S1 ? S2 ? . . . ? Sn ? . . . ,
 with hi ? hi+1. Note that equation 3.7 also holds for the nested sequence of hypothesis classes.
 By computing the respective actual risk for each subset, we choose the function f ?? in the subset
 52
S?i that minimises the upper bound on the risk [Scho?lkopf 1997; Osuna et al. 1997]. This prin-
 ciple is called principle of structural risk minimisation [Vapnik 1995]. Choosing this optimal
 combination (optimal trade-off between the capacity of the function and the empirical risk) is
 the task of the learning algorithm. To find the optimal hypothesis space one can use either of
 the following approaches: fix the confidence term and minimise the empirical risk or for a fixed
 empirical risk minimise the confidence term. However, the support vector algorithm accomplish
 this simultaneously [Osuna et al. 1997].
 3.3 Linear support vector machine
 The linear support vector machine also known as maximum margin classifier is the simplest
 form of support vector machine. In this section linear support vector machines will be presented
 first and will be used as an introduction to basic principles, notation and approaches that are later
 extended to more general Support Vector Machine. Furthermore, support vector machines are
 inherently binary classifiers. Therefore the formulation shown below and in subsequent sections
 is given for two class pattern recognition problem. In later sections we will see how this will
 be extended to multi-class pattern recognition problems. To make the presentation in two class
 classification problem simple and consistent with the conventional mathematical presentation
 we will use yi instead of ?i as class label.
 The complete mathematical formulation can be found in Appendix B. Some mathematical
 steps are removed from the presentation in this section. For the missing mathematical steps
 refer to the Appendix.
 3.3.1 Linear separable case (Maximum margin classifier)
 Given set of examples ((x1, y1), . . . , (xl, yl)) ? Rd ? {?1}, where yi ? {?1}, and xi ? Rd,
 assume there exists a set of hyperplanes which totally discriminate the positive examples from
 the negative ones. This means we can find a pair (w, b) such that:
 yi(xi.w + b)? 1 ? 0 i = 1, 2, . . . , l (3.8)
 where w is the direction of the normal or orientation of hyperplane and b is the threshold.
 The mapping function which is usually called the hypothesis is then given by:
 f(w, b) = sign(w.xi + b) (3.9)
 Consider the example given in Figure 3.4. These training samples of the two classes can
 be perfectly separated by a linear hyperplane. Furthermore, one can find an infinite number
 of hyperplane that can accomplish this task. Some of these hyperplanes are shown in Figure
 3.4. As we can seen from the figure, each of these hyperplanes has zero empirical risk, but
 we wish to find the one that will minimise the right hand side of equation (3.7) as seen in the
 53
previous section. From the figure one can take an educated guess to say the hyperplane that
 passes through the middle will be more likely to give the minimum risk. Formally defining this
 hyperplane, the optimal hyperplane is a hyperplane which is likely to minimise the expected risk
 is the one that maximises the margin, which is defined as the distance between the examples of
 the two classes that are close to this hyperplane.
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 000000000000000
 000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 111111111111111
 111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 111111111111111
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 00000000000
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 11111111111
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 000000000000000000000000000000000
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 111111111111111111111111111111111
 Optimal hyperplane
 Figure 3.4: Optimal separating hyperplane
 Once the optimal hyperplane is found all the training sets will satisfy equations (3.8) and
 classification will be based on the sign of equation (3.9). The points that lie on the hyperplane
 separating the data satisfies the equality:
 w.xi + b = 0 (3.10)
 Figure 3.5 gives graphical interpretation of support vector classification. All the points that
 satisfy the inequality w.xi + b ? ?1 lie on H1 or to the left of it and those satisfying w.xi + b ?
 ?1 will lie on H2 or to the right of it. The margin is therefore defined as the distance between H1
 or H2 and the optimal hyperplane (see figure 3.5). It is evident that H1 and H2 are parallel and
 for a perfectly separable training set, no point lies between the two hyperplanes. Furthermore,
 H1, H2 and the optimal hyperplane differ only on the threshold b. Formally defining the optimal
 hyperplane with respect to these two hyperplanes, the separating hyperplane is optimal if the
 minimum distance between these hyperplane and the optimal hyperplane is maximal.
 By computing the margin for a given orientation w, the problem of finding the optimal
 hyperplane will be defined as:
 Minimise 1
 2
 ?w?2
 subject to yi(xi.w + b)? 1 ? 0 ?i
 Minimising a quadratic function under a linear constraint formulated above is called quadratic
 programming and can be solved to give the solution to the optimal hyperplane using Quadratic
 54
0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 000000000000000
 000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 111111111111111
 111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 000
 000
 000
 000
 000
 111
 111
 111
 111
 111
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 0011
 H2
 H1
 w
 ?b
 |w|
 Figure 3.5: Maximum margin hyperplane for linear separable case
 Programming (QP) optimisation [Cristianini and Shawe-Taylor 2000]. Solving this problem
 using the classical Lagrangian multipliers approach has a number of advantages [Cristianini
 and Shawe-Taylor 2000]. Firstly, this approach gives an alternative formulation of the original
 problem (dual form) which is easier to solve. Secondly, the dual form is not only easier to
 solve but also emphasises the importance of some training examples over the other, leading to
 a minimised but critical sample size and thirdly, the dual form simplifies generalisation beyond
 linear separable cases.
 Introducing a dual vector of non-negative Lagrangian multiplier ? = (?1, ?2, . . . , ?l) cor-
 responding to each inequality constraint in (3.8) the constrained minimisation problem given
 above will be rewritten in the dual form as:
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0; (3.11)
 ? ? 0
 The Karush-Kuhn-Tucker(KKT) optimisation theory (The KKT theorem is given in Ap-
 pendix B.2), which guarantees the existence of a solution to the optimisation problem shows
 that, at the saddle point all points satisfy the constraint (3.8) with strict equality. i.e.
 ?i(yi(w.xi + b)? 1) = 0 i = 1, . . . , l (3.12)
 From equation 3.12, the following two conditions need to be distinguished:
 ? If ?i = 0, then yi(w.xi + b) ? 1
 55
? If ?i > 0, then yi(w.xi + b) = 1
 Recall that one of the advantages of using the Lagrangian function to solve the optimisation
 problem is expressing the importance of each pattern in the training set. Consider the value of ?i
 corresponding to each training pattern. Training patterns with ?i > 0 will fall on the hyperplane
 H1 or H2 (see Figure 3.5) and hence are critical in defining the decision boundary. Other training
 patterns with ?i = 0 lies to the left or right of H1 and H2 respectively. These training patterns
 have no effect in determining the decision boundary. Therefore, if those training patterns with
 ?i = 0 value are removed and the training is repeated, the decision boundary will remain the
 same. Training patterns with nonzero ?i are called Support Vectors (the name of this learning
 technique follows from this).
 With the orientation w given by w =
 ?l
 i=1 ?iyixi, the mapping function (equation 3.9) can
 be redefined as:
 f(x,?, b) = sign(
 l?
 i=1
 yi?i(x.xi) + b) ?i = 1, . . . , l (3.13)
 We have seen that the parameter ?i = 0 for all training points except for the support vectors,
 hence the mapping function will have its final form:
 f(x,?, b) = sign(?
 i?SV
 yi?i(x.xi) + b) ?i = 1, . . . , l (3.14)
 In other words the expression is evaluated in terms of the dot product between the pattern
 to be classified and the Support Vectors (SV) (xi), and the sign of the function will be used to
 classify the pattern to their respective class.
 3.3.2 Linearly non-separable case: Soft margin classifier
 So far we have seen the case where the training data is perfectly separable using linear hyper-
 plane. However, real-world problems involve non-separable data and the assumption taken in
 the previous section is too ambitious. To extend the above solution to non-separable data a
 positive slack variable ?i : i = 1, . . . , i is introduced to associate further cost as a penalty for
 misclassification whenever necessary (see Figure(3.6)).
 Using this relaxed separation constraint equation (3.8) becomes:
 yi(xi.w + b) > 1? ?i ?i ? 0 i = 1, . . . , l (3.15)
 The problem of finding optimal margin will therefore comprise two parts.
 ? Maximise the margin (the same as the linear separable case) and
 ? Minimise the slack variable ?i which counts for amount of error
 56
0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 000000000000000
 000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 111111111111111
 111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 00000000000000000
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 11111111111111111
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 0000000000000000
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 1111111111111111
 000
 000
 000
 000
 000
 111
 111
 111
 111
 111
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 00000000000000
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 11111111111111
 0011
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 00000000
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 11111111
 0
 0
 0
 1
 1
 1
 H2
 H1
 w
 ?b
 |w|
 X2
 1X
 ?
 1?
 2
 Figure 3.6: Maximum margin hyperplane for linear non-separable case
 One way of combining these two conditions into a single function is given below:
 ?(w,?) = 1
 2
 ?w?2 + C(
 l?
 i=1
 ?i)k
 The constant C is a parameter to be chosen freely by the user to specify the trade-off between
 the width of the margin and misclassification penalty. Therefore the optimal hyperplane will be
 the one that minimises the function ?(w,?). i.e.
 Minimise ?(w,?) = 1
 2
 ?w?2 + C(
 l?
 i=1
 ?i)k;
 yi(xi.w + b) > 1? ?i i = 1, . . . , l; (3.16)
 ?i > 0 i = 1, . . . , l
 By choosing k = 1, the above optimisation problem can be solved using QP. Introducing a
 dual vector of non-negative value ? = (?1, ?2, . . . ?l) for of each the first constraint and ? =
 (?1, ?2, . . . , ?l) for each of the second constraint the optimisation problem defined above will
 be rewritten as:
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0; (3.17)
 0 ? ? ? C
 57
Applying the KKT condition we have:
 ?i(yi(xi.w + b)? 1 + ?i) = 0 i = 1, 2, . . . l (3.18)
 From the above equation three different cases needs to be distinguished:
 ? If ?i = 0, then ?i = C (?i = 0) and yi(xi.w + b) = 1;
 ? If 0 ? ?i ? C, then 0 ? ?i ? C (?i = 0) and yi(xi.w + b) = 1;
 ? If ?i = C, then ?i = 0 (?i > 0) and yi(xi.w + b) = 1 + ?i.
 In the first case, the points are on the correct side of the optimal hyperplane and are distant from
 the hyperplane by more than the margin (i.e. these points lie to the left or to the right of the
 hyperplane H1 or H2 respectively). In the second case, the points lie on the hyperplane H1 or
 H2 and are Support Vectors. In the third case these points are also support vectors, but do not
 necessarily lie on the hyperplane H1 of H2. These points might be on the wrong side of the
 hyperplane or on the right side but closer than the hyperplanes H1 or H2 (For example: X1 and
 X2 in Figure 3.6).
 Besides the above additional constraints, the solution for the linear separable case holds
 with the decision function given by:
 f(x,?, b) = sign(?
 i?SV
 yi?i(x.xi) + b) ?i = 1, . . . , l
 So far we have seen the simplest form of SVM which are designed for the most trivial linear
 separable case can be further extended to accommodate cases that are linearly non-separable.
 But most real life classification problems do not have linear dataset but rather non-linear dataset
 (see Figure 3.7). Hence, we still need to extend the solution to accommodate non-linear datasets.
 It has been stated at the introduction to this chapter that one of the underlying principle that
 makes SVMs interesting is the ability to transform non-linear data from input space to higher
 dimensional feature space where the data can be linearly classified. The following section gives
 brief introduction to feature space and kernel tricks and show how these concepts are used to
 generalise linear SVMs to handle non-linear classification problems.
 3.4 Non-linear support vector machine: The Kernel trick
 In most real life problems, linear combinations of the individual measurements cannot fully
 describe the properties of the patterns under consideration. Hence a complex representation of
 these measurements is required resulting in a complex structured dataset in the input space. As
 is pointed out in the previous chapter changing the representation of the data from the input
 space into some feature space has a number of advantages. Feature space, as discussed in the
 previous chapter usually have lesser dimensionality than the input space (as a result of feature
 58
0
 0
 1
 1
 00
 00
 11
 11
 00
 00
 11
 11
 00
 00
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 11
 11
 0
 0
 1
 1
 00
 00
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 11
 11
 00
 00
 11
 11
 Figure 3.7: Linearly non-separable training sample
 selection or extraction). On the contrary, in this section we will see how we can use a feature
 space with higher dimension than the input space to increase the computational power of the
 linear SVM described in the previous sections without suffering from curse of dimensionality.
 Suppose the training set is ((x1, y1), . . . , (xl, yl)) ? Rd ? {?1} and there is a function ?(x)
 and a mapping given by:
 X = {x1, . . . , xd} ? ?(X) = {?(x1), . . . , ?(xN)} (3.19)
 where d < N and F = {?(x)|x ? X} is called the feature space.
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 00000000000000001111111111111111 00000000000000001111111111111111
 Mapping
 ?
 Input Space Feature space
 Figure 3.8: Mapping from two dimensional input space into two dimensional feature space
 where data can be linearly separated
 59
Consider the mapping given graphically in Figure 3.8. It is easy to see that the data cannot
 be separated using a linear hyperplane in the input space. However, once it is projected to the
 feature space, separating the data using a linear hyperplane is possible. This simple example
 demonstrates the power of mapping a non-linear dataset into feature space in simplifying com-
 plex classification problems. As it is pointed out in section (3.3.1), one of the advantages of
 using the classical Lagrangian approach to solve the optimisation problem is simplifying the
 problem of extending the solution found for linear SVM into the generic non-linear SVM. Also
 note equation (3.13) that the decision function requires the computation of the dot product be-
 tween the point to be classified and some of the training examples. Therefore, performing the
 classification in the feature space will give the hypothesis and the decision function a new form
 which is given by :
 f(w, b) = sign(w.?(xi) + b) (3.20)
 and
 f(x) = sign(
 l?
 i=1
 yi?i(?(x).?(xi) + b)) i = 1, . . . , l (3.21)
 The dot product ?(x).?(xi) will be easily defined by introducing the function K(x, y) called
 Kernel function (or more formally called Mercer Kernels to distinguish them from other kernels
 used in mathematics). For definition of kernel functions see see Appendix B.2.
 The introduction of the kernel function allows us to compute the dot product without ex-
 plicitly mapping the data into the feature space. To show that the kernel function represents
 the dot product in the feature space we will use Mercer?s Theorem (see Appendix B.2), which
 guarantees the existence of a mapping ? in F for any kernel which is a dot product in some
 feature space [Ganapathiraju 2002].
 To show how kernels are used to transform data implicitly into a higher dimensional space
 consider the following example:-
 First lets how mapping from input to feature space simplifies the classification problem.
 Consider the target function:
 f(x, y) = x
 2
 y2
 this target function could not be represeted by a linear machine. However a simple transforma-
 tion can make it representable by a linear machine:
 (x, y) ? (x?, y?) = (log x, log y)
 gives the representation
 f ?(x?, y?) = log f(x, y) = 2 log x? 2 log y
 f ? could be learnt by a linear machine.
 60
Now suppose we want the mapping from R2 to R6. Let us choose the kernel to be:
 K(x, y) = (x.y + 1)2 (3.22)
 Choosing the kernel function given below:
 (x.y + 1)2 = ?(x).?(y) (3.23)
 Let x = (x1, x2) and y = (y1, y2). This implies
 (x.y + 1)2 = x21y21 + x22y22 + 2x1y1 + 2x2y2 + 2x1y1x2y2 + 1
 Now if we define
 ?(x) = (1,?2x1,
 ?
 2x2, x21, x22,
 ?
 2x1x2)
 it can easily be shown that equation (3.23) holds. Note that the above example can be easily
 generalised to higher dimensional space.
 Having introduced the kernel representation of dot product in feature space, the decision
 function (3.21) can be rewritten using kernel functions as:
 f(x) = sign(?
 i?SV
 yi?ik(x, xi) + b)) i = 1, . . . , l (3.24)
 Note that, SVMs can be used to classify non-linear data without the need to transform the input
 space to a high dimensional feature space explicitly using a kernel function. This strategy also
 removes the curse of dimensionality, which usually occurs as a result of dimensionality increase
 for a fixed number of sample size. Also note that only the support vectors are involved in the
 decision function.
 Although kernel functions can be chosen freely, choosing some of the most commonly used
 kernels SVMs can represent other known classifiers with a better performance [Osuna et al.
 1997]. Some of these kernels are:
 ? Polynomial Kernel of degree d
 K(x, y) = (x.y + 1)d (3.25)
 where d is user-defined
 ? Gaussian radial basis function (RBF)
 K(x, y) = exp(???x? y?2) (3.26)
 where ? is user-defined
 61
? Sigmoid kernel: Two layer Neural network
 K(x, y) = tanh(k(x.y) + ?) (3.27)
 where k is gain and ? is the offset. Both of these parameters are user-defined
 Ganapathiraju [2002] reports that though convergence for RBF is the most expensive, this kernel
 is very powerful and can model a classifier that can be effective when datepoints from one class
 is totally encapsulated by the other. An example of classification using polynomial and RBF is
 given in figure 3.9.
 Other additional kernels (usually used in function estimation) includes [Vapnik 1995]
 ? Regularised Fourier (weaker mode regularisation)
 For one dimensional case:
 K(x, y) = pi
 2?
 cosh pi??xi?xj??
 sinh pi?
 (3.28)
 where 0 ? ?xi ? xj? ? 2pi and ? is user defined
 For the multidimensional case K(x, y) = ?dk=1 Kk(xk, yk)
 ? Regularised Fourier (strong mode regularisation)
 For one dimensional case:
 K(x, y) = 1? ?
 2
 2(1? 2? cosh(xi ? xj) + ?2) (3.29)
 where 0 ? ?xi ? xj? ? 2pi and ? is user defined
 For the multidimensional case K(x, y) = ?dk=1 Kk(xk, yk)
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 00
 00
 11
 11
 0
 0
 1
 1
 00
 00
 00
 11
 11
 11
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 00
 00
 00
 11
 11
 11
 0
 0
 0
 1
 1
 1
 0
 0
 0
 1
 1
 1
 Figure 3.9: Decision surface given by (a) Polynomial kernel, and (b) RBF kernel. Support
 vectors are indicated by dark filled points [Osuna et al. 1997]
 62
3.5 Multi-class classification
 SVMs are inherently binary classifiers. The binary SVM discussed in the previous sections use
 the discriminant function given in equation 3.20. However, most classification problems are not
 limited to binary classification problems but to multi-class classification problems where the
 inputs are classified into more than two groups according to their underlying property. To solve
 this problem, extending the binary SVM to handle multi-class classification is required. The
 multi-class SVM uses a set of discriminant function f? : X ? ? defined as
 f?(w, b) = sign(w?.?(xi) + b?)
 where X ? Rd and ? = (?1, ?2, . . . , ?c) with the same performance evaluation technique. In
 multi-class SVM classification the decision function will have the form defined below.
 f(x) = argmax
 ???
 f?(x) (3.30)
 To implement this, a number of approaches have been proposed in the last decade. Some
 of these approaches are one-against-the-rest presented in Scho?lkopf [1997], classification by
 pairwise coupling proposed by Hastie and Tibshirani [1996] based on the idea proposed by
 Friedman [1996] and the Multi-class SVM proposed by Weston and Watkins [1998]. In this
 section a brief overview of these three approached will be discussed. For detailed presentation
 refer to the above cited references.
 In the one-against-the-rest approach, a k-class classifier, is modelled by training k different
 binary SVMs each discriminating members of one class from the rest. This is done by rela-
 belling the training data and assigning say 1 to the class under consideration and ?1 to the
 remaining members of the training. A new data is then classified once it is tested using all
 the k SVMs and the final class is assigned based on equation 3.30. This technique is easy to
 implement, most widely used and gives a respectable result. It does however have some limita-
 tions [Scho?lkopf 1997]. Two of the limitations are, variation in the output range of the different
 SVMs, and mutual exclusion of class member. However, these two limitations have well for-
 mulated solutions. More description on the limitation and proposed solution can found in Lee
 et al. [2001] and Mayoraz and Alpaydin [1999].
 Classification by pairwise coupling also uses a number of binary classifiers to accomplish
 the multi-class classification. However, unlike the one-against-the-rest approach, in this ap-
 proach a number of different binary classifiers are trained to perform pairwise classification.
 That means to perform k class classification we need k(k?1)2 binary classifiers each performing
 pairwise classification. Note also that in this approach each binary classifier is trained with
 a subset of the training data. For example if our training data is equally balanced between
 classes, we have 2k fraction of the data to train each binary SVM. To classify an unseen pattern,
 the pattern will be tested against each class and is classified to the one which satisfies the voting
 criteria. Methods of evaluating the voting criteria are briefly described in Hastie and Tibshirani
 63
[1996].
 The third approach is the multi-class SVM proposed by Weston and Watkins [1998]. In
 this approach the mathematical formulation found for the binary SVM is extended to handle
 multi-class classification considering all the classes at once.
 3.6 SVM implementation
 Recall that the KKT condition (see Appendix B.2) guarantees an optimal solution to the con-
 strained optimisation problems formulated in equations 3.11 and 3.17. As it can be seen from
 these formulations, in order to get an optimal solution computing the dot product (xi.xj) be-
 tween all possible pairs of the training sample is required. For a classification task with a large
 number of training samples and/or each training sample having a large number of features to
 represent it, computing the dot product matrix and keeping the result in memory will be both
 processor and memory intensive task and sometimes a specialised computing resource might
 be needed. A simple approach to solve the memory problem is computing the dot product on
 demand basis. However, this approach will make the computation even more CPU intensive.
 To make this processor and/or memory intensive computation efficient, a number of ap-
 proaches has been proposed based on the principle that a global solution can be obtained by
 solving a smaller sub-problem at a given time. However, they differ in the way they define
 sub-problem. In the remainder of this section an overview of the different algorithms will be
 presented.
 3.6.1 Chunking
 Chunking is the first approach considered in solving SVM learning problem with large training
 set proposed by Boser et al. [1992]. This approach is based on the idea of dividing the opti-
 misation problem into small sub-problems (chunk) that can be solved efficiently. Training with
 this algorithm is done as follows. Training is started randomly with one sub-problem and then
 iteratively adding other examples that do not obey the KKT condition. Only support vectors
 found in the training stage of one sub-problem are to be carried out to the next stage. At the
 end of the optimisation process, when the KKT conditions are met, only the appropriate support
 vectors will be assimilated. This approach was proved by Osuna et al. [1997] to give the same
 global optimal solution and takes less resources to converge.
 Further extensions of this approach are the decomposition method proposed by Osuna et al.
 [1997] and Sequential Minimal Optimisation (SMO) proposed by Platt [1998]. The decomposi-
 tion method is based on selecting a working set when there are large number of support vectors
 that can be handled in memory. This working set is selected in such a way that it is big enough
 to hold all the support vectors, but small enough to be hand-fed by the computer [Osuna et al.
 1997]. SMO is a chunking algorithm with the two working sets [Platt 1998].
 It is evident that chunking has computational advantage over the naive approach when the
 64
training data is big. And this advantage is further extended by the decomposition method and
 SMO [Kroon 2003].
 3.6.2 Shrinking
 The linear memory requirement proportional to the number of training examples and support
 vectors has many advantages. However, the algorithm may take longer to converge. To over-
 come this potential disadvantage, Joachims [1999] proposed shrinking. Shrinking is based on
 predicting data points that will not be support vectors and eliminating these points during op-
 timisation. Eliminating these data points will result in a smaller sub-problem (shrinking of
 the optimisation problem). Furthermore, this algorithm uses the concept of ?Bounded Support
 Vectors? (BSV), which are support vectors with ?i at upper bound C.
 To extract these points the algorithm uses a heuristic approach to study the behaviour of
 points for a number of iterations. On each iteration only those points which behave consistently
 are extracted, which consequently shrinks the original problem [Kroon 2003].
 3.6.3 Caching
 The optimisation process requires evaluating the kernel matrix in each iteration, which is the
 most expensive process. One heuristic approach to speed up this process is caching kernel
 evaluation, which trades-off between memory consumption and training time [Joachims 1999].
 Caching is a well known technique implemented to speed up memory intensive computations
 and is widely used in many applications of computer science. There are a number of caching
 strategies such as First In First Out (FIFO), Least Recently Used (LRU), optimal, etc. The
 caching strategies used in SVM implementation is LRU, which replaces elements that are not
 used for a number of iterations, whenever the cache is full.
 3.7 Comparison of SVMs to other statistical techniques
 In recent years a large variety of pattern recognition techniques have been applied to solve a
 wide variety of problems. These includes decision trees, neural networks, logistic regression,
 and SVM. Even though the choice of a particular technique depends on the problem at hand,
 some techniques are reported to have number of theoretical advantages over the others. One
 such technique is SVM. Many researchers have pointed the significant advantage of SVM in a
 wide variety of benchmark application. Furthermore, researchers in the field of computational
 biology have also consistently shown the outstanding performance of SVMs.
 In the last couple of section detailed theoretical background and an overview on the im-
 plementation of SVMs was presented. Furthermore, in chapter 2 basic background of decision
 trees and neural networks was presented. In the remainder of this section a theoretical compar-
 ison showing some of the advantages and disadvantages of SVMs compared to decision trees
 and/or neural networks is presented.
 65
Linear or non-linear model
 A true challenge for most pattern recognition techniques is based on how they handle non-linear
 real life problems. While some techniques can handle non-linear classification directly, other
 need to approximate or use some techniques to convert the non-linear classification into a linear
 classification problem.
 Decision trees handle non-linear classification problem by approximating the problem with
 pairwise linear classification [Breiman et al. 1984]. Similar to decision trees, SVMs cannot
 handle non-linear data directly. To address the problem, SVMs map the non-linear data into a
 high-dimensional feature space where it can be classified linearly (or more specifically SVMs
 uses kernel function to classify non-linear data without the need for explicit mapping). How-
 ever, neural networks do not have this problem and address non-linear classification problem
 directly using multiple hyperplanes by the help of different activation functions such as Sig-
 moid, Gaussian or Radial Basis functions [Jain et al. 2000].
 Error minimisation
 As pointed out in section 3.2 all pattern recognition techniques are aimed at increasing the
 performance (minimise error) of the classifier based on the limited training sample available. It
 was also pointed out in section 3.2 that there are two principles of risk minimisation: namely
 empirical risk minimisation and structural risk minimisation. Both neural network and decision
 trees use the principle of empirical risk minimisation and are most likely to suffer from over-
 fitting [Jain et al. 2000]. However, SVMs use the principle of structural risk minimisation.
 To address the over-fitting problem both these methods have adopted different techniques.
 Decision trees use two different techniques. The first technique is stop growing the tree further
 when the split is not statistically significant [Breiman et al. 1984]. The second approach is first
 constructing the tree and then pruning the fully grown tree upward by considering a subtree
 with minimum accuracy loss [Breiman et al. 1984; Quinlan 1993]. One can address the over-
 fitting problems in neural networks either by choosing an appropriate number of hidden layers
 or stopping the training process before the network is fully trained. Even though the suggested
 techniques for both decision trees and neural networks have helped minimise the possibility of
 over-fitting, the problem is still there as the result of the basic principle of risk minimisation.
 Furthermore, there is no robust way of selecting a criterion to stop the training process so that
 the network will not be over trained, or stop pruning.
 As mensioned in section 3.2.2, SVMs use the principle of structural risk minimisation to
 avoid over-fitting. However, over-fitting is not absolutely avoided. A factor that can cause
 over-fitting in SVMs is selection of poor kernel function which will result in number of support
 vectors comparably equal to the size of the training samples. To address this problem new
 data driven kernel selection techniques have been proposed by Sollich [2000] which gives an
 intuitive guideline to choose a good kernel function.
 66
Global vs local solution
 Depending on the approach taken to solve the optimisation problem, pattern recognition tech-
 niques give global or local minimum. Neural networks use gradient descent techniques which
 gives the local minimum [Vapnik 1995]. On the other hand SVMs use quadratic programming
 technique which is guaranteed to give the global minimum [Burges 1998] (also see previous
 number of sections).
 Multi-class classification
 Most real-world problems are not bound to binary classification problems. Therefore, one needs
 to consider how multi-class classification is considered when comparing different techniques.
 While both decision trees and neural networks can be trained to perform 1? to? n classi-
 fication, SVMs can only perform 1 ? to ? 1 classification. This means for SVMs to perform
 multi-class classification one needs to train n different classifiers each performing binary clas-
 sification and hence is computationally demanding.
 Performance comparison
 Generally both decision trees, neural networks and SVMs have been applied in a many classifi-
 cation problem and the latter two have been the two competitive techniques recently. To show
 the dominant performance of SVM over the decision trees and neural networks, consider table
 3.1 showing selected results from Meyer et al. [2002].
 Problem SVMs Decision trees Neural Network
 BreastCancer 3.14 5.51 24.27
 tictactoe 0.14 8.24 33.97
 chess 0.49 3.20 39.73
 titanic 21.16 21.48 66.90
 Table 3.1: Mean error on a test set [Meyer et al. 2002]
 The result in table 3.1 shows that despite the minor limitations discussed above, SVMs have
 outperformed the other two classification techniques.
 3.8 Previous work
 3.8.1 SVMs benchmark applications
 So far we have seen the theoretical background and comparison of SVM. In this section, we
 will see the application of SVM as a tool for pattern recognition.
 67
Text classification
 The volume of information found in electronic format has increased incredibly as a result of
 the fast growing Internet. This fast growing volume of electronic information has urged many
 to look for a better way to organise these resources so as to find and filter for particular infor-
 mation in an effective and efficient way. Text classification or categorisation can be defined as
 pattern recognition problem, which classifies natural-language text to one or more predefined
 classes according to their content [Dumais et al. 1998]. So far, much text classification work
 is done manually, which has created a bottleneck in efficiency and effectiveness in dynamic
 information.
 There are a number of systems already in use to organise electronic documents in such a
 way that filtering and searching can be done faster and accurately. The speed and accuracy of
 such systems have a very high impact on their acceptance and usefulness. To increase the speed
 and accuracy of such systems and development of new systems, many have proposed machine
 learning techniques. Dumais et al. [1998] and Joachims [1998] have shown the performance of
 machine learning techniques (naive Bayes classifier, k-nearest neighbours, decision tree classi-
 fiers and SVM) in text classification in terms of precision/Recall-Breakeven Point. In addition,
 Sun et al. [2002] have used SVM to classify the web documents. In the next paragraph, we will
 see the application of SVM in text classification in more detail giving particular emphasis on
 the concepts like feature extraction and kernel used and the results obtained.
 Joachims [1998] conducted his experiment on 12902 documents (75% training and 25%
 testing) from Reuters-21578 dataset and 20000 documents (50% training, 50% testing) from
 Ohsumed corpus. To represent the text document Joachims [1998] used frequency of appear-
 ance of the words in the document with the restriction of frequency greater than two to avoid
 unnecessary large feature vectors. Having represented the document, further features were se-
 lected using a method of feature selection called information gain, which ranks all the words
 according to their information gain and selects those with highest the mutual information. Sim-
 ilarly Dumais et al. [1998] used the same number of datasets from Reuters-215782 collections
 and used binary feature value (occurs or does not occur) to reduce the feature space. Both
 Joachims [1998] and Dumais et al. [1998] conducted the experiment on a number of different
 machine learning techniques: naive Bayes classifier, k-nearest neighbours, decision tree classi-
 fiers and SVM. Joachims [1998] used both polynomial kernel of degree (d = 1, 2, 3, 4, 5) and
 RBF with ? = 0.6, 0.8, 1.0, 1.2 while Dumais et al. [1998] used only linear SVM. The result
 found in both experiments have confirmed that SVM is superior in performance and training
 time over the given training set. Furthermore, Sun et al. [2002] have extended text classifi-
 cation technique to web classification and have shown that SVM outperforms FOIL-PILFS3
 algorithm.
 2Due to the different time in the experiment the Reuters collections used in the experiment by Joachims [1998]
 and Dumais et al. [1998] are different
 3FOIL-PILFS is an algorithm, which is designed to learn rules, which use predicates based on Naive Bayesian
 models of text instead of keyword tests.
 68
Microarray gene expression analysis and cancer tissue classification
 In the previous section, we have seen how SVMs are applied to text recognition and it is em-
 phasised that SVMs perform better than other machine learning techniques when used as a
 pattern recognition tool. This section shows how SVMs are used to analyse microarray gene
 expressions [Brown et al. 2000] and used to classify cancer tissue samples based on microarray
 expressions [Furey et al. 2000].
 As the amount of data from different microarray hybridisation experiments become huge,
 the need for a means to extract biological significance and classifying genes according to their
 functional class is increasingly becoming imperative [Brown et al. 2000]. There are a number
 of approaches available to handle such task. Most of these approaches use a clustering algo-
 rithm based on the similarities between expression patterns to group genes. However, these
 approaches have a number of limitations. These limitations are the motivations for the research
 by Brown et al. [2000]. They used SVMs to perform microarray gene expression analysis.
 They pointed out that SVMs have the advantage of using a large set of similarity computing
 functions simultaneously and the ability to use prior knowledge about the true functional classes
 of the genes. The experiment was based on 2,467 yeast genes, which were selected based on
 availability of accurate functional annotation. These data sets were converted into 79 element
 gene expression vectors based on the results from an experiment with n = 79 genes on a single
 chip, which is converted to a vector form by dividing the expression level of the gene in the
 varying condition of interest by the expression level of the gene in some reference condition4.
 Having represented the data with the appropriate vector form, the set of genes that have common
 functional class are labelled positive, and negative otherwise.
 Based on these examples the SVM is taught to discriminate the positive from the negative
 examples. To carry out the learning task, polynomial and Gaussian kernels were used. One
 problem identified was that the set of positive examples was less in number compared to the
 negative ones and hence were treated as noise rather than examples belonging to a separate
 class. To handle this problem Brown et al. [2000] modified the kernel value during the support
 vector optimisation by adding the ratio of the positive examples multiplied by a scaling factor
 to the diagonal elements of the matrix defined by the kernel function. The result showed that
 higher degree polynomials and Gaussian kernels give a superior result as compared to previous
 analysis techniques.
 As a continuation of the research by Brown et al. [2000], Furey et al. [2000] applied SVMs
 in classifying cancer tissues based on the microarray expression data. The features extraction
 criteria used are different from that of Brown et al. [2000]. Furey et al. [2000] used the ratio
 of the difference between the mean of the positive examples (cancer tissue) and the negative
 examples (normal tissue) and the sum of their standard deviation5, which helped discriminate
 4Xi =
 log(EiRi )rP79
 j=1 log2(
 Ej
 Rj )
 5F (Xj) = ?
 +
 j ??
 ?
 j
 ?
 +
 j +?
 +
 j
 69
between the two classes. Genes with the highest score of the ratio were extracted.
 Unlike Brown et al. [2000], Furey et al. [2000] used only simple dot-product kernel and
 were not only able to classify tissues to the appropriate class but were also able to detect outliers
 in the example set. The result was confirmed using biological experiments.
 3.8.2 Other machine learning techniques in predicting drug resistance
 In the previous section, we have seen the power of SVM as a tool for pattern recognition and
 have showed some application of SVM in bioinformatics and real-world pattern recognition
 problems. In this section, we will see how other machine learning techniques are used to pre-
 dict drug resistance of HIV mutants. Among many researches carried out in this particular topic
 I will try to briefly present the methods used, features targeted, results achieved and the limi-
 tations of the research by Lathrop et al. [1999] and Draghici and Potter [2003]. Draghici and
 Potter [2003] have taken two different methods: Structure-based data mining and Sequence-
 based data mining. In this literature, we are interested in the Sequence-based method, which is
 based on the amino-acid sequence rather than the structural features.
 Lathrop et al. [1999] used rule-based expert systems based on the set of 55 rules extracted
 from different literatures on mutation and drug resistance and used both RT and PRO portions
 of the POL gene to predict drug resistant and nearby drug resistant mutants over all FDA6
 approved HIV drugs (11 at the time of the research) to suggest optimal combination of drugs
 for a patient under therapy. On the other hand Draghici and Potter [2003] used Neural Networks
 to predict resistance to Saquinavir using HIV Protease amino-acid sequences so that the HIV
 mutant is classified as low, medium or high resistant.
 The first step taken by both experiments was extracting features that can represent the input
 sequence, where both used the codons (433 codons in RT and 99 in PRO). Lathrop et al. [1999]
 extracted the codons that are targeted by the approved drugs, which reduced the input to 31
 different codon positions (20 in RT and 11 in PRO). Draghici and Potter [2003] extracted all
 the codons in PRO and assigned a value between 0 and 1 based on their difference from the
 wild type 7 (HXB2). Moreover, two codons, which do not show any variation from the wild
 type were eliminated resulting in 97 codon positions in the input vector. Each input vector
 was then converted into a normalised numeric pattern to make it suitable for Neural Network
 processing. Having extracted the relevant features Lathrop et al. [1999] applied the 55 set of
 rules in the form of IF ? antecedent ? THEN ? consequence ? WITH ? weight ? with a high level
 of confidence, where the weight varies from 0.1 (low resistant) to 1.0 (high resistant). Based on
 the rule weight the current resistance of the mutant is calculated. The nearby resistant mutants
 were predicted by applying the rule on newly generated sequences from those sequences that
 originally do not trigger the rules. The neural network approach taken by Draghici and Potter
 [2003] was different from the rule-based expert system. In their research, Draghici and Potter
 6Food and Drug Association
 7
 ?value is assigned between 0 and 1 in n equal increment where n is the number of different mutations from
 the wild type?[Draghici and Potter 2003, page 103]
 70
[2003] used thirty-six network architectures with different output matrix (12 ? 12 to 3 ? 3),
 different learning rates ( 0.9 to 0.5) and different initial neighbourhoods. Each network is trained
 32 times with 75 % of the data with 10 iterations (except one).
 The result found by both research approaches confirms that drug resistance can be predicted
 using the protein sequence of the HIV mutant. Results found by Lathrop et al. [1999] showed
 up to 25? viral load reduction in patients that completed one year of therapy of optimal drug
 combination recommended by the rule-based expert system. Likewise, the results from Draghici
 and Potter [2003] showed 69% coverage and 68% accuracy on a single network (8 ? 8 output
 matrix, 0.6 learning rate and 8 neighbourhoods) and as high as 85% average coverage and 78%
 accuracy over multiple networks.
 Although both research methods showed positive results there are some limitations. The
 rule-based expert system needs continuous maintenance with addition of new rules and discov-
 ery of new mutants and the neural network approach is theoretically hard to verify.
 3.9 Summary
 In this chapter a detailed background and mathematical formulation of SVM was presented.
 The two principles of error minimisation was covered in depth in the first section of the chapter.
 The simplest form of SVMs, the maximum margin classifiers was described in more detail
 and was also used to introduce a number of new terms and techniques that was latter used to
 address the more complex model if SVMs, the non-linear SVMs. This chapter has also covered
 how the inherent binary classifier is extended to multi-class classification and an overview on
 the implementation details of SVMs. Finally, comparison of SVMs with decision trees and
 neural network was given followed by some relevant previous works done on SVMs as a pattern
 recognition tool and HIV drug resistance prediction.
 In addition to the SVMs background and benchmark applications given so far, the next
 chapter will motivate why SVMs are particularly an ideal choice for the pattern recognition this
 research is solving.
 71
Chapter 4
 The Experiment
 4.1 Introduction
 In the previous chapters, detailed background on the HIV biology and a brief discussion on
 how this problem is best solved using pattern recognition techniques were presented. Differ-
 ent approaches in pattern recognition with more emphasis on statistical pattern recognition and
 SVMs as a classification algorithm were discussed. The different pattern recognition techniques
 discussed in Chapter 2 and Chapter 3 have been applied in a number of bioinformatics applica-
 tion and all of them have registered good results. However, depending on the type of problem
 at hand, the number of training patterns, the dimensionality of the problem, one technique
 outperform the other with no one technique showing absolute superiority. A brief theoretical
 comparisons of some of these techniques were also presented. The results extracted from pre-
 vious research (Table 3.1) and the theoretical comparisons presented reasons out why SVMs
 are likely to perform better than the traditional pattern recognition techniques. This research
 compared different pattern recognition techniques on the HIV drug resistance problem but it
 is not the intention of this research to rank the different techniques in this problem. The main
 objective of this work is to assess the performance of SVMs as a tool for HIV drug resistance
 prediction.
 In this Chapter a recap on the different techniques of HIV drug resistance testing and their
 limitations, why the problem is well suited to be solved as a pattern recognition problem and
 some of the benefits of using SVMs as a classification technique are presented before formu-
 lation the research question. Performance evaluation of classification algorithms can be based
 on a number of different metrics. This chapter discussed the approach taken to measure the
 performance of the different algorithms and motivates why the approach was chosen. One of
 the major issues in pattern recognition problem is the input encoding technique. Input encod-
 ing have a major effect on the performance of the classifier. Input data and encoding is also
 discussed in this Chapter. Moreover, this Chapter will give a detailed background on different
 model selection, error estimation and generalisation methods.
 In the next section the research question will be motivated and the corresponding research
 hypothesis will be formulated. Section 4.3 gives description of the implementation of the dif-
 72
ferent classification algorithms used for this comparative study. Section 4.4 gives a detailed
 background on different approaches on model selection, error estimation and generalisation.
 Section 4.5 gives the evaluation criteria for this research and motivates why a particular evalu-
 ation criterion is adopted. Section 4.6 will describe the data to be used and motivate why this
 data is used and the selected input encoding scheme. In section 4.7 an overview of the approach
 to be taken will be given. Finally a Section 4.8 concludes the chapter.
 4.2 Research question
 The AIDS epidemic has already caused a major economic, political and social problems all
 over the world specially in developing countries. Although there are a number of reasons for
 the failure in AIDS treatment, the high rate of mutation in the genetic code of the HIV virus
 is considered the major cause and is the main focus on combating the epidemic. Mutations
 are frequently exhibited on the genetic codes encoding the viral reverse transcriptase (RT) or
 protease (PRO) proteins and some of these mutations are known to confer drug resistance.
 Since not all mutations cause drug resistance the identification of those particular mutation
 points known to confer drug resistance will have advantage both during drug designing and
 clinical therapy of an HIV patient. The two ways of accomplishing this are phenotypic and
 genotypic testing. In genotypic testing the viral reverse transcriptase and protease are sequenced
 and checked for existence of mutations known to confer drug resistance and hence indirectly
 predict the drug resistance behaviour of the HIV strain upon therapy.
 The two approaches for drug resistance testing mentioned above have their advantages and
 disadvantages compared to each other (for more detail see section 2.3.4). One of the advantages
 of genotypic approach is that it is well suited for a computerised approach. This advantage is
 the reason for the involvement of non-biological disciplines in addressing the drug resistance
 problem. Some of the techniques adopted from other disciplines are expert systems and a num-
 ber of pattern recognition techniques. These proposed and already implemented systems for
 predicting the drug resistance behaviour based on sequenced reverse transcriptase or protease
 of an HIV strain extracted from a patient have already improved pharmaceutical therapy of the
 patients. However, these systems are still in a-work-in-progress stage and have a number of both
 biological and computational limitations. Firstly, most of the existing application to predicting
 HIV drug resistance are based on rule-based algorithms [Ravela et al. 2003]. These applica-
 tions are designed to predict the drug resistance of an HIV strain based on the sequenced reverse
 transcriptase or protease of the virus, highly rely on the biological knowledge available about
 the drugs and sets of mutation points known to confer resistance to these drugs. Therefore,
 the steady growth of mutation points causing drug resistance, due to failed retroviral therapy
 is negatively affecting the performance of these systems. To address the effect of the newly
 discovered mutations and incorporate the discovered biological information (about the drugs
 and new mutation points known to confer drug resistance), redesigning the systems might be
 required. Furthermore, as presented by [Ravela et al. 2003] this approach has resulted in differ-
 73
ent interpretations of the biological knowledge resulting in discordance between the different
 interpretation algorithms. Secondly, the growing number of mutation points, the dimensionality
 of the sequenced reverse transcriptase and protease genetic codes, the commonality of noise in
 most biological data, together with the cost and politics of data generation, leave these tech-
 niques to make the maximum out of a limited (usually small) training samples. As mentioned
 before many of the already existing techniques (based on neural networks and decision tree) rely
 on probabilistic density estimation as a statistical tool for prediction. This approach leaves these
 techniques to suffer from the ?curse of dimensionality due to the big ratio difference between
 the dimensionality of the data and the number of available training samples (high dimensional
 and limited samples). Furthermore, most of these techniques use the principle of empirical risk
 minimisation (presented in section 3.7), which makes them highly vulnerable to over-fitting.
 Thirdly, discovery of additional drugs will also impose additional restrictions on these systems.
 To overcome the above specified computational limitations there are a number of well stud-
 ied computational approaches that could be considered during model designing. Some of
 these approaches are reducing the dimensionality of the patterns, introducing steps to avoid
 over-fitting during the training process and incorporating advanced techniques to enable these
 techniques to accommodate newly discovered mutation points. One can also explore differ-
 ent pattern recognition techniques theoretically known to address the computational limita-
 tions specified above and have shown good performance in other (related) pattern recogni-
 tion tasks. This research considers SVMs, a pattern recognition technique which has been
 outperforming other pattern recognition techniques in many benchmark applications and com-
 putational biology problems. As specified in the previous chapter, the principle of structural
 risk minimisation, the fact that, only support vectors determine the classification model and
 the kernel trick have enabled SVMs outperform other pattern recognition techniques in a num-
 ber of classification problem where the number of available samples is limited, high dimen-
 sional and noisy. Moreover, SVMs are pattern recognition technique with a strong mathe-
 matical foundation to achieve good generalisation while maintaining a high classification ac-
 curacy [Ganapathiraju 2002]. For the above mentioned reasons, SVMs have been recently
 applied as a classification tool in a number of pattern recognition problems. In section 3.8.1
 a couple of applications of SVMs on problems in bioinformatics and computational molecu-
 lar biology was presented. Additional applications of SVMs on this domain can be found at
 http://www.support-vector.net/bioinformatics.html.
 In section 1.4 we have seen how HIV drug resistance prediction is mapped to a general
 pattern recognition problem. Therefore, this research explores the performance of SVMs in
 predicting the drug resistance behaviour of an HIV strain extracted from a patient to the different
 reverse transcriptase and protease inhibitors based on the genetic codes of the viral reverse
 transcriptase and protease respectively. Furthermore, we want to explore the possibility of
 designing a model without incorporating the already existing biological knowledge to have a
 model capable of accommodating not yet discovered mutations but contributing to the drug
 resistance behaviour of the strain to these drugs. Following the above argument the research
 74
question for this study can be posed as follows:
 Given a set of drug resistance data pairs S = {(x1, y1), (x2, y2), . . . , (xl, yl)}, where
 x is a valid genetic sequence of HIV reverse transcriptase or protease for an HIV
 strain and y = {Susceptible, Resistant} is the drug resistance behaviour of the
 corresponding strain for the drug under investigation: can SVMs learn from these
 examples and predict the drug resistance of unseen HIV strains?
 To answer this question, the performance of SVMs was evaluated based on accuracy of
 classification and the trade-off ability between the sensitivity and specificity (the performance
 evaluation criteria is presented in section 4.5). Evaluating the performance of SVMs alone will
 fall short of answering the research question. Therefore, in addition to SVMs, a number of
 different pattern recognition techniques; namely: decision trees, neural networks and logistic
 regression methods was investigated. Previous research by Beerenwinkel et al. [2002] showed
 the capacity of decision trees for predicting HIV drug resistance based on the genetic sequence
 of the viral protease or reverse transcriptase. Similarly, Draghici and Potter [2003] and [Wang
 and Larder 2003] showed the capacity of neural networks for HIV drug resistance prediction
 using the genetic sequence of the viral protease and reverse transcriptase and the presence of
 mutation on specific positions on the viral protease known to confer drug resistance respectively.
 For these reasons, the answer to the research question is affirmative or othewise based on the
 performance of SVMs incomparison to decision trees, neural network and/or logistic regression.
 If SVMs perform equally or better than these techniques, the research question will be answered
 affermative.
 The performance of the different classification algorithms was evaluated using the same
 performance evaluation technique. The best configuration for each of these algorithms was
 tuned. And finally comparison of performance was done. The search for the best configuration
 for the different algorithms have resulted in different data pre-processing steps for the different
 classification algorithms, however it did not compromise the performance comparison since the
 focus of the research is accessing the performance SVMs in comparison to the other popular
 pattern recognition techniques but not ranking the algorithms according to their performance.
 4.3 Classification algorithms description
 As presented in Chapter 2 and Chapter 3 pattern recognition/classification techniques can be
 categorised differently based on the underlying background, the approaches taken, the assump-
 tions made, etc. Furthermore, a single classification technique in turn can be implemented in
 a number of different ways each making a distinct design decision and assumption based on
 the underlying principles of the technique. Some implementations are designed to be more effi-
 cient on large data sets while others are implemented to have fast training time. The underlying
 background that lead to the different techniques is presented in the previous chapters. In this
 75
section, a brief description and a high level comparison of the different classification algorithm
 used in this research will be presented.
 The following is the list of algorithms used in this research:
 1. Support Vector Machines (SVMs)
 ? SVMlight [Joachims 1999]
 2. Neural Networks
 ? Feedforward Multilayered perceptron [Hagan and Menhaj 1994; Riedmiller and
 Braun 1993]
 3. Decision Trees
 ? C4.5 [Quinlan 1993]
 4. Regression methods
 ? Penalised logistic regression [Zhu and Hastie 2004]
 Support vector algorithms: SVMlight
 As presented in Chapter 3 finding the optimal solution that satisfies the constrained optimi-
 sation problem is computationally expensive. To address this problem, a number of differ-
 ent implementation of SVMs that differ not only in the approach taken to divide and con-
 quer the problem but also implemented for different platforms and problem size are proposed.
 A list of different SVMs implementations and utilities can be found from http://www.
 kernel-machines.org/software.html. The different approaches taken to make this
 CPU and memory intensive computation more efficient were discussed in Chapter 3. In this
 section a brief description of SVMlight, an SVMs implementation used in this research will be
 presented.
 SVMlight written by Joachims [1999] is a C implementation of support vector machine
 [Vapnik 1995]. SVMlight is reported to be the most popular implementation of SVM classifi-
 cation technique for a number of reasons. Firstly it is capable of learning of ranking functions,
 in addition to classification and regression. Secondly, it has scalable memory requirements.
 SVMlight used least-recently-used caching strategy to trade-off between memory consumption
 and training time. This property enables the algorithm effectively handle large range problems
 with many thousands of support vectors. The core optimisation method used in this algorithm
 is based on ?LOQO? algorithm, which is a software package implementing infeasible-primal-
 dual path following method to solve nonlinear optimisation problems [Vanderbei 1999, as cited
 in Joachims [1999]]. Thirdly, it also provides methods for assessing the generalisation perfor-
 mance using leave-one-out cross-validation technique. It also gives both error rate and preci-
 sion/recall on the training and test set
 76
SVMlight has a method to classify unseen problems based on the constructed model. Unlike
 many classification algorithms which return ?1, SVMlight returns continuous-valued output
 (??,+?), which makes it more suitable than other SVM implementation for computing the
 area under the ROC curve.
 For non-commercial use SVMlight is freely available at http://svmlight.joachims.
 org.
 Neural Networks: Resilient and Levenberg-Marquardt methods
 The most popular neural networks architecture over a wide range of real-world applications
 including computational biology problems is the feedforward multilayered perceptorns with
 backpropagation algorithm [Baldi and Brunak 2001]. As mentioned in Section 2.6 the stan-
 dard backpropagation algorithm is too slow for real-world problems. The two alternative ways
 of modifying this algorithm are heuristic approach or numerical optimisation. The Resilient
 backpropagation implements the first approach while Levenberg-Marquardt backpropagation
 algorithm that implement the later. Details of these two algorithms were presented in section
 2.6.
 The Matlab neural netwoks toolbox is used for this experiment.
 Decision Trees: C4.5
 As presented in Section 2.7 there are several heuristics methods in constructing decision tree
 each differing in the approaches taken in the construction step. The decision trees algorithm
 used in this research is C4.5 written by Quinlan [1993] as replacement for his original im-
 plementation ID3 (iterative dichotomizer 3rd). C4.5 is a top-down induction of decision tree
 algorithm where for a given set of labelled examples the decision tree is constructed in a top-
 down fashion. Unlike its predecessor this algorithm can handle both nominal and continuous
 attributes.
 The splitting criteria used by C4.5 is information gain ratio. C4.5 uses this important metric
 to avoid the problem that arises from using information gain as a split criteria. Recall from
 Section 2.7 that information gain favours attributes with distinct values. A fully grown decision
 tree is the pruned to reduce the size of the tree and avoid possible over fitting. C4.5 prunes the
 fully grown tree with out the need for a separate validation set. The C4.5 algorithm also has the
 ability to handle missing attributes.
 Release 8 of the algorithm can be found in http://www.rulequest.com/Personal/.
 Regression Model: Logistic regression
 In this research we used penalised logistic regression methods. The version of penalised logistic
 regression used in this research is proposed by Zhu and Hastie [2004] as an alternative for SVMs
 for microarray cancer diagnosis problem. This algorithm uses Sequential Minimal Optimisation
 77
(SOM) proposed by Platt [1998] to make it computational feasible for problem with larger
 attribute set.
 The Matlab implementation of the algorithm is available at http://www.tsi.enst.
 fr/?gfort/GLM/Programs.html
 4.4 Model selection, generalisation and error estimation
 Model selection is one of the primary steps in pattern recognition application. This step includes
 selection of a particular pattern recognition approach (for example: statistical or structural pat-
 tern recognition), a particular pattern recognition technique within the selected approach (for
 example: neural networks or SVMs). Once the appropriate classification algorithm for the prob-
 lem is identified, parameter tuning is carried out to determine the best configuration that gives
 the best performance. The best configuration parameters which is likely give better performance
 not only on the training sets but also unseen problems can be selected based on the theoretical
 bound on generalisation or error estimation on the training set.
 Generalisation and error estimation are important because they increase the confidence in
 the model selected. There are a number of ways of error estimation and generalisation. Estimat-
 ing the bound on the generalisation requires detailed theoretical information on the problem to
 be solved such as the underlying distribution probability which is generally difficult to compute
 from limited training samples. Therefore, estimating the bound is not an easy task for most
 pattern recognition tasks. However, error estimation is easy to conduct compared to evaluating
 the bound on generalisation. The simplest method of error estimation is cross-validation. This
 method is used in this experiment and is described in detail. There are more sophisticated meth-
 ods in deriving the bound on the generalisation error and tuning the classifier parameters. Such
 methods of error estimation and generalisation methods include PAC bound, complexity penal-
 isation and Bayesian model selection. However, these methods require a detailed knowledge
 about the behaviour of the data and the underlying distribution. The underlying distribution of
 the data is usually unknown and hence error estimation using cross validation is used in many
 practises.
 4.4.1 Hold-out and cross-validation method
 Cross validation is a model evaluation method that does not apply the entire data set to training
 the selected model. In this method some of the available samples are removed from the training
 set. Once the model under consideration is trained, those samples that were removed will be
 used to test the performance of the trained model. This is the basic idea behind the model
 evaluation methods called cross validation. However depending on the nature of the division
 of the training sample between the training and testing sample, we have three different cross-
 validation techniques. These techniques will be described next. The material in this section is
 based on Goutte [1997], Plutowski et al. [1994], Shao [1993].
 78
Hold-out method
 The hold-out method is the simplest kind of cross validation. The data set is separated into
 two sets, called the training set and the testing set. The pattern recognition system will be
 trained using the training set only. Then the system will be tested to classify those samples in
 the testing set. The performance of the system will be evaluated by the mean absolute test set
 error. One disadvantage of this method is the consistency of the result. The evaluation may
 depend heavily on which points are used as a training set and which points are used as test set.
 This means evaluation may be significantly different depending on how the available sample is
 divided between the training set and the testing set.
 k-fold cross validation
 One way to improve the limitations of the hold-out method is the k-fold cross validation method.
 In this approach the data set is divided into k disjoint subsets. Each time, one of the k subsets
 is used as the test set and the other k ? 1 subsets are put together to form a training set. That
 means the hold-out method is repeated k times. The average error over all k trials will be used
 to evaluate the model under investigation. The advantage of this method is that, unlike the hold-
 out method, how the data gets divided is less significant. Every data point gets to be in a test set
 only once, and gets to be in a training set k ? 1 times. Hence the consistency of the resulting
 estimate is increased as k is increased. One disadvantage of this method is that the training
 algorithm has to be run for k times for each training-testing set combination and takes k times
 as much computation to make an evaluation.
 A variation of this method is to randomly divide the data into a test and training set k
 different times. This approach have the advantage that you can independently choose how large
 each test set is and how many trials you average over.
 Leave-one-out cross validation
 Leave-one-out cross validation is k-fold cross validation taken to its logical extreme, with k
 set to be equal to n, the number of data points in the set. This means the pattern recognition
 system is trained on all the data except for one point n separate times and a classification is
 made for that point. As before, the average error is computed and used to evaluate the model.
 The evaluation given by leave-one-out cross validation error is good, but more expensive to
 compute than the other two methods.
 So far we have seen the different techniques of cross-validation. Although leave-one-out
 technique is the most powerful, it is also the most expensive. Therefore in this research, the
 second variation of k-fold cross-validation where 75% of the data will be used for training and
 the remaining 25% will be used for testing. The performance will be the average of 10 trials.
 79
4.5 Performance evaluation criteria
 The main objective of this research is to investigate the performance of SVMs relative to some
 of the popular machine learning/classification techniques. Like most research on empirical
 comparison of classification techniques, the major task of this research is investigating the best
 classification technique/algorithm along with its optimal parameter sets that not only give the
 best performance on the given data sets but also is likely to have better generalisation. In the last
 couple of decades, a number of articles comparing the performance of different classification
 techniques/algorithms on a wide variety of classification problem have been published. Al-
 though the authors of these articles were able to claim that one classifier is better than the other
 for the given task, not all of them agree on the dominant performance of one classifier/algorithm
 over the rest. This is evidence showing that different classifiers are more appropriate for dif-
 ferent tasks and one can not absolutely claim that one classifier is globally dominant. The
 performance of a classifier depends on a number of factors such as the assumptions governing
 the problem, the design of the classifier, and the dimensionality of the problem. Hence it is hard
 to claim that one classification technique is generally better than others.
 Traditionally performance of a classifier is measured in terms of the accuracy of prediction
 defined as the percentage classified correctly out of the tested samples. Practically available
 samples are divided into training and testing set. The performance of the classifier which is
 trained using the training set is then measured based on the mis-classification rate on the test
 set. However, this approach has a number of limitations due to the assumptions made about
 the test data set [Provost et al. 1998; Fawcett 2003]. Accuracy or error rate as a measure of
 performance provides an insight when the available data sets (training and testing sets) are
 sufficiently large, the training and test sets are independent and the distribution of the patterns
 across the different pattern classes is well balanced [Jain et al. 2000; Ferri et al. 2003]. However,
 these assumptions are very hard to satisfy in most real-world problems where the number of
 counter-examples are very few relative to the large set of patterns available. This problem even
 becomes more emphasised in classification problems in the domain of computational biology
 where data generation is not only costly but also the available data is likely to be highly skewed
 (see Saitta and Neri [1998] for example). When the number of patterns belonging to one class is
 proportionally very big relative to the other (say 99:1), a complete mis-classification of patterns
 belonging to the class with fewer test pattern will not have a significant effect despite a likely
 overall poor generalisation ability of the classifier on unseen problem. In addition to this, using
 accuracy as classification performance measure assumes equal misclassification cost or penalty.
 This assumption limits the insight we have about the classifier, such as the type of error made
 (sensitivity of the classifier) [Provost et al. 1998].
 There are a number of alternative approaches to address this bias when evaluating the perfor-
 mance of a classifier and making empirical comparison on classifier performance. One approach
 is to evaluate the performance of the classifier on each pattern class separately. An alternative
 approach is a technique originated in signal detection theory which is a graphical approach to
 80
visualise, organise and select classifiers based on their performance called Receiver Operation
 Characteristics (ROC) [Provost et al. 1998; Fawcett 2003]. Spackman [1989] is one of the early
 adaptors of ROC in machine learning for comparing and evaluating different classification algo-
 rithms [Fawcett 2003]. Following Spackman [1989] and other early adopters of this technique,
 increasing number of researchers are applying ROC to evaluate and compare the performance
 of machine learning algorithms on classification problems that are hard to satisfy assumptions
 that makes the use of the traditional approach less appropriate [Fawcett 2003]. In this research
 area under the ROC curve (AUC) will be used in conjunction with the accuracy to measure and
 compare the performance of the different classification algorithms used.
 Application of the trained classifier will give us information showing the difference between
 the true and the predicted class for the set of labelled patterns in the test set. For a binary (two
 class) classification problem the information can be summarised in the four metrics listed below:
 True Positive (TP): Number of correctly classified positive patterns
 True Negative (TN): Number of correctly classified negative patterns
 False Positive (FP): Number of misclassified positive patterns
 False Negative (FN): Number of misclassified negative patterns
 These four metrics can be effectively represented using contingency table or confusion ma-
 trix as shown in Table 4.1. This matrix is a basis for many performance measures that combine
 these metrics to ease comparison of classifiers.
 Predicted Predicted
 Positive Negative
 Positive
 Examples TP FN Pos (TP + FN)
 Negative
 Examples FP TN Neg (FP + TN)
 PPos PNeg N
 (TP + FP) (FN + TN)
 Table 4.1: A contingency table or confusion Matrix
 From Table 4.1, the row total ?Pos? and ?Neg? are actually positive and negative examples,
 the column total ?PPos? and ?PNeg? the number of predicted negative and positive patterns
 respectively. N is the total number of test patterns (N = Pos + Neg = PPos + PNeg). The numbers
 along the major diagonals of represents correctly classified patterns while the off-diagonals are
 the confusion between the two classes. Common performance metrics which gives meaningful
 81
measurements can be calculated from the confusion matrix:
 error rate =
 FP + FN
 TP + TN + FP + FN
 accuracy (1 ? Error rate) = TP + TN
 TP + TN + FP + FN
 sensitivity (TP rate) = TP
 TP + FN
 FP rate =
 FP
 FP + TN
 Specificity (1 ? FP rate) = TP
 TP + FP
 Some of the above metrics allows us to measure the performance of the classifiers with respect
 to the individual classes. The two terms which are usually associated with ROC are sensitivity
 and specificity, also known as recall and precision respectively. For classifiers with continuous-
 valued output (??,+?) (for example SVMlight, neural networks with tanh activation func-
 tion), these terms are subjected to different values based on the threshold value (cut-off value
 to label a pattern as positive or negative, it its value greater or less than the threshold value
 respectively) chosen for classifier. In such cases the major diagonals and the off diagonals of
 the confusion matrix will have a different value based on the chosen threshold value. An ROC
 curve, a two dimensional graph where FP rate is plotted on the x-axis and TP rate is plotted
 in the y-axis represents all possible combination of TP rate and FP rates [Fawcett 2003]. An
 example of ROC curve for three classifiers A, B and C is given in Figure 4.1.
 C
 B
 A
 10
 True Positive
 s
 False Positives
 1
 Figure 4.1: An ROC curve for three classifiers A, B and C
 The ROC curve in figure 4.1 shows the performance of the three classifiers in terms of the
 trade off between sensitivity and specificity. An ROC curve which is plotted on an FP rate - TP
 rate space always passes through the points (0,0) and (1,1). At (0,0) the classifier has classified
 all the test sets as negative. On the contrary, at (1,1) the classifier has classified all the test sets
 82
as positive. For a random classifier, the ROC curve will be a straight line connecting these two
 points. For a perfect classifier the ROC curve is a curve connecting the three points ((0,0), (0,1)
 and (1,1)). From the ROC curves given in Figure 4.1 it can be said that classifier A performs
 better than a random classifier and both classifiers B and C. However, for any two classifiers
 where the ROC curve intersects (Classifiers B and C from Figure 4.1) it is not always easy to
 compare the performance. It can be seen that both classifiers have a better performance than
 a random classifier but it is hard to say B is better than C and vice versa. Depending on a
 threshold value chosen (particular sensitivity) one is better than the other.
 In such instances where the ROC curve crosses using the area under the ROC curve (AUC)
 as a single number performance evaluator is an appropriate tool [Bradley 1997]. For a random
 classifier the AUC equals 0.5 and for a perfect classifier it will be 1.0. Although it is possible
 for a classifier with higher accuracy to have a lower AUC than one with lesser accuracy, AUC
 has been voted as a better way of evaluating classifiers performance than accuracy [Ferri et al.
 2003]. The algorithm used in this research to calculate the points on the ROC curve and the
 area under the curve (AUC) is adopted from Ferri et al. [2003].
 4.6 Data and input encoding
 4.6.1 Data
 Ideally, a large training and test data set for which the phenotypic drug resistance status of the
 different samples was known would be used. Unfortunately, these are difficult to obtain in suffi-
 cient quantity. Instead, a data set that has previously been classified by other researchers is used.
 While not ideal, we emphasis that our objective is to evaluate machine learning algorithms for
 future mutations and drugs rather than to discover new biological knowledge for existing drugs
 and mutations. The data set used has a high-degree of biological fidelity but more important, it
 contains typical patterns and mutations. Thus, if the model can learn the mutations that others
 have predicted or spotted, it can conclude that the machine learning technique can be used to
 learn new patterns for new drugs and mutations.
 This research compared the performance of SVMs, neural networks, decision trees and
 logistic regression method on the HIV-1 sub-type B, common in Europe and Americas. There
 is no reason to believe that performance of other sub-types would be different. The recognition
 performance of these algorithms is tested by its ability to classify a given nucleotides sequence
 as drug resistant or susceptible and their trade-off between sensitivity and specificity, which is
 measured by the metric area under the ROC curve.
 Because of the drugs used in the sample data, the genetic sequences of the viral protease
 or/and reverse transcriptase were used. The genetic sequences for the protease and reverse
 transcriptase of the HIV-1 virus used in this experiment are the same set of sequences used by
 Ravela et al. [2003] in their investigation of mutation points responsible for the discordances
 between different genotypic drug resistance interpretation algorithms. This data set comprises
 83
sequences of isolates from 2045 individuals each constituting position 1?99 of protease and
 1?240+ of reverse transcriptase along with the corresponding drug resistance status labelled
 as S for susceptible, I for intermediate and R for resistant for 15 different retroviral drugs ac-
 cording to the different algorithms used in Ravela et al. [2003] (some of the sequences contain
 positions 1 ? 250 for RT). These algorithms had in some cases classified some sequences differ-
 ently. These discordances are reported to result from several frequently occurring simple muta-
 tion patterns and small number of drug resistance mutations in the cases of nucleoside reverse
 transcriptase inhibitors (NRTIs) and non-nucleoside reverse transcriptase inhibitors (NNRTIs)
 respectively. However, discordances in protease inhibitors (PIs) are results of a large number of
 complicated mutation patterns [Ravela et al. 2003]. Therefore, to avoid these discordances, this
 research considered drug resistance status assigned by HIV RT and Protease sequence Database
 (HIVDB) (http://hivdb.stanford.edu). A preliminary experiment has shown no per-
 formance difference as a result of drug resistance algorithm selection.
 As few of the test data sequences were labelled as intermediate, for this research, intermedi-
 ate sequences are considered as resistant sequences. Table 4.2 shows the final count of resistant
 and susceptible sequences for the five selected drugs.
 Drug category Drug Resistant (R) Susceptible (S)
 PI IDV 884 1161
 PI NFV 1037 1008
 NRTI AZT 1090 955
 NRTI D4T 1058 987
 NNRTI NVP 1377 668
 Table 4.2: Sequence distribution between resistance (R) and susceptible(S)
 4.6.2 Input encoding
 The first step towards applying the different classification techniques used is pre-processing and
 filtering of the selected data sets to meet the specific data representation requirement of these
 algorithms. The sequences from Ravela et al. [2003] are composed of the codons positioned 1
 ? 99 of the protease and 1 ? 240+ of reverse transcriptase. Each sequence has a reading frame
 stating from the first base in the sequence and hence the first (99? 3 = 297) bases corresponds
 to protease while the remaining (240+ ?3 = 720+) bases correspond to reverse transcriptase.
 Therefore the first step was extracting part of each sequence corresponding to the particular
 drug category under investigation.
 Mutations on different parts of the viral genome are responsible for drug resistance be-
 haviour of the virus. Resistance to PIs are caused by exhibited mutations in the viral protease.
 However, resistance to both NRTIs and NNRTIs are caused due to some mutations in the viral
 reverse transcriptase. The first 297 nucleotides of the sequences are used for the drugs in the
 protease inhibitors category. For the other two drug categories, namely NRTI and NNRTI, two
 84
alternative approaches were taken. The first approach taken was considering the entire sequence
 which is the combination of protease and reverse transcriptase and leaving the task of discrim-
 inating those features that are not responsible for the resistance behaviour of the strain for the
 particular drug for the classification algorithm. In the second approach, only part of the whole
 sequence which belongs reverse transcriptase was considered. After removing the first 297 nu-
 cleotides from each sequence to work with the protease inhibitors, the remaining sequence is
 part of the original sequence that comprised position 1 ? 240+ of the reverse transcriptase. Since
 most of the mutation points responsible for drug-resistance for drugs in these two categories is
 the first 240 codon positions of the reverse transcriptase, the remaining sequence after elimina-
 tion the protease part had all the necessary information needed for this experiment. There is one
 problem that could possibly arise when removing the first part of the sequence is the question of
 open reading frames of the reverse transcriptase sequence (the remaining part of the sequence)
 for those sequences where the protease might be shorter or longer that 297 nucleotides. To ad-
 dress the possibility of such problem a preliminary sequence alignment was carried out and no
 such problem was identified. Having done this, the second approach has resulted in a sequence
 of 720 nucleotides long (240 codons). Thus, there is one set of sequence with the correspond-
 ing drug resistance label for the protease inhibitors and another two sets for the nucleoside and
 non-nucleoside reverse transcriptase inhibitors.
 Designing a complete classifier includes identification of a proper pre-processing and post-
 processing techniques beside selecting appropriate classification algorithm and tuning the cor-
 responding parameters. Like most computational biology problems, the first pre-processing
 step is encoding the nucleotides (character) sequence into an appropriate vector (numeric) rep-
 resentation that can efficiently work across the different classification algorithms used in this
 research. The choice of the encoding technique affects the quality of information retrieved
 from the raw data (the nucleotide sequences), and consequently the performance of the classi-
 fiers. Therefore, when selecting an encoding technique a number of factors should be taken into
 consideration. As presented in Wu [1997], when encoding sequences one needs to ask whether
 to consider fixed or variable length of sequences, if the local or global information more impor-
 tant, if the information has any positional dependency, if the intention is searching for signal or
 content, etc. Recall that drug resistance is caused by exhibited mutations in a specific position
 in the viral genome. Furthermore, not all mutations have the same contribution to wards the
 drug resistance behaviour of the strain. There are major and minor mutation points. To address
 these underlying facts, data representation which can highlight the global position of the mu-
 tation points and their positional dependency with other mutations is of interest is considered.
 For this purpose the input encoding in this research is as follows.
 The sequences in each data set were converted into vectors as follows. Before converting
 each sequence into an equivalent numeric representation, each sequence was first transformed
 into a consistent nucleotide representation by substituting those nucleotide codes that represent
 any two or more nucleotide bases. The conversion of these nucleotide codes into the bases was
 done according to the IUPAC codes (see Table C.2 in Appendix C). Each set of 3 nucleotides
 85
makes up a codon, which encodes for part of the corresponding enzyme. We give each pos-
 sible codon a numerical value using a standard numbering scheme. The conversion starts by
 assigning an integer value for each of the nucleotide bases: A, C, U and G were assigned 0, 1,
 2 and 3 respectively. With a numeric value assigned to each nucleotide base, the numeric value
 of the codon is the sum of the integer representation of the each nucleotide base multiplied by
 4i, where i is the position of the nucleotide in the codon. For example, the codon AUG will be
 0? 42 + 2? 41 + 3? 40 = 11.
 Using this scheme, each sequence was converted into a vector. The protease sequences
 are 99-dimensional vectors (99 ? 3 = 297) and the reverse transcriptase sequences are 240-
 dimensional vectors (240 ? 3 = 720). The vector representation of each sequence paired with
 the corresponding drug resistance label makes up the data sets used in this research for training
 and testing. This representation scheme allows us to detect mutations at which positions caused
 drug resistance. Note that this particular input encoding in HIV drug resistance prediction task
 has the advantage that it models both the global and relative position of mutations exhibited.
 Given the vector representation of the above selected data sets, data sets were divided into
 a training and test set for cross-validation. As described in Section 4.4, we randomly selected
 75% of the data sets for training and the remaining 25% for testing. With this approach the
 training set compromises 1500 sequences with the corresponding drug resistance profile for
 each drug. The testing set contains 545 sequences. The random selection gives equal proba-
 bility for each sequence and usually guarantees evenly distribution between the Resistant and
 Susceptible classes relative to the overall distribution over the sample. 10 training and testing
 sets were created for each drug in table 4.2.
 4.6.3 Alternative input encoding
 A lot of questions can be asked about the input encoding technique and we will not claim that
 this is the best input encoding that can be used for this problem. However, we believe that the
 input encoding used has done the job to answer the research question and had an advantage
 compared to the other two approaches were tested as a pre-experiment.
 The first alternative approach tested was a sight variation of the above mentioned scheme.
 In this approach, the effect of giving different order of significance to the nucleotides in a codon
 was investigated. In the above mentioned scheme, the first nucleotides the codon was the most
 significant bit. In this alternative approach, the last nucleotide (ie. the third nucleotide) was
 given the highest significance. With this scheme, the codon AUG which was equivalent to 11
 in the above mentioned scheme will be equivalent to 0? 40 + 2? 41 + 3? 42 = 56.
 The performance of the different classifiers on the data set with this input encoding scheme
 was slightly lower than the performance with the above scheme. Furthermore, this input encod-
 ing scheme did not produce a different ordering of the classifiers in terms performance. Besides
 the slight performance advantage of the above scheme, there is a biological explanation that
 makes the above representation more favourable. As it can be seen from Table C.1 in Appendix
 C, the most significant nucleotide is the first nucleotide in the codon encoding an amino-acid
 86
and the third nucleotide is the least significant (for example, both GAU and GAC encode the
 amino-acid Aspartic acid (Asp).
 The second alternative approach tested was direct sequence encoding presented in Wu
 [1997]. In this approach a vector of four units with three zeros and a single one was used for
 a nucleotide. The four nucleotides was represented as 1000 (A), 0100 (U), 0010 (G) and 0001
 (C). With this scheme, the dimensionality of the data is 12 times more than the two schemes
 presented above. For example, the protease sequence which is 297 nucleotides long will have
 a new dimensionality of 1188 vector components compared to the 99 dimensional of the above
 schemes.
 The average performace of the different classifiers with this input encoding scheme was
 slightly better than the scheme selected for this research and presented in the previous section.
 However, the models were less stable and the standard deviation was bigger. Furthermore,
 similar to the above scheme, this input encoding did not result in a different ordering of the
 classifiers in terms of performance. Besides the higher dimensionality and the unstable be-
 haviour of some of the models, interpretation of the models in terms of mutation points was
 more complicated. For example, with the selected input encoding scheme comparing the split
 criteria with the mutations points known to confirm drug resistance is easier compared to the bit
 encoding. Therefore, the above mentioned reasons, the input encoding scheme presented in the
 previous section was selected.
 4.7 Detailed methodology
 The experiment reported in this dissertation compares the performance of four classification
 techniques: SVMs, neural networks, decision trees and logistic regression. The performance
 (as described in Section 4.5) of the each classification algorithm described in Section 4.3 was
 tested in terms of the ability to classify the sequences for the different HIV strain described in
 Section 4.6.1 as susceptible or resistant. As described in the precious section, the input data
 was pre-processed to make it suitable for the different classification algorithms. These standard
 input representation was used across the different classification algorithms. However, further
 pre-processing such as normalisation of the vectors, dimensionality reduction was required for
 some of the algorithms. These further pre-processing steps are described in this section together
 with the experimental setup for the different classification algorithms.
 For SVMs and neural networks significant testing was done to find the best parameters for
 this problem. For this phase of the research much focus have been given to neural networks and
 decision trees because previous work has been done on drug resistance using neural networks
 and decision trees. A thorough study is not conducted in the case of logistic regression.
 87
4.7.1 Support vector machines
 The first set of experiments was done using SVMs. There is a general approach to use SVMs
 as a classification tool, which can be effectively summarised in the following sequence of steps.
 Having defined the problem set as a classification problem and prepared the training test where
 our learning system is going to acquire its knowledge the next step will be kernel selection.
 Kernel selection is based on the available prior knowledge or previous similar experiments or
 in the absence of both empirical test on the different kernels from simple to more complicated
 ones. For the selected kernels, the kernel parameters are then tuned based on the performance
 of the trained learning system on the kept aside test data. Unseen objects are classified based
 on the sign of the decision function. In this section we will describe the experimental setup
 starting from the kernel selection step. As described in the research question and motivation
 for the research, existing biological knowledge about the problem will not incorporated and
 hence kernel selection is not based on prior knowledge about the problem. Furthermore, to our
 knowledge there is no other previous research on drug resistance prediction using SVM as a
 classification tool. Hence, empirical testing is used to select the kernel and tune the respective
 parameters.
 In this experiment we used a base SVM kernel to normalise the encoded input before starting
 training and testing with the other kernels. This means the data is transformed using normali-
 sation kernel k(x.y) such that the data is contained in a sphere of unit radius. k(x, y) is defined
 as:
 k(x, y) = x.y?(x.x)(y.y) (4.1)
 This transformation makes the data suitable for the domain-restricted kernels and simplifies the
 generalisation and error estimation during optimisation. The data was further transformed using
 the polynomial or radial basis kernel during the training process.
 There are two parameters of interest for each kernel selected, the kernel parameter and
 the regularisation constant. For the polynomial kernel (see equation 3.25) the parameters of
 interest is the degree of the polynomial d. In this experiment polynomials of first to third
 degree (d = 1, 2 and 3) are investigated. The parameters of interest for the RBF kernel (see
 equation 3.26) is the width of the kernel (?). The values of ? investigated in this research
 are ( ? = 0.1, 0.5, 1, 2, 5 and 10). Different values for the regularisation constant (C), which
 indicates the trade-off between the training error and the separating margin, are investigated for
 each selected kernel parameter. The values of the regularisation constants are (C = 0, 1, 10, 100
 and the default value). The default value for the regularisation is computed from the training
 data during the learning process. The selection of the kernel parameters is done starting from
 a simple dot product kernel to a higher degree polynomials and different RBF kernels. The
 a couple of values for the degree of the polynomial kernel, the width of the RBF kernel and
 the regularisation constant are selected with a view that, these values are enough to answer the
 research question.
 88
4.7.2 Neural networks
 The second set of experiments was done using neural networks. The network architecture used
 in this research is the popular feedforward multilayered perceptrons with fully-connected neu-
 rons and backpropagation learning. Neural networks with backpropagation learning has a num-
 ber of parameters affecting its performance such as, the learning rate, the momentum term and
 the architecture of the network. The network architecture is characterised by the number of hid-
 den layers, the number of neurons in each layer, their activation function and so on. There are
 different modifications to the standard backpropagation algorithms each with their advantage in
 terms of memory usage, convergence speed and training set size. Two of these algorithms are
 the Levenberg-Marquardt algorithm proposed by Hagan and Menhaj [1994] and Resilient prop-
 agation algorithm proposed by Riedmiller and Braun [1993]. The modification to the standard
 backpropagation algorithm adopted by these algorithm is theoretically best suited to certain
 problems than other. However, the practical effect is not always in line with the theory. Hence
 one needs to test these algorithms and determine the best for the problem at hand.
 One of the parameters that define the architecture of the network is the activation function.
 The activation function is responsible for the reaction for each neuron it is attached to as a
 result of the input. Different activation functions have different input and output range and
 hence might have different training time. Choosing the appropriate activation function(s) for
 our problem at hand is another important step of this experiment. A number of preliminary
 experiments will be carried out to choose the appropriate learning algorithm and activation
 function before further network parameters are tuned.
 The number of neurons in the input layer equals the dimensionality of the data and there
 is only one output. However, the number of hidden layers and the number of neurons in each
 layer should march the complexity of the classification problem. As described in Riedmiller
 and Braun [1993] and numerous previous research, a network with two hidden layers can be
 tuned to address any complex problem by choosing appropriate number of neurons in each
 layer. There is a theoretical way of choosing the number of neurons. The number of neurons in
 the hidden layers should be large enough to address the complexity of the problem but not too
 large for the network to estimate the corresponding weight from the available training data. In
 this experiment we will start with small number of neurons and increase the number of neurons
 to obtain the best network configuration. Furthermore, depending on the learning algorithm, the
 respective parameters such as the number of epochs, learning rate, momentum term, etc. will
 be tuned to achieve better performance.
 Like most pattern recognition algorithms, the performance of the neural network is affected
 by the large dimensionality and the representation of the data. As presented in section 4.6.2, the
 order of magnitude of each attribute ranges from 0 to 63. This might be a source of problem
 for some activation functions which are defined in a standard universe. The former problem
 will be addressed by preprocessing the data using dimensionality reduction technique. The
 dimensionality reduction technique used in this research is principal component analysis (PCA).
 The dimensionality reduction is carried according to the recommendation by Jolliffe [1986].
 89
The latter is addressed by scaling (normalising) the data. The Neural Networks toolbox and
 MatLab implementation of PCA is used in this research.
 4.7.3 Decision trees
 The third set of experiment was evaluating the performance of decision trees. Similar to the
 previous two experiments, we started with converting the available training and test data into
 a format suitable for the selected decision tree algorithm, C4.5. This algorithm has a module
 to build a decision tree based on a given training set and evaluate the resulting tree on a given
 test set. This module has a number of parameters, some general and others affecting the per-
 formance. In this experiment the default value for all the parameters except one is used. The
 parameter confidence factor (CF) affects decision tree pruning. A small value means heavy
 pruning. During our experiment we will start with the default value (25%) and decrease the
 value if the actual error of the pruned tree on the test is higher than the estimated error.
 4.7.4 Logistic regression
 The final set of experiment is evaluating the performance of logistic regression. In this ex-
 periment we used Matlab implementation penalised Logistic regression with Ridge estimator
 proposed by Zhu and Hastie [2004]. The recommended valued were used for all the parameters
 and the range of the possible values for the regularisation parameter was set between 0.001 and
 1000 in step of 0.1 following recommendations from previous research.
 4.8 Summary
 This work answered the question ?can SVMs predict HIV drug resistance based on the genetic
 sequence of the viral protease or reverse transcriptase?. To answer this question we will be
 using a comparative approach and compare the performance of SVM to other popular classi-
 fication algorithms. The performance evaluation criteria used in this work is the accuracy of
 classification and the area under the ROC curve. Previous researchers have argued that area
 under the ROC curve is a better single value performance indicator hence more emphasis was
 given for this value.
 The performance of the different classification algorithms will be evaluated using cross-
 validation testing where randomly 75% of the data was used as training and the remaining 25%
 for testing. This procedure was repeated 10 times on different training-testing data sets and the
 average performance is considered.
 The results of these experiments is be presented in the next chapter
 90
Chapter 5
 Results and Discussion
 5.1 Introduction
 Model selection can be carried out by analysing the theoretical bound on generalisation or by
 performing empirical experiments and analysing the performance of the classifier based on a
 given unbiased test sets. In this work the model selection is carried out using cross-validation
 technique. As discussed in section 4.5, the performance of the different classification algorithms
 was evaluated in terms of the accuracy of classification and the area under the ROC curve (AUC)
 based on cross-validation results as outlined in the previous chapter. In this procedure 75% of
 the data randomly selected were used for the training. The accuracy and the AUC for the
 different classification algorithms was then tested on remaining 25% of the data. The results
 presented in this chapter are the mean accuracy and AUC over 10 trials. Each of the 10 trials
 used different pairs of training and testing sets.
 As described in the previous chapter, the main objective of this research is evaluating the
 performance of SVMs in comparison to some of the other popular pattern recognition tech-
 niques. Hence the conducted research was aimed at achieving this by designing prediction
 models for neural networks, decision trees and logistic regression and comparing them with the
 designed SVMs model.
 The remainder of this chapter is organised as follows. In the next section, the performance
 of SVMs as a function of kernel parameters and the regularisation constant will be discussed.
 This section also gives general remark on SVMs and tries to relate some of our results with
 the theory. In section 5.3 the performance of neural networks models is presented. Different
 architectures and learning algorithms were investigated and this section presents the results for
 the best configuration found. Moreover, the effect of dimensionality reduction is investigated.
 Section 5.4 presented the performance of decision trees models followed by section 5.5 which
 presents the performance of logistic regression models. The performance comparison of the
 different classification algorithms and the finding of this experiment is discusses in section
 5.6. Section 5.7 gives some remark on statistical significance test. Section 5.8 discusses the
 limitations of the research and hints ways of addressing these limitations. Finally, the chapter
 is summarised in section 5.9.
 91
The results for the different set of experiments are presented in following tables.
 Description Table No Page No
 SVM Polynomial kernel Table 5.2 Page 95
 SVM RBF Kernel Table 5.5 Page 102
 Table 5.3 Page 97
 Table 5.4 Page 97
 Neural networks Table 5.6 Page 104
 Decision trees Table 5.7 Page 105
 Logistic regression Table 5.8 Page 108
 Selected Results Table 5.9 Page 108
 Table 5.1: List to result tables
 5.2 Performance of SVMs
 The success around SVMs is highly attributed to the kernel trick which simplifies highly com-
 plex real life problems by projecting the patterns into a higher dimensional feature space where
 it can be solved with ease. Hence kernel selection is the critical part of SVM approach. As
 presented in section 3.4 the three commonly used kernels are the polynomial, RBF and sigmoid
 kernels. Osuna et al. [1997] has pointed out that, with these kernels SVMs can emulate dif-
 ferent previously well studied classifiers. The implementation of SVMs used in this research,
 SVMlight, does not incorporate automatic selection criteria for the kernels or their correspond-
 ing parameters. Hence, the task of selecting kernel(s) and tuning their parameters are left for
 the expert designing the classifier.
 Kernel selection is motivated by the information at your disposal. When enough prior infor-
 mation is available about the problem, one can choose the best kernel that well suits the problem
 or modify existing kernels by incorporating these information or design a specific kernel for the
 problem. However in the absence of such information we are forced to make our kernel choice
 based on previous research on similar domain or empirically test the available kernels. In this
 research empirical testing using cross-validation is used.
 The polynomial kernel is one of the most commonly used kernels. Depending on the degree
 of the polynomial, the shape of the resulting boundary becomes more complicated. Hence
 by tuning the degree of the polynomial either based on the estimated VC-dimension or by
 assessing the generalisation error (for example cross-validation), the polynomial kernel can be
 tuned to adopt the required decision boundaries. The RBF kernel is the most popular kernel
 because of its capacity to generate a decision boundaries that can accommodate a wide range of
 classification problems. The RBF kernel is best suited for a classification problem where one
 class is enclosed by the other class. It can also classify simple linearly separable case, which
 according to Keerthi and Lin [2003] can be considered a special case of RBF. For an SVM
 classifier with RBF kernel the support vectors are the centres of the RBF and ?, which is the
 width of the RBF, determines the area which is influenced by these support vectors [Vapnik
 92
1995]. The Sigmoid kernel which emulates the multilayered perceptron is a tricky one to use
 since it satisfies the Mercer condition only for some values of the gain and the threshold [Vapnik
 1995]. For this reason and its comparative poor performance compared to the other two kernels,
 the sigmoid kernel is less recommended by many researchers [Lin and Lin 2005]. However, it
 is used for historical reasons. The Sigmoid kernel is not used for this research.
 The model selection was done by tuning the kernel parameters for the polynomial and the
 RBF kernels. With the polynomial kernel we have tested different values for the degree of the
 polynomial. Similarly, for the RBF kernel model, various values for the width of the kernel were
 tested. For each of the polynomial or RBF kernel, we further tuned the models by selecting a
 different value for the regularisation constant.
 The tuning of these parameters was carried out using grid-search using cross-validation
 techniques over the available samples. As pointed out in the previous chapter we used 75% of
 the data for training and the remaining 25% for test. The patterns for the training and testing
 sets were selected randomly. These steps were repeated 10 times and the results presented are
 the mean value over the 10 experiments.
 5.2.1 Effect of polynomial kernel parameter on classification
 The polynomial kernel (equation 3.25) has two free parameters that need to be tuned according
 to the complexity of the problem: the degree of the polynomial d and the regularisation constant
 C. While the degree of the polynomial controls the complexity of the classifier or shape of
 decision boundaries, the regularisation constant controls the intensity or sharpness of these
 boundaries. In other words the regularisation constant counts for the trade-off between the
 width of margin of the classifier and the misclassification penalty. Therefore, model selection
 for this kernel is determining a good value for both d and C. To do this we have used grid-
 search using cross validation technique for d = 1,2 or 3 and C in {default value, 1, 10, 100}.
 The default value of the regularisation constant is calculated from the training set during the
 training steps and equals
 N?N
 i=1 K(xi.xi)
 (5.1)
 where K(x, y) is the chosen kernel (in this case the polynomial kernel)and N is the size of the
 training set. The results for the performance of the SVMs classifier with polynomial kernel are
 presented in Table 5.2 on page 95.
 As it can be seen from the table the performance of classifier has increased as we increase the
 degree of the polynomial except for the case that the default value is used for the regularisation
 constant. However, the performance gain was not continuous for all of the patterns and there
 are instances where the performance has deteriorated. For NVP(PR) and NVP(PR+RT) when
 C = 1 and for NFV when C = 100 there was no improvement gained as we increased the
 degree of the polynomial from two to three (see Figure 5.1 on page 94). This observation
 is in line with the theory, that for a given classification problem, if effectively classified by
 93
a polynomial of lower degree, polynomials of higher degree are likely to generate identical
 decision boundary. For example, if a classification problem is linearly separable, one can expect
 the same classification boundary from a linear and quadratic and higher degree polynomial
 kernel SVMs.
 SVMlight is designed to optimise the generalisation in terms of accuracy of classification.
 However, as mentioned above in this research we are giving more value to the AUC as a per-
 formance measure. As you can see from the table there are instances where the accuracy of
 classification has increased, however the AUC remained constant or decreases (for example
 NVP(RT) and NVP(PR + RT) for C = 100). Hence when talking about performance gain as a
 result of parameter tuning, both AUC and accuracy were considered.
 Figure 5.1: Selected graphs (degree (d) vs Accuracy/AUC) showing the behaviour of SVMs as
 a function of the polynomial kernel degree d. The solid lines in the graph show the accuracy of
 the model, the broken lines show the AUC. These graphs are intended to show how the accuracy
 of classification and the AUC are related as a function of the degree of the polynomial.
 Similar to the degree of the polynomial, the SVM classifier has shown performance gain as
 we increase the value of the regularisation constant keeping the degree of the polynomial fixed.
 As can be seen from the table (for instance NFV) the performance increased from 92.66% accu-
 racy and 96.96% AUC to 94.31% accuracy and 97.10% AUC as we increase the regularisation
 constant from the default value to 10. However, there was no more performance gain as we fur-
 ther increased the value of regularisation constant for the given polynomial degree. Although
 the results presented in the table do not exhibit this behaviour consistently, it is expected to
 happen as we further increase the regularisation constant for all patterns for a selected degree.
 This is due to the fact that, once we reach a certain value for the regularisation constant that best
 94
degree (d) = 1 degree (d)= 2 degree (d)= 3
 Drug Accu. AUC No. SV Accu. AUC No. SV Accu. AUC No SV
 Regularisation constant = default
 IDV 95.05 94.96 906 91.19 94.43 1335 95.05 94.97 905
 NFV 92.66 96.96 956 79.08 95.67 1376 92.66 96.97 955
 AZT (RT) 84.59 96.12 950 64.77 93.86 1486 84.59 96.12 949
 AZT (PR + RT) 81.83 96.05 1032 63.12 93.44 1496 81.83 96.04 1030
 D4T (RT) 85.14 88.41 971 69.36 86.69 1486 85.14 88.41 969
 D4T (PR + RT) 84.22 96.70 1038 68.62 94.85 1496 84.22 96.70 1038
 NVP (RT) 68.99 73.53 1020 68.99 73.55 1026 68.99 73.54 1022
 NVP (PR + RT) 68.99 77.02 1021 68.99 76.95 1027 68.99 76.98 1019
 Regularisation constant = 1
 IDV 95.05 95.03 879 96.15 95.66 526 96.78 95.93 880
 NFV 92.66 97.02 931 93.39 97.10 523 93.66 97.02 928
 AZT (RT) 84.95 96.23 907 92.11 96.43 555 93.01 96.72 907
 AZT (PR + RT) 82.94 96.16 987 92.29 96.37 636 92.29 96.16 986
 D4T (RT) 85.14 88.45 929 92.29 89.38 573 92.29 88.45 929
 D4T (PR + RT) 84.22 96.86 989 92.66 96.83 650 92.73 96.86 989
 NVP (RT) 68.99 73.51 1020 72.66 74.31 1015 72.56 74.21 1022
 NVP (PR + RT) 68.99 77.01 1022 72.11 77.07 1008 72.11 77.02 1020
 Regularisation constant = 10
 IDV 94.27 95.96 388 95.23 95.42 289 95.96 94.26 385
 NFV 94.31 97.10 355 94.68 97.06 239 94.92 96.85 353
 AZT (RT) 92.29 96.80 373 92.48 96.73 241 92.87 96.80 371
 AZT (PR + RT) 92.29 96.50 440 92.48 96.67 276 93.01 96.50 440
 D4T (RT) 93.02 96.69 489 94.13 96.61 333 94.13 96.68 491
 D4T (PR + RT) 92.00 89.46 426 93.94 89.03 283 94.13 89.46 422
 NVP (RT) 74.86 76.78 979 79.45 78.93 912 80.25 79.54 979
 NVP (PR + RT) 74.68 77.74 970 82.02 79.01 887 82.23 79.55 970
 Regularisation constant = 100
 IDV 95.73 95.78 258 95.78 95.73 246 95.78 95.73 258
 NFV 94.02 97.10 203 95.96 97.40 168 95.96 97.11 205
 AZT (RT) 92.84 97.06 208 93.76 96.84 174 94.23 97.06 209
 AZT (PR + RT) 92.48 97.13 233 93.76 96.85 208 94.17 97.44 233
 D4T (RT) 94.13 88.79 240 94.86 89.33 220 95.05 88.79 240
 D4T (PR + RT) 93.97 97.32 278 94.13 97.11 255 94.31 96.94 278
 NVP (RT) 82.57 80.12 844 83.49 91.36 816 83.76 80.11 845
 NVP (PR + RT) 83.49 80.16 814 83.41 81.25 766 83.54 80.17 816
 Table 5.2: SVMs classifier performance with polynomial kernel
 95
address the trade-off, further increase will either have no or negative effect on the performance
 of the classifier.
 Another result that needs to be discussed is the number of support vectors. As you can
 see from the table, the number of support vectors in most of the cases less than 50% of the
 total number of training patterns. As we have mentioned in section 2.5.2 one of the major
 problem that compromising the performance of the traditional classification techniques is the
 curse of dimensionality. For a polynomial kernel of degree d, the dimension of the feature
 space will be at least
 (
 n+d?1
 n
 )[Cristianini and Shawe-Taylor 2000]. However, the ability of
 SVMs to implicitly perform this higher dimensional classification task in the input space allows
 the classifier to construct hyperplanes in this high-dimensional space without suffering the curse
 of dimensionality and/or over-fitting (see Section 3.4). As you can also see from the table the
 increase in the number of support vectors compare to the increase in the dimension of the
 feature space (or increase the degree of the polynomial for a fixed value of C), is very low and
 sometimes it even decreased.
 Finally, one interesting aspect of support vectors that need to be discussed is the overlap
 between the support vectors that shape up different hyperplanes. Based on the reported find-
 ings by both Vapnik [1995] and Scho?lkopf [1997] on the commonality of support vectors among
 different classification boundaries on the same problem, we tried to see this behaviour in this re-
 search. Although numerical results are not presented, commonality on a number of the support
 vectors for different classification hyperplanes for a particular drug with different polynomial
 degree and regularisation constant was also witnessed in this reseach.
 5.2.2 Effect of RBF Kernel Parameter on classification
 The Radial Basis kernel (equation 3.26) is the most popular kernel for practical applications
 since it is complex enough to address complex non-linear classification problems by transform-
 ing the data into a higher dimensional space and simple enough to solve linear classification
 problems, which is considered as a special case [Keerthi and Lin 2003]. The classical approach
 of estimating RBF classifiers involves finding the RBF centre using k-means clustering mecha-
 nisms and then estimating the corresponding weight for each cluster using error backpropaga-
 tion. However, the SVM approach has a more elegant way of computing the centre, weight and
 the threshold that results in best generalisation automatically [Scho?lkopf et al. 1996].
 There are two parameters of interest for the RBF kernel: the width of the kernel ? and the
 regularisation constant C. Like the polynomial kernel our goal is to find a good value for both
 ? and C. These pair of values are also determined using grid-search using cross validation
 technique. The complete result from these experiments is presented in Table 5.5 on page 102.
 This kernel is most favourable when one class is totally encircled by the other as shown in
 the Figure 3.9(b). However, as presented in Keerthi and Lin [2003] by varying the value of
 either of these parameters, fixing the other one, we can achieve almost any kind of decision
 boundaries including linear hyperplanes. The value of ? is related to the diameter of the enclos-
 ing boundaries, the smaller ? gets, the tighter the closed boundaries (circles). In other words,
 96
the smaller the value of ? gets the smoother and more regular the decision boundaries. And for
 a regularisation constant C which is a function of ?, and by keeping ? close to zero, we can
 have a linear hyperplane.
 To aid our discussion selected results from Table 5.5 on page 102 for three drugs: one from
 each drug category are presented in Table 5.3 on page 97 and Table 5.4 on page 97. Similar
 to the polynomial kernel, there was performance gain as a function of ? and C. As it can be
 seen from the Table 5.3, (for example NVP(RT)) there was a performance gain as we gradually
 increase ? from 0.1 to 1. However, as we further increase ? to 10 the performance deteriorates
 (see Figure 5.2). It can also be seen from this table and Table 5.5 on page 102 this is not an
 isolated incidence.
 IDV D4T(RT) NVP(RT)
 ? C = 1 C = 100 C = 100
 Accu. AUC No. SV Accu. AUC No. SV Accu. AUC No SV
 0.1 89.72 94.04 1418 93.94 96.22 328 93.94 78.34 328
 0.5 95.05 95.03 892 93.94 96.02 254 93.93 80.02 256
 1 95.78 95.37 694 94.50 96.96 239 94.50 81.26 239
 2 96.15 94.92 542 94.86 97.17 259 94.86 81.17 259
 5 96.33 95.25 426 93.76 97.24 288 93.76 78.23 288
 10 96.15 95.63 407 93.03 96.94 325 93.03 76.52 325
 Table 5.3: Selected results for SVM with RBF kernel showing the effect of ? on the classifier
 performance
 IDV D4T(RT) NVP(RT)
 C ? = 10 ? = 2 ? = 1
 Accu. AUC No. SV Accu. AUC No. SV Accu. AUC No SV
 default 96.33 95.52 473 92.66 96.61 725 91.01 77.63 816
 1 96.15 95.63 407 92.29 96.55 593 92.29 77.52 752
 10 95.05 96.97 335 93.94 96.35 309 93.94 78.55 343
 100 93.76 96.76 333 94.86 97.17 259 94.50 81.26 239
 Table 5.4: Selected results for SVM with RBF kernel showing the effect of C on the classifier
 performance
 Similar to the width of the kernel, performance gain was also recorded as a function of the
 regularisation constant. It can be seen from Table 5.4 for the drug IDV, as we increased the value
 of C from the default value (usually in the range (0, 1]) to 1, the performance increased from
 94.33% accuracy and 95.52% AUC to 96.15% accuracy and 95.63% AUC. However, there was
 also a saturation point where the performance started to decline (see Figure 5.3 on page 98). As
 described in the previous section, this is also an indication that, some value of the regularisation
 constant around C = 1 has already accounted for the penalty for misclassification. The same
 result can also be seen from the Table 5.5. The result from Table 5.4 for the drugs D4T and NVP
 shows that the performance increased gradually as we increased the value of the regularisation
 97
Figure 5.2: ? vs Accuracy/AUC graph for SVM with RBF kernel. The solid lines in the graph
 show the accuracy of the model, the broken lines show the AUC.
 Figure 5.3: C vs Accuracy/AUC graph for SVM with RBF kernel.The solid lines in the graph
 show the accuracy of the model, the broken lines show the AUC.
 98
constant. However, as we further increase the value of C we are expecting the performance to
 deteriorate or show no gain after some value.
 Finally, similar to the result for the polynomial kernel, the number of support vectors is not
 as high as the dimensionality of the classification problem. The dimensionality of the feature
 space for a Gaussian kernel is ? [Cristianini and Shawe-Taylor 2000]. As we can see from
 the tables (Table 5.3 and Table 5.4) the number of support vectors are once again less than
 50% of the sample size. Commonality of support vectors among the different models was also
 exhibited.
 5.2.3 Remark on SVMs performance
 In this research the performance of the SVMs classifier is assessed using the polynomial and
 the RBF kernel. As we have discussed in the previous two section, the performance of the
 classifier range from 68.99% Accuracy and 73.51% AUC for NVP to 96.78% accuracy and
 95.93% AUC for IDV for the polynomial kernel. The performance also ranges from 68.99%
 accuracy and 73.55% AUC for NVP to 96.15% accuracy and 95.63% AUC for IDV for the RBF
 kernel. These performances were recorded using grid-search on range of values of the degree
 of the polynomial and/or the width of the kernel and regularisation constant. The performance
 for these models might be increased by doing extensive search around the point where the
 best reported performance was recorded. Furthermore, the recorded performances are achieved
 without incorporating any of the prior biological information about the mutation points and
 the resistance they are known to confer to the input encoding or kernel designing process. By
 incorporating these information, further performance gain might be recorded.
 As mentioned previously, SVMlight is designed to optimise the accuracy of the classifier but
 as you can see from the graphs in the previous section (Figure 5.1, Figure 5.2 and Figure 5.3)
 optimising the accuracy usually leads to optimised AUC. This observation was also published
 previously by Rakotomamonjy [2004]. Consider the highlighted results from the table 5.4 for
 D4T and NVP have the accuracy 94.86% and 94.50% respectively. The AUC has a bigger
 variance and equals 97.17% and 81.26% for these drugs respectively. These shows that the
 AUC variance is higher than the variance in accuracy when the data is skewed. This is in
 agreement with the fact that AUC is more sensitive towards skewed datasets. The same result
 was also reported by Rakotomamonjy [2004].
 Dimensionality of the patterns under investigation is one of the major issues that compro-
 mising the performance of traditional pattern recognition techniques. As you can see from the
 table we have a pair of data sets for each of the drugs in the nucleoside and non-nucleoside
 reverse transcriptase inhibitors category. The difference in the dimension between each pair of
 data is 99 attributes (the 99 codons belonging to the protease part of the sequence). This dimen-
 sionality difference together with the fact that SVMs constructs classification hyperplanes in a
 high dimensional feature space, one can expect a huge performance difference of a model on
 these two datasets. As described in section 2.5.2 to avoid curse of dimensionality and maintain
 the performance, the number of training sample was supposed to be increased to match the di-
 99
mensionality difference. The reported result in the previous sections has also showed no much
 performance loss as a result of dimensionality. These shows the capacity of SVMs to handle
 high dimensional classification tasks without the need for comparably large data sets.
 100
c =default c = 1 c = 10 c = 100
 Drug Accuracy AUC No. SV Accuracy AUC No. SV Accuracy AUC No. SV Accuracy AUC No. SV
 ? = 0.1
 IDV 95.05 95.06 870 89.72 94.04 1418 95.78 95.39 683 95.60 94.61 331
 NFV 92.66 97.45 912 74.50 95.12 1446 92.84 97.09 698 94.50 96.81 282
 AZT (RT) 85.14 96.06 898 66.24 94.12 1452 92.11 96.46 732 92.11 96.50 284
 AZT(PR + RT) 83.67 96.03 978 64.04 93.37 1494 88.99 96.24 802 92.11 96.55 342
 D4T (RT) 85.14 96.25 914 70.83 94.69 1450 92.29 96.34 738 93.94 96.22 328
 D4T(PR + RT) 84.95 96.68 980 68.62 94.80 1490 87.89 96.82 805 93.94 96.94 396
 NVP (RT) 85.14 77.03 914 70.83 77.23 1450 92.29 77.06 738 93.94 78.34 328
 NVP(PR + RT) 68.99 73.55 1023 68.99 73.41 1024 70.09 73.56 1024
 ? = 0.5
 IDV 95.23 95.3 806 95.05 95.03 892 95.96 94.35 398 95.60 95.74 266
 NFV 92.84 97.63 843 92.66 97.03 933 94.31 96.99 366 95.96 97.37 212
 AZT (RT) 87.34 96.31 863 85.14 96.15 914 92.11 96.65 371 92.66 96.93 207
 AZT(PR + RT) 85.69 96.15 926 83.12 96.10 911 92.11 96.41 455 92.66 97.02 243
 D4T (RT) 85.5 96.45 867 85.14 96.42 933 93.94 96.46 426 93.94 96.02 254
 D4T(PR + RT) 86.24 97.23 937 84.59 97.27 1000 93.21 96.55 497 93.94 97.34 286
 NVP (RT) 85.5 77.25 867 85.14 77.37 933 93.94 77.76 426 93.94 80.02 254
 NVP(PR + RT) 69.36 73.59 1023 68.99 73.52 1019 74.86 76.8 979 82.02 79.97 876
 ? = 1
 IDV 95.23 95.46 759 95.78 95.37 694 95.78 94.85 340 95.23 95.90 260
 NFV 92.84 97.54 769 93.03 97.52 701 94.50 97.14 290 96.15 97.49 203
 AZT (RT) 91.93 96.44 809 92.11 96.56 745 92.11 96.58 289 93.39 97.21 193
 AZT(PR + RT) 86.79 96.32 880 88.62 96.35 812 92.11 96.58 353 92.66 97.10 226
 D4T (RT) 91.01 96.48 816 92.29 96.54 752 93.94 96.31 343 94.50 96.96 239
 D4T(PR + RT) 86.97 97.39 881 88.26 97.37 819 93.94 96.62 401 94.50 97.18 288
 NVP (RT) 91.01 77.63 816 92.29 77.52 752 93.94 78.55 343 94.50 81.26 239
 NVP(PR + RT) 70.09 73.56 1016 70.09 73.65 1017 76.88 78.38 958 82.20 80.16 869
 101
c =default c = 1 c = 10 c = 100
 Drug Accuracy AUC No. SV Accuracy AUC No. SV Accuracy AUC No. SV Accuracy AUC No. SV
 ? = 2
 IDV 95.78 95.17 664 96.15 94.92 542 95.78 95.4 310 94.13 96.65 268
 NFV 93.03 97.07 668 93.39 97.12 531 94.86 97.43 258 95.96 97.75 200
 AZT (RT) 92.29 96.55 715 92.11 96.56 575 92.29 96.71 263 93.76 97.46 192
 AZT(PR + RT) 89.36 96.36 791 91.93 96.43 658 92.29 96.72 309 93.21 96.71 236
 D4T (RT) 92.66 96.61 725 92.29 96.55 593 93.94 96.35 309 94.86 97.17 259
 D4T(PR + RT) 89.17 97.34 791 92.48 96.86 672 93.94 96.61 372 93.58 97.11 303
 NVP (RT) 92.66 78.05 725 92.29 77.99 593 93.94 80.02 309 94.86 81.17 259
 NVP(PR + RT) 70.09 73.74 1020 72.66 74.15 1021 79.27 79.09 935 81.83 79.41 886
 ? = 5
 IDV 96.15 95.28 252 96.33 95.25 426 95.05 96.43 312 94.13 96.84 291
 NFV 93.39 97.11 508 94.13 97.54 397 95.96 97.93 251 95.41 98.14 227
 AZT (RT) 92.29 96.53 543 92.11 96.55 408 92.66 97.07 246 94.13 96.64 230
 AZT(PR + RT) 92.11 96.48 652 92.11 96.51 513 92.66 97 303 93.21 97.16 297
 D4T (RT) 93.03 96.64 565 93.94 96.50 460 94.13 96.85 291 93.76 97.24 288
 D4T(PR + RT) 92.66 97.40 658 93.21 96.76 538 94.31 97.47 351 92.11 96.81 353
 NVP (RT) 93.03 78.11 565 93.94 78.80 460 94.13 80.84 291 93.76 78.23 288
 NVP(PR + RT) 73.39 74.72 1032 73.94 76.53 1016 81.10 78.83 937 77.43 74.96 928
 ? = 10
 IDV 96.33 95.52 473 96.15 95.63 407 95.05 96.97 335 93.76 96.76 333
 NFV 94.5 97.12 434 94.86 97.79 369 95.78 98.2 274 95.78 98.24 249
 AZT (RT) 92.29 96.54 444 92.11 96.56 356 93.21 97.38 259 94.13 96.18 274
 AZT(PR + RT) 92.29 96.49 565 92.11 96.45 469 92.84 96.52 344 93.39 97.2 347
 D4T (RT) 93.76 96.60 485 93.94 96.54 417 94.68 97.16 324 93.03 96.94 325
 D4T(PR + RT) 93.39 97.39 590 94.13 97.18 504 94.31 97.19 386 92.29 96.68 405
 NVP (RT) 93.76 78.92 485 93.94 79.73 417 94.68 80.45 324 93.03 76.52 325
 NVP(PR + RT) 73.94 75.76 1034 76.15 77.25 1012 79.63 77.37 989 74.13 73.15 971
 Table 5.5: SVMs classifier performance with Radial Basis Function (RBF) kernel
 102
5.3 Performance of neural networks
 The performance of a neural network is highly affected by its architecture. Some of the factors
 defining the architecture and hence affecting the performance of a neural network model are:
 the connection type, the learning or training algorithms, the number of hidden layers and the
 number of neurons in them, the activation functions, learning rate, initial values of the weight
 and the training stopping criteria. Although we did not perform exhaustive search for all of these
 factors affecting the performance, in this experiment we have tried different network architec-
 tures with different numbers of neurons in the hidden layer, activation function and learning
 algorithm. The architecture were chosen using empirical testing as there is no automatic way
 of selecting a network architecture for a given problem.
 In search of a simple network architecture which is complex enough networks with one
 hidden (input layer, one hidden layer and output layer) and two hidden layer architectures were
 tested. There were no performance difference between these two architectures and hence an
 architecture with one hidden layer network was used as it convergence faster. The number of
 neurons in the input layers equals the dimensionality of the data and the network has one outout
 (the output layer has one neuron). Different number of neurons in the hidden layer were tested
 starting with a simple network with 5 neurons. The number of neurons were increased by 5
 until no performance gain was recorded and computational time increased highly (sometimes
 longer than an hour). For most of the drugs, architecture with 10 ? 15 neurons were complex
 enough.
 For each architecture, different learning algorithms were also tested. The three learning
 algorithms tested are standard backpropagation, Resilient backpropagation and the Levenberg-
 Marquardt algorithm. Each of these learning algorithms have their theoretical pros and cons.
 However, empirically testing is required to see their perform on the problem at hand. For this
 experiment, the Levenberg-Marquardt algorithm was selected. This algorithm was the best in
 terms of performance, convergence and memory usage. Similarly different activation functions
 were tested and log-Sigmoid transfer function was selected. Finally, feedforward neural net-
 works with fully connected neurons and log-sigmoid activation function were used.
 The performance of these architectures were evaluated on the same data sets with the iden-
 tical input encoding technique with the SVMs model. However, the data was further pre-
 processed to make it more suitable for neural network model. The data was normalised in a
 such away that it will have zero mean and unit standard deviation. The data was then divided
 into three sets, where 50% is used for training, 25% for validation and the remaining 25% for
 testing. Each experiment was repeated 10 times and the mean accuracy and AUC is presented
 in Table 5.6 on page 104.
 As it can be seen from Table 5.6 the performance of the network has reduced as the dimen-
 sionality of the data increased. The mean accuracy of the network on the protease inhibitors is
 between 89% and 91%. However, the performance is reduced to the range 82% to 84% for the
 AZT and D4T as a result of increase in the dimensionality of the patterns. The performance was
 103
Original dimensionality Reduced Dimensionality
 Drug Accuracy AUC Accuracy AUC
 IDV 89.75 91.76 89.91 86.76
 NFV 91.74 95.31 78.90 87.51
 AZT (RT) 82.57 87.79 82.20 88.57
 AZT( PR + RT) 82.02 87.70 59.63 60.94
 D4T (RT) 84.34 91.85 84.22 92.25
 D4T( PR + RT) 82.33 87.09 60.37 54.80
 NVP (RT) 69.72 68.93 70.49 69.50
 NVP( PR + RT) 64.77 71.10 68.99 47.06
 Table 5.6: Performance of neural networks model
 further reduced to the range 64% to 69% for NVP due to the higher dimensionality of the data
 and its relatively skewness. Similar to SVMs algorithm, the Levenberg-Marquardt algorithm is
 accuracy optimised, however it also optimises AUC in most of the cases.
 To address the performance loss as a result of dimensionality increase, we performed one
 more pre-processing technique to reduce the dimensionality of the data. For these task we used
 principal component analysis. As recommended by Jolliffe [1986] we reduced the dimension-
 ality of the data up to 90%. The data was then reduced to 2 or 3 dimensional patterns. As we
 have done in the experiment with the original dimensionality, the data was divided into training,
 validation and test sets and the mean accuracy and AUC over 10 repeats of the each experiment
 is given in Table 5.6.
 As it can be seen from the table there is no performance gain as a result of dimensionality
 reduction except for NVP(RT). The reason for these reduced performance could be the informa-
 tion loss due to dimensionality reduction. Remember that the data have discrete representation
 with each numeric value representing one of the 20 amino acids. Hence, dimensionality re-
 duction we might have caused some valuable information to be lost. From the performance on
 AZT(PR+RT), D4T(PR + RT) and NVP(PR + RT) it can be seen that the information loss was
 even higher for these data sets.
 5.4 Performance of decision trees
 In this part of the experiment we generated decision tree models that describes the drug resis-
 tance profile of an HIV strain in terms of the numeric composition of the vector, which in turn
 represents the different amino acids composition of the enzyme targeted by the drug. To esti-
 mate the prediction power of the decision tree model we performed 10 independent tests, each
 time we used 75% of the data for training and the remaining 25% for testing. The result for the
 performance of this model is given in Table 5.7 on page 105. The result gives a mean number
 of interior nodes of the tree, number of leaves of the tree, accuracy of prediction and AUC over
 the 10 experiments.
 Each training session resulted in one decision tree for each drug. Most of the resulting
 104
Drug No. Interior No. Leaves Accuracy AUC
 vertex
 IDV 16 17 97.06 92.70
 NFV 6 7 97.06 94.43
 AZT (RT) 14 15 96.33 93.41
 AZT (PR + RT) 18 19 94.86 90.96
 D4T (RT) 29 30 93.58 88.29
 D4T (PR + RT) 29 30 91.93 85.39
 NVP (RT) 25 26 94.31 85.52
 NVP (PR + RT) 25 26 94.68 85.99
 Table 5.7: Performance of pruned C4.5 classification algorithm
 P90
 R P30
 R P82
 P82 P46
 P82 R R S
 R S
 ?11 >11
 ?2 >2
 ?54 >54
 ?9 >9 ?9 >9
 ?5 >5
 Figure 5.4: Example Decision tree for NFV. The label of the interior nodes refers to the position
 of the codon in the sequence (for example P30 means amino acid at position 30) and the label
 of the leaves indicated the drug resistance property of the strain
 105
P90
 R P30
 R P82
 R S P46
 R S
 M
 N
 T/A I
 I
 Figure 5.5: Biologically interpreted decision tree for NFV given in Figure 5.4. The label for
 the edges represents the amino acid (codon) at the position give in the node above it. M =
 Methionine (AUG), T = Threonine (ACC), N = Asparagine (AAU), I = Isoleucine (AUC) and
 A = Alanine (GCU) (see Table C.1 in Appendix C)
 decision trees were simple with the most complicated one having an average of 29 interior
 nodes (see Table 5.7). An example of a decision tree for the drug NFV is given in Figure 5.4.
 This tree is one of the 10 trees generated for this drug and is only selected as an example because
 it has the average tree size. The biological interpretation the decision trees is not in the scope
 of these research and requires deep biological knowledge. However, there are some biological
 aspects of these trees we want to discuss. For each of the interior nodes of the resulting trees
 for IDV, NFV, AZT(RT), D4T(RT) and NVP(RT) we tried to compare if these mutation points
 are previously reported. We have found (result not shown for all) that most of the interior
 nodes are reported as significant mutation points. For example, for the drug NFV (decision
 tree given in Figure 5.5) all four of the interior nodes P30, P46, P82 and P90 are identified as
 positions associated with drug resistance1 (see Beerenwinkel et al. [2002] and Drug resistance
 summaries on http://hivdb.stanford.edu/index.html). However, not all of the
 position associated with drug resistance are interior nodes on our tree. For example position
 71 of the protease sequence is indicated as a position associated with drug resistance to NFV
 by Beerenwinkel et al. [2002] does not appear as a split criteria in our tree. The absence of
 these and other positions as an interior node from our tree might be due to a number of reasons.
 Firstly, the number of samples in our data set associating drug resistance with these positions
 might be very few. Secondly, most of the mutation patterns are correlated and hence might be
 ignored as a split criteria. On the contrary, we have also seen positions which are not previously
 associated with mutation to the particular drugs appear as an interior node. For example, some
 1The mutation related to these positions are 30N, 46I, 82A, 82T, 82I, 90M
 106
positions inside the protease part of the sequence have appeared as an interior node for the
 decision tree for D4T(PR + RT). This phenomenon is very hard to reason out without further
 clinical study. However, it has been argued by a number of researchers that one of the limitations
 of current clinical research is that only an isolated gene is studied during drug design. And there
 might be a possibility that these positions have an effect in the drug resistance behaviour of the
 drug.
 The result shown in Table 5.7 gives the average performance of the pruned tree. During the
 experiment we first constructed the unpruned tree and then each of these trees were pruned using
 information gain ratio criterion. We tested different values for pruning confidence factor in order
 to prevent both over and under fitting. As it can be seen from table we have found accuracy in
 the range 91.93% to 97.06% and area under the ROC curve in the range 85.29% to 94.43%.
 The variation of performance between the decision tree models for D4T(RT)and D4T(PR +
 RT), and AZT(RT) and AZT(PR + RT) might be due to the codons from protease which are
 used as a split criteria for decision trees for D4T(PR + RT) and AZT(PR + RT). However, the
 performance variation between NVP(RT) and NVP(PR + RT) is very small compared to the
 previous two. This variation in the performance between the two models of decision tree for
 NVP might be due to the reason than there is only one position belonging to protease which has
 occurred as a split criteria deep inside the tree (decision tree not shown).
 The variation in the performance is not in contradiction with the argument made in the above
 paragraph about the possible effect of protease mutation in drug resistance behaviour of reverse
 transcriptase inhibitors. The main reason behind this variation might be that the data used in this
 research is already classified with drug resistance prediction algorithm that considered protease
 or reverse transcriptase genes alone for the respective inhibitor drug resistance prediction.
 Finally, the high performance rate of the decision tree model together with the confirma-
 tion of most interior nodes as positions associated with drug resistance can be considered as a
 confirmation for input encoding technique used.
 5.5 Performance of logistic regression
 The final part of the experiment was performance evaluation of logistic regression techniques.
 As mentioned in Section 2.8 the choice of an optimal value for the ridge parameter is crucial.
 It was also mentioned that cross-validation is the most successfull way of estimating it. Hence,
 like the rest of the experiment, we also used 75% of the data randomly selected for training and
 the remaining 25% for testing. The result given in Table 5.8 is the mean value for the accuracy
 and the AUC over 10 experiments.
 In this part of experiment we used logistic regression with ridge estimate. Different values
 for ridge parameter were tested. The results presented in the table are the best performance
 recorded. The accuracy for logistic regression model ranges 75.78% to 93.39% and AUC ranges
 from 50.96% to 85.06%.
 107
Drug Accuracy AUC
 IDV 92.66 85.06
 NFV 93.39 83.58
 AZT (RT) 81.10 78.24
 AZT (PR + RT) 84.40 78.01
 D4T (RT) 80.55 65.37
 D4T (PR + RT) 80.38 63.31
 NVP (RT) 75.78 46.26
 NVP (PR + RT) 77.98 50.96
 Table 5.8: Performance of logistic regression
 5.6 Discussion
 Evaluation of different algorithms usually leads to a biased ranking of the algorithms. As spec-
 ified in section 4.4, theoretical estimation of the generalisation ability of different algorithm re-
 quires deep knowledge about the underlying distribution of data. These information is usually
 not available hence researchers are forced to find alternative approach that can give unbiased
 estimator of the true error rate of a classifier. When the number of training and testing pat-
 terns are limited, a single train-and-test experiment will result in a misleading information. To
 avoid this problem and get unbiased estimate on the performance of the different algorithms,
 this research used cross-validation. The performance for each of the different classification al-
 gorithms is given in Table 5.9. The results show that SVMs produce the best results in all cases.
 Overall, decision trees were the second best (except for NVP). The performance of the different
 models on NVP is relatively poor compared to the other drugs. This might be due to the relative
 skewness of the data sets for this drug. Neural networks performed well compared to logistic
 regression but was inferior to the other two classifiers.
 SVM NN DT Logistic Reg
 Drug Accu. AUC Accu. AUC Accu. AUC Accu. AUC
 IDV 96.33 95.52 89.75 91.76 97.06 92.70 92.66 85.06
 NFV 96.15 97.49 91.74 95.31 97.06 94.43 93.39 83.58
 AZT (RT) 94.13 96.64 82.20 88.57 94.86 90.96 81.10 78.24
 AZT (PR + RT) 93.21 96.64 82.02 87.70 96.33 93.41 84.40 78.01
 D4T (RT) 94.86 97.17 84.34 91.85 91.93 85.39 80.55 65.37
 D4T (PR + RT) 93.58 97.11 76.33 84.09 93.58 88.29 80.38 63.31
 NVP (RT) 94.50 81.26 70.49 69.50 94.31 85.52 75.78 46.26
 NVP (PR + RT) 82.20 80.16 64.77 71.10 94.68 85.99 77.98 50.96
 Table 5.9: Performance of the different classification algorithms
 Decision trees had a slightly better accuracy than SVMs on almost all data sets however,
 SVMs have overall better area under the curve which means the SVM models have an overall
 better trade-off between sensitivity and specificity compared to decision trees (see the graphs
 in Figure 5.6 and 5.7). There can be a number of reasons for the impressive performance of
 108
decision trees. Firstly, as we have mentioned in section 5.4 the training and testing set was
 previously classified using a knowledge based approach, hence best suits decision tree model.
 Secondly, the discrete data encoding can be argued to suit the decision tree model than SVMs.
 Figure 5.6: Accuracy of the different classifiers
 Comparing the performance of SVMs and neural network, it can be seen from the table
 that neural networks were outperformed by SVMs on all datasets (see the graphs in Figure 5.6
 and 5.7). This might be attributed to the dimensionality of the data compared to the number of
 available training samples and the loss of information during dimensionality reduction. As we
 have mentioned in section 5.4 and Chapter 2, mutation points complement each other to cause
 drug resistance. While some mutations are major and cause resistance alone, others are sec-
 ondary and needs the existence of other mutations to cause drug resistance. That means these
 attributes are highly correlated and might be given lower rank during dimensionality reduction
 using principal component analysis. Besides its performance, SVMs converged rapidly com-
 pared to neural network even on the reduced dimensionality. The results also shows that the
 SVM models also outperformed the logistic regression models.
 Although there are no previous research known to us which used SVMs to directly predict
 HIV drug resistance, to check the performance of our model we made comparison to the SVM
 regression model by Beerenwinkel et al. [2003a], the decision tree models by Beerenwinkel et
 al. [2002] and the neural networks models by Draghici and Potter [2003] and Wang and Larder
 109
Figure 5.7: Area under the ROC curve of the different classifiers
 [2003].
 The difference in the size of the data sets used, the error estimation techniques and the test
 method are different and makes direct comparison of our models with models in previous work
 difficult. The SVMs models for this research can be compared positively to SVMs model by
 Beerenwinkel et al. [2003a] where SVM regression is used to predict the genotypic resistance
 from phenotypic resistance. Similarly the decision tree models for this research produced a
 better performance than the decision tree models presented in Beerenwinkel et al. [2002] in
 all of the five drugs. Comparing our neural networks models with the neural networks models
 in Draghici and Potter [2003] our model performed better than the single network model of
 Draghici and Potter [2003]. However, the performance of the neural network model by Wang
 and Larder [2003] on one drug in the protease inhibitor category was slightly higher than what
 we have achieved which might be attributed to the incorporated previous knowledge about the
 drugs.
 A key result is minimal biological knowledge or expert knowledge was used (we only used
 the fact of what the drug was targeted at to select the general genomic region to study). We
 do not wish to downplay the contribution that expert knowledge can make to understanding
 the mutation process (after all, we are only using the one-dimensional knowledge given by
 the genomic sequence, when the virus and its RNA/DNA are both complex three dimensional
 objects). However, these results show that machine learning can significantly complement and
 110
assist the work of human experts working on the virus.
 5.7 Statistical significance
 It has become a tradition to report statistical significance if a two-tailed, paired t-test produces a
 p-value less than some significance level (usually 0.05) [Salzberg 1999]. According to Salzberg
 [1999], these statistical tests should be used very carefully when used for classifiers compari-
 son as they are not designed for computation experiments. Furthermore, Salzberg [1999] has
 highlighted following problems:
 ? The t-test assumes that the test sets are independent for each algorithm. In this research
 cross-validation with random partitioning of the data set between training and testing was
 used. And the different algorithms are trained and tested on the same data set. This means
 that the test sets share some of the patterns and hence are not independent. Dietterich
 [1996, as cited in Salzberg [1999]] has pointed out that this has a ?high probability of
 Type I error . . . and should never be used?.
 ? The use of wrong p-value finds statistical significance where there is none. This is usually
 addressed by making adjustment to the significance level called Bonferroni adjustment
 provided that the experiments are independent of one another.
 Salzberg [1997] highlighted k-fold cross validation as a recommended approach, but has also
 emphasised that it will not produce valid statistics because the test data are interdependent.
 Recommended alternative approaches can be found in Salzberg [1997], Salzberg [1999] or
 [Gascuel and Caraux 1992]. For this research, we have chosen to present the performance
 of the different classifier graphically as given in Figure 5.6 and Figure 5.7.
 5.8 Limitations of the research
 The results achieved in this research are not without their limitations. Some of the limitations
 can be easily addressed while the others are more complicated and costly. One of the limitation
 is the availability of the data. As stated in the introduction section, our wish was designing
 a model without incorporating the existing biological knowledge. To do this one needs a pure
 phenotypic HIV drug resistance data which is very expensive to generate, has a very low quality
 as a computational data and might present a different level of difficulty.
 Another limitation is number of training samples. Again due to the cost of data generation
 and the politics behind patient-doctor privilege it is always hard to collect a large number of data.
 As an example due to the small number sequences labelled intermediate (I) we were forced to
 perform two-class classification rather than multi-class classification. Another limitation that
 needs to be addressed is the input encoding technique. In this research, an input encoding
 technique used converts the patterns in to a discrete vector. This encoding might be well suited
 111
for one algorithm more than the other and might have affected the ranking of the algorithms
 according to performance.
 This research can further be extended to address the above limitations. Further improvement
 to this research can be obtained by collecting high quality and large number of data from dif-
 ferent gene databases. The same target can also achieved by working collaborations with HIV
 laboratories. This research can also be extended to make it more user friendly, web based so
 that it can be accessible a wide range of experts working in the field. Further more this research
 can be extended to by incorporating different feature extraction and reduction techniques.
 5.9 Summary
 Performing comparative research is always a source of controversy among pattern recognition
 societies. The reason for conducting comparative analysis of performance of SVMs against neu-
 ral network, decision trees and/or logistic regression was not to verify or falsify the superiority
 of one classification technique over the other. As a relatively new addition to the pattern recog-
 nition techniques, SVMs have been compared to numerous pattern recognition techniques in a
 number of benchmark application. The result from these research claimed SVMs has a better
 performance on high dimensional data when faced with limited training samples. Dimensional-
 ity of the patterns and shortage of training sample is one of the common characteristics of many
 computation biology problems. Following this, this research investigates if SVMs can be used
 as a drug resistance tool based on an HIV strain. The results found might not be used directly
 to answer the question however, it has promised positive result.
 The research have a number of limitations, however its significance in the bioinformatics
 and computational biology research can not be ignored.
 112
Chapter 6
 Conclusion and Future work
 6.1 Conclusion
 The Human Immuno-deficiency Virus (HIV) has infected millions of people and is spreading
 with a very high rate. One of the reasons for the failure in combating the epidemic is the drug
 resistant behaviour of the virus, which results from the rapid mutation rate. HIV drug resistance
 testing that helps optimise drug administration has shown a positive result in prolonging viral
 suppression and helps to reconstruct patient immunity. Phenotypic testing and genotypic testing
 are the two approaches for doing HIV drug resistance tests. The genotypic approach has gained
 a lot of interest because it is cost effective, easy to conduct and the interpretation of the result
 is independent of the process. The latter advantage has also made genotypic testing an ideal
 application for computerized expert systems and/or pattern recognition techniques.
 Pattern recognition techniques have been used to solve problems that cannot be solved using
 the traditional algorithmic approach. These techniques are particularly useful when the relation
 between the input and the output is not defined properly, the data to be processed is enormous
 in size and very hard to analyse manually. This research used SVMs as a pattern recognition
 technique.
 Although SVMs are relatively new, have shown outstanding performance as a tool for pat-
 tern recognition. The principle of structural risk minimisation and the ability of SVMs to trans-
 form the input data to higher dimensional feature space (kernel trick) are the secret behind the
 success achieved by SVMs. Other traditional pattern recognition techniques use the principle
 of empirical risk minimisation, which minimises the training error and usually suffers from
 over-fitting. On the other hand, the principle structural risk minimisation, which minimises the
 upper bound on the test error, gives SVMs the ability to generalise better. SVMs can be simply
 described as a combination of linear network, regularisation and kernel trick. It discriminate
 the positive examples from the negative ones using a hyperplane while maintaining a maximum
 margin between them. Many real-world problems are not linearly separable (separable by a
 linear hyperplane). To classify non-linear data sets, SVMs use a kernel function that implicitly
 works in a higher dimensional feature space, where the data can be discriminated using a linear
 hyperplane.
 113
There are a number of different implementations of SVMs. This research used SVMlight, a
 C implementation of Vapnik?s support vector machine. SMVlight is selected because it the most
 popular implementation, it is optimal and can handle large problems. It also has the ability
 to incorporate user defined kernel function and a cost model besides the three popular kernel
 functions.
 The input data used to answer the research question is the genetic sequence of the viral
 protease and reverse transcriptase from 2045 individuals. These sequences were previously
 classified using some of the popular drug resistance interpretation algorithms. On this data set
 the performance of SVMs, neural networks, decision trees and logistic regression were tested.
 The best performance for these algorithms was configured using grid-search over a range of
 model parameters.
 The results showed that SVMs outperformed the neural networks and logistic regression.
 The performance of the decision tree models and the SVMs models was almost similar. The
 results in this research confirms the findings of many researchers outlining the power of SVMs
 as a classification tool.
 6.2 Future work
 Although this research showed the ability of SVMs as a tool for predicting the HIV drug resis-
 tance based on the genetic sequence of the virus, it has a number of limitations. Following the
 achievements of this work and to some of address the limitations, the following future directions
 are proposed:
 1. The two major limitations identified arise from the number and quality of data available.
 To address this limitation further research should be conducted by gathering large size
 of quality data and performing not only binary but multi-class prediction of HIV drug
 resistance.
 2. As pointed out in the previous chapter, the input encoding technique might have favoured
 some classification algorithms than the others. Section 4.6 also highlighted the effect of
 different input encoding scheme however, a detailed study was not done. To investigate
 the effect of input encoding schemes in the performance of the different classification
 algorithm further research should be conducted.
 3. The data used in this research was previously classified using rule-based algorithms. Al-
 though this does not compromise what is achieved in this research, conducting the re-
 search using pure phenotypic data will give more insight and might present a different
 level of challenge.
 4. In this research we only concentrated on the computation property and computational
 achievement of the different classification tools. However, further work should be done
 114
in close collaboration with experts in the field to investigate the biological implications
 of the results.
 5. The results for the decision tree model showed that interior nodes were related to existing
 mutation points, even though we did not use any of the prior biological information.
 Further research can be conducted to investigate the application of SVMs to identify
 mutation points using SVMs for feature selection. This can be carried out by performing
 sensitivity analysis on SVMs and investigate if this technique can be used to identify
 mutation points. This work should also be done in collaboration with biological experts.
 6. Similar to the above future work, Boz [2002] investigated ways of converting a trained
 neural network into a decision tree in order to make the complex rules discovered by
 the neural network model more human understandable. In our particular problem this
 can also be used identify mutation points which are known to confer drug resistance.
 This mutation points are the split criteria. Hence, future research can be carried out to
 investigate ways of converting a trained SVMs model into a decision tree.
 115
Bibliography
 [Baldi and Brunak 2001] Pierre Baldi and Soren Brunak. Bioinformatics: the machine learning
 approach. MIT Press, Cambridge, MA, USA, 2001.
 [Bean 2000] Pamela Bean. An introduction to HIV drug resistance testing. Clinical Note,
 pages 10?11, May 2000.
 [Beerenwinkel et al. 2002] Niko Beerenwinkel, Barbara Schmiot, Hauke Walter, Rolf Kaiser,
 Thomas Lengauer, Daniel Hoffmann, Klaus Korn, and Joachim Selbing. Diversity and
 complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype
 from genotype. Proceeding of the National Academy of Sciences of the United States of
 America, 99(12):8271 ? 8276, June 2002.
 [Beerenwinkel et al. 2003a] Niko Beerenwinkel, Martin Daumer, Mark Oette, Klaus Korn,
 Daniel Hoffmann, Rolf Kaiser, Thomas Lengauer, Joachim Selbig, and Hauke Walter.
 Geno2pheno: estimation phenotypic drug resistance from HIV-1 genotypes. Nucleic
 Acids Research, 31(13):3850 ? 3855, 2003.
 [Beerenwinkel et al. 2003b] Niko Beerenwinkel, Thomas Lengauer, Martin Dumer, Rolf
 Kaiser, Hauke Walter, Klaus Korn, Daniel Hoffmann, and Joachim Selbig. Methods for
 optimizing antiviral combination therapies. Bioinformatics, 19:i16?i25, 2003.
 [Boser et al. 1992] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training
 algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on
 Computational Learning Theory, pages 144?152, Pittsburgh, PA, July 27-29 1992. ACM
 Press.
 [Boz 2002] Olcay Boz. Extracting decision trees from trained neural networks. In Proceedings
 of the eighth ACM SIGKDD international conference on Knowledge discovery and data
 mining, pages 456?461, Edmonton, Alberta, Canada, July 23-26 2002.
 [Bradley 1997] Andrew P. Bradley. The use of the area under the ROC curve in the evaluation
 of machine learning algorithms. Pattern Recognition, 30(7):1145?1159, 1997.
 [Breiman et al. 1984] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J.
 Stone. Classification and Regression Trees. The Wadsworth Statistics/Probability Series.
 Wadsworth International Group, Belmont, California, USA, 1984.
 116
[Brown et al. 2000] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S.
 Furey, Jr. M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene ex-
 pression data using support vector machines. Proceedings of the National Academy of
 Sciences of the United States of America, 97(1):262?267, January 2000.
 [Burges 1998] Christopher J. C. Burges. A tutorial on Support Vector Machines for pattern
 recognition. Data Mining and Knowledge Discovery, 2(2):121?167, 1998.
 [Cessie and van Houwelingen 1992] S. Le Cessie and J. C. van Houwelingen. Classification
 of gene microarrays by penalized logistic regression. Applied Statistics, 41(1):191?201,
 1992.
 [Chan 1996] Lia-Wan Chan. Levenberg-Marquardt learning and regularization. In Shun-Ichi
 Amari, Lei Xu, Lai-Wan Chan, Irwin King, and Kwong-Sak Leung, editors, Progress in
 Neural Information Processing: Iconip?96 : Proceedings of the International Conference
 on Neural Information Processing, pages 139 ? 144, Hong Kong, 24-27 September 1996.
 Springer-Verlag Singapore.
 [Coffin 1999] J. M. Coffin. Molecular Biology of HIV. In Keith A. Crandall, editor, Evolution
 of HIV, pages 3?41. John Hopkins Univeristy Press, 1999.
 [Cortes and Vapnik 1995] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learn-
 ing, 20:273 ? 297, 1995.
 [Cristianini and Shawe-Taylor 2000] N. Cristianini and J. Shawe-Taylor. An Introduction to
 Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.
 [Delen and Kadam 2005] G. Walker Delen, D. and A. Kadam. Predicting breast cancer surviv-
 ability: a comparison of three data mining methods. Journal of Artificial Intelligence in
 Medicine, 34(2):113 ? 127, June 2005.
 [Demuth and Beale 2004] Howard Demuth and Mark Beale. Neural Network Toolbox: User
 guide, June 2004.
 [Diem 1962] Konrad Diem, editor. DOCUMENTA GEIGY : SCIENTIFIC TABLES. Geigy
 Pharmaceuticals, Basle, Switzerland, 1962.
 [Dietterich 1996] T. Dietterich. Statistical tests for comparing supervised learining algorithm.
 Technical report, Oregon State University, Corvallis, OR, 1996.
 [Draghici and Potter 2003] S. Draghici and R. B. Potter. Predicting HIV drug resistance with
 neural network. Bioinformatics, 19(1):98 ? 107, January 2003.
 [Draghici et al. 2000] S. Draghici, L. Cumberland, and L. C. Kovari. Correlation of HIV pro-
 tease structure with Indinavir resistance: a data mining and neural network approach.
 117
In Belur V. Dasarathy, editor, Proceedings, the International Society for Optical Engi-
 neering, volume 4057 of Data Mining and Knowledge Discovery: Theory, Tools, and
 Technology II, pages 319 ? 329, Orlando, Florida, April 2000.
 [Duda et al. 2000] Richard O. Duda, Peter E. Hart, and David G. Stock. Pattern Recognition
 and Scene Analysis. John Wiley & Sons, New York, 2000.
 [Dumais et al. 1998] Susan T. Dumais, John Platt, David Hecherman, and Mehran Sahami.
 Inductive learning algorithms and representations for text categorization. In Proc. 7th
 International Conference on Information and Knowledge Management, pages 148?155,
 Bethesda, Maryland, United States, November 3-7 1998. ACM Press.
 [Eilers et al. 2001] P. H. Eilers, J. M. Boer, G.-J. van Ommen, and H. C. van Houwelingen.
 Classification of microarray data with penalized logistic regression. In Proc. SPIE Vol.
 4266, p. 187-198, Microarrays: Optical Technologies and Informatics, Michael L. Bit-
 tner; Yidong Chen; Andreas N. Dorsel; Edward R. Dougherty; Eds., pages 187?198, June
 2001.
 [Fawcett 2003] Tom Fawcett. ROC Graphs: Notes and Practical consideration for Re-
 searchers. Technical report, HP Laboratory, 2003.
 [Ferri et al. 2003] Ce?sar Ferri, Jose? Herna?ndez-Orallo, and Miguel A. Salido. Volume under
 the ROC surface for multi-class problems. In ECML, pages 108?120, Cavtat-Dubrovnik,
 Croatia, November 2003. LNAS Springer Verlag.
 [Friedman 1996] Jerome H. Friedman. Another Approach to Polychromous Classification.
 Technical report, Stanford University, 1996. http://www-stat.stanford.edu/
 ?jhf/ftp/poly.ps.Z.
 [Fu 1974] K. S. Fu. Syntactic Methods in Pattern Recognition. Academic Press, New York,
 1974.
 [Furey et al. 2000] Terrence S. Furey, Nigel Duffy, Nello Cristianini, David Bednarski, Michel
 Schummer, and David Haussler. Support vector machine classification and validation of
 cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906?914,
 2000.
 [Ganapathiraju 2002] Aravind Ganapathiraju. Support Vector Machines for speech
 recognition. PhD thesis, Mississippi State University, May 2002. http:
 //www.ece.msstate.edu/academics/gradprog/dissertations/
 ganapathiraju_dissertation.pdf,.
 [Gascuel and Caraux 1992] Olivier Gascuel and Gilles Caraux. Statistical significance in in-
 ductive learining. In Proceedings of the European Conference on Artificial Intelegence
 (ECAI), pages 435 ? 439, New York, USA, 1992. Wiley.
 118
[Goutte 1997] Cyril Goutte. Note on free lunches and cross-validation. Neural Computation,
 9(6):1245?1249, 1997.
 [Greene and Peterlin 2003] W. C. Greene and B. M. Peterlin. Molecular Insights into HIV
 biology. Nature Science of HIV, February 2003. HIV InSite Knowledge Based Chapter
 at http://hivinsite.ucsf.edu/.
 [Gutierrez-Osuna ] Richardo Gutierrez-Osuna. Pattern recognition lecture notes. http://
 research.cs.tamu.edu/prism/lectures.htm.
 [Hagan and Menhaj 1994] Martin T. Hagan and Mohammad B. Menhaj. Training feedforward
 networks with the Marquardt algorithm. IEEE Transaction on Neural Networks, 5(6):989
 ? 993, November 1994.
 [Hastie and Tibshirani 1996] T. Hastie and R. Tibshirani. Classification by pairwise coupling.
 Technical report, University of Toronto, 1996. http://www-stat.stanford.
 edu/?hastie/Papers/nips97.ps.
 [Health Canada 2001] Health Canada. HIV-1 Strain and Primary Drug Resistance in Canada.
 surveillance report to June 30 2002. Division of HIV/AIDS Epidemiology and Surveil-
 lance, Centre for Infectious Disease Prevention and Control, Health Canada, 2001.
 http://www.hc-sc.gc.ca/pphb-dgspsp/publicat/hiv1-vih1/.
 [Hearst et al. 1998] M.A. Hearst, B. Scho?lkopf, S. Dumais, E. Osuna, and J. Platt. Trends and
 Controversies - Support Vector Machines. IEEE Intelligent Systems, 13(4):18 ? 28, July
 1998.
 [Hoffmann and Kamps 2003] Christian Hoffmann and Bernd Sebastian Kamps. HIV Medicine
 2003, January 2003. www.hivMedicine.com.
 [Hunter 1993] Lawrence Hunter. Artificial Intelligence and Molecular Biology, chapter Molec-
 ular biology for computer scientists. American Association for Artificial Intelligence,
 Menlo Park, CA, USA, March 1993.
 [Jain and Chandrasekaran 1982] Anil K. Jain and B. Chandrasekaran. Dimensionality and sam-
 ple size consideration in pattern recognition practice. In P.R. Krishnaiah and L. N. Kanal,
 editors, Handbook of Statistics, volume 2, pages 835 ? 855. North-Holland Publishing
 Company, Amsterdam: North Holland, 1982.
 [Jain et al. 2000] Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. Statistical pattern
 recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence,
 22(1):4?37, January 2000.
 119
[Joachims 1998] Thorsten Joachims. Text categorization with support vector machines: learn-
 ing with many relevant features. In Claire Ne?dellec and Ce?line Rouveirol, editors, Pro-
 ceedings of ECML-98, 10th European Conference on Machine Learning, pages 137?142,
 Chemnitz, DE, April 1998. Springer Verlag, Heidelberg, DE.
 [Joachims 1999] Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard
 Scho?lkopf, Chris Burges, and Alex Smola, editors, Advances in Kernel Methods ? Support
 Vector Learning, pages 169?185. MIT Press, 1999.
 [Jolliffe 1986] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
 [Kanal 2000] Laveen Kanal. Pattern in pattern recognition: 1968 - 1974. IEEE Transactions
 on Information Theory, IT-20(6):697 ? 722, November 2000.
 [Kass 1980] G.V. Kass. An exploratory technique for investigating large quantities of categor-
 ical data. Applied Statistics, 29(2):119?172, 1980.
 [Keerthi and Lin 2003] S. Sathiya Keerthi and Chih-Jen Lin. Asymptotic behaviors of support
 vector machines with Gaussian kernel. Neural Computation., 15(7):1667?1689, 2003.
 [Klatt 2003] Edward C. Klatt. Pathology of AIDS. Florida State University, College
 of Medicine, Florida, USA, May 2003. http://medlib.med.utah.edu/
 WebPath/AIDS2004.pdf.
 [Kohavi and Quinlan 2002] Ronny Kohavi and J. Ross Quinlan. Handbook of data mining and
 knowledge discovery, chapter Data mining tasks and methods: Classification: decision-
 tree discovery, pages 267?276. Oxford University Press, Inc., New York, NY, USA, 2002.
 [Kroon 2003] Rodeney Stephen Kroon. Support Vector Machines, Generalization Bounds and
 Transduction. Master?s thesis, University of Stellenbosch, December 2003.
 [Lathrop and Pazzani 1999] R. H. Lathrop and M. J. Pazzani. Combinatorial Optimization
 in Rapidly Mutating Drug-Resistant Viruses. Journal of Combinatorial Optimization,
 3(2):301 ? 320, July 1999.
 [Lathrop et al. 1999] R. H. Lathrop, N. R. Steffen, M. P. Raphael, S.Deeds-Rubin, M. J. Paz-
 zani, P. J. Cimoch, D. M. See, and J. G. Tilles. Knowledge-based Avoidance of Drug-
 Resistant HIV Mutant. AI magazine, 20(1):13 ? 25, spring 1999.
 [Lee et al. 2001] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory Support Vector
 Machines. Technical Report TR-1043, Department of Statistics, University of Wisconsin-
 Madison, Madison, USA, 2001. http://www.galaxy.gmu.edu/interface/
 I01/I2001Proceedings/YLee/YLee.pdf.
 120
[Lin and Lin 2005] Hsuan-Tien Lin and Chih-Jen Lin. A study of Sigmoid Kernels for SVM and
 the Training of non-PSD Kernels by SOM-type Methods. Technical report, Department of
 Computer Science and Information Engineering, National Taiwan University, Taipei, Tai-
 wan, March 2005. http://www.csie.ntu.edu.tw/?cjlin/papers/tanh.
 pdf.
 [Loh and Shih 1997] Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification
 trees. Statisica Sinica, 7:815 ? 840, 1997.
 [Marquardt 1963] Donald Marquardt. An algorithm for least-squares estimation of nonlinear
 parameters. SIAM Journal on Applied Mathematics, 11:431 ? 441, 1963.
 [Mayoraz and Alpaydin 1999] Eddy Mayoraz and Ethem Alpaydin. Support Vector Machines
 for multi-class classification. In International Work-conference on Artificial and Natural
 Neural Networks (2), pages 833?842, Alicante, Spain, June 2-4 1999.
 [Meyer et al. 2002] David Meyer, Friedrich Leisch, and Kurt Hornik. Benchmarking Support
 Vector Machines. Report Series Report No 78, Vienna University of Economics and
 Business Administration, 2002.
 [NIAID 2004] NIAID. National Inistitute of Aliergy and Infectious Diseases: HIV Vaccine
 Glossary, May 2004. www.niaid.nih.gov/factsheets/GLOSSARY.htm.
 [Norris 2002] Amy Norris. Multivariate Analysis and Reverse Engineering of Signal Trans-
 duction Pathways. Master?s thesis, Institute of Applied Mathematics, Uniersity of British
 Colombia, Canada, April 2002. http://www.iam.ubc.ca/theses/Norris/
 ANorris_MSc_thesis.pdf.
 [Olson 2000] Clark F. Olson. Maximum-Likelihood Template Matching. In Proceedings of
 the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
 volume 2, pages 52?57, Hilton Head, South California, June 2000.
 [Olszewski 2001] Robert T. Olszewski. Generalized Feature Extraction for Structural Pattern
 Recognition in TimeSeries Data. PhD thesis, School of Computer Science; Carnegie
 Mellon University, February 2001. http://reports-archive.adm.cs.cmu.
 edu/anon/2001/abstracts/01-108.html.
 [Osuna et al. 1997] E. Osuna, R. Freund, and F. Girosi. Support Vector Machines: Training
 and Application. Technical Report AIM-1602, MTI A.I. Lab, 1997. http://hdl.
 handle.net/1721.1/7290.
 [Pavlidis 1977] Theodosios Pavlidis. Structural Pattern Recognition. Springer, Berlin, Ger-
 many, 1977.
 121
[Perlich et al. 2004] Claudia Perlich, Foster Provost, and Jeffrey S. Simonoff. Tree induction
 vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research,
 4(2):211 ? 255, February 2004.
 [Platt 1998] J. Platt. Sequential minimal optimization: A fast algorithm for training support
 vector machines. Technical Report MSR-TR-98-14, Microsoft Research, April 1998.
 http://research.microsoft.com/users/jplatt/smoTR.pdf.
 [Plutowski et al. 1994] Mark Plutowski, Shinichi Sakata, and Halbert White. Cross-validation
 estimates imse. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, In Advances in
 Neural Information Processing Systems, volume 6, pages 391 ? 398. Morgan Kaufman,
 San Mateo, CA, 1994.
 [Provost et al. 1998] Foster J. Provost, Tom Fawcett, and Ron Kohavi. The case against ac-
 curacy estimation for comparing induction algorithms. In ICML ?98: Proceedings of
 the Fifteenth International Conference on Machine Learning, pages 445?453, Madison,
 Wisconsin USA, July 24-27 1998. Morgan Kaufmann Publishers Inc.
 [Quinlan 1987] J. R. Quinlan. Simplifying decision trees. International Journal of Man-
 Machine Studies, 27(3):221?234, 1987.
 [Quinlan 1993] J. Ross Quinlan. C4.5: Programmes for Machine learning. Morgan Kaufmann,
 San Meteo, California, 1993.
 [Quinlan 1996] J. Ross Quinlan. Improved use of continuous attributes in C4.5. Journal of
 Artificial Intelligence Research, 4:77?90, 1996.
 [Rakotomamonjy 2004] Alain Rakotomamonjy. SVMs and Area under ROC curves. Technical
 report, De?partement Architecture des systmes d?Informations de l?INSA de Rouen, 2004.
 http://asi.insa-rouen.fr/?arakotom/publi/svmandauc.pdf.
 [Raudys and Jain 1991] Sarunas J. Raudys and Anil K. Jain. Small sample size effects in sta-
 tistical pattern recognition: Recommendation for practitioners. IEEE Transactions on
 Pattern Analysis and Machine Intelligence, 13(3):252 ? 264, March 1991.
 [Ravela et al. 2003] Jaideep Ravela, Bradley J. Betts, Francoise Brun-Vzinet, Anne-Mieke
 Vandamme, Diane Descamps, Kristel Van Laethem, Kate Smith, Jonathan M. Schapiro,
 Dean L. Winslow, Caroline Reid, and Robert W. Shafer. HIV-1 protease and reverse
 transcriptase mutation patterns responsible for discordances between genotypic drug re-
 sistance interpretation algorithms. Journal of Acquired Immune Deficiency Syndromes,
 JAIDS, 33:8 ? 14, 2003.
 [Riedmiller and Braun 1993] Martin Riedmiller and Heinrich Braun. A direct adaptive method
 for faster backpropagation learining: The RPROP algorithm. In Proceedings of the
 122
IEEE International Conference on Neural Networks, pages 586?591, San Francisco, CA,
 March 28 1993. IEEE Computer Society.
 [Riedmiller 1994] Martin Riedmiller. Advanced supervised learning in multilayer perceptions -
 from backpropagation to adaptive learning techniques. International Journal of Computer
 Standards and Interfaces. Special Issue on Neural Networks, 16:265?278., 1994.
 [Rosenblatt 1962] Frank Rosenblatt. Principles of neurodynamics; perceptrons and the theory
 of brain mechanisms. Spartan Books, Washington D.C., 1962.
 [Roweis 1998] Sam Roweis. EM algorithms for PCA and SPCA. In Michael I. Jordan,
 Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing
 Systems, volume 10. The MIT Press, 1998.
 [Rumelhart et al. 1994] David E. Rumelhart, Bernard Widrow, and Michael A. Lehr. The basic
 ideas in neural networks. Communication of the ACM, 37(3):87?92, 1994.
 [Russell 1991] Ingrid F. Russell. Neural networks in the undergraduate curriculum. Collegiate
 Microcomputer, 9(1):1?6, February 1991.
 [Saeys 2004] Yvan Saeys. Feature Selection for classification of Nucleic Acid Sequences. PhD
 thesis, Faculty of Science, Universiteit Gent, Gent, Belgium, September 2004. http:
 //www.psb.ugent.be/?yvsae/phd/phdyvsae.pdf.
 [Saitta and Neri 1998] Lorenza Saitta and Filippo Neri. Learning in the ?Real World?. Machine
 Learning., 30(2-3):133?163, 1998.
 [Salzberg 1997] Steven L. Salzberg. On comparing classifiers: Pitfalls to avoid and recom-
 mended approach. Data Mining and Knowledge Discovery, 1:317?328, 1997.
 [Salzberg 1999] Steven L. Salzberg. On comparing classifiers: A critique of current research
 and methods. Data Mining and Knowledge Discovery, 1:1?12, 1999.
 [Schalkoff 1991] Robert J. Schalkoff. Pattern recognition: statistical, structural and neural
 approaches. John Wiley & Sons, Inc., New York, USA, 1991.
 [Schimek 2003] Michael G. Schimek. Penalized logistic regression in gene expression analy-
 sis. In Proceedings of 2003 semiparametic conference, Berlin, Germany, 2003.
 [Scho?lkopf and Smola 2001] Bernhard Scho?lkopf and Alexander J. Smola. Learning with Ker-
 nels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive
 Computation and Machine Learning. The MIT Press, USA, December 2001.
 [Scho?lkopf et al. 1996] B. Scho?lkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio,
 and V. Vapnik. Comparing Support Vector Machines with Gaussian Kernels to Ra-
 dial Basis Function Classifiers. Technical Report AIM-1599, Massachusetts Institute
 123
of Technology, Cambridge, MA, USA, 1996. http://cbcl.mit.edu/cbcl/
 publications/ai-publications/1500-1999/AIM-159%9.ps.
 [Scho?lkopf 1997] Bernhard Scho?lkopf. Support Vector Learning. PhD thesis, Technischen
 Universita?t Berlin, 1997. Published by: R. Oldenbourg Verlag, Munich http://www.
 kernel-machines.org/papers/book_ref.ps.gz.
 [Shafer 2002a] R. W. Shafer. Genotypic Testing for Human Immuniodeficency Virus Type 1
 Drug Resistance. Clinical Microbiology Reviews, 15(2):247 ? 277, April 2002.
 [Shafer 2002b] Robert W. Shafer. Assays for antiretroviral resistance. HIV Insight, June 2002.
 [Shannon 2001] Claude. E. Shannon. A mathematical theory of communication. ACM SIG-
 MOBILE Mobile Computing and Communications Review, 5(1):3?55, 2001.
 [Shao 1993] Jun Shao. Linear model selection by cross validation. Journal of the American
 Statistical Association, pages 486 ? 494, 1993.
 [So 1995] Ying So. A Tutorial on Logistic Regression. Technical Report 450, SAS Insti-
 tute Inc., Cary, NC, 1995. http://www.sas.com/service/techsup/tnote/
 tnote_index4.html.
 [Sollich 2000] Peter Sollich. Probabilistic methods for support vector machines. In S.A. Solla,
 T.K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing, vol-
 ume 12, pages 349 ? 355. MIT Press, Cambrige, MA, USA, 2000.
 [Spackman 1989] Kent A. Spackman. Signal detection theory: valuable tools for evaluating
 inductive learning. In Proceedings of the sixth international workshop on Machine learn-
 ing, pages 160?163, Ithaca, New York, USA, June 29 - July 1 1989.
 [Sun et al. 2002] Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. Web classification using Sup-
 port Vector Machine. In Proceedings of the Fourth International Workshop on Web Infor-
 mation and Data Management, pages 96?99, McLean, Virginia, USA, November 8 2002.
 ACM Press.
 [Theodoridis and Koutroumbas 1999] Sergios Theodoridis and Konstantinos Koutroumbas.
 Pattern Recognition. Academic Press, London, UK, 1999.
 [Tou and Gonazalez 1974] Julius T. Tou and Rafael C. Gonazalez. Pattern Recognition Princi-
 ples. Addison-Wesley Publishing Company, Massachusetts, USA, 1974.
 [Trunk 1979] G. V. Trunk. A problem of dimensionality: A sample example. IEEE Transac-
 tions on Pattern analysis and machine intelligence, PAMI-1(3):306 ? 307, July 1979.
 [UNAIDS 2004] UNAIDS. AIDS epidemics update: 2003, 2004. http://www.who.int/
 hiv/pub/epidemiology/epi2003/en/.
 124
[Vanderbei 1999] Robert. J. Vanderbei. LOQO: An interior point code for quadratic program-
 ming. Optimization Methods and Software, 11:451?484, 1999.
 [Vapnik and Chervonenkis 1991] V. N. Vapnik and A. Ya. Chervonenkis. The necessary and
 sufficient condition for consistency in the empirical risk minimization methods. Pattern
 Recognition and Image Analysis, 1(3):283 ? 305, 1991.
 [Vapnik 1979] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].
 Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).
 [Vapnik 1995] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag
 New York, Inc., 1995.
 [Wang and Larder 2003] Dechao Wang and Berndan Larder. Enhanced prediction of lopinavir
 resistance from genotype by use of artificial neural network. The Journal of Infectious
 Diseases, 188:653 ? 660, September 2003.
 [Watson et al. 1997] James Watson, Michael Gilman, Jan Witkowski, and Mark Zoller. Re-
 combinant DNA. Scientific American Books, New York, USA, 2nd edition, 1997.
 [Webb 1999] Andrew Webb. Statistical Pattern Recognition. Arnold Publishers, London, UK,
 1999.
 [Weston and Watkins 1998] J. Weston and C. Watkins. Multi-class Support Vector Machine.
 Technical Report CSD-TR-98-04, Royal Holloway College, University of London,
 Egham, England, 1998. http://www.cs.rhbnc.ac.uk/research/compint/
 areas/comp_learn/sv/pub/repo%rt98-04.ps.gz.
 [Whitney et al. 2002] James B. Whitney, Maureen Oliveira, Mervi Detorio, Yongjun Guan, and
 Mark A. Wainberg. The M184V mutation in reverse transcriptase can delay reversion
 of attenuated variants of simian immunodeficiency virus. Journal of Virology, 76(17),
 september 2002.
 [Wikipedia 2004] Wikipedia. Wikipedia: The free Encyclopedia. http://en.
 wikipedia.org/wiki/HIV, 2004.
 [Wu 1997] Cathy H. Wu. Artificial neural networks for molecular sequence analysis. Comput-
 ers & Chemistry, 21(4):237 ? 256, 1997.
 [Zhu and Hastie 2004] Ji Zhu and Trevor Hastie. Classification of gene microarrays by penal-
 ized logistic regression. Biostatistics, 5:427?443, 2004.
 125
Appendix A
 Statistical and Parametric Classifiers
 A.1 Bayesian decision theory
 Bayesian decision theory is a tool broadly used to solve pattern recognition problems, provided
 the problem is defined in terms of probability densities and all the probability values are de-
 fined. Having defined all the relevant probability valued, Bayesian decision theory is based on
 finding optimal trade offs between the various classification decisions and the accompanying
 cost (misclassification penality) [Duda et al. 2000]. The material on this section is extracted
 form Duda et al. [2000] unless otherwise specified.
 Given a set of patterns represented by a d-dimensional feature vectorX = (x1, x2, . . . , xd) ?
 Rd called the feature space, let ? be any of the c finite states of nature ? = {?1, ?2, . . . , ?c} and
 A = {?1, ?2, . . . , ?a} be the finite set of possible actions. The decision problem is then defined
 as choosing an appropriate action among the finite set of possible actions for an event in the
 world ( an event belonging to one of the finite states of nature based on the measurement given
 by the feature vector x). The performance of the decision system is then measured by the loss
 function ? : A?? ? R, which establishes the cost ?(?i|?j) that describes the loss incurred by
 choosing action ?i when the state of nature is ?j . However, we cannot observe the world and
 hence we need to consider the measurement x ? Rd and the mapping function ?(x) : Rd ? A
 called the decision function, which is the function of measurement, that tells us which action to
 take for the given measurements.
 The measurement x and the corresponding state of nature ? can be viewed as a single
 observation and might be considered in terms of probability. In this probabilistic framework,
 decision will be based on the posterior probability P (?j|x) for j = 1, . . . , c, which is the
 probability that the measurement x belongs to the state of nature ?j . An action ?j will be taken
 based on the largest probability once the posterior is computed for each state of nature. But how
 do we know the posterior probability?
 Let x ? Rd be a random variable and assume the prior distribution P (?j), which tells
 how likely the nature of state is ?j and let p(x|?j) be the state-conditional probability, which
 describes the relationship among the state of nature ?j and measurement x are known. With
 the prior distribution and state-conditional density, the joint distribution, p(?j, x), of finding a
 126
measurement in the state of nature ?j and having the vector value x will be given as:
 p(?j, x) = P (?j|x)p(x) = p(x|?j)P (?j) (A.1)
 rearranging the above equation we get the Bayes rule:
 P (?j|x) = p(x|?j)P (?j)p(x) (A.2)
 where p(x) = ??j?? p(x|?j)P (?j) is the probability density function for x.
 Therefore: P (?j|x) = p(x|?j)P (wj)?
 ?j?? p(x|?j)P (wj)
 (A.3)
 The Bayes rule tells us how the probability of the state of nature is updated by inverting
 the relationship among world state and measurement. The term P (?j|x), the probability that
 the true state of nature is ?j , accounts for the fact that once the measurement is observed the
 probability of having a certain state changes. Suppose for the measurement x the action ?i is
 taken while the true state of nature is ?j . According to the definition the penalty of taking action
 ?i while the true state of nature is ?j is given by ?(?i|?j). And the average loss also known as
 the conditional risk of choosing this action is given by:
 R(?i|x) =
 ?
 ?j??
 ?(?i|?j)P (?j|x)
 If we have a decision rule ? which tells us which action to take for every possible observation
 we can substitute ?i by ?(x) and obtain the average loss R(?(x)|x) given the measurement x.
 The overall average loss (expected risk) therefore will be obtained by averaging over all possible
 measurements and is given as:
 R =
 ?
 Rd
 R(?(x)|x)P (x)dx (A.4)
 The expected risk is a single scalar that measures the overall performance. Then the best deci-
 sion rule is found by computing the expected risk for all possible actions and selecting the one
 that minimises the risk, i.e.
 ? = argmin
 ?
 R? = argmin
 ?
 ?
 Rd
 R(?(x)|x)P (x)dx, ?x ? Rd (A.5)
 By minimising point-wise the function under the sign of integral, it is easy to conclude that the
 optimal function is
 ?(x) = arg min
 ?i?A
 R(?i|x), ?x ? Rd (A.6)
 The resulting overall minimum risk is called the Bayes risk.
 127
Bayesian pattern classification is based on Bayes decision theory to define the decision
 boundaries to perform pattern recognition tasks. Pattern recognition is a particular decision
 problem where the world is seen as a pattern source, the state of nature is seen as the pattern
 classes with the measurement being a pattern. Each action ?i is simply identifying the pattern to
 be in one of the pattern class. The decision rule ?(x) is a discriminant function which maps the
 pattern x to one of the pattern classes. To evaluate the performance lets assume misclassification
 is equally bad and consider the 0/1 loss function, which is defined to be
 ?(?i|?j) =
 ??
 ?
 0 i = j,
 1 i 6= j
 i, j = 1, . . . , c (A.7)
 The optimal decision function is then
 ?(x) = argmin
 ?i
 R(?i|x) = argmin
 ?i
 ?
 ?i
 ?(?i|?j)P (?j|x)
 = argmin
 ?i
 ?
 ?j???i 6=j
 P (?j|x)
 = argmin
 ?i
 1? P (?i|x)
 = argmax
 ?i
 P (?i|x)
 Therefore the Bayes decision rule can be restated as follows
 x ? ? if P (?|x) = max
 ?j
 P (?j|x) (A.8)
 Recall equation (2.2) and note that the denominator (p(x)) is independent of ?i therefore the
 best decision rule is given by
 ?(x) = argmax
 ?i
 p(?i|x)P (?), ?x ? Rd (A.9)
 In the formulation of the Bayes decision rule it is assumed that all the class-conditional densities
 are defined. However, this assumption does not hold in practise and hence one needs to learn
 these parameters from the available training samples. Sometimes one can assume something
 about the form of the class-conditional density. Depending on the assumption taken we have
 parametric and nonparametric approaches for density estimation [Jain et al. 2000]. If we can as-
 sume that the class-conditional density have specific form but have unknown parameters, which
 needs to be estimated based on the sample, equation (2.2) will be used to calculate the posterior
 probability. In this case we have a parametric classification problem. Otherwise the poste-
 rior probability should be directly estimated on the feature space (training data) or alternatively
 construct the decision boundary on the feature space and thus resulting in a non-parametric
 128
classifier.
 129
Appendix B
 SVM: Mathematical formulation
 B.1 Linear support vector machines
 The linear support vector machine also known as maximum margin classifier is the simplest
 form of Support Vector Machine. Hence this section will be used as an introduction to basic
 principles, notation and approaches that are later extended to a more general support vector
 machine. Furthermore, Support Vector Machines are inherently binary classifier. Therefore
 the formulation shown below and in subsequent sections is given for two class pattern recog-
 nition problem. In later sections we will see how this will be extended to multi-class pattern
 recognition problems. To make the presentation in two class classification problem simple and
 consistent with the conventional mathematical presentation we will use yi instead of ?i as class
 label. The material presented in this section is based on Vapnik [1995]; Burges [1998]; Osuna
 et al. [1997]; Cristianini and Shawe-Taylor [2000]; Scho?lkopf [1997]. If any other reference is
 used it will be cited accordingly.
 B.1.1 Linear separable case - Maximum margin classifier
 Given set of examples ((x1, y1), . . . , (xl, yl)) ? Rd ? {?1}, where yi ? {?1}, xi ? Rd as-
 sume there exists a set of hyperplanes which totally discriminate the positive examples from the
 negative ones. This means we can find a pair (w, b) such that:
 yi(xi.w + b)? 1 ? 0 i = 1, 2, . . . , l (B.1)
 where w is the direction of the normal or orientation of hyperplane and b is the threshold.
 The mapping function which is usually called the hypothesis is then given by:
 f(w, b) = sign(w.xi + b) (B.2)
 Consider the example given in Figure 3.4. These training samples of the two classes (circle
 and square) can be perfectly separated by a linear hyperplane. Furthermore, one can find infinite
 number of hyperplane that can accomplish this task. Some of these hyperplane are shown in
 130
Figure 3.4. As we can seen from the figure, each of these hyperplane have zero empirical risk,
 but we should find the one that will minimise the right hand side of equation (3.7) as seen in
 Section 3.2. From the figure one can take an educated guess to say the hyperplane that passes
 through the middle will be more likely to give the best minimum risk. Formally defining this
 hyperplane, the optimal hyperplane that is likely to minimise the expected risk is the one that
 maximises the margin, which is defined as the distance between the examples from the the
 opposite class that are close to this hyperplane.
 Once the optimal hyperplane is found all the training sets will satisfy equations (B.1) and
 classification will be based on the sign of equation (B.2). The points that lie on the hyperplane
 separating the data satisfies:
 w.xi + b = 0 (B.3)
 Figure 3.5 gives graphical interpretation of Support Vector Classification. All the points that
 satisfy the inequality w.xi + b ? ?1 lie on H1 or to the left of it and those satisfying w.xi + b ?
 ?1 will lie on H2 or to the right of it. The margin is therefore defined as the distance between H1
 or H2 and the optimal hyperplane (see figure 3.5). It is evident that H1 and H2 are parallel and
 for a perfectly separable training set, no point lies between the two hyperplane. Furthermore,
 H1, H2 and the optimal hyperplane differ only on the threshold b. Formally defining the optimal
 hyperplane with respect to these two hyperplanes, the separating hyperplane is optimal if the
 minimum distance between these hyperplane and the optimal hyperplane is maximal. i.e.
 if min
 {
 min
 yi=1
 {
 w.xi + b
 ? w ?
 }
 , min
 yi=?1
 {
 ?(w.xi + b)
 ? w ?
 }}
 is maximal.
 To compute the threshold for a given w, let
 d+ = min
 yi=1
 {w.xi}, d? = min
 yi=?1
 {?w.xi}
 Substituting this values, then the optimal hyperplane is the one which maximises the equation
 given below:
 1
 ? w ? min{d+ + b, d? ? b}
 From the above expression, the maximum will be attained when the two expressions inside the
 bracket are equal. Hence the threshold for the optimal hyperplane will be:
 bopt =
 d
 ?
 ? d+
 2
 and the respective margin will be:
 ? =
 1
 ? w ?|d+ + bopt| =
 1
 ? w ?|d? ? bopt|
 131
=? ? =
 d+ + d?
 2 ? w ?
 Therefore we can find the optimal hyperplane that maximises the margin by minimising ? w ?
 subject to equation (B.1). i.e.
 Minimise 1
 2
 ?w?2
 subject to yi(xi.w + b)? 1 ? 0 ?i
 Note that minimising ?w? is the same as minimising ?w?2. Minimising a quadratic function
 under a linear constraint formulated above is called quadratic program and can be solved to
 give the solution to the optimal hyperplane using Quadratic Programming (QP) optimisation
 [Cristianini and Shawe-Taylor 2000]. Solving this problem using the classical Lagrangian mul-
 tipliers approach have a number of advantages [Cristianini and Shawe-Taylor 2000]. Firstly, this
 approach gives an alternative formulation of the original problem (dual form) which is easier to
 solve. Secondly, the dual form is not only easier to solve but also emphasises the importance
 of some training examples over the other, leading to a minimised but critical sample size and
 thirdly, the dual form makes generalisation beyond linear separable cases an easy task.
 Introduce a dual vector of non-negative Lagrangian multiplier ? = (?1, ?2, . . . , ?l) corre-
 sponding to each inequality constraint in (B.1) the Lagrangian function (see Appendix B.2) will
 be defined as:
 LP (w, b,?) = 12 ? w ?
 2
 ?
 l?
 i=1
 ?iyi(xi.w + b) +
 ?
 i=1
 ?i (B.4)
 The saddle point of the Lagrangian, which will be determined by minimising LP with respect
 to w and b and maximise with respect to ? ? 0 is the solution to the optimisation problem.
 Minimising LP with respect to w and b we have:
 ?LP (w, b,?)
 ?w = 0
 =? w =
 l?
 i=1
 ?iyixi (B.5)
 ?LP (w, b,?)
 ?b = 0
 =?
 l?
 i=1
 ?iyi = 0 (B.6)
 substituting equation (B.5) and (B.6) into (B.4) we have
 LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj) (B.7)
 132
The optimisation problem is now reduced to maximising LD with respect to ? constrained to
 (B.6) with the solution given by (B.5). i.e.
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0; (B.8)
 ? ? 0
 The Karush-Kuhn-Tucker theorem (see Appendix B.2) of optimisation theory, which guar-
 antees the existence of a solution to the optimisation problem shows that, at the saddle point all
 points satisfy the constraint (B.1) with strict equality. i.e.
 ?i(yi(w.xi + b)? 1) = 0 i = 1, . . . , l (B.9)
 From this equation, the following two conditions need to be distinguished:
 ? If ?i = 0, then yi(w.xi + b) ? 1
 ? If ?i > 0, then yi(w.xi + b) = 1
 Recall that one of the advantages of using the Lagrangian function to solve the optimisation
 problem is expressing the importance of each pattern in the training set. Consider the value
 of ?i corresponding to each training pattern. Training patterns with ?i > 0 will fall on the
 hyperplane H1 or H2 (see figure 3.5) and hence are critical in defining the decision boundary.
 Other training patterns with ?i = 0 lies to the left or right of H1 and H2 respectively. These
 training patterns have no effect in determining the decision boundary. Therefore if those training
 patterns with ?i = 0 value are removed and the training is repeated, the decision boundary will
 remain the same. Training patterns with nonzero ?i are called Support Vectors (The name of
 this learning technique follows from this).
 Suppose the parameter set ?? solves the quadratic optimisation problem given in equation
 (B.8). Then the orientation of the optimal hyperplane w? will given by equation (B.5). The
 geometric margin can be redefine in terms of ?? as:
 ? =
 d+ + d?
 2 ? w? ?
 for maximum margin hyperplane d+ = d? = 1, hence
 ? =
 1
 ? w? ? =
 (?(w?.w?))?1 (B.10)
 133
substituting w in (B.5) in (B.1) we have
 yi(
 ?
 j
 ??jy
 ?
 j (xi.xj) + b?) = 1;
 and substituting w from (B.5) in (B.10) we have
 (w?.w?) = ?
 i,j
 ??i?
 ?
 jyiyj(xi.xj)
 =
 ?
 i
 ??i yi
 ?
 j
 ??jyj(xi.xj)
 =
 ?
 i
 ??i (1? yib?)
 =
 ?
 i
 ??i ?
 ?
 i
 (??i yi)b?
 Using equation (B.6) the second term will be cancelled. Therefore,
 ? =
 (??
 i
 ??i
 )
 ?1
 (B.11)
 However, the threshold can not be computed directly from ??. To compute the threshold b?
 recall the optimal hyperplane with the maximum margin was defines as:
 ??
 ?
 maxyi=?1(w?.xi) + b? = ?1,
 maxyi=+1(w?.xi) + b? = +1
 ?i = 1, 2, . . . , l
 Adding the two expressions and solving for the threshold of the optimal hyperplane b? gives:
 b? = ?maxyi=?1(w
 ?
 .xi) + maxyi=+1(w?.xi)
 2
 (B.12)
 With the weight vector w? and the threshold b? in place, substituting equation (B.5) into
 equation (B.2) the mapping function can be redefined as:
 f(x,??, b?) = sign(
 l?
 i=1
 yi??i (x.xi) + b?) ?i = 1, . . . , l (B.13)
 We have seen that the parameter ??i = 0 for all training points except for the support vectors,
 hence the mapping function will have its final form:
 f(x,??, b?) = sign(?
 i?SV
 yi??i (x.xi) + b?) ?i = 1, . . . , l (B.14)
 In other words the expression is evaluated in terms of the dot product between the pattern
 134
to be classified and the Support Vectors (xi), and the sign of the function is used to classify the
 pattern to their respective class. Summing up the dual approach in the case of linear separable
 examples the following proposition can be made:
 Proposition B.1.1 Consider a set of training data which can be separated into their respective
 class by a linear hyperplane:
 ((x1, y1), . . . , (xl, yl)) ? Rn ? {?1}
 Let ? ? Rl be a vector that solves the constrained optimisation problem given below:
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0;
 ? ? 0
 Then the optimal hyperplane is defined by the pair w and b defined as:
 w =
 l?
 i=1
 ?iyixi
 and
 b = ?1
 2
 [max
 yi=?1
 (w.xi) + max
 yi=+1
 (w.xi)]
 with the geometric margin given by:
 ? =
 (??
 i?SV
 ?i
 )
 ?1
 And classification of unseen data will be done based on the sign of the function:
 f(x, ?, b) = sign(?
 i?SV
 yi?i(x.xi) + b) ?i = 1, . . . , l
 B.1.2 Linearly non-separable case - Soft margin classifier
 So far we have seen the case where the training data is perfectly separable using linear hy-
 perplane but real-world problems involve non-separable data and the assumption taken in the
 previous section is too ambitious. To extend the above solution to non-separable data a positive
 slack variable ?i; i = 1, . . . , i is introduced to associate further cost as a penalty for misclassifi-
 cation whenever necessary (see Figure 3.6).
 135
Using this relaxed separation constraint equation (B.1) becomes:
 yi(xi.w + b) > 1? ?i ?i ? 0 i = 1, . . . , l (B.15)
 The problem of finding optimal margin will therefore compromise two part.
 ? Maximise the margin (the same as the linear separable case) and
 ? Minimise the slack variable ?i which counts for amount of error
 One way of combining these two conditions into a single function is given below:
 ?(w,?) = 1
 2
 ?w?2 + C(
 l?
 i=1
 ?i)k
 The constant C is a parameter to be freely chosen by the user to specify the trade-off between
 the width of the margin and misclassification penalty. Therefore the optimal hyperplane will be
 the one that minimises the function ?(w,?). i.e.
 Minimise ?(w,?) = 1
 2
 ?w?2 + C(
 l?
 i=1
 ?i)k;
 yi(xi.w + b) > 1? ?i i = 1, . . . , l; (B.16)
 ?i > 0 i = 1, . . . , l
 If we choose k = 1, the above optimisation problem can be solved using QP. Introducing
 a dual vector of non-negative value ? = (?1, ?2, . . . ?l) for of each the first constraint and
 ? = (?1, ?2, . . . , ?l) for each of the second constraint the Lagrangian representation will be:
 LP (w, b,?,?,?) = 12 ? w ?
 2
 ?
 l?
 i=1
 ?i(yi(xi.w + b)? 1 + ?i)?
 l?
 i=1
 ?i?i + C
 l?
 i=1
 ?i (B.17)
 Following the same approach as the separable case, the solution to the optimisation problem
 will be determined by the saddle point of the primal form of the Lagrangian which will be
 determined by minimising with respect to w, b and ? and maximised with respect to ? and ?.
 Minimising LP with respect to w, b and ? gives:
 ?LP (w, b,?,?,?)
 ?w = w ?
 l?
 i=1
 ?iyixi = 0 (B.18)
 ?LP (w, b,?,?,?)
 ?b =
 l?
 i=1
 ?iyi = 0 (B.19)
 ?LP (w, b,?,?,?)
 ?? = C ? ?i ? ?i = 0 (B.20)
 136
Substituting (B.18),(B.19) and (B.20) into (B.17) the dual from will still be given by:
 LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj) (B.21)
 The dual form is the same for both linear separable and non-separable cases (see equation (B.7)
 and (B.21)) respectively. But there are additional constraints to be satisfied in the later case. It is
 stated above that both the Lagrangian multipliers ? and ? should be non-negative. Although ?
 does not appears in the dual form as a result of choosing k to be one, equation (B.20) places the
 constraint that C = ?i + ?i. This condition limits the value of ? to be less that C. Otherwise,
 if ? > C, then ? < 0 for the condition in equation (B.20) to hold but this a violation of the
 assumption that the Lagrangian multiplier ? is non-negative. Therefore, the new optimisation
 problem will be redefined as:
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0; (B.22)
 0 ? ? ? C
 Applying the KKT condition we have:
 ?i(yi(xi.w + b)? 1 + ?i) = 0 i = 1, 2, . . . l (B.23)
 Following the above equation three different cases needs to be distinguished:
 ? If ?i = 0, then ?i = C (?i = 0) and yi(xi.w + b) = 1;
 ? If 0 ? ?i ? C, then 0 ? ?i ? C (?i = 0) and yi(xi.w + b) = 1;
 ? If ?i = C, then ?i = 0 (?i > 0) and yi(xi.w + b) = 1 + ?i.
 In the first case the points are on the correct side of the optimal hyperplane and are distant from
 the hyperplane by more than the margin ? (i.e. these points lie to the left or to the right of the
 hyperplane H1 or H2 respectively). In the second case, the points lie on the hyperplane H1 or
 H2 and are Support Vectors. In the third case these points are also Support Vectors, but does
 not necessarily lie on the hyperplane H1 of H2. These points might be on the wrong side of the
 hyperplane or on the right side but closer than the hyperplanes H1 or H2 (For example: X1 and
 X2 in Figure (3.6)).
 Besides the above additional constraints, the solution for the linear separable case holds.
 Therefore the solution found in the previous section will be generalise by restating proposition
 B.1.1 as:
 137
Proposition B.1.2 Given a set of training data
 ((x1, y1), . . . , (xl, yl)) ? Rn ? {?1}
 Let ? ? Rl be a vector that solves the constrained optimisation problem given below:
 Maximise LD(w, b,?) =
 ?
 i=1
 ?i ?
 1
 2
 ?
 i,j
 ?i?jyiyj(xi.xj)
 subject to
 l?
 i=1
 ?iyi = 0;
 0 ? ? ? C
 Then the optimal hyperplane is defined by the pair w and b defined as:
 w =
 l?
 i=1
 ?iyixi
 and
 b = ?1
 2
 [max
 yi=?1
 (w.xi) + max
 yi=+1
 (w.xi)]
 with the geometric margin given by:
 ? =
 (??
 i?SV
 ?i
 )
 ?1
 And classification of unseen data will be done based on the sign of the function:
 f(x, ?, b) = sign(?
 i?SV
 yi??i (x.xi) + b) ?i = 1, . . . , l
 B.2 Important definitions and theorems
 Definition B.2.1 [Cristianini and Shawe-Taylor 2000] Given an optimisation problem with the
 objective function f(w), and equality constraint hi(w), i = 1, 2, . . . l, we define the Lagrangian
 function as
 L(w,?) = f(w) +
 l?
 i
 ?ihi(w)
 where the coefficient ?i are called the Lagrangian multipliers.
 Theorem B.2.2 (Kuhn - Tucker) [Cristianini and Shawe-Taylor 2000] The point ?0 ? R min-
 imises the function
 f : RN ? R
 138
subject to
 gi(?) ? 0, i = 1, . . . , l
 where gi are convex function and there is at least one point satisfying the constraint with strict
 inequalities, if and only if there exists a vector ?0 ? 0 ? Rl with
 L(?,?) = f(?) +
 l?
 i
 ?igi(?)
 has a saddle point at (?0,?0), which is the minimum with respect to ? and a maximum with
 respect to ?. The condition to be a maximum with respect to ? is equivalent to the Kuhn-Tucker
 condition
 ?igi(?0) = 0, ??i ? ?0
 i.e. for each i either ?i = 0, or gi(?0) = 0 The last relation is known as Karush-Kuhn-Tucker
 (KKT) complementarity condition.
 Definition B.2.3 [Cristianini and Shawe-Taylor 2000] A kernel K(x, y) is a function such that
 for any two points (x, y) in the input space:
 K(x, y) = ?(x).?(y) (B.24)
 where x, y ? X and ? is the mapping X ? F .
 Theorem B.2.4 (Mercer?s Theorem) [Burges 1998] There exists a mapping ? and an expansion
 K(x, y) = ?
 i
 ?i(x).?i(y)
 if and only if, for any g(x) such that ?
 g(x)2dx is finite
 then ?
 K(x, y)g(x)g(y)dxdy ? 0
 139
Appendix C
 Genetic codes
 C.1 RNA codon table
 Amino-acid Name Code codons coding it
 Alanine (Ala) A GCU, GCC, GCA, GCG
 Arginine(Arg) R CGU, CGC, CGA, CGG, AGA, AGG
 Asparagine (Asn) N AAU, AAC
 Aspartic acid (Asp) D GAU, GAC
 Cysteine (Cys) C UGU, UGC
 Glutamine (Gln) Q CAA, CAG
 Glutamic acid (Glu) E GAA, GAG
 Glycine (Gly) G GGU, GGC, GGA, GGG
 Histidine (His) H CAU, CAC
 Isoleucine (Ile) I AUU, AUC, AUA
 Leucine (Leu) L UUA, UUG, CUU, CUC, CUA, CUG
 Lysine (Lys) K AAA, AAG
 Methionine (Met) M AUG
 Phenylalanine(Phe) F UUU, UUC
 Proline (Pro) P CCU, CCC, CCA, CCG
 Serine (Ser) S UCU, UCC, UCA, UCG, AGU,AGC
 Threonine (Thr) T ACU, ACC, ACA, ACG
 Tryptophan (Trp) W UGG
 Tyrosine (Tyr) Y UAU, UAC
 Valine (Val) V GUU, GUC, GUA, GUG
 Start AUG, GUG
 Stop UAG, UGA, UAA
 Table C.1: Standard amino acids used in proteins, and the codons that code for each amino acid.
 140
C.2 The IUPAC nucleotides Codes
 Nucleotide Code: Base:
 A Adenine
 C Cytosine
 G Guanine
 T (orU) Thymine (or Uracil)
 R A or G
 Y C or T
 S G or C
 W A or T
 K G or T
 M A or C
 B C or G or T
 D A or G or T
 H A or C or T
 V A or C or G
 N any base
 . or - gap
 Table C.2: The IUPAC nucleotides Codes
 141