Dynamic protein classification: Adaptive models based on incremental learning strategies
Date
2008-03-18T09:27:41Z
Authors
Mohamed, Shakir
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Abstract
One of the major problems in computational biology is the inability of existing
classification models to incorporate expanding and new domain knowledge. This
problem of static classification models is addressed in this thesis by the introduction
of incremental learning for problems in bioinformatics. The tools which have been
developed are applied to the problem of classifying proteins into a number of primary
and putative families. The importance of this type of classification is of particular
relevance due to its role in drug discovery programs and the benefit it lends to this
process in terms of cost and time saving. As a secondary problem, multi–class classification
is also addressed. The standard approach to protein family classification
is based on the creation of committees of binary classifiers. This one-vs-all approach
is not ideal, and the classification systems presented here consists of classifiers that
are able to do all-vs-all classification.
Two incremental learning techniques are presented. The first is a novel algorithm
based on the fuzzy ARTMAP classifier and an evolutionary strategy. The second
technique applies the incremental learning algorithm Learn++. The two systems
are tested using three datasets: data from the Structural Classification of Proteins
(SCOP) database, G-Protein Coupled Receptors (GPCR) database and Enzymes
from the Protein Data Bank. The results show that both techniques are comparable
with each other, giving classification abilities which are comparable to that of the
single batch trained classifiers, with the added ability of incremental learning. Both
the techniques are shown to be useful to the problem of protein family classification,
but these techniques are applicable to problems outside this area, with applications
in proteomics including the predictions of functions, secondary and tertiary structures,
and applications in genomics such as promoter and splice site predictions and
classification of gene microarrays.
Description
Keywords
bioinformatics, protein classification, neural networks, fuzzy ARTMAP, incremental learning