A tree-structured index algorithm for expressed sequence tags clustering
No Thumbnail Available
Date
2009-02-04T09:40:35Z
Authors
Kumwenda, Benjamin
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Abstract
Expressed sequence tags (ESTs) are complementary deoxyribonucleic acid (cDNA) fragments,
which are reverse transcribed from mature ribonucleic acid (mRNA), a direct gene transcript.
ESTs are a readily rich information source of complete expressed gene sequences. They reveal
the type and number of genes being expressed in an organism. Joining ESTs into complete gene
sequences is computationally expensive because they are numerous, erroneous, redundant and
mixed up. ESTs that originate from the same gene are grouped together. This enables efficient
consensus sequences generation, which reveals underlying gene sequences and their possible
alternative splicings. EST clustering enables efficient discovery of expressed genes based on
which several fields rely such as: disease diagnostics, drug discovery, genetic engineering, alternative
splicing and many others. Most clustering algorithms developed so far are quadratic and
their running time is prohibitively high. A tree-structured index algorithm has been developed
to efficiently cluster ESTs with respect to running time and quality of generated clusters. The
algorithm clusters ESTs in a pseudometric space by recursively partitioning a data set of EST
windows into two disjointed sets. Performance of the algorithm was tested with respect to running
time and quality of generated clusters. Further experiments were performed to investigate
the effectiveness of the triangle inequality, which was implemented to reduce distance computations
during clustering. Experimental results show that the algorithm has a running time closer
to linear with a 100% specificity, but it fluctuates in sensitivity. Implementation of the triangle
inequality did not significantly improve the performance of the algorithm.