A tree-structured index algorithm for expressed sequence tags clustering

Date
2009-02-04T09:40:35Z
Authors
Kumwenda, Benjamin
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Abstract Expressed sequence tags (ESTs) are complementary deoxyribonucleic acid (cDNA) fragments, which are reverse transcribed from mature ribonucleic acid (mRNA), a direct gene transcript. ESTs are a readily rich information source of complete expressed gene sequences. They reveal the type and number of genes being expressed in an organism. Joining ESTs into complete gene sequences is computationally expensive because they are numerous, erroneous, redundant and mixed up. ESTs that originate from the same gene are grouped together. This enables efficient consensus sequences generation, which reveals underlying gene sequences and their possible alternative splicings. EST clustering enables efficient discovery of expressed genes based on which several fields rely such as: disease diagnostics, drug discovery, genetic engineering, alternative splicing and many others. Most clustering algorithms developed so far are quadratic and their running time is prohibitively high. A tree-structured index algorithm has been developed to efficiently cluster ESTs with respect to running time and quality of generated clusters. The algorithm clusters ESTs in a pseudometric space by recursively partitioning a data set of EST windows into two disjointed sets. Performance of the algorithm was tested with respect to running time and quality of generated clusters. Further experiments were performed to investigate the effectiveness of the triangle inequality, which was implemented to reduce distance computations during clustering. Experimental results show that the algorithm has a running time closer to linear with a 100% specificity, but it fluctuates in sensitivity. Implementation of the triangle inequality did not significantly improve the performance of the algorithm.
Description
Keywords
Citation
Collections