Using Manhattan distance and standard deviation for expressed sequence tag clustering

No Thumbnail Available

Date

2011-07-07

Authors

Kennedy, Dane

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

An explosion of genomic data in recent years has necessitated the novel application of old algorithms and the development of new algorithms in order to process and understand this information. One specific type of data comes in the form of expressed sequence tags (ESTs) which have significant biological importance. They do, however, require a lot of work in the dry lab once they have been created in a wet lab before anything meaningful can be made of the data. ESTs are short fragments of genomic data that represent the active genes in a sequenced sample. The sequencing process is complex and involves the copying, magnification and then splitting up of long complete sequences. The splitting of each copy is random and by looking for overlaps on the shorter sequences we can then begin to re-assemble the original long sequence. The process of finding the overlaps is called clustering and in theory it groups all the ESTs from a single gene together (where data sets in general contain ESTs from many different genes). Two new heuristics have been developed in this research that aim to speed up the time taken to do the clustering, which is traditionally an all-against-all comparison (and thus quadratic time complexity). The first heuristic uses a sorted word list of all the words in all the sequences to rapidly identify matching pairs of sequences. Further, the standard deviation between the relative position where words occur in the matching sequences is used to determine the quality of the matches – if a pair of sequences has sufficiently many words in common, and the variation in relative positions of those words is sufficiently low it is then tested by the second heuristic. The second heuristic uses a distance measure called d1 which identifies pairs of matching sequences in a very similar way to the d2 distance measure. d2 is an accepted distance calculation for calculating the relatedness of ESTs, but d1 is used as a comparison as it runs in linear time and rules out many of the potential matches from the first heuristic that would also be ruled out by the final d2 comparison. The results show that with careful selection of clustering parameters a significant speed-up can be made over various competing clustering techniques with little significant impact on clustering quality.

Description

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By