This Article 
 Bibliographic References 
 Add to: 
Space and Time Efficient Parallel Algorithms and Software for EST Clustering
December 2003 (vol. 14 no. 12)
pp. 1209-1221

Abstract—Expressed sequence tags, abbreviated as ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and for understanding important genetic variations such as those resulting in diseases. In this paper, we present the algorithmic foundations and implementation of PaCE, a parallel software system we developed for large-scale EST clustering. The novel features of our approach include 1) design of space-efficient algorithms to limit the space required to linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce runtime and facilitate clustering of large data sets. Using a combination of these techniques, we report the clustering of 327,632 rat ESTs in 47 minutes, and 420,694 Triticum aestivum ESTs in 3 hours and 15 minutes, using a 60-processor IBM xSeries cluster. These problems are well beyond the capabilities of state-of-the-art sequential software. We also present thorough experimental evaluation of our software including quality assessment using benchmark Arabidopsis EST data.

[1] A. Apostolico, C. Iliopoulos, G.M. Landau, B. Schieber, and U. Vishkin, Parallel Construction of a Suffix Tree with Applications Algorithmica, vol. 3, pp. 347-365, 1988.
[2] J. Burke, D. Davison, and W.A. Hide, d2_Cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences Genome Research, vol. 9, no. 11, pp. 1135-1142, Nov. 1999.
[3] J.E. Carpenter, A. Christoffels, Y. Weinbach, and W.A. Hide, Assessment of the Parallelization Approach of d2_Cluster for High Performance Sequence Clustering J. Computational Chemistry, vol. 23, no. 7, pp. 755-757, 2002.
[4] E. Coward, S.A. Haas, and M. Vingron, SpliceNest: Visualizing Gene Structure and Alternative Splicing Based on EST Clusters Trends in Genetics, vol. 18, no. 1, pp. 53-55, 2002.
[5] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[6] S.A. Haas, T. Beissbarth, E. Rivals, A. Krause, and M. Vingron, GeneNest: Automated Generation and Visualization of Gene Indices Trends in Genetics, vol. 16, no. 11, pp. 521-523, 2000.
[7] R. Hariharan, Optimal Parallel Suffix Tree Construction J. Computer and System Sciences, vol. 55, no. 1, pp. 44-69, 1997.
[8] X. Huang and A. Madan, CAP3: A DNA Sequence Assembly Program Genome Research, vol. 9, no. 9, pp. 868-877, 1999.
[9] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[10] A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel, Efficient Clustering of Large EST Data Sets on Parallel Computers Nucleic Acids Research, vol. 31, no. 11, pp. 2963-2974, 2003.
[11] Z. Kan, E.C. Rouchka, W.R. Gish, and D.J. States, Gene Structure Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs Genome Research, vol. 11, pp. 889-900, 2001.
[12] J.P. Kitajima, G. Navarro, B.A. Ribeiro-Neto, and N. Ziviani, Distributed Generation of Suffix Arrays: A Quicksort-Based Approach Proc. Workshop String Processing, vol. 1264, pp. 53-69, 1997.
[13] A. Krause, S.A. Haas, E. Coward, and M. Vingron, SYSTERS, GeneNest, SpliceNest: Exploring Sequence Space from Genome to Protein Nucleic Acids Research, vol. 30, 2002.
[14] F. Liang, I. Holt, G. Pertea, S. Karamycheva, S. Salzberg, and J. Quackenbush, An Optimized Protocol for Analysis of EST Sequences Nucleic Acids Research, vol. 28, no. 18, pp. 3657-3665, 2000.
[15] E. McCreight, A Space Economical Suffix Tree Construction Algorithm J. ACM, vol. 23, pp. 262-272, 1976.
[16] B. Modrek and C. Lee, A Genomic View of Alternative Splicing Nature Genetics, vol. 30, pp. 13-19, 2002.
[17] G. Navarro, J.P. Kitajima, B.A. Ribeiro-Neto, and N. Ziviani, Distributed Generation of Suffix Arrays Proc. Symp. Combinatorial Pattern Matching, vol. 1264, pp. 102-115, 1997.
[18] S.B. Needleman and C.D. Wunsch, A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[19] W.R. Pearson and D.J. Lipman, Improved Tools for Biological Sequence Comparison Proc. Nat'l Academic of Sciences USA, vol. 85, pp. 2444-2448, 1988.
[20] G. Pertea, X. Huang, F. Liang, V. Antonescu, R. Sultana, S. Karamycheva, Y. Lee, J. White, F. Cheung, B. Parvizi et al., TIGR Gene Indices Clustering Tool (TGICL): A Software System for Fast Clustering of Large EST Datasets Bioinformatics, vol. 19, no. 5, pp. 651-652, 2003.
[21] J. Quackenbush, J. Cho, D. Lee, F. Liang, I. Holt, S. Karamycheva, B. Parvizi, G. Pertea, R. Sultana, and J. White, The TIGR Gene Indices: Analysis of Gene Transcript Sequences in Highly Sampled Eukaryotic Species Nucleic Acids Research, vol. 29, pp. 159-164, 2001.
[22] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology. Boston, Mass.: PWS Publishing Company, 1997.
[23] T.F. Smith and M.S. Waterman, Identification of Common Molecular Subsequences J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[24] G. Sutton, O. White, M. Adams, and A. Kerlavage, TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects Genome Science and Technology, vol. 1, pp. 9-19, 1995.
[25] R.E. Tarjan, Efficiency of a Good But not Linear Set Union Algorithm J. ACM, vol. 22, no. 2, pp. 215-225, 1975.
[26] D.C. Torney, C. Burks, D. Davison, and K.M. Sirotkin, Computers and DNA. New York: Addison-Wesley, 1990.
[27] E. Ukkonen, On-Line Construction of Suffix Trees Algorithmica, vol. 14, pp. 249-260, 1995.
[28] P. Weiner, Linear Pattern Matching Algorithm Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973.
[29] Z. Zhang, S. Shwartz, L. Wagner, and W. Miller, A Greedy Algorithm for Aligning DNA Sequences J. Computational Biology, vol. 7, pp. 203-214, 2000.
[30] W. Zhu, S.D. Schlueter, and V. Brendel, Refined Annotation of theArabidopsis ThalianaGenome by Complete EST Mapping Plant Physiology, June 2003.

Index Terms:
Computational biology, EST clustering, maximal common substring, parallel algorithms, suffix tree applications.
Anantharaman Kalyanaraman, Srinivas Aluru, Volker Brendel, Suresh Kothari, "Space and Time Efficient Parallel Algorithms and Software for EST Clustering," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 12, pp. 1209-1221, Dec. 2003, doi:10.1109/TPDS.2003.1255634
Usage of this product signifies your acceptance of the Terms of Use.