This Article 
 Bibliographic References 
 Add to: 
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory
January 2005 (vol. 17 no. 1)
pp. 90-105
Mammalian genomes are typically 3Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information ( of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [19]. In this paper, we show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(n \log n) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16MB main memory only to construct more than 200Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.

[1] R. Baeza-Yates and G. Navarro, “A Hybrid Indexing Method for Approximate String Matching,” J. Discrete Algorithms, 2000.
[2] M. Bender and M. Farach-Colton, “The LCA Problem Revisited,” Proc. Fourth Am. Symp. Theoretical Informatics, pp. 88-94, 2000.
[3] P. Bieganski, “Genetic Sequence Data Retrieval and Manipulation Based on Generalized Suffix Trees,” PhD thesis, Univ. of Minnesota, 1995.
[4] E. Coffman, M. Garey, and D. Johnson, Approximation Algorithms for Bin Packing: A Survey, Dorit S. Hochbaum, ed. PWS Publishing Company, 1997.
[5] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms. The MIT Press, 1989.
[6] M. Farach, “Optimal Suffix Tree Construction with Large Alphabets,” Proc. 38th Ann. Symp. Foundation of Computer Science, pp. 137-143, 1997.
[7] M. Farach, P. Ferragina, and S. Muthukrishnan, “Overcoming the Memory Bottleneck in Suffix Tree Construction,” Proc. 39th Symp. Foundations of Computer Science, pp. 174-185, 1998.
[8] M. Farach and S. Muthukrishnan, “Optimal Logarithmic Time Randomized Suffix Tree Construction,” Proc. 23rd Int'l Colloquium Automata Languages and Programming, pp. 550-561, 1996.
[9] M. Farach-Colton, P. Ferragina, and S. Muthukrishnan, “On the Sorting-Complexity of Suffix Tree Construction,” J. ACM, vol. 47, no. 6, pp. 987-1011, 2000.
[10] R. Giegerich and S. Kurtz, “From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction,” Algorithmica, vol. 19, no. 3, pp. 331-353, 1997.
[11] R. Giegerich, S. Kurtz, and J. Stoye, “Efficient Implementation of Lazy Suffix Trees,” Proc. Third Int'l Workshop Algorithm Eng., pp. 30-42, 1999.
[12] D. Gusfield, Algorithms on Strings, Trees, and Sequences Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[13] D. Gusfield and J. Stoye, “Linear Time Algorithms for Finding and Representing All Tandem Repeats in a String,” technical report, Computer Science Dept., Univ. of California, Davis, 1998.
[14] D. Harel and R.E. Tarjan, “Fast Algorithms for Finding Nearest Common Ancestors,” SIAM J. Computing, vol. 13, pp. 338–355, May 1984.
[15] E. Hunt, M.P. Atkinson, and R.W. Irving, “A Database Index to Large Biological Sequences,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 139-148, 2001.
[16] E. Hunt, R.W. Irving, and M. Atkinson, “Persistent Suffix Trees and Suffix Binary Search Trees as DNA Sequence,” technical report, Dept. of Computing Science, Univ. of Glasgow, 2000.
[17] J. Kärkkäinen, “Suffix Cactus: A Cross between Suffix Tree and Suffix Array,” Proc. Sixth Symp. Combinatorial Pattern Matching, pp. 191-204, 1995.
[18] J. Karkkainen and E. Ukkonen, “Sparse Suffix Trees,” Proc. Second Ann. Int'l Conf. Computing and Combinatorics, pp. 219-230, 1996.
[19] S. Kurtz, “Reducing the Space Requirement of Suffix Trees,” Software Practice and Experience, vol. 29, no. 13, pp. 1149-1171, 1999.
[20] E.M. McCreight, “A Space-Economical Suffix Tree Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262-272, 1976.
[21] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[22] G. Navarro and R. Baeza-Yates, “A New Indexing Method for Approximate String Matching,” Proc. 10th Ann. Symp. Combinatorial Pattern Matching, pp. 163-185, 1999.
[23] B. Schieber and U. Vishkin, “On Finding Lowest Common Ancestors: Simplification and Parallelization,” SIAM J. Computing, vol. 17, pp. 1253-1262, Dec. 1988.
[24] W. Szpankowski, “Asymptotic Properties of Data Compression and Suffix Trees,” IEEE Trans. Information Theory, vol. 39, no. 5, pp. 1647-1659, 1993.
[25] P. Weiner, “Linear Pattern Matching Algorithm,” Proc. 14th IEEE Symp. Switching and Automata Theory, 1973.
[26] A.D. Wyner and J. Ziv, “Some Asymptotic Properties of the Entropy of a Stationary Ergodic Data Source with Applications to Data Compression,” IEEE Trans. Information Theory, vol. 35, no. 6, 1989.

Index Terms:
Biological sequences, database index, and suffix tree.
Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu, "Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, Jan. 2005, doi:10.1109/TKDE.2005.3
Usage of this product signifies your acceptance of the Terms of Use.