Subscribe

Issue No.03 - March (2011 vol.23)

pp: 321-334

Qingguo Wang , University of Missouri, Columbia

Dmitry Korkin , University of Missouri, Columbia

Yi Shang , University of Missouri, Columbia

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.123

ABSTRACT

Finding the longest common subsequence (LCS) of multiple strings is an NP-hard problem, with many applications in the areas of bioinformatics and computational genomics. Although significant efforts have been made to address the problem and its special cases, the increasing complexity and size of biological data require more efficient methods applicable to an arbitrary number of strings. In this paper, we present a new algorithm for the general case of multiple LCS (or MLCS) problem, i.e., finding an LCS of any number of strings, and its parallel realization. The algorithm is based on the dominant point approach and employs a fast divide-and-conquer technique to compute the dominant points. When applied to a case of three strings, our algorithm demonstrates the same performance as the fastest existing MLCS algorithm designed for that specific case. When applied to more than three strings, our algorithm is significantly faster than the best existing sequential methods, reaching up to 2-3 orders of magnitude faster speed on large-size problems. Finally, we present an efficient parallel implementation of the algorithm. Evaluating the parallel algorithm on a benchmark set of both random and biological sequences reveals a near-linear speedup with respect to the sequential algorithm.

INDEX TERMS

Longest common subsequence (LCS), multiple longest common subsequence (MLCS), dynamic programming, dominant point method, divide and conquer, parallel processing, multithreading.

CITATION

Qingguo Wang, Dmitry Korkin, Yi Shang, "A Fast Multiple Longest Common Subsequence (MLCS) Algorithm",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 3, pp. 321-334, March 2011, doi:10.1109/TKDE.2010.123REFERENCES

- [1] A. Apostolico, M. Atallah, L. Larmore, and S. Mcfaddin, "Efficient Parallel Algorithms for String Editing and Related Problems,"
SIAM J. Computing, vol. 19, pp. 968-988, 1990.- [2] A. Apostolico, S. Browne, and C. Guerra, "Fast Linear-Space Computations of Longest Common Subsequences,"
Theoretical Computer Science, vol. 92, no. 1, pp. 3-17, 1992.- [3] T.K. Attwood and J.B.C. Findlay, "Fingerprinting G Protein-Coupled Receptors,"
Protein Eng., vol. 7, no. 2, pp. 195-203, 1994.- [4] K.N. Babu and S. Saxena, "Parallel Algorithms for the Longest Common Subsequence Problem,"
Proc. Fourth Int'l Conf. High Performance Computing, pp. 120-125, 1997.- [5] L.J. Bentley, "Multidimensional Divide-and-Conquer,"
Comm. ACM, vol. 23, no. 4, pp. 214-229, 1980.- [6] L. Bergroth, H. Hakonen, and T. Raita, "A Survey of Longest Common Subsequence Algorithms,"
Proc. Int'l Symp. String Processing Information Retrieval (SPIRE '00), pp. 39-48, 2000.- [7] M. Blanchette, T. Kunisawa, and D. Sankoff, "Gene Order Breakpoint Evidence in Animal Mitochondrial Phylogeny,"
J. Molecular Evolution, vol. 49, no. 2, pp. 193-203, 1999.- [8] P. Bork and E.V. Koonin, "Protein Sequence Motifs,"
Current Opinion in Structural Biology, vol. 6, pp. 366-376, 1996.- [9] G. Bourque and P.A. Pevzner, "Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species,"
Genome Research, vol. 12, pp. 26-36, 2002.- [10] L. Brocchieri and S. Karlin, "Protein Length in Eukaryotic and Prokaryotic Proteomes,"
Nucleic Acids Research, vol. 33, no. 10, pp. 3390-3400, 2005.- [11] Y. Chen, A. Wan, and W. Liu, "A Fast Parallel Algorithm for Finding the Longest Common Sequence of Multiple Biosequences,"
BMC Bioinformatics, vol. 7, p. S4, 2006.- [12] F.Y. Chin and C.K. Poon, "A Fast Algorithm for Computing Longest Common Subsequences of Small Alphabet Size,"
J. Information Processing, vol. 13, no. 4, pp. 463-469, 1990.- [13] M.O. Dayhoff, "Computer Analysis of Protein Evolution,"
Scientific Am., vol. 221, no. 1, pp. 86-95, 1969.- [14] R.C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput,"
Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.- [15] R.C. Edgar, "MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity,"
BMC Bioinformatics, vol. 5, no. 1, p. 113, 2004.- [16] S.M. Elbashir, J. Harborth, W. Lendeckel, A. Yalcin, K. Weber, and T. Tuschl, "Duplexes of 21-Nucleotide RNAs Mediate RNA Interference in Cultured Mammalian Cells,"
Nature, vol. 411, no. 6836, pp. 494-498, 2001.- [17] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer, and A. Bateman, "The Pfam Protein Families Database,"
Nucleic Acids Research, vol. 36, pp. D281-D288, 2008.- [18] R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, "Pfam: Clans, Web Tools and Services,"
Nucleic Acids Research, vol. 34, pp. D247-D251, 2006.- [19] V. Freschi and A. Bogliolo, "Longest Common Subsequence between Run-Length-Encoded Strings: A New Algorithm with Improved Parallelism,"
Information Processing Letters, vol. 90, no. 4, pp. 167-173, 2004.- [20] T.R. Gregory, Animal Genome Size Database, http:/www. genomesize.com, 2005.
- [21] K. Hakata and H. Imai, "Algorithms for the Longest Common Subsequence Problem,"
Proc. Genome Informatics Workshop III, pp. 53-56, 1992.- [22] K. Hakata and H. Imai, "Algorithms for the Longest Common Subsequence Problem for Multiple Strings Based on Geometric Maxima,"
Optimization Methods and Software, vol. 10, pp. 233-260, 1998.- [23] K.F. Han and D. Baker, "Recurring Local Sequence Motifs in Proteins,"
J. Molecular Biology, vol. 251, no. 1, pp. 176-187, 1995.- [24] D.S. Hirschberg, "Algorithms for the Longest Common Subsequence Problem,"
J. ACM, vol. 24, pp. 664-675, 1977.- [25] W.J. Hsu and M.W. Du, "Computing a Longest Common Subsequence for a Set of Strings,"
BIT Numerical Math., vol. 24, no. 1, pp. 45-59, 1984.- [26] J.W. Hunt and T.G. Szymanski, "A Fast Algorithm for Computing Longest Common Subsequences,"
Comm. ACM, vol. 20, no. 5, pp. 350-353, 1977.- [27] D. Korkin, "A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem," Technical Report TR01-148, Univ. of New Brunswick, 2001.
- [28] D. Korkin and L. Goldfarb, "Multiple Genome Rearrangement: A General Approach via the Evolutionary Genome Graph,"
Bioinformatics, vol. 18, pp. S303-S311, 2002.- [29] D. Korkin, Q. Wang, and Y. Shang, "An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem,"
Proc. 37th Int'l Conf. Parallel Processing (ICPP '08), pp. 354-363, 2008.- [30] H.T. Kung, F. Luccio, and F.P. Preparata, "On Finding the Maxima of a Set of Vectors,"
J. ACM, vol. 22, pp. 469-476, 1975.- [31] M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Gibson, and D.G. Higgins, "Clustal W and Clustal X Version 2.0,"
Bioinformatics, vol. 23, pp. 2947-2948, 2007.- [32] H.F. Lodish,
Molecular Cell Biology. WH Freeman, 2003.- [33] M. Lu and H. Lin, "Parallel Algorithms for the Longest Common Subsequence Problem,"
IEEE Trans. Parallel and Distributed System, vol. 5, no. 8, pp. 835-848, Aug. 1994.- [34] G. Luce and J.F. Myoupo, "Systolic-Based Parallel Architecture for the Longest Common Subsequences Problem,"
VLSI J. Integration, vol. 25, pp. 53-70, 1998.- [35] D. Maier, "The Complexity of Some Problems on Subsequences and Supersequences,"
J. ACM, vol. 25, pp. 322-336, 1978.- [36] W.J. Masek and M.S. Paterson, "A Faster Algorithm Computing String Edit Distances,"
J. Computer and System Sciences, vol. 20, pp. 18-31, 1980.- [37] J.F. Myoupo and D. Seme, "Time-Efficient Parallel Algorithms for the Longest Common Subsequence and Related Problems,"
J. Parallel and Distributed Computing, vol. 57, pp. 212-223, 1999.- [38] A. Nekrutenko and W.H. Li, "Transposable Elements Are Found in a Large Number of Human Protein-Coding Genes,"
Trends in Genetics, vol. 17, no. 11, pp. 619-621, 2001.- [39] C. Rick, "New Algorithms for the Longest Common Subsequence Problem," Technical Report No. 85123-CS, Computer Science Dept., Univ. of Bonn, Oct. 1994.
- [40] Y. Saito, H.-P. Nothacker, Z. Wang, S.H.S. Lin, F. Leslie, and O. Civelli, "Molecular Characterization of the Melanin-Concentrating-Hormone Receptor,"
Nature, vol. 400, pp. 265-269, 1999.- [41] D. Sankoff, "Matching Sequences Under Deletion/Insertion Constraints,"
Proc. Nat'l Academy of Sciences USA, vol. 69, pp. 4-6, 1972.- [42] D. Sankoff and M. Blanchette, "Phylogenetic Invariants for Genome Rearrangements,"
J. Computational Biology, vol. 6, pp. 431-445, 1999.- [43] D. Sankhoff and J.B. Kruskal,
Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wealey, 1983.- [44] R.P. Sheridan and R. Venkataraghavan, "A Systematic Search for Protein Signature Sequences,"
Proteins, vol. 14, no. 1, pp. 16-28, 1992.- [45] T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences,"
J. Molecular Biology, vol. 147, pp. 195-197, 1981.- [46] The Los Alamos National Laboratory Website, http://www.lanl. gov/roadrunnerindex.shtml , 2009.
- [47] E.N. Trifonov and I.N. Berezovsky, "Evolutionary Aspects of Protein Structure and Folding,"
Current Opinion in Structural Biology, vol. 13, no. 1, pp. 110-114, 2003.- [48] R.A. Wagner and M.J. Fischer, "The String to String Correction Problem,"
J. ACM, vol. 21, no. 1, pp. 168-173, 1974.- [49] X. Xu, L. Chen, Y. Pan, and P. He, "Fast Parallel Algorithms for the Longest Common Subsequence Problem Using an Optical Bus,"
Lecture Notes in Computer Science, pp. 338-348, Springer, 2005.- [50] T.K. Yap, O. Frieder, and R.L. Martino, "Parallel Computation in Biological Sequence Analysis,"
IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 3, pp. 283-294, Mar. 1998.- [51] M.S. Zastrow, D.B. Flaherty, G.M. Benian, and K.L. Wilson, "Nuclear Titin Interacts with A-and B-Type Lamins In Vitro and In Vivo,"
J. Cell Science, vol. 119, no. 2, pp. 239-249, 2006. |