This Article 
 Bibliographic References 
 Add to: 
An O(N^2) Algorithm for Discovering Optimal Boolean Pattern Pairs
October-December 2004 (vol. 1 no. 4)
pp. 159-170

Abstract—We consider the problem of finding the optimal combination of string patterns, which characterizes a given set of strings that have a numeric attribute value assigned to each string. Pattern combinations are scored based on the correlation between their occurrences in the strings and the numeric attribute values. The aim is to find the combination of patterns which is best with respect to an appropriate scoring function. We present an O(N^2) time algorithm for finding the optimal pair of substring patterns combined with Boolean functions, where N is the total length of the sequences. The algorithm looks for all possible Boolean combinations of the patterns, e.g., patterns of the form p \land \lnot q, which indicates that the pattern pair is considered to occur in a given string s, if p occurs in s, AND q does NOT occur in s. An efficient implementation using suffix arrays is presented, and we further show that the algorithm can be adapted to find the best k{\hbox{-}}{\rm pattern} Boolean combination in O(N^k) time. The algorithm is applied to mRNA sequence data sets of moderate size combined with their turnover rates for the purpose of finding regulatory elements that cooperate, complement, or compete with each other in enhancing and/or silencing mRNA decay.

[1] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to the Automatic Discovery of Patterns in Biosequences,” J. Computational Biology, vol. 5, pp. 279-305, 1998.
[2] L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification,” J. Computational Biology, vol. 7, pp. 345-360, 2000.
[3] H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, “A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases,” Proc. Int'l Workshop Algorithmic Learning Theory, pp. 247-261, 1998.
[4] E. Eskin and P.A. Pevzner, “Finding Composite Regulatory Patterns in DNA Sequences,” Bioinformatics, vol. 18, pp. S354-S363, 2002.
[5] X. Liu, D. Brutlag, and J. Liu, “BioProspector: Discovering Conserv-ed DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes,” Proc. Pacific Symp. Biocomputing, pp. 127-138, 2001.
[6] O. Maruyama, H. Bannai, Y. Tamada, S. Kuhara, and S. Miyano, “Fast Algorithm for Extracting Multiple Unordered Short Motifs Using Bit Operations,” Information Sciences, vol. 146, pp. 115-126, 2002.
[7] S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa, “Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI,” Trans. Information Processing Soc. Japan, vol. 35, no. 10, pp. 2009-2018, 1994.
[8] A. Shinohara, M. Takeda, S. Arikawa, M. Hirao, H. Hoshino, and S. Inenaga, “Finding Best Patterns Practically,” Progress in Discovery Science, pp. 307-317, 2002.
[9] M. Takeda, S. Inenaga, H. Bannai, A. Shinohara, and S. Arikawa, “Discovering Most Classificatory Patterns for Very Expressive Pattern Classes,” Proc. Sixth Int'l Conf. Discovery Science, pp. 486-493, 2003.
[10] D. Shinozaki, T. Akutsu, and O. Maruyama, “Finding Optimal Degenerate Patterns in DNA Sequences,” Bioinformatics, vol. 19, pp. 206ii-214ii, 2003.
[11] H.J. Bussemaker, H. Li, and E.D. Siggia, “Regulatory Element Detection Using Correlation with Expression,” Nature Genetics, vol. 27, pp. 167-171, 2001.
[12] H. Bannai, S. Inenaga, A. Shinohara, M. Takeda, and S. Miyano, “A String Pattern Regression Algorithm and Its Application to Pattern Discovery in Long Introns,” Genome Informatics, vol. 13, pp. 3-11, 2002.
[13] E.M. Conlon, X.S. Liu, J.D. Lieb, and J.S. Liu, “Integrating Regulatory Motif Discovery and Genome-Wide Expression Analysis,” Proc. US Nat'l Academy Sciences, vol. 100, no. 6, pp. 3339-3344, 2003.
[14] H. Bannai, S. Inenaga, A. Shinohara, M. Takeda, and S. Miyano, “Efficiently Finding Regulatory Elements Using Correlation with Gene Expression,” J. Bioinformatics and Computational Biology, vol. 2, no. 2, pp. 273-288, 2004.
[15] C.B. -Z. Zilberstein, E. Eskin, and Z. Yakhini, “Using Expression Data to Discover RNA and DNA Regulatory Sequence Motifs,” First Ann. RECOMB Satellite Workshop on Regulatory Genomics, 2004.
[16] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, 1997.
[17] Y. Wang, C. Liu, J. Storey, R. Tibshirani, D. Herschlag, and P. Brown, “Precision and Functional Specificity in mRNA Decay,” Proc. US Nat'l Academy of Sciences, vol. 99, no. 9, pp. 5860-5865, 2002.
[18] E. Yang, E. van Nimwegen, M. Zavolan, N. Rajewsky, M. Schroeder, M. Magnasco, and J. Darnell Jr., “Decay Rates of Human mRNAs: Correlation with Functional Characteristics and Sequence Attributes,” Genome Research, vol. 13, no. 8, pp. 1863-1872, 2003.
[19] H. Bannai, H. Hyyrö, A. Shinohara, M. Takeda, K. Nakai, and S. Miyano, “Finding Optimal Pairs of Patterns,” Proc. Fourth Int'l Workshop Algorithms in Bioinformatics, pp. 450-462, 2004.
[20] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” SIAM J. Computing, vol. 22, no. 5, pp. 935-948, 1993.
[21] D.K. Kim, J.S. Sim, H. Park, and K. Park, “Linear-Time Construction of Suffix Arrays,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching, pp. 186-199, 2003.
[22] P. Ko and S. Aluru, “Space Efficient Linear Time Construction of Suffix Arrays,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching, pp. 200-210, 2003.
[23] J. Kärkkäinen and P. Sanders, “Simple Linear Work Suffix Array Construction,” Proc. 30th Int'l Colloquium Automata, Languages and Programming, pp. 943-955, 2003.
[24] T. Kasai, H. Arimura, and S. Arikawa, “Efficient Substring Traversal with Suffix Arrays,” Technical Report 185, Dept. of Informatics, Kyushu Univ., 2001.
[25] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “The Enhanced Suffix Array and Its Applications to Genome Analysis,” Proc. Second Int'l Workshop Algorithms in Bioinformatics, pp. 449-463, 2002.
[26] M.A. Bender and M. Farach-Colton, “The LCA Problem Revisited,” Proc. Latin American Theoretical Informatics, pp. 88-94, 2000.
[27] S. Alstrup, C. Gavoille, H. Kaplan, and T. Rauhe, “Nearest Common Ancestors: A Survey and a New Distributed Algorithm,” Proc. 14th Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 258-264, 2002.
[28] L. Hui, “Color Set Size Problem with Applications to String Matching,” Proc. Third Ann. Symp. Combinatorial Pattern Matching, pp. 230-243, 1992.
[29] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications,” Proc. 12th Ann. Symp. Combinatorial Pattern Matching, pp. 181-192, 2001.
[30] C.J. Wilusz, M. Wormington, and S.W. Peltz, “The Cap-to-Tail Guide to mRNA Turnover,” Nature Reviews: Molecular Cell Biology, vol. 2, pp. 237-246, 2001.
[31] J. Graber, “Variations in Yeast 3'-Processing Cis-Elements Correlate with Transcript Stability,” Trends in Genetetics, vol. 19, no. 9, pp. 473-476, , 2003.
[32] M. Wickens, D.S. Bernstein, J. Kimble, and R. Parker, “A PUF Family Portrait: 3' UTR Regulation as a Way of Life,” Trends in Genetics, vol. 18, no. 3, pp. 150-157, 2002.
[33] M.J. Ruiz-Echevarria, R. Munshi, J. Tomback, T.G. Kinzy, and S.W. Peltz, “Characterization of a General Stabilizer Element that Block Deadenylation-Dependent mRNA Decay,” J. Biological Chemistry, vol. 276, no. 33, pp. 30995-31003, 2001.
[34] A. Kasprzyk, D. Keefe, D. Smedley, D. London, W. Spooner, C. Melsopp, M. Hammond, P. Rocca-Serra, T. Cox, and E. Birney, “EnsMart: A Generic System for Fast and Flexible Access to Biological Data,” Genome Research, vol. 14, pp. 160-169, 2004.
[35] S. Inenaga, H. Bannai, H. Hyyrö, A. Shinohara, M. Takeda, K. Nakai, and S. Miyano, “Finding Optimal Pairs of Cooperative and Competing Patterns with Bounded Distance,” Proc. Seventh Int'l Conf. Discovery Science, pp. 32-46, 2004.

Index Terms:
Pattern discovery, Boolean patterns, suffix tree, suffix array.
Hideo Bannai, Heikki Hyyr?, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai, Satoru Miyano, "An O(N^2) Algorithm for Discovering Optimal Boolean Pattern Pairs," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 159-170, Oct.-Dec. 2004, doi:10.1109/TCBB.2004.36
Usage of this product signifies your acceptance of the Terms of Use.