Subscribe

Issue No.04 - April (2012 vol.24)

pp: 735-744

Jasbir Dhaliwal , Royal Melbourne Institute of Technology, Melbourne

Simon J. Puglisi , Royal Melbourne Institute of Technology, Melbourne

Andrew Turpin , The University of Melbourne, Melbourne

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.242

ABSTRACT

In recent years, several algorithms for mining frequent and emerging substring patterns from databases of string data (such as proteins and natural language texts) have been discovered, all of which traverse an enhanced suffix array data structure. All of these algorithms lie at either extreme of the efficiency spectrum; they are either fast and use enormous amounts of space, or they are compact and orders of magnitude slower. In this paper, we present an algorithm that achieves the best of both these extremes, having runtime comparable to the fastest published algorithms while using less space than the most space efficient ones. This excellent practical performance is underpinned by theoretical guarantees. Our main mechanism for keeping memory usage low is to build the enhanced suffix array incrementally, in blocks. Once built, a block is traversed to output patterns with required support before its space is reclaimed to be used for the next block.

INDEX TERMS

String mining, suffix array, suffix tree, data mining, algorithms.

CITATION

Jasbir Dhaliwal, Simon J. Puglisi, Andrew Turpin, "Practical Efficient String Mining",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 4, pp. 735-744, April 2012, doi:10.1109/TKDE.2010.242REFERENCES

- [1] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, "Replacing Suffix Trees with Enhanced Suffix Arrays,"
J. Discrete Algorithms, vol. 2, no. 1, pp. 53-86, 2004.- [2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules,"
Proc. Int'l Conf. Very Large Databases (VLDB), pp. 487-499, 1994.- [3] A. Shinohara, "String Pattern Discovery,"
Proc. Int'l Conf. Algorithmic Learning Theory (ALT), vol. 3244, pp. 1-13, 2004.- [4] J.L. Bentley and R. Sedgewick, "Fast Algorithms for Sorting and Searching Strings,"
Proc. Eighth Ann. Symp. Discrete Algorithms, pp. 360-369, 1997.- [5] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, "Approaches to the Automatic Discovery of Patterns in Biosequences,"
J. Computational Biology, vol. 5, no. 2, pp. 279-305, 1998.- [6] S. Burkhardt and J. Kärkkäinen, "Fast Lightweight Suffix Array Construction and Checking,"
Proc. 14th Ann. Conf. Combinatorial Pattern Matching, pp. 55-69, 2003.- [7] G. Chen, S.J. Puglisi, and W.F. Smyth, "LZ-Factorization in Less Time and Space,"
Math. in Computer Science, vol. 1, no. 4, pp. 605-623, 2008.- [8] D. Clark,
Compact PAT trees, PhD thesis, Waterloo Univ., Canada, 1996.- [9] C.J. Colbourn and A.C.H. Ling, "Quorums from Difference Covers,"
Information Processing Letters, vol. 75, nos. 1-2, pp. 9-12, July 2000.- [10] DNA, NIH genome ftp site, Accessed, ftp://ftp.ncbi.nih.gov/genomesH_sapiens/, Mar. 2009.
- [11] J. Fischer, V. Huen, and S. Kramer, "Optimal String Mining under Frequency Constraints,"
Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), vol. 4213, pp. 139-150, 2006.- [12] J. Fischer, V. Mäkinen, and N. Väimäki, "Space Efficient String Mining under Frequency Constraints,"
Proc. IEEE Int'l Conf. Data Mining (ICDM '08), pp. 193-202, 2008.- [13] J. Fischer, Efficient Data Structures for String Algorithms, PhD thesis, Ludwig-Maximilians-Univ., München, 2007.
- [14] D. Gusfield,
Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge Univ. Press, Cambridge, UK, 1997.- [15] H. Bannai, H. Hyyro, A. Shinohara, M. Takeda, K. Nakai, and S. Miyano, "An O($n^2$ ) Algorithm for Discovering Optimal Boolean Pattern Pairs,"
IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 159-170, 2004.- [16] L.C.K. Hui, "Color Set Size Problem with Application to String Matching,"
Proc. Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 230-243, 1992.- [17] J. Kärkkäinen, "Fast BWT in Small Space by Blockwise Suffix Sorting,"
Theoretical Computer Science, vol. 387, no. 3, pp. 249-257, 2007.- [18] J. Kärkkäinen, G. Manzini, and S.J. Puglisi, "Permuted Longest-Common-Prefix Array,"
Proc. 21st Ann. Symp. Combinatorial Pattern Matching (CPM), 2009.- [19] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications,"
Proc. 12th Ann. Symp. Combinatorial Pattern Matching (CPM '01), pp. 181-192, 2001.- [20] A. Kügel and E. Ohlebusch, "A Space Efficient Solution to the Frequent String Mining Problem for Many Databases,"
Data Mining and Knowledge Discovery, vol. 17, pp. 24-38, 2008.- [21] U. Manber and G.W. Myers, "Suffix Arrays: A New Method for On-Line String Searches,"
SIAM J. Computing, vol. 22, no. 5, pp. 935-948, 1993.- [22] MemUsage, Gnu glibc memusage library extension, Accessed, http://www.gnu.org/softwarelibc/, May 2008.
- [23] S.J. Puglisi and A. Turpin, "Space-Time Tradeoffs for Longest-Common-Prefix Array Computation,"
Proc. Int'l Symp. Algorithms and Computation (ISAAC), pp. 124-135, 2008.- [24] S.J. Puglisi, W.F. Smyth, and A. Turpin, "A Taxonomy of Suffix Array Construction Algorithms,"
ACM Computing Surveys, vol. 39, no. 2, pp. 1-31, 2007.- [25] D. Weese and M.H. Schulz, "Efficient String Mining under Constraints via the Deferred Frequency Index,"
Proc. Eighth Industrial Conf. Data Mining, pp. 374-388, 2008. |