This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
q-Gram Matching Using Tree Models
April 2006 (vol. 18 no. 4)
pp. 433-447
Wenke Lee, IEEE Computer Society
q{\hbox{-}}\rm gram matching is used for approximate substring matching problems in a wide range of application areas, including intrusion detection. In this paper, we present a tree-based model to perform fast linear time q{\hbox{-}}{\rm gram} matching. All q{\hbox{-}}{\rm grams} present in the text are stored in a tree structure similar to Trie. We use a tree redundancy pruning algorithm to reduce the size of the tree without losing any information. We also use suffix links for fast q{\hbox{-}}{\rm gram} search during query matching. We compare our work with the Rabin-Karp-based hash-table technique, commonly used for multiple q{\hbox{-}}{\rm gram} search. We present results of experiments on system call sequence data used for intrusion detection.

[1] University of New Mexico System Call Data Set, http://cs.unm.edu/~immsecsystemcalls.htm , 2006.
[2] A.V. Aho and M.J. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, June 1975.
[3] R.S. Boyer and J.S. Moore, “A Fast String Searching Algorithm,” Comm. ACM, 20, no. 10, p. 762, Oct. 1977.
[4] S. Burkhardt, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals, and M. Vingron, “q-Gram Based Database Searching Using a Suffix Array (Quasar),” Proc. Third Ann. Int'l Conf. Computational Molecular Biology, pp. 77-83, 1999.
[5] W.I. Chang and T.G. Marr, “Approximate String Matching and Local Similarity,” Proc. Fifth Ann. Symp. Combinatorial Pattern Matching, pp. 259-273, 1994.
[6] M.T. Chen and J. Seiferas, “Elegant and Efficient Subword Tree Construction,” Combinatorial Algorithms on Words, pp. 97-107, 1985.
[7] J. Coit, S. Staniford, and J. McAlerney, “Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort,” Proc. DARPA Information Survivability Conf. and Exposition (DISCEX II '02), vol. 1, pp. 367-373, 2001.
[8] R. Cole and R. Hariharan, “Approximate String Matching: A Simpler Faster Algorithm,” SIAM J. Computing, vol. 31, no. 6, pp. 1761-1782, 2002.
[9] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff, “A Sense of Self for Unix Processes,” Proc. 1996 IEEE Symp. Research in Security and Privacy, pp. 120-128, 1996.
[10] Z. Galil and R. Giancarlo, “Improved String Matching with K-Mismatches,” SIGACT News, vol. 17, no. 4, pp. 52-54, 1986.
[11] R.M. Karp and M.O. Rabin, “Efficient Randomized Pattern-Matching Algorithms,” IBM J. Research and Developement, pp. 249-260, 1987.
[12] D.E. Knuth, J.H. Morris, and V.R. Pratt, “Fast Pattern Matching in Strings,” SIAM J. Computing, vol. 6, no. 1, pp. 323-360, 1977.
[13] G.M. Landau and U. Vishkin, “Efficient String Matching with K-Mismatches,” Theoretical Computer Science, pp. 239-249, 1986.
[14] E.M. McCreight, “A Space-Economical Suffix Tree Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262-272, 1976.
[15] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[16] G. Navarro and R. Baeza-Yates, “Fast and Practical Approximate String Matching,” Information Processing Letters, vol. 59, pp. 21-27, 1996.
[17] G. Navarro and R. Baeza-Yates, “Faster Approximate String Matching,” Algorihtmica, vol. 23, no. 2, pp. 127-158, 1999.
[18] D.M. Sunday, “A Very Fast Substring Search Algorithm,” Comm. ACM, vol. 33, no. 8, pp. 132-142, 1990.
[19] E. Sutinen and J. Tarhio, “On Using Q-Gram Locations in Approximate String Matching,” Proc. Third Ann. European Symp. Algorithms, pp. 327-340, 1995.
[20] J. Tarhio and E. Ukkonen, “Boyer Moore Approach for Approximate String Matching,” Proc. Second Scandinavian Workshop Algorithm Theory, pp. 348-359, 1990.
[21] E. Ukkonen, “Approximate String-Matching with Q-Grams and Maximal Matches,” Theoretical Computer Science, vol. 92, no. 1, pp. 191-211, 1992.
[22] E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, pp. 249-260, 1995.
[23] P. Weiner, “Linear Pattern Matching Algorithms,” Proc. IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973.
[24] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Comm. ACM, vol. 35, pp. 83-91, 1992.
[25] S. Wu, U. Manber, and E.W. Myers, “A Subquadratic Algorithm for Approximate Limited Expression Matching,” Algorithmica, vol. 15, no. 1, pp. 50-67, 1996.

Index Terms:
Intrusion detection, q{\hbox{-}}{\rm gram} matching, pattern matching, search problems, string matching, suffix tree, trees, tree data structure, word processing.
Citation:
Prahlad Fogla, Wenke Lee, "q-Gram Matching Using Tree Models," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 4, pp. 433-447, April 2006, doi:10.1109/TKDE.2006.66
Usage of this product signifies your acceptance of the Terms of Use.