This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Tries for Approximate String Matching
August 1996 (vol. 8 no. 4)
pp. 540-547

Abstract—Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers, case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie-based method whose cost is independent of document size. Our experiments show that this new method significantly outperforms the nearest competitor for k = 0 and k = 1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k = 2. For larger files, complexity arguments indicate that tries will outperform the linear methods for larger values of k. Trie indexes combine suffixes and so are compact in storage. When the text itself does not need to be stored, as in a spelling checker, we even obtain negative overhead: 50% compression. We discuss a variety of applications and extensions, including best match (for spelling checkers), case insensitivity, and limited approximate regular expression matching.

[1] A. Apostolico, "The Myriad Virtues of Suffix Trees," Combinatorial Algorithms on Words, pp. 85-96, Springer-Verlag, 1985.
[2] R.A. Baeza-Yates, "String Searching Algorithms," Information Retrieval: Data Structures and Algorithms, W.B. Frakes and R.A. Baeza-Yates, eds., pp. 219-40, Prentice-Hall, 1992.
[3] R.A. Baeza-Yates and G.H. Gonnet, "Efficient Text Searching of Regular Expressions," Proc. 16th Int'l Colloquium on Automata, Languages and Programming, G. Ausiello, M. Dezani-Ciancaglini, and S.R.D. Rocca, eds., LNCS 372, pp. 46-62.Stresa, Italy: Springer-Verlag, July 1989.
[4] R.A. Baeza-Yates and G.H. Gonnet, "A New Approach to Text Searching, Comm. ACM, vol. 35, no. 10, pp. 74-82, 1992.
[5] R.A. Baeza-Yates and C.H. Perleberg, "Fast and Practical Approximate String Matching," Proc. Third Ann. Symp. Combinatorial Pattern Matching, G. Goos and J. Hartmanis, eds., LNCS 644, pp. 185-192.Tucson, Ariz. Springer-Verlag, Apr. 1992.
[6] R.S. Boyer and J.S. Moore, “A Fast String Searching Algorithm,” Comm. ACM, vol. 20 pp. 762-772, Oct. 1977.
[7] W.I. Chang and E.L. Lawler, "Approximate String Matching in Sublinear-Expected Time," Proc. 31st Ann. Symp. Foundations of Computer Science, pp. 116-24.St. Louis, Mo. IEEE C. S. Press, Oct. 1990.
[8] F.J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors," Comm. ACM, vol. 7, no. 3, pp. 171-176, 1964.
[9] G.H. Gonnet, "Efficient Searching of Text and Pictures," Technical Report OED-88-02, Centre for the New OED., Univ. of Waterloo, 1988.
[10] G.H. Gonnet, R.A. Baeza-Yates, and T. Snider, "New Indices for Text: PAT Trees and PAT Arrays," W.B. Frakes and R.A. Baeza-Yates, eds., Information Retrieval: Data Structures and Algorithms, pp. 66-82. Prentice-Hall, 1992.
[11] P.A.V. Hall and G.R. Dowling, "Approximate String Matching," Computing Surveys, vol. 12, no. 4, pp. 381-402, 1980.
[12] J.Y. Kim and J. Shawe-Taylor, "An Approximate String-Matching Algorithm," Theoretical Computer Science, vol. 92, pp. 107-117, 1992.
[13] D.E. Knuth, J.H. Morris, and V.R. Pratt, "Fast Pattern Matching in Strings," Computer J., vol. 6, no. 2, pp. 323-50, 1977.
[14] K. Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, 1992.
[15] V. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Soviet Physics Dokl., vol. 6, pp. 126-36, 1966.
[16] E.M. McCreight, "A Space Economical Suffix Tree Construction Algorithm," J. ACM, vol. 23, no. 2, pp. 262-72, 1976.
[17] T.H. Merrett, Relational Information Systems.Reston, Va: Reston Publishing Co., 1983.
[18] T.H. Merrett and H. Shang, "Trie Methods for Representing Text," Proc. Fourth Int'l Conf., FODO'93, LNCS 730, pp. 130-45,Chicago: Springer-Verlag, Oct. 1993.
[19] D.R. Morrison, "PATRICIA-Practical Algorithm To Retrieve Information Coded In Alphanumeric," J. ACM, vol. 15, no. 4, pp. 514-34, 1968.
[20] M.K. Odell and R.C. Russell, U.S. Patent Numbers, 1,261,167 (1918) and 1,435,663, 1922. U.S. Patent Office, Washington, D.C.
[21] J.A. Orenstein, "Multidimensional Tries Used for Associative Searching," Information Processing Letters, vol. 14, no. 4, pp. 150-156, 1982.
[22] H. Shang, "Trie Methods for Text and Spatial Data on Secondary Storage," PhD Dissertation, School of Computer Science, McGill Univ., Nov. 1994.
[23] G.A. Stephen, String Searching Algorithms, Lecture Notes on Computing, World Scientific Pub., 1994.
[24] E. Ukkonen, "Finding Approximate Patterns in Strings," J. Algorithms, vol. 6, pp. pp. 132-7, 1985.
[25] J. Veronis, "Computerized Correction of Phonographic Errors," Comput. Hum., vol. 22, pp. 43-56, 1988.
[26] R.A. Wagner and M.J. Fischer, "The String-to-String Correction Problem," J. ACM, vol. 21, no. 1, pp. 168-78, 1974.
[27] P. Weiner, "Linear Pattern Matching Algorithms," Proc. IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973.
[28] S. Wu and U. Manber, "Fast Text Searching," Comm. ACM, vol. 35, pp. 83-91, 1992.

Index Terms:
Approximate matching, text search, trie.
Citation:
H. Shang, T.h. Merrettal, "Tries for Approximate String Matching," IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 4, pp. 540-547, Aug. 1996, doi:10.1109/69.536247
Usage of this product signifies your acceptance of the Terms of Use.