This Article 
 Bibliographic References 
 Add to: 
Learning String-Edit Distance
May 1998 (vol. 20 no. 5)
pp. 522-532

Abstract—In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: The minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

[1] COMLEX Pronouncing Lexicon, Version 0.2. Linguistic Data Consortium LDC95L3, July 1995.
[2] L.R. Bahl and F. Jelinek, “Decoding for Channels with Insertions, Deletions, and Substitutions with Applications to Speech Recognition,” IEEE Trans. Information Theory, vol. 21, no. 4, pp. 404-411, July 1975.
[3] L. Baum and J. Eagon, "An Inequality With Applications to Statistical Estimation for Probabilistic Functions of a Markov Process and to Models for Ecology," Bulletin Am. Mathematical Soc. 73, pp. 360-363, 1967.
[4] L. Baum, T. Petrie, G. Soules, and N. Weiss, "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains," Ann. Mathematical Statistics 41, pp. 164-171 1970.
[5] H. Bunke and J. Csirik, “Parametric String Edit Distance and Its Application to Pattern Recognition,” IEEE Trans. Systems, Man, and Cybernetics, vol. 25, pp. 202-206, 1995.
[6] A. Dempster, N. Laird, and D. Rubin, "Maximum Likelihood From Incomplete Data via the EM Algorithm," J. Royal Statistical Soc. Series B (methodological), vol. 39, pp. 1-38, 1977.
[7] G.D. Forney, "The Viterbi Algorithm," Proc. IEEE, vol. 61, no. 3, pp. 268-278, 1973.
[8] J. Godfrey, E. Holliman, and J. McDaniel, "Switchboard: Telephone Speech Corpus for Research and Development," Proc. IEEE ICASSP, pp. 517-520,Detroit, 1995.
[9] S. Greenberg, J. Hollenbach, and D. Ellis, "Insights Into Spoken Language Gleaned From Phonetic Transcription of the Switchboard Corpus," Proc. ICSLP,Philadelphia, Oct. 1996.
[10] P. Hall and G. Dowling, "Approximate String Matching," Computing Surveys, vol. 12, no. 4, pp. 381-402, 1980.
[11] F. Jelinek, L.R. Bahl, and R.L. Mercer, "The Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech," IEEE Trans. Information Theory, vol. 21, no. 3, pp. 250-256, 1975.
[12] F. Jelinek and R.L. Mercer, "Interpolated Estimation of Markov Source Parameters From Sparse Data," Pattern Recognition in Practice, E.S. Gelsema and L.N. Kanal, eds., pp. 381-397,Amsterdam, May,21-23 1980.
[13] K. Kukich, "Techniques for Automatically Correcting Words in Text," ACM Computing Surveys, vol. 24, pp. 377-439, 1992.
[14] V. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals," Soviet Physics—Doklady 10, vol. 10, pp. 707-710, 1966.
[15] D. Llorens and E. Vidal, "Application of Extended Generalized Linear Discriminant Functions (EGLDF) to Planar Shape Recognition," Proc. IEE European Workshop Handwriting AnalysisLondon, May 1996.
[16] A. Marzal and E. Vidal, "Computation of Normalized Edit Distance and Applications," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 15, pp. 926-932, 1993.
[17] W. Masek and M.A. Paterson, "Faster Algorithm Computing String Edit Distances. J. Computer System Science, vol. 20, pp. 18-31 1980.
[18] B. Oomman, "Constrained String Editing," Information Sciences, vol. 40, pp. 267-284 1986.
[19] B. Oommen and R. Kashyap, "Optimal and Information Theoretic Syntactic Pattern Recognition for Traditional Errors," Advances in Structural and Syntactic Pattern Recognition, P. Perner, P. Wang, and A. Rosenfeld, eds., pp. 11-20,Berlin, Aug.20-23 1996.
[20] J. Peterson, "Computer Programs for Detecting and Correcting Spelling Errors," Comm. ACM, vol. 23, pp. 676-687, 1980.
[21] R.A. Redner and H.F. Walker, "Mixture Densities, Maximum Likelihood, and the EM Algorithm," SIAM Review, vol. 26, no. 2, pp. 195-239, 1984.
[22] M. Riley, A. Ljolje, D. Hindle, and F. Pereira, "The AT&T 60,000 Word Speech-to-Text System," Eurospeech'95: ECSA Fourth European Conf. Speech Communication and Technology, J.M. Pardo, E. Enríquez, J. Ortega, J. Ferreiros, J. Macías, and F.J.Valverde, eds., vol. 1, pp. 207-210,Madrid, Spain, Sept. 1995.
[23] M.D. Riley and A. Ljolje, "Automatic Generation of Detailed Pronunciation Lexicons," Automatic Speech and Speaker Recognition: Advanced Topics, ch. 12, C.-H. Lee, F.K. Soong, and K.K. Paliwal, eds. Kluwer Academic, 1996.
[24] E.S. Ristad and R.G. Thomas, "Hierarchical Non-Emitting Markov Models," Proc. 35th Ann. Meeting ACL, pp. 381-385,Madrid, July7-11 1997.
[25] E.S. Ristad and P.N. Yianilos, "Finite Growth Models," Technical Report CS-TR-533-96, Dept. of Computer Science, Princeton Univ., Princeton, NJ, Dec. 1996.
[26] E. S. Ristad and P. N. Yianilos Learning String Edit Distance. Tech. Rep. CS-TR-532-96, Department of Computer Science, Princeton University, Princeton, N.J. October 1996. Revised November 1997.
[27] E.S. Ristad, and P.N. Yianilos, "Learning String Edit Distance," Proc. 14th Int'l Conf. Machine Learning, D. Fisher, Ed., pp. 287-295,San Francisco, July,8-11 1997.
[28] D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison.Reading, Mass.: Addison-Wesley, 1983.
[29] A.J. Viterbi, “Error Bounds for Convolution Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, vol. 13, pp. 260-269, 1967.
[30] R. Wagner and M. Fisher, "The String to String Correction Problem," J. ACM, vol. 21, pp. 168-173, 1974.
[31] P.N. Yianilos, "Topics in Computational Hidden State Modeling," PhD thesis, Dept. Computer Science, Princeton Univ., Princeton, N.J., June 1997.

Index Terms:
String-edit distance, Levenshtein distance, stochastic transduction, syntactic pattern recognition, spelling correction, string correction, string similarity, string classification, pronunciation modeling, Switchboard corpus.
Eric Sven Ristad, Peter N. Yianilos, "Learning String-Edit Distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522-532, May 1998, doi:10.1109/34.682181
Usage of this product signifies your acceptance of the Terms of Use.