Subscribe

Issue No.10 - October (2011 vol.23)

pp: 1541-1554

Jordi Nin , CNRS, LAAS, Toulouse

Javier Herranz , Universitat Politècnica de Catalunya, Barcelona

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.190

ABSTRACT

Comparison functions for sequences (of symbols) are important components of many applications, for example, clustering, data cleansing, and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, etc.), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.

INDEX TERMS

Sequences of symbols, string distances, triangular inequality.

CITATION

Jordi Nin, Javier Herranz, "Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 10, pp. 1541-1554, October 2011, doi:10.1109/TKDE.2010.190REFERENCES

- [1] F. Hoerndli, D.C. David, and J. Götz, "Functional Genomics Meets Neurodegenerative Disorders : : Part ii: Application and Data Integration,"
Progress in Neurobiology, vol. 76, no. 3, pp. 169-188, 2005.- [2] N. Shoval, G.K. Auslander, T. Freytag, R. Landau, F. Oswald, U. Seidl, H.-W. Wahl, S. Werner, and J. Heinik, "The Use of Advanced Tracking Technologies for the Analysis of Mobility in Alzheimer's Disease and Related Cognitive Diseases,"
BMC Geriatrics, vol. 8, no. 7, pp. 1-12, 2008.- [3] R. Agrawal and R. Srikant, "Mining Sequential Patterns,"
Proc. 11th Int'l Conf. Data Eng., pp. 3-14, 1995.- [4] G. Dong and J. Pei,
Sequence Data Mining. Springer, 2007.- [5] C. Gómez-Alonso and A. Valls, "A Similarity Measure for Sequences of Categorical Data Based on the Ordering of Common Elements,"
Proc. Int'l Conf. Modeling Decisions for Artificial Intelligence (MDAI), pp. 134-145, 2008.- [6] R.W. Hamming, "Error Detecting and Error Correcting Codes,"
Bell System Technical J., vol. 26, no. 2, pp. 147-160, 1950.- [7] M. Jaro, "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,"
J. Am. Statistical Assoc., vol. 84, pp. 414-420, 1989.- [8] V.I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,"
Soviet Physics Doklady, vol. 10, pp. 707-710, 1966.- [9] G. Navarro, "A Guided Tour to Approximate String Matching,"
ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.- [10] H. Saneifar, S. Bringay, A. Laurent, and M. Teisseire, "S2mp: Similarity Measure for Sequential Patterns,"
Proc. Seventh Australasian Data Mining Conf. (AusDM), pp. 95-104, 2008.- [11] P. Selllers, "The Theory and Computation of Evolutionary Distances: Pattern Recognition,"
J. Algorithms, vol. 1, pp. 359-373, 1980.- [12] E. Chávez, G. Navarro, R. Baeza-yates, and J.L. Marroquín, "Searching in Metric Spaces,"
ACM Computing Surveys, vol. 33, pp. 273-321, 1999.- [13] A.K. Jain, M.N. Murty, and P. Flynn, "Data Clustering: A Review,"
ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.- [14] T. Cover and P. Hart, "Nearest Neighbor Pattern Classification,"
IEEE Trans. Information Theory, vol. IT-13, no. 1, pp. 21-27, Jan. 1967.- [15] F.J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors,"
Comm. ACM, vol. 7, no. 3, pp. 171-176, 1964.- [16] E. Ristad and P. Yianilos, "Learning String Edit Distance,"
IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 20, no. 5, pp. 522-532, May 1998.- [17] R.A. Wagner and M.J. Fischer, "The String-to-String Correction Problem,"
J. ACM, vol. 21, no. 1, pp. 168-173, 1974.- [18] E. Ukkonen, "On Approximate String Matching,"
Proc. Int'l FCT-Conf. Fundamentals of Computation Theory, pp. 487-495, 1983.- [19] H. Berghel and D. Roach, "An Extension of Ukkonen's Enhanced Dynamic Programming Asm Algorithm,"
ACM Trans. Information Systems, vol. 14, no. 1, pp. 94-106, 1996.- [20] NCBI BLAST Databases, ftp://ftp.ncbi.nih.gov/blast/db FASTA /, 2011.
- [21] Catalan Official Statistics Inst. (IDESCAT), http://www.idescat. caten/, 2011.
- [22] K. Kukich, "Techniques for Automatically Correcting Words in Text,"
ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, 1992.- [23] W.E. Winkler, "Data Cleaning Methods,"
Proc. ACM Workshop Data Cleaning, Record Linkage and Object Identification, 2003.- [24] V. Torra and J. Domingo-Ferrer, "Record Linkage Methods for Multidatabase Data Mining,"
Information Fusion in Data Mining, pp. 101-132, Springer, 2003.- [25] strcmp.c Original C Implementation of Jaro-Winkler Distance,http://www.census.gov/geo/msb/standstrcmp.c , 2011.
- [26] N. Koudas, S. Sarawagi, and D. Srivastava, "Record Linkage: Similarity Measures and Algorithms,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 802-803, 2006.- [27] W. Winkler, "The State of Record Linkage and Current Research Problems," Technical Report 04, Statistical Research Division, US Bureau of the Census, 1999.
- [28] M. Hernandez and S. Stolf, "Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,"
J. Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 9-37, 1998.- [29] L. Gu, R. Baxter, D. Vickers, and C. Rainsford, "Record Linkage: Current Practice and Future Directions," Technical Report 03/83, CSIRO Math. and Information Sciences, 2003.
- [30] L. Jin, C. Li, and S. Mehrotra, "Efficient Record Linkage in Large Data Sets,"
Proc. Int'l Conf. Database Systems for Advanced Applications (DASFAA), pp. 137-146, 2003.- [31] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, "Robust and Efficient Fuzzy Match for Online Data Cleaning,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 313-324, 2003.- [32] C. Xiao, W. Wang, and X. Lin, "Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints,"
Proc. VLDB, pp. 933-944, 2008.- [33] A. Jain, M. Murty, and P. Flynn, "Data Clustering: A Review,"
ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.- [34] R. Agrawal and R. Srikant, "Mining Sequential Patterns,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 3-14, 1995. |