This Article 
 Bibliographic References 
 Add to: 
Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources
August 2005 (vol. 17 no. 8)
pp. 1111-1126
The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.

[1] “Access to Biological Collection Data (ABCD),” http://www., 2004
[2] J. Ace, B. Marvel, and B. Richer, “Matchmaker. . . Matchmaker. . . Find Me the Address (Exact Address Match Processing),” Telephone Engineer and Management, vol. 96, no. 8, pp. 50-53, 1992.
[3] M.R. Anderberg, Cluster Analysis for Applications. New York and London: Academic Press, 1973.
[4] C. Batini, M. Lenzerini, and S. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” ACM Computing Surveys, vol. 18, no. 4, pp. 323-364, 1986.
[5] D. Bitton and D.J. DeWitt, “Duplicate Record Elimination in Large Data Files,” ACM Trans. Database Systems, vol. 8, no. 2, pp. 255-265, 1983.
[6] R.S. Boyer and J.S. Moore, “A Fast String-Searching Algorithm,” Comm. ACM, vol. 20, no. 10, pp. 762-772, 1977.
[7] A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int'l World Wide Web Conf., pp. 391-404, 1997.
[8] P. Buneman, S. Davidson, K. Hart, C. Overton, and L. Wong, “A Data Transformation System for Biological Data Sources,” Proc. 21st Very Large Data Base Conf., 1995.
[9] “CABRI Guideline for Catalogue Production,”, 1998.
[10] E.L. Carr, P. Kämpfer, B.K.C. Patel, V. Gürtler, and R.J. Seviour, “Seven Novel Species of Acinetobacter Isolated from Activated Sludge,” Int'l J. Systematic and Evolutionary Microbiology, vol. 53, pp. 953-963, 2003.
[11] R. Cattell and D. Barry, The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.
[12] W.I. Chang and J. Lampe, “Theoretical and Empirical Comparisons of Approximate String Matching Algorithms,” Proc. Third Symp. Combinatorial Pattern Matching, pp. 175-184, 1992.
[13] W.W. Cohen, H.A. Kautz, and D.A. McAllester, “Hardening Soft Information Sources,” Proc. Sixth Int'l Conf. Knowledge Discovery and Data Mining, pp. 255-259, 2000.
[14] T.H. Cormen, C.E. Lesierson, and R.L. Rivest, Introduction to Algorithms. MIT Press, 1990.
[15] “The Darwin Core,” Core darwin_core.asp, 2003.
[16] P. Dawyndt, M. Vancanneyt, and J. Swings, “On the Ontegration of Microbial Information,” World Federation for Culture Collections Newsletter, vol. 38, pp. 19-34, 2004.
[17] S.N. Dedysh, V.N. Khmelenina, N.E. Suzina, Y.A. Trotsenko, J.D. Semrau, W. Liesack, and J.M. Tiedje, “Methylocapsa acidiphila gen. nov., sp. nov., a Novel Methane-Oxidizing and Dinitrogen-Fixing Acidophilic Bacterium from Sphagnum Bog,” Int'l J. Systematic and Evolutionary Microbiology, vol. 52, pp. 251-261, 2002.
[18] H. De Meyer, H. Naessens, and B. De Baets, “Algorithms for Computing the Min-Transitive Closure and Associated Partition Tree of a Symmetric Fuzzy Relation,” European J. Operational Research, vol. 155, pp. 226-238, 2004.
[19] L.R. Dice, “Measures of the Amount of Ecological Association between Species,” J. Ecology, vol. 26, pp. 297-302, 1945.
[20] L. Dijkshoorn, B.M. Ursing, and J.B. Ursing, “Strain, Clone, and Species: Comments on Three Basic Concepts of Bacteriology,” J. Medical Microbiology, vol. 49, no. 5, pp. 397-401, 2000.
[21] E. Dimitriadou, S. Dolnicar, and A. Weingessel, “An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets,” Psychometrika, vol. 67, no. 1, pp. 137-160, 2002.
[22] M.-W. Du and S.C. Chang, “Approach to Designing Very Fast Approximate String Matching Algorithms,” IEEE Trans. Knowledge and Data Eng., vol. 6, no. 4, pp. 620-633, 1994.
[23] J. Dunn, “A Graph-Theoretical Analysis of Pattern Classification via Tamura's Fuzzy Relation,” IEEE Trans. Systems, Man, and Cybernetics, vol. 4, no. 3, pp. 310-313, 1974.
[24] U. Fayyad, G. Piatetsky-Shaprio, P. Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI Magazine, vol. 17, no. 3, pp. 37-54, 1996.
[25] I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, pp. 1183-1210, 1969.
[26] Z. Galil and R. Giancarlo, “Data Structures and Algorithms for Approximate String Matching,” J. Complexity, vol. 4, pp. 33-72, 1988.
[27] W. Gams, G.L. Hennebert, J.A. Stalpers, D. Janssens, M.A.A. Schipper, J. Smith, D. Yarrow, and D.L. Hawksworth, “Structuring Strain Data for Storage and Retrieval of Information on Fungi and Yeasts in MINE, the Microbial Information Network Europe,” J. General Microbiology, vol. 134, pp. 1667-1689, 1988.
[28] G.M. Garrity, K.L. Johnson, J.A. Bell, and D.B. Searles, “Taxonomic Outline of the Procaryotes,” Bergey's Manual of Systematic Bacteriology, second ed., Release 5.0, DOI:10.1007/bergeysoutline200405, New York: Springer-Verlag, 2002.
[29] M. Gyllenberg, T. Koski, and M. Verlaan, “Classification of Binary Vectors by Stochastic Complexity,” J. Multivariate Analysis, vol. 63, pp. 47-72, 1997.
[30] M. Halkidi and M. Vazirgiannis, “Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set,” Proc. IEEE Int'l Conf. Data Mining, Nov. 2001.
[31] P.A.V. Hall and G.R. Dowling, “Approximate String Matching,” ACM Computing Surveys, vol. 12, no. 4, pp. 381-402, 1980.
[32] J.A. Hartigan, Clustering Algorithms. New York: Wiley, 1975.
[33] M.A. Hernández and S.J. Stolfo, “The Merge\Purge Problem for Large Databases,” Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 127-138, May 1995.
[34] M.A. Hernández and S.J. Stolfo, “Real-World Data is Dirty: Data Cleansing and the Merge\Purge Problem,” J. Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998.
[35] J.E. Hopcroft and J.D. Ullman, “Set Merging Algorithms,” SIAM J. Computing, vol. 2, no. 4, pp. 294-303, 1973.
[36] J.A. Hylton, “Identifying and Merging Related Bibliographic Records,” Technical Report 678, MIT Laboratory for Computer Science, MIT, 1996.
[37] C. Kanz et al., “The EMBL Nucleotide Sequence Database,” Nucleic Acids Research, vol. 33, database issue D29-D33, 2005.
[38] C. Jacquemin and J. Royaute, “Retrieving Terms and Their Variants in a Lexicalized Unification-Based Framework,” Proc. ACM-SIGIR Conf. Research and Development in Information Retrieval, pp. 132-141, 1994.
[39] V. Josifovski and T. Risch, “Integrating Heterogeneous Overlapping Databases through Object-Oriented Transformations,” Proc. 25th Int'l Conf. Very Large Databases, pp. 435-446, 1999.
[40] P. Kämpfer, U. Dreyer, A. Neef, W. Dott, and H.-J. Busse, “Chryseobacterium defluvii sp. nov., Isolated from Wastewater,” Int'l J. Systematic and Evolutionary Microbiology, vol. 53, pp. 93-97, 2003.
[41] W. Kim, I. Choi, S. Gala, and M. Scheevel, “On Resolving Schematic Heterogeneity in Multidatabase Systems,” Distributed and Parallel Databases, vol. 1, no. 3, pp. 251-279, 1993.
[42] D.E. Knuth, J.H. Morris Jr., and V.R. Pratt, “Fast Pattern Matching in Strings,” SIAM J. Computing, vol. 6, no. 2, pp. 323-350, 1977.
[43] K. Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, 1992.
[44] S. Kundu, “An Optimal ${\cal O}(N^2)$ Algorithm for Computing the Min-Transitive Closure of a Weighted Graph,” Information Processing Letters, vol. 74, nos. 5-6, pp. 215-220, 2000.
[45] P. Lane and G. Lumpk, Oracle8i, Data Warehousing Guide, Release 2 (8.1.6). Oracle Corporation, 1999.
[46] V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics, vol. 10, pp. 707-710, 1966.
[47] H. Lee, “An Optimal Algorithm for Computing the Max-Min Transitive Closure of a Fuzzy Similarity Matrix,” Fuzzy Sets and Systems, vol. 123, no. 1, pp. 129-136, 2001.
[48] M. Madhavaram, D.L. Ali, and M. Zhou, “Integrating Heterogeneous Distributed Database Systems,” Computers and Industrial Eng., vol. 31, nos. 1-2, pp. 315-318, 1996.
[49] M. Mecella, M. Scannapieco, A. Virgillito, R. Baldoni, T. Catarci, and C. Batini, “Managing Data Quality in Cooperative Information Systems,” J. Data Semantics, vol. 1, 2003.
[50] T. Milo and S. Zohar, “Using Schema Matching to Simplify Heterogeneous Data Translation,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 122-133, 1998.
[51] A.E. Monge and C.P. Elkan, “The WebFind Tool for Finding Scientific Papers over the Worldwide Web,” Proc. Third Int'l Congress Computer Science Research, pp. 41-46, 1996.
[52] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery, May 1997.
[53] A.E. Monge, “An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records,” Technical Report 90840-8302, California State Univ., Long Beach, CECS Department, 2000.
[54] S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins,” J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[55] H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James, “Automatic Linkage of Vital Records,” Science, vol. 130, pp. 954-959, 1959.
[56] H.B. Newcombe, Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration and Business. Oxford Univ. Press, 1988.
[57] J. Peterson, “Computer Programs for Detecting and Correcting Spelling Errors,” Comm. ACM, vol. 23, no. 12, pp. 676-687, 1980.
[58] R. Preissner, A. Goede, and C. Frömmel, “Homonyms and Synonyms in the Dictionary of Interfaces in Proteins (DIP),” Bioinformatics, vol. 15, pp. 832-836, 1999.
[59] N. Ramakrishnan and A.Y. Grama, “Mining Scientific Data,” Advances in Computers, vol. 55, pp. 119-169, 2001.
[60] T.E. Senator, H.G. Goldberg, J. Wooton, M.A. Cottini, A.F. Umarkhan, C.D. Klinger, W.M. Llamas, M.P. Marrone, and R.W. Wong, “The Financial Crimes Enforcement Network AI System (FAIS): Identifying Potential Money Laundring from Reports of Large Cash Transactions,” AI Magazine, vol. 16, no. 4, pp. 21-39, 1995.
[61] B.E. Slaven, “The Set Theory Matching System: An Application to Ethnographic Research,” Social Science Computer Rev., vol. 10, no. 2, pp. 215-229, 1992.
[62] T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[63] P.H.A. Sneath and R.R. Sokal, “Numerical Taxonomy,” The Principles and Practice of Numerical Classification. San Francisco, Calif.: W. H. Freeman and Co., 1973.
[64] W.W. Song, P. Johannesson, and J.A. Bubenko Jr., “Semantic Similarity Relations and Computation in Schema Integration,” Data and Knowledge Eng., vol. 19, no. 1, pp. 65-97, 1996.
[65] J.T. Staley and N.R. Krieg, “Classification of Procaryotic Organisms: An Overview,” Bergey's Manual of Systematic Bacteriology, pp. 1-4, Baltimore, Md.: Williams and Wilkins, 1984.
[66] J.A. Stalpers, M. Kracht, D. Janssens, J. De, Ley J. Van Der Toorn, J. Smith, D. Claus, and H. Hippe, “Structuring Strain Data for Storage and Retrieval of Information on Bacteria in MINE, the Microbial Information Network Europe, Systematic and Applied Microbiology, vol. 13, pp. 92-103, 1990.
[67] P. Vandamme, B. Pot, M. Gillis, P. De Vos, K. Kersters, and J. Swings, “Polyphasic Taxonomy, a Consensus Approach to Bacterial Systematics,” Microbiological Review, vol. 60, pp. 407-438, 1996.
[68] Y.R. Wang, S.E. Madnick, and D.C. Horton, “Interdatabase/Instance Identification in Composite Information Systems,” Proc. 22nd Ann. Hawaii Int'l Conf. System Sciences, pp. 677-684 1989
[69] N. Weiss, M. Kracht, D. Gleim, and B.J. Tindall, “Bacterial Nomenclature Up-to-Date,” DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH, bactnombactname.htm, 2005.
[70] “Directory of Culture Collections,” WFCC-MIRCEN, World Data Centre for Microorganisms (WDCM), http:/, 2005.
[71] M.I. Yampolskii and A.E. Gorbonosov, “Detection of Duplicate Secondary Documents,” Nauchno-Tekhnicheskaya Informatsiya, vol. 1, no. 8, pp. 3-6, 1973.

Index Terms:
Index Terms- Transitive closure, union-find, homology, synonymy, error detection/correction, microbiology.
Peter Dawyndt, Marc Vancanneyt, Hans De Meyer, Jean Swings, "Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1111-1126, Aug. 2005, doi:10.1109/TKDE.2005.131
Usage of this product signifies your acceptance of the Terms of Use.