Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources
Issue No.08 - August (2005 vol.17)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.131
The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.
Index Terms- Transitive closure, union-find, homology, synonymy, error detection/correction, microbiology.
Marc Vancanneyt, Hans De Meyer, Peter Dawyndt, "Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources", IEEE Transactions on Knowledge & Data Engineering, vol.17, no. 8, pp. 1111-1126, August 2005, doi:10.1109/TKDE.2005.131