Issue No. 04 - July/August (2006 vol. 10)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIC.2006.79
Mehmed Kantardzic , University of Louisville
Samuel Madden , Massachusetts Institute of Technology
Anup Kumar , University of Louisville
Recent advances in computing, communications, and digital storage technologies, together with the development of high-throughput data-acquisition technologies, have made it possible to gather and store incredible volumes of data. One example is the hundreds of terabytes of DNA, protein-sequence, and gene-expression data that biological sciences researchers have gathered at steadily increasing rates. Similarly, data warehouses store massive quantities of information about various business operation aspects. Warehouses of international retailers (such as Wal-Mart) are typically multi-terabyte databases that contain information about retail transactions by customers all over the world. Finally, complex distributed systems (computer systems, communication networks, and power systems, for example) are equipped with sensors and measurement devices that gather and store a variety of data for use in monitoring, controlling, and improving their operations. 1
These developments have created unprecedented opportunities for large-scale data-driven knowledge discovery, as well as the potential for fundamental gains in scientific and business understanding. Data mining technology has emerged as a means of performing this discovery. 2,3 The field draws on extensive work in areas such as statistics, machine learning, pattern recognition, databases, and high-performance computing. The popularity of the Internet and the Web makes it imperative that the data mining framework is extended to include distributed and time-dependent information and tools. The added dimension of distributed data mining significantly increases the data mining process's complexity. Therefore, data mining requires a structured framework that will help users translate an application-domain problem into a set of data mining tasks at a higher abstract level, with less effort, and without knowing all the technical details about the distributed infrastructure. The Web architecture, with layered protocols and services, provides a sound framework for supporting distributed data mining.
Tackling the Distribution Issue
The emergence of these tremendous data sets creates a growing need for analyzing them across geographical lines using distributed and parallel systems. Implementations of data mining techniques on high-performance distributed computing platforms are moving away from centralized computing models for both technical and organizational reasons. In some cases, centralization is hard because it requires these multi-terabyte data sets to be transmitted over very long distances. In others, centralization violates privacy legislation, exposes business secrets, or poses other social challenges. Common examples of such challenges arise in medicine, where relevant data might be spread among multiple parties: commercial organizations such as drug companies or hospitals, government bodies such as the US Food and Drug Administration, and nongovernment organizations such as charities and public-health organizations. Each organization is bound by regulatory restrictions, such as privacy legislation, or corporate requirements on proprietary information that could give competitors a commercial advantage. Consequently, a need exists for developing algorithms, tools, services, and infrastructure that let us mine data distributed across organizations for patterns while preserving privacy.
This shift toward intrinsically distributed, complex environments has prompted a range of new data mining research challenges. In addition to data being distributed, the advent of the Internet has led to increasingly complex data, 4 including natural language text, images, time series, sensor data, multi-relational and object data types, and so on. To further complicate matters, systems need incremental or online mining tools that don't require complete remining whenever a change is made to the underlying data. Providing these features in distributed data mining systems requires novel solutions. 5 It's crucial that we be able to ensure scalability and interactivity as data mining infrastructure continues to grow substantially in size and complexity. 6 Ultimately, systems must be able to hide this technological complexity from users.
In this Issue
The articles in this special issue aim to provide a snapshot of the latest developments in distributed data mining.
In "Distributed Data Mining in Peer-to-Peer Networks," Souptik Datta and his coauthors discuss the challenges of data mining in peer-to-peer systems and highlight potential applications for P2P data mining in mobile ad hoc networks, sensor networks, distributed federated databases, e-commerce, and other such environments. The authors identify the significance of local algorithms in P2P data mining and classify them as exact-local or approximate-local algorithms. They present algorithms such as majority voting, frequent item-set mining, and K-means clustering for potential use in solving distributed data mining problems in P2P networks. Researchers are still working to meet the challenges of developing mature versions of these algorithms and integrating them into real-world data mining applications in P2P environments.
"Mining Text with Pimiento," by Juan José García Adeva and Rafael Calvo, describes an object-oriented application framework (OOAF) for text mining. The article highlights the need for mining unstructured information and then identifies the OOAF's benefits. The authors compare some existing text-mining systems' features and discuss how to assess a text-mining framework's quality. They then propose the Pimiento framework, which offers distributed text categorization, language identification, clustering, and similarity analysis. The article concludes with a case study of Pimiento's use in plagiarism detection.
In "Anteater: A Service-Oriented Architecture for High-Performance Data Mining," Dorgival Guedes, Wagner Meira Jr., and Renato Ferreira describe a SOA that offers simple abstractions for users and supports computationally intensive applications for data mining. Using Anthill, a runtime system for iterative distributed applications, they describe and evaluate distributed runtime systems, parallel implementations of decision trees, and association-analysis algorithms. The article then details the authors' Web-services-based Anteater data mining environment for distributed applications. The system is operational and in use for various public health and public safety applications in Brazil.
"Service-Oriented Distributed Data Mining," by William K. Cheung and his coauthors, provides a flexible service-oriented distributed data mining (DDM) framework. In this framework, the authors have adopted the Business Process Execution Language for Web Services (BPEL4WS) to carry out DDM services. To provide privacy control for data, they've used a recently proposed approach called learning from abstraction in their environment. They demonstrate their framework's overall effectiveness by implementing two DDM applications, including distributed data clustering and distributed manifold unfolding for visualization.
Taken as a whole, these articles illustrate the state of the art in distributed data mining, highlighting many of the opportunities and challenges we've outlined. As the amount of data available on the Internet grows, the need for carrying out effective knowledge discovery is becoming ever more critical. The key challenge in achieving this objective is building fast and effective distributed architectures for mining data spread across multiple geographical locations.
Anup Kumar is a professor in the computer engineering and computer science department and the director of the Mobile Information Network and Distributed System (MINDS) lab at the University of Louisville. His research interests include distributed computing, mobile systems, and sensor networks. Kumar has a PhD in electrical and computer engineering from North Carolina State University. He is a senior member of the IEEE. Contact him at firstname.lastname@example.org.
Mehmed Kantardzic is a professor at the computer engineering and computer science department, and the director of the Data Mining Lab at the University of Louisville. His research interests include data mining and knowledge discovery, Web-services based infrastructure for distributed data mining, stream mining, and link analysis. Kantardzic has a PhD in computer science from the University of Sarajevo, Bosnia. His is the author of two data mining books published by Wiley-IEEE Press in 2003 and 2005. Contact him at email@example.com.
Samuel Madden is an assistant professor in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology. His research interests include database system design, query processing and optimization, and sensor networks. Madden has a PhD from the University of California, Berkeley. Contact him at firstname.lastname@example.org.