Issue No. 05 - October (1996 vol. 11)
There is a lot of talk these days about data mining—and this special issue will add some more. To help the material here clarify and illustrate rather than add to the hype, here's a brief perspective on the trends that shaped this issue, as well as a road map for the individual articles.
The growing perception of the advantage of putting information on line, beginning with the success of automated data processing in commercial and scientific enterprises, has led to the collection and storage of ever larger amounts of data. The effort that has gone into ensuring the stability, security, and accessibility of these data has resulted in the substantial database management technology we see today.
So, many enterprises are actively constructing and maintaining very large databases that enjoy strong organizational commitment and rely on specific underlying technologies. A common case is the large transaction database that relies on a software/hardware infrastructure of relational database and data-server technology. A relational database has a well-defined structure of rows and columns. Because the information in the database has been forced into this structure to enable the efficient execution of certain access and manipulation operations, in most environments this in turn bounds the uses to which the data will ever be put. And because only easily stated relational queries are well supported, the data is used—indeed is thought of—strictly in terms of basic query-and-display operations.
Industry has long recognized the limitation imposed by this codependence of databases and their underlying technology. There has, for example, been a decades-old call for better decision support—tools for data analysis beyond the retrieval, manipulation, and graphics meant for usual business processing. In fact, as we often hear, relational databases were originally intended for decision support. However, virtually all of the technology and, more important, the organizational culture of databases, has been directed toward transaction processing. Thus, while most database companies contend that their products can be used for some flavor of decision support, knowledgeable observers generally agree that in-depth decision support requires new technology. This new technology should enable the discovery of trends and predictive patterns in data, the creation and testing of hypotheses, and the generation of insight-provoking visualizations. Nondatabase experts and nonstatisticians should find the technology easy to use; it should also accommodate their ever-changing needs by giving clear, rapid answers to their unplanned-for and perhaps informally stated questions.
Many think data mining promises this technology. Just as there is nothing new about the cry for decision support, there is nothing particularly new about the technologies that underlie data mining: visualization, statistics, machine learning, and deductive databases. What is new is the confluence of (fairly) mature offshoots of these technologies at a time when the world is ready to see their value (Java is an interesting comparison). Also new is the emergence of an approach for applying these technologies to real problems.
In fact, one of this issue's themes is that data mining is really a general approach that is supported to varying degrees by a set of technologies. The approach has been shaped by the challenges faced by data-mining technologists in applying their technologies to an existing infrastructure. As always happens when new technologies are applied to an existing infrastructure, the technologists need to do a great deal of "other" work to create an environment in which their technologies can operate effectively (several articles in this issue document this process). For data mining, a consensus is currently emerging on a process for creating the environment and applying the technologies.
Therefore, while this issue consciously attempts to sample the diversity of data-mining technology and applications, it also tries to place this diversity in the context of an overall approach and emerging process.
Road map for this issue
The issue begins with a discussion of the scientific underpinnings and current activities of the field. The article by Usama Fayyad opens the discussion of terminology, positioning data mining as part of an overall endeavor of knowledge discovery and providing insight into the many issues that influence the success or failure of that endeavor.
Evangelos Simoudis's article provides an excellent analysis of the data-mining process, including a detailed description of the contributions of the various underlying technologies. The article illustrates the process with a set of actual examples of how data mining has been used with noteworthy effect in the real world.
The rest of the issue provides in-depth descriptions of actual data-mining practice, covering both applications—solutions to specific problems—and techniques that can be applied to a wide range of problems. The articles show the range of problems and technologies that are relevant to this burgeoning field.
Edmond Mesrobian and his colleagues present Oasis, a data-mining environment designed to help geophysical scientists find interesting phenomena in large databases of collected data. Their article outlines the kind of system solution that is required to address the real-world nature of data mining as an ongoing process. They discuss the issues of dealing with software and hardware heterogeneity, and the need to help users find potentially relevant databases in a large distributed environment. It is useful to view this effort in the context of the guidelines and approach described in the initial Fayyad and Simoudis articles.
The article by Kazuo J. Ezawa and Steven W. Norton describes an enormous problem in the telecommunications industry and the way it is being addressed with data-mining technology. They show not only an important application but also a framework for integrating uncertainty reasoning—a major issue for data mining, as it is for much of AI. In particular, they describe a Bayesian network approach that has produced impressive results, and they very usefully put it in the context of related work and alternative approaches.
George H. John, Peter Miller, and Randy Kerber show the application of a very different technology, rule induction, to a very different problem, stock selection. This domain is important not only because of its role as one of the major "interested parties" for data-mining technology, but also because results are relatively easy—in fact, deceptively easy—to score: they are inherently numerical, and the problems have been studied intensively enough to provide solid metrics for success. Again, the results are impressive. The analysis of those results is particularly instructive, both for the thoroughness of the effort and for the illustration of how difficult it is to really understand results based on large amounts of complex data.
In the next article, Diane J. Cook, Lawrence B. Holder, and Surnjani Djoko show the value of combining technologies to improve data-mining results. They address the issue of automatically finding useful substructures in data (such as, identifying reusable substructures in a complex circuit). An interesting aspect of their work is the explicit investigation of the value of domain knowledge, explored by running a substructure discovery algorithm on the same data with and without the addition of specific domain knowledge.
Finally, Hing-Yan Lee and Hwee-Leng Ong offer a visualization technique for multidimensional data. Their technique has broad applicability and is presented in terms of an easy-to-understand software package, amply demonstrating the appeal of data-mining technology to nonexpert users.
Data mining is an emerging field. It is reasonable to expect some differences in scope and definition—and this special issue certainly meets this expectation. But the main message here is that within the diverse set of technologies and applications, technologists are addressing a central thread of an important area: unlocking the information that is buried in the enormous stock of data we have already put on line and developing the underpinnings for better ways to handle data and support future decision making.
Bill Mark is the director of the National Semiconductor Architecture Laboratory and former Associate Editor-in-Chief of IEEE Expert magazine. His research interests include distributed systems of "smart things" and information appliances. He received his BS and MS in electrical engineering and computer science and his PhD in computer science from the Massachusetts Institute of Technology. He is a member of the AAAI and the ACM. Reach him at the Nat'l Semiconductor Architecture Lab, 2900 Semiconductor Dr. M/S E-100, Santa Clara, CA 95052; firstname.lastname@example.org.