Issue No. 04 - July/August (2002 vol. 4)
George Karypis , University of Minnesota, Minneapolis
Data mining is the process of automatically extracting new and useful knowledge hidden in large data sets. This emerging discipline is becoming increasingly important as advances in data collection lead to the explosive growth in the amount of available data. Data mining techniques primarily help analyze commercial data sets and play a critical role in analyzing and understanding purchasing behaviors for effective consumer relations management, process optimization, personalized marketing, and customer segmentation.
Data mining's success has sparked an interest in applying such analysis techniques to various scientific and engineering fields, such as biology, medicine, fluid dynamics, astronomy, ecosystem modeling, and structural mechanics. This success is evident from the large number of commercial data mining software suites that are available from established companies, and specialized independent software vendors.
The data sets characteristics in scientific applications differ significantly from those in commercial and business settings, for which most of the existing data mining algorithms were originally developed. Scientific data sets tend to have a strong temporal, sequential, spatial, topological, geometric, and relational nature. Most of the existing data mining algorithms (pattern discovery, clustering, and classification) expect the data to be described either as a set of transactions (market-basket data), a sequence of such transactions (historical market-basket data), or as multidimensional vectors (demographic characteristics). Sometimes, transforming scientific data sets in these frameworks is easy; however, such transformations are either impossible or can only be done with a substantial loss in the amount of information.
Moreover, even though there is an abundance of data, that data is often not in a form that can be mined directly. For example, high-resolution images generated by various sky surveys require sophisticated techniques to identify interesting objects and represent them in a usable form for further analysis. Similarly, the continuous stream of values computed for each node or element of a finite-element mesh in scientific numerical simulations cannot be used for any effective data mining, requiring nontrivial feature extraction techniques to identify the objects of interest and how they are related before intelligent mining can begin.
Despite these challenges, early successes have shown that data mining can be a very effective tool for analyzing and understanding large scientific datasets, and further algorithmic and modeling advances promise to further increase data mining's impact and effectiveness. Thus, the purpose of this special issue is to report both on innovative new data mining algorithms suited for analyzing scientific data sets and to present novel applications and success stories from analyzing scientific data sets using data mining techniques.
In This Issue
Auroop R. Ganguly's article focuses on the rainfall forecasting problem and describes an approach that combines traditional forecasting approaches based on physical and statistical models with neural network techniques to improve the overall accuracy of the predictions. This work illustrates that when reasonably good domain-specific models exist, incorporating them in the data mining process can lead to dramatic improvements.
The article by David S. Thompson, Raghu K. Machiraju, Ming Jiang, Jaya Sreevalsan Nair, Gheorghe Craciun, and Satya Sridhar Dusi Venkata provides an insight to the challenges involved with analyzing data sets obtained from numerical simulations of study flows. They discuss the Evita system they are developing for finding interesting features in the 3D flow and using these features for further mining.
The article by Naren Ramakrishnan and Chris Bailey-Kellogg challenges the assumption underlying many data mining algorithms—that of the abundance of data. Their focus is on application domains in which getting a sufficient number of data points to use in data analysis is very expensive and present sampling strategies to ensure that any collected data point will be useful in subsequent data mining operations.
Robert Grossman and Marco Mazzucco's article stands out from the rest because it does not focus on any particular problem; it presents the DataSpace, a distributed, Web-based infrastructure that can facilitate and simplify data analysis over distributed data sources.
Finally, the article by Chandrika Kamath, Erick Cantú-Paz, Imola K. Fodor, and Nu Ai Tang focuses on the problem of classifying galactic objects and provides an insightful overview on the type of preprocessing that is required, and its associated challenges, to convert raw image data in a form that is suitable for data mining.
George Karypis is an assistant professor at the Department of Computer Science & Engineering at the University of Minnesota, Minneapolis. His research interests include data mining, parallel computing, bio-informatics, information retrieval, collaborative filtering, and scientific computing. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), parallel Cholesky factorization (PSPASES), collaborative filtering-based recommendation algorithms (SUGGEST), and for clustering high-dimensional data sets (CLUTO). He is a member of the IEEE and the ACM. Contact him at the EE/CS Bldg. 4-192, University of Minnesota, 200 Union St. SE, Minneapolis, MN 55455; email@example.com