, University of Technology, Sydney
, University of Technology, Sydney
, Hong Kong University of Science and Technology
Pages: pp. 12-13
Many data analysis applications—such as data mining, information retrieval, machine learning, Web data management, data warehousing, and pattern recognition—need information enhancement. This involves taking data in their raw form, removing as much noise and redundancy as possible, and bringing out a core that's ready for further processing. Indeed, information enhancement, which straddles data preprocessing and data mining, often presents itself as a less glamorous but more critical step than other steps in data mining applications; a minor information enhancement adjustment could bring higher effectiveness. Therefore, information enhancement is a crucial research topic. However, much work in relevant fields, such as data mining, is based only on quality data. That is, researchers have been assuming that the input to the data mining algorithms conforms to well-defined data distribution, containing no missing, inconsistent, or incorrect values. 1 This leaves a large gap between the available data and the machinery available to process the data.
Information enhancement is important in three aspects:
We observe that data preprocessing and data mining aren't two separate steps in the data mining life cycle; instead, they form a process that crosses boundaries by straddling the traditional phases. Indeed, in real-world practice, data cleaning and data mining are intertwined.
In particular, as the Web rapidly becomes a channel for information flood, individuals and organizations take into account the Internet's low-cost information and knowledge when making decisions. So, researchers and practitioners must intensify efforts to develop appropriate techniques for efficiently using and managing data. Although data mining technology can support data analysis applications within these organizations, we must be able to enhance information from raw data to enable efficient and quality knowledge discovery. Thus, developing information enhancement technologies and methodologies is a challenging and critical task.
The articles in this special issue emphasize practical techniques and methodologies for information enhancement for data mining applications. We have striven to include in this issue articles that can benefit all areas of data analysis.
We can categorize the articles into three main parts: data clustering, data cleaning, and Web intelligence.
Eugene Tuv and George Runger propose a new scoring method for variables in heterogeneous (mixed-type) data, using a nontraditional clustering approach called supervised-contrasting-independence clustering. Their method is computationally efficient and flexible in mapping categorical variables to numeric scores in mixed-type data.
Taghi M. Khoshgoftaar, Naeem Seliya, and Shi Zhong describe an interactive approach for software quality estimation that combines unsupervised learning and experts. This approach is effective in predicting both software modules' fault proneness and potential "noisy" (for example, mislabeled) modules.
Mong Li Lee, Wynne Hsu, and Vijay Kothari formalize a solution to a new real-life data-quality problem that's just one of potentially many. Their approach uses context information to clean up spurious links in data by first identifying and retrieving the data containing potential spurious links, then performing a context similarity comparison to determine records with high overlaps. In the process, the degree of overlapping context indicates the likelihood of existing spurious links.
Data quality is of prime concern to any task involving data analysis. Choh Man Teng designs a process to correct potential errors by tenfold cross-validation comprising prediction and adjustment. In the prediction phase, suspect elements in data are identified together with a nominated replacement value. In the adjustment phase, the algorithm selectively incorporates the nominated changes into the analyzed data set.
In the CRM (customer relationship management) industry, analyzing customer churn is important for keeping customers and providing higher values. To highlight the importance of preprocessing data as a preparation for predicting customer behavior such as customer churn, Lian Yan, Richard H. Wolniewicz, and Robert Dodier present techniques, experiences, and lessons for customer behavior prediction in the telecom industry. The data preparation involves understanding customers' data and business practice, defining modeling targets, extracting data, reprocessing raw data, and compensating the available data for greater model accuracy.
Ying Yang and Xindong Wu analyze four parameters that can help measure the performance of an induction algorithm in feature elimination. They design a feature elimination method that considers not only the data and the target concept, but also the induction algorithm that will learn the target concept from the data.
Doru Tanasa and Brigitte Trousse advocate an approach for preprocessing multiple Web server logs for Web usage mining. This method can increase data quality and reduce in a significant but relevant manner the size of the Web servers' log files.
The diversity of data and data mining tasks offer many challenging research issues for information enhancement:
Grants from the Australian ARC and UTS in Australia support Shichao Zhang's work. Grants from the Australian ARC, UTS in Australia, and the Australian CMCRC support Chengqi Zhang's work. Grants from the Hong Kong Research Grant Committee and the Hong Kong University of Science and Technology support Qiang Yang's work.