Issue No. 04 - July-August (2008 vol. 25)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MDT.2008.122
Scott Davidson , Sun Microsystems
Data Mining: Concepts, Models, Methods, and Algorithms,, by Mehmed Kantardzic (Wiley-IEEE Press, 2002, ISBN: 978-0-471-22852-3, 360 pp., $93.50).
In this book, Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic defines data mining as "the entire process of applying a computer-based methodology ... for discovering knowledge from data." We have all been in contact with data mining. Examples include a search engine collecting information from the Web, the local grocery store building information about our shopping habits through discount cards, and an online bookseller trying to guess what other books we might want to buy.
But why review a book on data mining in IEEE Design & Test? Data mining is used to learn the characteristics of a population on the basis of a set of samples. Well, one application of this in fabs is statistical postprocessing, which learns the defect profile of a sample of thoroughly tested ICs, and thus which tests are most useful for detecting errors in the current population. As chips get bigger, test data volumes grow as well. It might be possible to mine this data to discover more about these chips and the processes to build them. So, this book review will be from the viewpoint of a test engineer, not a statistician. Specifically, is this book useful for the kind of work we do?
The book begins with a good chapter on the basic concepts of data mining, and how it is different from simply getting information from a database, which is that data mining is useful when we don't know exactly what we are looking for. The next two chapters focus on the data we can expect to get, and how to deal with practical problems such as missing, corrupt, or redundant data. For example, the author describes how to detect and deal with outliers—which, in this case, refers to corrupt data, and not data with values far from the mean. Data reduction techniques attempt to reduce the number of dimensions of the data, which makes for faster processing. If two features are strongly correlated, such as manufacturing location and product type, we can remove one of them. We might also reduce the number of possible values by mapping specific test results into broader classes of failures. The binning of ICs after test does this, and similarly, we can bucket a database of retest results from field failures into logic failures, memory failures, I/O failures, and so on.
The next stage is learning from the data. Learning involves two steps: estimating the dependencies on outputs with respect to inputs based on sample data, and using that estimation to predict new outputs for future input values not covered in the sample. Regression is an example of unsupervised learning. In supervised learning, actual responses to the sample are known and the learning system is modified through error signals, which represent the difference between the predicted and actual responses—as if a teacher were correcting the responses.
Most of the remaining chapters of the book survey various learning methods. Chapter 5 is a good introduction to statistical methods, such as regression; but there is nothing new here for most engineers. The next chapter is on clustering methods, which involve partitioning a set of data into buckets whose members are similar in some way. The output of a clustering method is a description of each bucket and a similarity measurement. The book presents several methods and gives helpful examples. The next chapter discusses another approach to classification: the use of decision trees and decision rules.
Chapter 8, on association rules, describes some of the most commonly used unsupervised learning methods, which involve searching a database and finding interesting rules and patterns. However, the section of this chapter on Web data mining seems badly out of date. Although the book was published in 2002, this section appears to have been written even a few years earlier—pre-Google.
The next several chapters describe somewhat more exotic learning methods. The first is the use of neural networks for data mining, to recognize patterns and associations in data. Although the book gives some practical hints here, the data miner would have to do a lot of work to use those techniques. A few more examples would have been helpful, although references to some books on neural networks do help a little. The next method discussed is genetic algorithms, which are familiar to many in computer science. The chapter is well-written, and a long example is provided. The application here is for supervised learning, in which the error signals for several possible mappings from input to output serve as the fitness function to identify the best members of the population of potential mappings.
Possibly more useful is the chapter on fuzzy sets and fuzzy logic. If we are partitioning a group of heights into short, medium, and tall, there will be values right on the boundaries, so traditional techniques have problems producing discrete clusters. The first part of this chapter gives an excellent basic tutorial on the concept and then describes how to apply fuzzy sets to data mining.
The last chapter, on visualization, was disappointing. Although some good requirements for a visualization tool were given, the chapter needs many more pictures. The book ends with two useful appendices, however: the first is a list of data-mining tools and useful websites; the second provides short descriptions of industry applications. These all might be a bit out of date now, but they can still serve as a useful start for those interested in learning about these methods.
So, is this book useful to the test engineer? Yes. In fact, I'm already working on applying some of the things I've learned to my own work. This book is written at the right level, and it doesn't include overly complex mathematical treatments of the concepts. Readers should come away with a solid overview of the topic. Those wishing to implement a data-mining technique would need to dig much deeper, of course. But those in testing can at least glean some good ideas on how to apply data mining to get the most from their piles of test data, and on some of the tools that could help them.▪