This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Generating Synthetic Data to Match Data Mining Patterns
May/June 2008 (vol. 12 no. 3)
pp. 78-82
Josh Eno, University of Arkansas
Craig W. Thompson, University of Arkansas
Synthetic data sets can be useful for repeatable regression testing and for providing realistic — but not real — data to third parties for testing new software. In some cases, it is desirable that the synthetic data set be realistic, preserving various properties of the original data. Several synthetic data generators generate data that superficially matches known characteristics of data. This paper shows how to generate data that exhibits some of the same hidden patterns that can be discovered by data mining algorithms, in particular, decision tree patterns.

1. J. White, "American Data Set Generation Program: Creation, Applications, and Significance," master's thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2005.
2. J. Hoag and C. Thompson, "A Parallel General-purpose Synthetic Data Generator," ACM SIGMOD Record, vol. 36, no. 1, 2007, pp. 19–24.
3. J. Hoag, "Synthetic Data Generation," doctoral dissertation, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2008.
4. P. Lin et al., "Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems," Proc. 3rd Int'l Conf. Information Technology: New Generations, IEEE CS Press, 2006, pp. 707–712.
5. K. Houkjaer, K. Torp, and R. Wind, "Simple and Realistic Data Generation," Proc. 32nd Very Large Databases (VLDB 06), 2006, VLDB Endowment, pp. 1243–1246.
6. XML Schema 1.1 Part 2: Datatypes, World Wide Web Consortium (W3C), Feb. 2006, www.w3.org/TR/xmlschema11-2#built-in-primitive-datatypes .
7. J. Eno, "Generation of Synthetic Data to Conform to Constraints Derived from Data Mining Applications," master's thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2007.
8. R.A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems," Annual Eugenics, vol. 7, part II, 1936, pp. 179–188.
9. Integral Solutions Limited, "Clementine 10.0, PMML 3.0 Heart Decision Tree," 2007, www.dmg.org/pmml_examplesIRIS_TREE.xml.

Index Terms:
Synthetic data generation, data mining, decision trees
Citation:
Josh Eno, Craig W. Thompson, "Generating Synthetic Data to Match Data Mining Patterns," IEEE Internet Computing, vol. 12, no. 3, pp. 78-82, May-June 2008, doi:10.1109/MIC.2008.55
Usage of this product signifies your acceptance of the Terms of Use.