The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - May/June (2008 vol.12)
pp: 78-82
Josh Eno , University of Arkansas
Craig W. Thompson , University of Arkansas
ABSTRACT
Synthetic data sets can be useful for repeatable regression testing and for providing realistic — but not real — data to third parties for testing new software. In some cases, it is desirable that the synthetic data set be realistic, preserving various properties of the original data. Several synthetic data generators generate data that superficially matches known characteristics of data. This paper shows how to generate data that exhibits some of the same hidden patterns that can be discovered by data mining algorithms, in particular, decision tree patterns.
INDEX TERMS
Synthetic data generation, data mining, decision trees
CITATION
Josh Eno, Craig W. Thompson, "Generating Synthetic Data to Match Data Mining Patterns", IEEE Internet Computing, vol.12, no. 3, pp. 78-82, May/June 2008, doi:10.1109/MIC.2008.55
REFERENCES
1. J. White, "American Data Set Generation Program: Creation, Applications, and Significance," master's thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2005.
2. J. Hoag and C. Thompson, "A Parallel General-purpose Synthetic Data Generator," ACM SIGMOD Record, vol. 36, no. 1, 2007, pp. 19–24.
3. J. Hoag, "Synthetic Data Generation," doctoral dissertation, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2008.
4. P. Lin et al., "Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems," Proc. 3rd Int'l Conf. Information Technology: New Generations, IEEE CS Press, 2006, pp. 707–712.
5. K. Houkjaer, K. Torp, and R. Wind, "Simple and Realistic Data Generation," Proc. 32nd Very Large Databases (VLDB 06), 2006, VLDB Endowment, pp. 1243–1246.
6. XML Schema 1.1 Part 2: Datatypes, World Wide Web Consortium (W3C), Feb. 2006, www.w3.org/TR/xmlschema11-2#built-in-primitive-datatypes .
7. J. Eno, "Generation of Synthetic Data to Conform to Constraints Derived from Data Mining Applications," master's thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2007.
8. R.A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems," Annual Eugenics, vol. 7, part II, 1936, pp. 179–188.
9. Integral Solutions Limited, "Clementine 10.0, PMML 3.0 Heart Decision Tree," 2007, www.dmg.org/pmml_examplesIRIS_TREE.xml.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool