This Article 
 Bibliographic References 
 Add to: 
Synthetic Generation of High-Dimensional Datasets
Dec. 2011 (vol. 17 no. 12)
pp. 2317-2324
Georgia Albuquerque, TU Braunschweig
Thomas Löwe, TU Braunschweig
Marcus Magnor, TU Braunschweig
Generation of synthetic datasets is a common practice in many research areas. Such data is often generated to meet specific needs or certain conditions that may not be easily found in the original, real data. The nature of the data varies according to the application area and includes text, graphs, social or weather data, among many others. The common process to create such synthetic datasets is to implement small scripts or programs, restricted to small problems or to a specific application. In this paper we propose a framework designed to generate high dimensional datasets. Users can interactively create and navigate through multi dimensional datasets using a suitable graphical user-interface. The data creation is driven by statistical distributions based on a few user-defined parameters. First, a grounding dataset is created according to given inputs, and then structures and trends are included in selected dimensions and orthogonal projection planes. Furthermore, our framework supports the creation of complex non-orthogonal trends and classified datasets. It can successfully be used to create synthetic datasets simulating important trends as multidimensional clusters, correlations and outliers.

[1] G. Albuquerque, M. Eisemann, D. J. Lehmann, H. Theisel, and M. Magnor, Improving the visual analysis of high-dimensional datasets using quality measures. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST), 10 2010.
[2] I. Bronstein, K. Semendjajew, G. Musiol, and H. Mhlig, Taschenbuch der Mathematik. Verlag Harri Deutsch, 2001.
[3] M.G. Bulmer, Principles of statistics. Dover Pubns, 1979.
[4] P. Dalgaard, Introductory Statistics with R. Springer, 2002. ISBN 0–387-95475–9.
[5] RA DeMilli and A.J. Offutt, Constraint-based automatic test data generation. Software Engineering, IEEE Transactions on, 17 (9): 900–910, 1991.
[6] N. Elmqvist, P. Dragicevic, and J. Fekete, Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE Transactions on Visualization and Computer Graphics, pages 1141–1148, 2008.
[7] H. Guo and H. Viktor, Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM SIGKDD Explorations Newsletter, 6 (1): 30–39, 2004.
[8] D. Jeske, P. Lin, C. Rendon, R. Xiao, and B. Samadi, Synthetic Data Generation Capabilties for Testing Data Mining Tools. MILCOM, 0: 1–6, 2006.
[9] D. Jeske, B. Samadi, P. Lin, L. Ye, S. Cox, R. Xiao, T. Younglove, M. Ly, D. Holt, and R. Rich, Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 756–762. ACM, 2005.
[10] C. Michael, G. McGraw , M. Schatz, C. Walton, R. Res, and V. Sterling, Genetic algorithms for dynamic test data generation. In 12th IEEE International Conference Automated Software Engineering, 1997. Proceedings., pages 307–308, 1997.
[11] W. J. Nash and Tasmania, The Population biology of abalone (Haliotis species) in Tasmania. 1, Blacklip abalone (H. rubra) from the north coast and the islands of Bass Strait. Sea Fisheries Division, Dept. of Primary Industry and Fisheries, Tasmania, Hobart:, 1994.
[12] R. Pargas, M. Harrold, and R. Peck, Test-data generation using genetic algorithms. Software Testing, Verification and Reliability, 9 (4): 263–282, 1999.
[13] W. Peng, M. O. Ward, and E. A. Rundensteiner, Clutter reduction in multi-dimensional data visualization using dimension reordering. In INFOVIS'04, pages 89–96, 2004.
[14] J.P. Reiter, Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics - Stockholm, 18 (4): 531–544, 2002.
[15] R. Saucier, Computer Generation of Statistical Distributions. Technical report, Army Research Laboratory, Mar. 2000.
[16] D. F. Swayne, D. Temple Lang, A. Buja, and D. Cook, GGobi: evolving from XGobi into an extensible framework for interactive data visualization. Computational Statistics & Data Analysis, 43: 423–444, 2003.
[17] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, and D. Keim, Combining automated analysis and visualization techniques for effective exploration of high-dimensional data. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST), pages 59–66, Atlantic City, New Jersey, USA, Oct. 2009.
[18] J. Tukey and P. Tukey, Computing graphics and exploratory data analysis: An introduction. In Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics 85. Nat. Computer Graphics Assoc., 1985.
[19] M. O. Ward, Xmdvtool: Integrating multiple methods for visualizing multivariate data. In Proceedings of the IEEE Symposium on Information Visualization, pages 326–333, 1994.
[20] L. Wilkinson, A. Anand, and R. Grossman, Graph-theoretic scagnostics. In Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on, pages 157–164. IEEE, 2005.

Index Terms:
Synthetic data generation, multivariate data, high-dimensional data, interaction.
Georgia Albuquerque, Thomas Löwe, Marcus Magnor, "Synthetic Generation of High-Dimensional Datasets," IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2317-2324, Dec. 2011, doi:10.1109/TVCG.2011.237
Usage of this product signifies your acceptance of the Terms of Use.