This Article 
 Bibliographic References 
 Add to: 
Hierarchical Decision Tree Induction in Distributed Genomic Databases
August 2005 (vol. 17 no. 8)
pp. 1138-1151
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.

[1] W. Sthlinger, O. Hogl, H. Stoyan, and M. Muller, “Intelligent Data Mining for Medical Quality Management,” Proc. Fifth Workshop Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP-2000), Workshop Notes of the 14th European Conf. Artificial Intelligence (ECAI-2000), pp. 55-67, 2000.
[2] N.J. Risch, “Searching for Genetic Determinants in the New Millennium,” Nature 405, pp. 847-856, 2005.
[3] F.-C. Tsui, J.U. Espino, V.M. Dato, P.H. Gesteland, J. Hutman, and M.M. Wagner, “Technical Description of Rods: A Real-Time Public Health Surveillance System,” J. Am. Medical Informatics Assoc. (JAMIA), vol. 10, no. 5, pp. 399-408, Sept./Oct. 2003.
[4] M.M. Wagner, J.M. Robinson, F.-C. Tsui, J.U. Espino, and W.R. Hogan, “Design of a National Retail Data Monitor for Public Health Surveillance,” J. Am. Medical Informatics Assoc. (JAMIA), vol. 10, no. 5, pp. 409-418, Sept./Oct. 2003.
[5] http:/, 2005.
[6] J.C. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22nd Int'l Conf. Very Large Databases for Data Mining, 1996.
[7] M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, 1996.
[8] M.V. Joshi, G. Karypis, and V. Kumar, “A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets,” Proc. Int'l Parallel Processing Symp., 1998.
[9] K. Alsabti, S. Ranka, and V. Singh, “CLOUDS: A Decision Tree Classifier for Large Datasets,” Knowledge Discovery and Data Mining, pp. 2-8, 1998.
[10] A. Srivastava, E.-H.S. Han, V. Kumar, and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Data Mining and Knowledge Discovery: An Int'l J., vol. 3, pp. 237-261, 1999.
[11] R. Jin and G. Agrawal, “Communication and Memory Efficient Parallel Decision Tree Construction,” Proc. Third SIAM Int'l Conf. Data Mining (SDM), 2003.
[12] P.K. Chan and S.J. Stolfo, “Toward Parallel and Distributed Learning by Meta-Learning,” Working Notes AAAI Work. Knowledge Discovery in Databases, pp. 227-240, 1993.
[13] F.J. Provost and D.N. Hennessy, “Scaling Up: Distributed Machine Learning with Cooperation,” Proc. 13th Nat'l Conf. Artificial Intelligence, 1996.
[14] L.O. Hall, N. Chawla, and K.W. Bowyer, “Combining Decision Trees Learned in Parallel,” Proc. Distributed Data Mining Workshop at the Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[15] H. Kargupta, B. Park, D. Hershbereger, and E. Johnson, “Collective Data Mining: A New Perspective Toward Distributed Data Mining,” Advances in Distributed and Parallel Knowledge Discovery, 1999.
[16] E.B. Hunt, J. Marin, and P.T. Stone, Experiments in Induction. Academic Press, 1966.
[17] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[18] R. Rastogi and K. Shim, “PUBLIC: A Decision Tree Classifier That Integrates Building and Pruning,” Data Mining and Knowledge Discovery, vol. 4, no. 4, pp. 315-344, 2000.
[19] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Monterey, Calif.: Wadsworth and Brooks, 1984.
[20] D. Caragea, A. Silvescu, and V. Honavar, “A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees,” Int'l J. Hybrid Intelligent Systems, invited paper, 2003.
[21] S.J. Stolfo, A.L. Prodromidis, S. Tselepis, W. Lee, D.W. Fan, and P.K. Chan, “JAM: Java Agents for Meta-Learning over Distributed Databases,” Knowledge Discovery and Data Mining, pp. 74-81, 1997.
[22] J. Catlett, “Megainduction: Machine Learning on Very Large Databases,” PhD dissertation, Univ. of Sydney, 1991.
[23] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, “BOAT— Optimistic Decision Tree Construction,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 1999.
[24] M. Castro, P. Druschel, A. Kermarrec, and A. Rowstron, “Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure,” IEEE J. Selected Areas in Comm., vol. 8, p. 20, 2002.
[25] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variable,” J. Am. Statistical Assoc., vol. 58, pp. 13-30, 1963.
[26] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge and Discovery and Data Mining, pp. 71-80, 2000.
[27] R.R. Hudson, “Generating Samples Under a Wright-Fisher Neutral Model of Genetic Variation,” Bioinformatics, vol. 18, pp. 337-338, 2002.
[28] G. Greenspan and D. Geiger, “Model-Based Inference of Haplotype Block Variation,” RECOMB, pp. 131-137, 2003.
[29] L. Raileanu and K. Stoffel, “Theoretical Comparison between Gini Index and Information Gain Criteria,” Annals of Math. and Artificial Intelligence, vol. 41, no. 1, pp. 77-93, May 2004.

Index Terms:
Index Terms- Data mining, distributed algorithms, decision trees, classification.
Amir Bar-Or, Daniel Keren, Assaf Schuster, Ran Wolff, "Hierarchical Decision Tree Induction in Distributed Genomic Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1138-1151, Aug. 2005, doi:10.1109/TKDE.2005.129
Usage of this product signifies your acceptance of the Terms of Use.