The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2008 vol.20)
pp: 736-751
We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large datasets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between genesets in a bioinformatics dataset, and relating publications in the PubMed index of abstracts.
Data mining, Mining methods and algorithms, Retrieval models, Graph and tree search strategies
Deept Kumar, Naren Ramakrishnan, Richard F. Helm, Malcolm Potts, "Algorithms for Storytelling", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 6, pp. 736-751, June 2008, doi:10.1109/TKDE.2008.32
[1] L. Parida and N. Ramakrishnan, “Redescription Mining: Structure Theory and Algorithms,” Proc. 20th Nat'l Conf. Artificial Intelligence (AAAI '05), pp. 837-844, 2005.
[2] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R. Helm, “Turning CARTwheels: An Alternating Algorithm for Mining Redescriptions,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 266-275, 2004.
[3] M. Zaki and N. Ramakrishnan, “Reasoning About Sets Using Redescription Mining,” Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '05), pp. 364-373, 2005.
[4] D. Kumar, N. Ramakrishnan, M. Potts, and R. Helm, “Algorithms for Storytelling,” Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '06), pp. 604-610, 2006.
[5] D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics. Springer, 2004.
[6] R. López de Mántaras, “A Distance-Based Attribute Selection Measure for Decision Tree Induction,” Machine Learning, vol. 6, pp. 81-92, 1991.
[7] A. Nanopoulos and Y. Manolopoulos, “Efficient Similarity Search for Market Basket Data,” The VLDB J., vol. 11, no. 2, pp. 138-152, 2002.
[8] S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity Predicates,” Proc. ACM SIGMOD '04, pp. 743-754, June 2004.
[9] N. Mamoulis, D. Cheung, and W. Lian, “Similarity Search in Sets and Categorical Data Using the Signature Tree,” Proc. Ninth IEEE Int'l Conf. Data Eng. (ICDE '03), pp. 75-86, 2003.
[10] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos, “Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Data,” Proc. Seventh East-European Conf. Advances in Databases and Information Systems (ADBIS '03), pp. 236-252, Sept. 2003.
[11] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. 25th Int'l Conf. Very Large Data Bases (VLDB '99), pp. 518-529, Sept. 1999.
[12] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher, “Min-Wise Independent Permutations,” J. Computer and System Sciences, vol. 60, no. 3, pp. 630-659, June 2000.
[13] A. Moore and M. Lee, “Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets,” J. Artificial Intelligence Research, vol. 8, pp. 67-91, 1998.
[14] C. Aggarwal, J. Wolf, and P. Yu, “A New Method for Similarity Indexing of Market Basket Data,” Proc. ACM SIGMOD '99, pp.407-418, 1999.
[15] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, vol. 22, no. 3, pp. 213-215, 1999.
[16] J. Storey and R. Tibshirani, “Statistical Significance for Genome-Wide Experiments,” Proc. Nat'l Academy of Sciences, vol. 100, pp.9440-9445, 2003.
[17] A. Kuchinsky, K. Graham, D. Moh, A. Adler, K. Babaria, and M. Creech, “Biological Storytelling: A Software Tool for Biological Information Organization Based Upon Narrative Structure,” ACM SIGGROUP Bull., vol. 23, no. 2, pp. 4-5, Aug. 2002.
[18] R. Guha, R. Kumar, D. Sivakumar, and R. Sundaram, “Unweaving a Web of Documents,” Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '05), pp. 574-579, 2005.
[19] L. Getoor, “Link Mining: A New Data Mining Challenge,” ACM SIGKDD Explorations Newsletter, vol. 5, no. 1, pp. 84-89, 2003.
[20] D. Swanson and N. Smalheiser, “An Interactive System for Finding Complementary Literatures: A Stimulus to Scientific Discovery,” Artificial Intelligence, vol. 91, no. 2, pp. 183-203, 1997.
[21] P. Fung and G. Ngai, “One Story, One Flow: Hidden Markov Story Models for Multilingual Multidocument Summarization,” ACM Trans. Speech and Language Processing, vol. 3, no. 2, pp. 1-16, July 2006.
[22] M. Meila, “Comparing Clusterings by the Variation of Information,” Proc. 16th Ann. Conf. Learning Theory (COLT '03), pp. 173-187, 2003.
[23] D. Simovici and S. Jaroszewicz, “An Axiomatization of Partition Entropy,” IEEE Trans. Information Theory, vol. 48, no. 7, pp. 2138-2142, 2002.
36 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool