Subscribe

Issue No.05 - May (2012 vol.24)

pp: 823-839

Arindam Banerjee , University of Minnesota, Minneapolis

Varun Chandola , Oakridge National Laboratory, Oak Ridge

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.235

ABSTRACT

This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous. We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.

INDEX TERMS

Discrete sequences, anomaly detection.

CITATION

Arindam Banerjee, Varun Chandola, "Anomaly Detection for Discrete Sequences: A Survey",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 5, pp. 823-839, May 2012, doi:10.1109/TKDE.2010.235REFERENCES

- [1] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection - A Survey,"
ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, July 2009.- [2] V. Hodge and J. Austin, "A Survey of Outlier Detection Methodologies,"
Artificial Intelligence Rev., vol. 22, no. 2, pp. 85-126, 2004.- [3] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, "A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,"
Proc. SIAM Int'l Conf. Data Mining, May 2003.- [4] S. Forrest, C. Warrender, and B. Pearlmutter, "Detecting Intrusions Using System Calls: Alternate Data Models,"
Proc. IEEE Symp. Security and Privacy (ISRSP), pp. 133-145, 1999.- [5] S.A. Hofmeyr, S. Forrest, and A. Somayaji, "Intrusion Detection Using Sequences of System Calls,"
J. Computer Security, vol. 6, no. 3, pp. 151-180, citeseer.ist.psu.eduhofmeyr98intrusion.html , 1998.- [6] C.C. Michael and A. Ghosh, "Two State-Based Approaches to Program-Based Anomaly Detection,"
Proc. 16th Ann. Computer Security Applications Conf., p. 21, 2000.- [7] W. Lee, S. Stolfo, and P. Chan, "Learning Patterns from Unix Process Execution Traces for Intrusion Detection,"
Proc. AAAI 97 Workshop AI Methods in Fraud and Risk Management, 1997.- [8] W. Lee and S. Stolfo, "Data Mining Approaches for Intrusion Detection,"
Proc. Seventh USENIX Security Symp., 1998.- [9] F.A. Gonzalez and D. Dasgupta, "Anomaly Detection Using Real-Valued Negative Selection,"
Genetic Programming and Evolvable Machines, vol. 4, no. 4, pp. 383-403, 2003.- [10] B. Gao, H.-Y. Ma, and Y.-H. Yang, "Hmms (Hidden Markov Models) Based on Anomaly Intrusion Detection Method,"
Proc. Int'l Conf. Machine Learning and Cybernetics, pp. 381-385, 2002.- [11] S. Budalakoti, A. Srivastava, R. Akella, and E. Turkov, "Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences," Technical Report NASA TM-2006-214553, NASA Ames Research Center, 2006.
- [12] S. Budalakoti, A. Srivastava, and M. Otey, "Anomaly Detection and Diagnosis Algorithms for Discrete Symbol Sequences with Applications to Airline Safety,"
Proc. IEEE Int'l Conf. Systems, Man, and Cybernetics, vol. 37, no. 6, 2007.- [13] P. Sun, S. Chawla, and B. Arunasalam, "Mining for Outliers in Sequential Databases,"
Proc. SIAM Int'l Conf. Data Mining, 2006.- [14] V. Chandola, V. Mithal, and V. Kumar, "A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data,"
Proc. Int'l Conf. Data Mining, 2008.- [15] D. Gusfield,
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.- [16] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff, "A Sense of Self for Unix Processes,"
Proc. IEEE Symp. Security and Privacy (ISRSP '96), pp. 120-128, citeseer.ist.psu.eduforrest96sense.html, 1996.- [17] E. Eskin, W. Lee, and S. Stolfo, "Modeling System Call for Intrusion Detection Using Dynamic Window Sizes,"
Proc. DARPA Information Survivability Conf. and Exposition (DISCEX), citeseer. ist.psu.eduportnoy01intrusion.html , 2001.- [18] T. Lane and C.E. Brodley, "Temporal Sequence Learning and Data Reduction for Anomaly Detection,"
ACM Trans. Information Systems and Security, vol. 2, no. 3, pp. 295-331, 1999.- [19] G. Liu, T.K. McDaniel, S. Falkow, and S. Karlin, "Sequence Anomalies in the cag7 Gene of the Helicobacter Pylori Pathogenicity Island,"
Proc. Nat'l Academy of Sciences USA, vol. 96, no. 12, pp. 7011-7016, 1999.- [20] A.N. Srivastava, "Discovering System Health Anomalies Using Data Mining Techniques,"
Proc. Joint Army Navy NASA Airforce Conf. Propulsion, 2005.- [21] S. Chakrabarti, S. Sarawagi, and B. Dom, "Mining Surprising Patterns Using Temporal Description Length,"
Proc. 24th Int'l Conf. Very Large Data Bases, pp. 606-617, 1998.- [22] D. Pavlov and D. Pennock, "A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains,"
Proc. Advances in Neural Information Processing Systems, 2002.- [23] D. Pavlov, "Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions,"
Proc. Third IEEE Int'l Conf. Data Mining, pp. 251-258, 2003.- [24] S. Ramaswamy, R. Rastogi, and K. Shim, "Efficient Algorithms for Mining Outliers from Large Data Sets,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 427-438, 2000.- [25] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, "A Geometric Framework for Unsupervised Anomaly Detection,"
Applications of Data Mining in Computer Security, pp. 78-100, Kluwer Academics, 2002.- [26] A.K. Jain and R.C. Dubes,
Algorithms for Clustering Data. Prentice-Hall, Inc., 1988.- [27] J. Yang and W. Wang, "CLUSEQ: Efficient and Effective Sequence Clustering,"
Proc. Int'l Conf. Data Eng., pp. 101-112, 2003.- [28] D. Ron, Y. Singer, and N. Tishby, "The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length,"
Machine Learning, vol. 25, nos. 2/3, pp. 117-149, 1996.- [29] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, "Visualization of Navigation Patterns on a Web Site Using Model-Based Clustering,"
Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 280-284, 2000.- [30] P. Smyth, "Clustering Sequences with Hidden Markov Models,"
Proc. Advances in Neural Information Processing Systems, vol. 9, 1997.- [31] R.R. Sokal and C.D. Michener, "A Statistical Method for Evaluating Systematic Relationships,"
Univ. of Kansas Scientific Bull., vol. 38, pp. 1409-1438, 1958.- [32] J.W. Hunt and T.G. Szymanski, "A Fast Algorithm for Computing Longest Common Subsequences,"
Comm. ACM, vol. 20, no. 5, pp. 350-353, 1977.- [33] N. Kumar, V.N. Lolla, E.J. Keogh, S. Lonardi, and C.A. Ratanamahatana, "Time-Series Bitmaps: A Practical Visualization Tool for Working with Large Time Series Databases,"
Proc. SIAM Int'l Conf. Data Mining (SDM), 2005.- [34] T. Lane and C.E. Brodley, "Sequence Matching and Learning in Anomaly Detection for Computer Security,"
Proc. AI Approaches to Fraud Detection and Risk Management, Fawcett, Haimowitz, Provost, and Stolfo, eds., pp. 43-49, 1997.- [35] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, "Lof: Identifying Density-Based Local Outliers,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 93-104, 2000.- [36] D. Endler, "Intrusion Detection: Applying Machine Learning to Solaris Audit Data,"
Proc. 14th Ann. Computer Security Applications Conf., pp. 268-279, 1998.- [37] H. Debar, M. Dacier, M. Nassehi, and A. Wespi, "Fixed vs. Variable-Length Patterns for Detecting Suspicious Process Behavior,"
Proc. Fifth European Symp. Research in Computer Security, pp. 1-15, 1998.- [38] A.K. Ghosh, A. Schwartzbard, and M. Schatz, "Using Program Behavior Profiles for Intrusion Detection,"
Proc. SANS Third Conf. and Workshop Intrusion Detection and Response, citeseer.ist.psu. edughosh99learning.html , Feb. 1999.- [39] A. Ghosh, A. Schwartzbard, and M. Schatz, "Learning Program Behavior Profiles for Intrusion Detection,"
Proc. First USENIX Workshop Intrusion Detection and Network Monitoring, pp. 51-62, Apr. 1999.- [40] J.B.D. Cabrera, L. Lewis, and R.K. Mehra, "Detection and Classification of Intrusions and Faults Using Sequences of System Calls,"
SIGMOD Record, vol. 30, no. 4, pp. 25-34, 2001.- [41] A.P. Kosoresow and S.A. Hofmeyr, "Intrusion Detection via System Call Traces,"
IEEE Software, vol. 14, no. 5, pp. 35-42, Sept./Oct. 1997.- [42] D. Dasgupta and F. Nino, "A Comparison of Negative and Positive Selection Algorithms in Novel Pattern Detection,"
Proc. IEEE Int'l Conf. Systems, Man, and Cybernetics, vol. 1, pp. 125-130, 2000.- [43] T. Lane and C.E. Brodley, "An Application of Machine Learning to Anomaly Detection,"
Proc. 20th Nat'l Information Systems Security Conf., pp. 366-380, 1997.- [44] T. Lane, "Machine Learning Techniques for the Computer Security Domain of Anomaly Detection," PhD dissertation, Purdue Univ., 2000.
- [45] D. Dasgupta and N. Majumdar, "Anomaly Detection in Multidimensional Data Using Negative Selection Algorithm,"
Proc. IEEE Conf. Evolutionary Computation, pp. 1039-1044, May 2002.- [46] S. Forrest, P. D'haeseleer, and P. Helman, "An Immunological Approach to Change Detection: Algorithms, Analysis and Implications,"
Proc. IEEE Symp. Security and Privacy, pp. 110-119, 1996.- [47] S. Forrest, A.S. Perelson, L. Allen, and R. Cherukuri, "Self-Nonself Discrimination in a Computer,"
Proc. IEEE Symp. Security and Privacy, pp. 202-212, 1994.- [48] S. Forrest and D. Dasgupta, "Novelty Detection in Time Series Data Using Ideas from Immunology,"
Proc. Fifth Int'l Conf. Intelligence Systems, 1996.- [49] S. Forrest, F. Esponda, and P. Helman, "A Formal Framework for Positive and Negative Detection Schemes,"
IEEE Trans. Systems, Man and Cybernetics, Part B, vol. 34, no. 1, pp. 357-373, Feb. 2004.- [50] A.K. Ghosh, J. Wanken, and F. Charron, "Detecting Anomalous and Unknown Intrusions against Programs,"
Proc. 14th Ann. Computer Security Applications Conf., pp. 259-267, 1998.- [51] M. Wang, C. Zhang, and J. Yu, "Native Api Based Windows Anomaly Intrusion Detection Method Using SVM,"
Proc. IEEE Int'l Conf. Sensor Networks, Ubiquitous, and Trustworthy Computing, vol. 1, pp. 514-519, 2006.- [52] S. Tian, S. Mu, and C. Yin, "Sequence-Similarity Kernels for Svms to Detect Anomalies in System Calls,"
Neurocomputing, vol. 70, nos. 4-6, pp. 859-866, 2007.- [53] X. Li, J. Han, S. Kim, and H. Gonzalez, "Roam: Rule- and Motif-Based Anomaly Detection in Massive Moving Object Data Sets,"
Proc. Seventh SIAM Int'l Conf. Data Mining, 2007.- [54] N. Ye, "A Markov Chain Model of Temporal Behavior for Anomaly Detection,"
Proc. Fifth Ann. IEEE Information Assurance Workshop, 2004.- [55] C. Marceau, "Characterizing the Behavior of a Program Using Multiple-Length N-Grams,"
Proc. Workshop New Security Paradigms, pp. 101-110, 2000.- [56] E. Eskin, W.N. Grundy, and Y. Singer, "Protein Family Classification Using Sparse Markov Transducers,"
Proc. Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '08), pp. 134-145, 2000.- [57] W.W. Cohen, "Fast Effective Rule Induction,"
Proc. 12th Int'l Conf. Machine Learning, A. Prieditis and S. Russell, eds., pp. 115-123, July 1995.- [58] L.R. Rabiner and B.H. Juang, "An Introduction to Hidden Markov Models,"
IEEE ASSP Magazine, vol. 3, no. 1, pp. 4-16, Jan. 1986.- [59] L.E. Baum, T. Petrie, G. Soules, and N. Weiss, "A Maximization Technique Occuring in the Statistical Analysis of Probabilistic Functions of Markov Chains,"
Annals of Math. Statistics, vol. 41, no. 1, pp. 164-171, 1970.- [60] T. Lane, "Hidden Markov Models for Human/Computer Interface Modeling,"
Proc. IJCAI-99 Workshop Learning about Users, pp. 35-44, 1999.- [61] K. Yamanishi and Y. Maruyama, "Dynamic Syslog Mining for Network Failure Monitoring,"
KDD '05: Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining, pp. 499-508, 2005.- [62] J. Forney, G.D., "The Viterbi Algorithm,"
Proc. IEEE, vol. 61, no. 3, pp. 268-278, Mar. 1973.- [63] G. Florez, Z. Liu, S. Bridges, A. Skjellum, and R. Vaughn, "Lightweight Monitoring of Mpi Programs in Real Time,"
Concurrency and Computation: Practice and Experience, vol. 17, no. 13, pp. 1547-1578, 2005.- [64] Y. Qiao, X.W. Xin, Y. Bin, and S. Ge, "Anomaly Intrusion Detection Method Based on Hmm,"
Electronics Letters, vol. 38, no. 13, pp. 663-664, 2002.- [65] X. Zhang, P. Fan, and Z. Zhu, "A New Anomaly Detection Method Based on Hierarchical Hmm,"
Proc. Fourth Int'l Conf. Parallel and Distributed Computing, Applications and Technologies, pp. 249-252, 2003.- [66] E. Keogh, J. Lin, S.-H. Lee, and H.V. Herle, "Finding the Most Unusual Time Series Subsequence: Algorithms and Applications,"
Knowledge and Information Systems, vol. 11, no. 1, pp. 1-27, 2006.- [67] E. Keogh, J. Lin, and A. Fu, "Hot SAX: Efficiently Finding the Most Unusual Time Series Subsequence,"
Proc. Fifth IEEE Int'l Conf. Data Mining, pp. 226-233, 2005.- [68] J. Lin, E. Keogh, A. Fu, and H.V. Herle, "Approximations to Magic: Finding Unusual Medical Time Series,"
Proc. 18th IEEE Symp. Computer-Based Medical Systems, pp. 329-334, 2005.- [69] J. Lin, E. Keogh, L. Wei, and S. Lonardi, "Experiencing SAX: A Novel Symbolic Representation of Time Series,"
Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 107-144, 2007.- [70] E. Keogh, S. Lonardi, and C.A. Ratanamahatana, "Towards Parameter-Free Data Mining,"
Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 206-215, 2004.- [71] L. Wei, N. Kumar, V. Lolla, E.J. Keogh, S. Lonardi, and C. Ratanamahatana, "Assumption-Free Anomaly Detection in Time Series,"
Proc. 17th Int'l Conf. Scientific and Statistical Database Management, pp. 237-240, 2005.- [72] L. Wei, E. Keogh, and X. Xi, "Saxually Explicit Images: Finding Unusual Shapes,"
Proc. Sixth Int'l Conf. Data Mining, pp. 711-720, 2006.- [73] Y. Bu, T.-W. Leung, A. Fu, E. Keogh, J. Pei, and S. Meshkin, "Wat: Finding Top-k Discords in Time Series Database,"
Proc. Seventh SIAM Int'l Conf. Data Mining, 2007.- [74] A.W.-C. Fu, O.T.-W. Leung, E.J. Keogh, and J. Lin, "Finding Time Series Discords Based on Haar Transform,"
Proc. Second Int'l Conf. Advanced Data Mining and Applications, pp. 31-41, 2006.- [75] A. Ghoting, S. Parthasarathy, and M.E. Otey, "Fast Mining of Distance-Based Outliers in High-Dimensional Datasets,"
Proc. SIAM Data Mining Conf., 2006.- [76] R. Gwadera, A. Gionis, and H. Mannila, "Optimal Segmentation Using Tree Models,"
ICDM '06: Proc. Sixth Int'l Conf. Data Mining, pp. 244-253, 2006.- [77] E. Keogh, S. Lonardi, and B.Y.C. Chiu, "Finding Surprising Patterns in a Time Series Database in Linear Time and Space,"
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 550-556, 2002.- [78] J.J. Schlesselman,
Case-Control Studies: Design, Conduct, Analysis (Monographs in Epidemiology and Biostatistics). Oxford Univ. Press, 1982.- [79] R. Gwadera, M. Atallah, and W. Szpankowski, "Reliable Detection of Episodes in Event Sequences,"
Knowledge and Information Systems, vol. 7, no. 4, pp. 415-437, 2005.- [80] R. Gwadera, M. Atallah, and W. Szpankowskii, "Detection of Significant Sets of Episodes in Event Sequences,"
Proc. Fourth IEEE Int'l Conf. Data Mining, pp. 3-10, 2004.- [81] R. Gwadera, M.J. Atallah, and W. Szpankowski, "Markov Models for Identification of Significant Episodes,"
Proc. Fifth SIAM Int'l Conf. Data Mining, 2005.- [82] R.A. Maxion and K.M.C. Tan, "Benchmarking Anomaly-Based Detection Systems,"
Proc. Int'l Conf. Dependable Systems and Networks, pp. 623-630, 2000.- [83] A. Pawling, P. Yan, J. Candia, T. Schoenharl, and G. Madey, "Anomaly Detection in Streaming Sensor Data,"
Intelligent Techniques for Warehousing and Mining Sensor Network Data, IGI Global, 2008.- [84] D. Pokrajac, A. Lazarevic, and L.J. Latecki, "Incremental Local Outlier Detection for Data Streams,"
Proc. IEEE Symp. Computational Intelligence and Data Mining, 2007.- [85] G. Florez-Larrahondo, S.M. Bridges, and R. Vaughn, "Efficient Modeling of Discrete Events for Anomaly Detection Using Hidden Markov Models,"
Information Security, vol. 3650, pp. 506-514, 2005.- [86] G.K. Palshikar, "Distance-Based Outliers in Sequences,"
Proc. Second Int'l Conf. Distributed Computing and Internet Technology, pp. 547-552, 2005.- [87] D. Yankov, E.J. Keogh, and U. Rebbapragada, "Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets,"
Proc. Int'l Conf. Data Mining, pp. 381-390, 2007.- [88] P. Protopapas, J.M. Giammarco, L. Faccioli, M.F. Struble, R. Dave, and C. Alcock, "Finding Outlier Light Curves in Catalogues of Periodic Variable Stars,"
Monthly Notices of the Royal Astronomical Soc., vol. 369, no. 2, pp. 677-696, 2006.- [89] U. Rebbapragada, P. Protopapas, C.E. Brodley, and C. Alcock, "Finding Anomalous Periodic Time Series,"
Machine Learning, vol. 74, pp. 281-313, 2009.- [90] J. Ma and S. Perkins, "Online Novelty Detection on Temporal Sequences,"
Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 613-618, 2003.- [91] Z. Liu, J.X. Yu, L. Chen, and D. Wu, "Detection of Shape Anomalies: A Probabilistic Approach Using Hidden Markov Models,"
Proc. IEEE 24th Int'l Conf. Data Eng., pp. 1325-1327, Apr. 2008.- [92] R. Gwadera and F. Crestani, "Discovering Significant Patterns in Multi-Stream Sequences,"
Proc. Eighth IEEE Int'l Conf. Data Mining, pp. 827-832, 2008.- [93] H. Cheng, P.-N. Tan, C. Potter, and S. Klooster, "Detection and Characterization of Anomalies in Multivariate Time Series,"
Proc. Ninth SIAM Int'l Conf. Data Mining, 2009.- [94] R. Fujimaki, T. Nakata, H. Tsukahara, and A. Sato, "Mining Abnormal Patterns from Heterogeneous Time-Series with Irrelevant Features for Fault Event Detection,"
Proc. SIAM Int'l Conf. Data Mining, pp. 472-482, 2008.- [95] R. Gwadera and F. Crestani, "Ranking Sequential Patterns with Respect to Significance,"
Proc. 14th Pacific-Asia Conf. Knowledge Discovery and Data Mining, (PAKDD '[10), pp. 286-299, 2010. |