Issue No.05 - September/October (2010 vol.36)
pp: 688-703
Xiangyu Zhang , Purdue University, West Lafayette
Zhiqiang Lin , Purdue University, West Lafayette
Program input syntactic structure is essential for a wide range of applications such as test case generation, software debugging, and network security. However, such important information is often not available (e.g., most malware programs make use of secret protocols to communicate) or not directly usable by machines (e.g., many programs specify their inputs in plain text or other random formats). Furthermore, many programs claim they accept inputs with a published format, but their implementations actually support a subset or a variant. Based on the observations that input structure is manifested by the way input symbols are used during execution and most programs take input with top-down or bottom-up grammars, we devise two dynamic analyses, one for each grammar category. Our evaluation on a set of real-world programs shows that our technique is able to precisely reverse engineer input syntactic structure from execution. We apply our technique to hierarchical delta debugging (HDD) and network protocol reverse engineering. Our technique enables the complete automation of HDD, in which programmers were originally required to provide input grammars, and improves the runtime performance of HDD. Our client study on network protocol reverse engineering also shows that our technique supersedes existing techniques.
Input syntactic structure, reverse engineering, control dependence, grammar inference, delta debugging, top-down grammar, bottom-up grammar.
Xiangyu Zhang, Zhiqiang Lin, "Reverse Engineering Input Syntactic Structure from Program Execution and Its Applications", IEEE Transactions on Software Engineering, vol.36, no. 5, pp. 688-703, September/October 2010, doi:10.1109/TSE.2009.54
[1] "Libyahoo2: A C Library for Yahoo! Messenger," http:/, 2009.
[2] "The Protocol Informatics Project," http://www.baselineresearch. netPI/, 2009.
[3] "Tidy Project Page," http:/, 2009.
[4] "Wireshark: The World's Most Popular Network Protocol Analyzer," http:/, 2009.
[5] "Grammar of HTML Document," orelly/web/htmlappa_02.html , 2009.
[6] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[7] P. Bille, "A Survey on Tree Edit Distance and Related Problems," Theoretical Computer Science, vol. 337, nos. 1-3, pp. 217-239, 2005.
[8] D. Coppit and J. Lian, "Yagg: An Easy-to-Use Generator for Structured Test Inputs," Proc. 20th ACM/IEEE Int'l Conf. Automated Software Eng., pp. 356-359, 2005.
[9] W. Cui, M. Peinado, K. Chen, H. Wang, and L. Irun-Briz, "Tupni: Automatic Reverse Engineering of Input Formats," Proc. 15th ACM Conf. Computer and Comm. Security, pp. 391-402, Oct. 2008.
[10] W. Cui, M. Peinado, H.J. Wang, and M. Locasto, "Shieldgen: Automatic Data Patch Generation for Unknown Vulnerabilities with Informed Probing," Proc. 2007 IEEE Symp. Security and Privacy, pp. 252-266, May 2007.
[11] J. Caballero and D. Song, "Polyglot: Automatic Extraction of Protocol Format Using Dynamic Binary Analysis," Proc. 14th ACM Conf. Computer and Comm. Security, pp. 317-329, Oct. 2007.
[12] P. Godefroid, A. Kiezun, and M.Y. Levin, "Grammar-Based Whitebox Fuzzing," Proc. 2008 ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 188-198, June 2008.
[13] K. Hanford, "Automatic Generation of Test Cases," IBM Systems J., vol. 9, no. 4, pp. 242-257, 1970.
[14] A. Kalafut, A. Acharya, and M. Gupta, "A Study of Malware in Peer-to-Peer Networks," Proc. Sixth ACM SIGCOMM Conf. Internet Measurement, pp. 327-332, 2006.
[15] Z. Lin, X. Jiang, D. Xu, and X. Zhang, "Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution," Proc. 15th Ann. Network and Distributed System Security Symp., pp. 221-238, 2008.
[16] Z. Lin and X. Zhang, "Deriving Input Syntactic Structure from Execution and Its Applications," Proc. 16th ACM SIGSOFT Int'l Symp. Foundations of Software Eng., pp. 83-93, Nov. 2008.
[17] J. Lim, T. Reps, and B. Liblit, "Extracting Output Formats from Executables," Proc. 13th Working Conf. Reverse Eng., pp. 167-178, 2006.
[18] R. Majumdar and R. Xu, "Directed Test Generation Using Symbolic Grammars," Proc. 22th IEEE Int'l Conf. Automated Software Eng., pp. 553-556, 2007.
[19] W. Masri, A. Podgurski, and D. Leon, "Detecting and Debugging Insecure Information Flows," Proc. 15th Int'l Symp. Software Reliability Eng., pp. 198-209, 2004.
[20] P. Maurer, "Generating Test Data with Enhanced Context-Free Grammars," IEEE Software, vol. 7, no. 4, pp. 50-55, July/Aug. 1990.
[21] G. Misherghi and Z. Su, "HDD: Hierarchical Delta Debugging," Proc. 28th Int'l Conf. Software Eng., pp. 142-151, June 2006.
[22] V. Nagarajan, R. Gupta, X. Zhang, M. Madou, B. De Sutter, and K. De Bosschere, "Matching Control Flow of Program Versions," Proc. 2007 Int'l Conf. Software Maintenance, pp. 84-93, 2007.
[23] N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," Proc. ACM SIGPLAN 2007 Conf. Programming Language Design and Implementation, pp. 89-100, 2007.
[24] R. Parekh and V. Honavar, "Grammar Inference, Automata Induction, and Language Acquisition," The Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers, eds., pp. 727-764, Marcel Dekker, Inc., 2000.
[25] P. Purdom, "A Sentence Generator for Testing Parsers," BIT Numerical Math., vol. 12, no. 3, pp. 366-375, 1972.
[26] L.V. Put, D. Chanet, B. De Bus, B. De Sutter, and K.D. Bosschere, "Diablo: A Reliable, Retargetable and Extensible Link-Time Rewriting Framework," Proc. IEEE Int'l Symp. Signal Processing and Information Technology, pp. 7-12, 2005.
[27] E. Sirer and B. Bershad, "Using Production Grammars in Software Testing," Proc. Second Conf. Domain-Specific Languages, pp. 1-13, 1999.
[28] H. Wang, C. Guo, D. Simon, and A. Zugenmaier, "Shield: Vulnerability-Driven Network Filters for Preventing Known Vulnerability Exploits," Proc. 2004 ACM Conf. Applications, Technologies, Architectures, and Protocols for Computer Comm., pp. 193-204, 2004.
[29] X. Wang, Z. Li, J. Xu, M.K. Reiter, C. Kil, and J.Y. Choi, "Packet Vaccine: Black-Box Exploit Detection and Signature Generation," Proc. 13th ACM Conf. Computer and Comm. Security, pp. 37-46, 2006.
[30] G. Wondracek, P.M. Comparetti, C. Kruegel, and E. Kirda, "Automatic Network Protocol Analysis," Proc. 15th Ann. Network and Distributed System Security Symp., pp. 203-220, Feb. 2008.
[31] B. Xin and X. Zhang, "Efficient Online Detection of Dynamic Control Dependence," Proc. Int'l Symp. Software Testing and Analysis, pp. 185-195, 2007.
[32] A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers, Inc., 2005.
[33] A. Zeller and R. Hildebrandt, "Simplifying and Isolating Failure-Inducing Input," IEEE Trans. Software Eng., vol. 28, no. 2, pp. 183-200, Feb. 2002.
[34] M. Zhang, X. Zhang, X. Zhang, and S. Prabhakar, "Tracing Lineage Beyond Relational Operators," Proc. Int'l Conf. Very Large Data Bases, pp. 1116-1127, 2007.
[35] X. Zhang and R. Gupta, "Cost Effective Dynamic Slicing," Proc. 2004 ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 94-106, 2004.
[36] X. Zhang, S. Tallam, and R. Gupta, "Dynamic Slicing Long Running Programs through Execution Fast Forwarding," Proc. 14th ACM SIGSOFT Int'l Symp. Foundations of Software Eng., pp. 81-91, 2006.