2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops (2013)
Philadelphia, PA USA
July 8, 2013 to July 11, 2013
Joshua Saxe , Invincea Inc., Fairfax, VA, USA
David Mentis , Invincea Inc., Fairfax, VA, USA
Christopher Greamo , Invincea Inc., Fairfax, VA, USA
The exponential growth of unique malware binary artifacts has led researchers to explore automated techniques for characterizing unknown malware binaries' capabilities. Thus far, automatic malware analysis systems have relied on labeled training data and analyst defined rules to identify malware samples' software features and functional categories. Such approaches require substantial expert analyst effort to maintain, as malware authors change programming languages, APIs, malicious tactics, and operating system targets. In this paper we present preliminary results demonstrating the viability of a new research direction for malware capability identification that addresses these issues, the concept of mining web technical documentation to automatically identify malware capabilities. This approach does not require expert generation of rules or training labels and automatically stays up to date with the latest software engineering trends. We make two contributions aimed at demonstrating the value of this research direction: first, with a corpus of 6 million web technical postings from the programming question and answer website StackOverflow.com, we show that symbols found in a corpus of malicious executable files, such as registry keys, file names, and API call names, also occur frequently in the StackOverflow data, suggesting that applying natural language processing to the StackOverflow posts (and other technical documents) may help us automatically generate characterizations of technical symbols, and, thereby, capabilities, found in malware. Our second contribution is to show that by analyzing function call symbol co-occurrence within StackOverflow posts, as well as the semantic tags associated with these posts, we can create function relationship graphs over the symbols which show promise in helping to identifying malware software capabilities. We argue that these early findings demonstrate the promise of a web technical document based approach to automating malware capability identification.
Malware, Webcams, Data mining, Programming, Internet, Clustering algorithms
J. Saxe, D. Mentis and C. Greamo, "Mining Web Technical Discussions to Identify Malware Capabilities," 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops(ICDCSW), Philadelphia, PA USA, 2014, pp. 1-5.