Signal Processing for Malware Analysis
The goal of this project is to explore methods in Signal and Image processing for analyzing malware. Malware binaries are visualized as gray-scale images, with the observation that for many malware families, the images belonging to the same family appear very similar in layout and texture.
Visual Similarity in Malware Variants
Most of the new malware are modifications of existing malware. Thus, the variants have almost the same content. Here are some examples of malware variants from different families:
Variants of Agent.FYI Family
Variants of Dialplatform Family
There are two main observations:
- There is visual similarity in the malware variants within the families
- There is visual dis-similarity in the malware variants across the families
We exploit these visual similarities and dis-similarities and propose Image Similarity based features to the problems of Malware Classification, Detection, Retrieval and other areas.
We have released the Malimg Dataset for download. Please cite our paper, Malware Images: Visualization and Automatic Classification, when using this dataset. Please follow this blog for a step-by-step tutorial on using this dataset and obtaining the results in our paper:
Malware Images Album
SARVAM: Search And RetrieVAl of Malware
SARVAM is a demonstration of a simple and effective technique for visualizing and classifying malware using image processing. It is a content based malware image search and retrieval system, which we made publicly accessible for researchers and security professionals to upload a malware query and find its best match.
Currently, SARVAM has a database of more than 7 million malware. We have received more than 250,000 malware submissions since its launch in 2012.
Block Schematic of SARVAM
SATTVA: SpArisTy inspired classificaTion of malware VAriants
In the second part, we consider a malware binary as a one dimensional digital signal rather than a two dimensional grayscale image. Although images provide better visualization and image similarity features have been richly studied in literature, there is some arbitrariness in choosing the column width. Here is the signal representation of a malware binary:
We then model an unknown malware as a sparse linear combination of malware from the dataset. Since malware binaries can vary in size, the dimensionality can be very high. So we apply Random Projections to reduce the dimensions of the binaries and then do sparse modeling:
Supervised Classification with k-fold Cross Validation on a Multi Family Malware Dataset: A step by step tutorial (with code snippets) on how to do malware classification on the Malimg Dataset
Finding Visually Similar Malware among Millions of Malware: A gentle tutorial on how SARVAM works
This research has been supported by grants: ONR #N00014-11-10111, ONR # N00014-14-1-0027