Signal Processing for Malware Analysis

Objective

The goal of this project is to explore methods in Signal and Image processing for analyzing malware. Malware binaries are visualized as gray-scale images, with the observation that for many malware families, the images belonging to the same family appear very similar in layout and texture.


 

Visual Similarity in Malware Variants

Most of the new malware are modifications of existing malware. Thus, the variants have almost the same content. Here are some examples of malware variants from different families:



Variants of Agent.FYI Family



Variants of Dialplatform Family

There are two main observations:

  • There is visual similarity in the malware variants within the families
  • There is visual dis-similarity in the malware variants across the families

We exploit these visual similarities and dis-similarities and propose Image Similarity based features to the problems of Malware Classification, Detection, Retrieval and other areas.

Datasets

We have released the Malimg Dataset for download. Please cite our paper, Malware Images: Visualization and Automatic Classification, when using this dataset. Please follow this blog for a step-by-step tutorial on using this dataset and obtaining the results in our paper:


Malware Images Album

We have created an Album of "Malware Images" for the Malimg Dataset

SARVAM: Search And RetrieVAl of Malware

SARVAM is a demonstration of a simple and effective technique for visualizing and classifying malware using image processing. It is a content based malware image search and retrieval system, which we made publicly accessible for researchers and security professionals to upload a malware query and find its best match.

Currently, SARVAM has a database of more than 7 million malware. We have received more than 250,000 malware submissions since its launch in 2012. 


Block Schematic of SARVAM

 

​SATTVA: SpArisTy inspired classificaTion of malware VAriants

In the second part, we consider a malware binary as a one dimensional digital signal rather than a two dimensional grayscale image. Although images provide better visualization and image similarity features have been richly studied in literature, there is some arbitrariness in choosing the column width. Here is the signal representation of a malware binary:


We then model an unknown malware as a sparse linear combination of malware from the dataset. Since malware binaries can vary in size, the dimensionality can be very high. So we apply Random Projections to reduce the dimensions of the binaries and then do sparse modeling:


 

Blogs

Supervised Classification with k-fold Cross Validation on a Multi Family Malware Dataset: A step by step tutorial (with code snippets) on how to do malware classification on the Malimg Dataset 

Finding Visually Similar Malware among Millions of Malware: A gentle tutorial on how SARVAM works

 

Acknowledgements

This research has been supported by grants: ONR #N00014-11-10111, ONR # N00014-14-1-0027