D. V. Komashinskiy, I. V. Kotenko, A. V. Shorov, “Approach to detect malware based on postionally dependent information”, Tr. SPIIRAN, 2009, Issue 10,Pages <nobr>132

Approach to detect malware based on postionally dependent information

D. V. Komashinskiy, I. V. Kotenko, A. V. Shorov

St. Petersburg Institute for Informatics and Automation of RAS

Abstract: The work examines an approach for detecting malware on the basis of positionally dependent information. The problem of counteracting malicious software is still relevant, although currently there are more effective mechanisms for detecting malware. This paper considers the application of methods of data mining to solve this problem. The novelty of the approach described in the paper is to focus on the processing of positionally dependent static information, ensuring the formation of particular elements of an effective model of detecting malicious executables.
The main idea of the proposed approach is to use the features of a file format, which can include malicious code. Knowledge of the nature and structure of the information included in a potentially dangerous object can reduce the amount of data that should be analyzed.
For the functioning of basic classifiers methods such as Decision Table, C4.5, RandomForest and Naive Bayes were used. The bunches of variables “Position—Value” as features for learning were selected.
For the experiments two bases of executable files were selected. They contain 5854 hazardous files and 1656 nonhazardous files.
To extract the sets of features, we developed the utility for parsing the files of PE32 format. It focuses on access to the file content using relative virtual addresses. This utility can generate output files of Attribute-Relation (ARFF) format, including the set of all possible features. To carry out experiments we used a software package Weka 3.6.1.
Experiments have shown that the use of positionally dependent characteristics is quite effective when Data Mining methods are used, which related to classifiers using generation rules and the construction of decision trees. The most effective method was RandomForest.
The proposed approach does not provide absolute accuracy of detecting malware, but may be effective at certain stages of decision-making on how to further process the object and under construction of malware detectors. As an example, the task of automating the detection and identification of used obfuscation or protection tools for executable files.

Keywords: information security, malware, detection of malicious software, data mining, methods of static analysis.

UDC: 004.49