RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2022 Volume 34, Issue 5, Pages 63–76 (Mi tisp721)

This article is cited in 1 paper

A method to evaluate program similarity using machine learning methods

P. D. Borisova, Yu. V. Kosolapovb

a State Scientific Organization: Research Institute "Spetsvuzavtomatika", Rostov-on-Don
b Southern Federal University

Abstract: The problem of constructing an algorithm for comparing two executable files is considered. The algorithm is based on the construction of similarity features vector for a given pair of programs. This vector is then used to decide on the similarity or dissimilarity of programs using machine learning methods. Similarity features are built using algorithms of two types: universal and specialized. Universal algorithms do not take into account the format of the input data (values of fuzzy hash functions, values of compression ratios). Specialized algorithms work with executable files and analyze machine code (using disassemblers). A total of 15 features were built: 9 features of the first type and 6 of the second. Based on the constructed training set of similar and dissimilar program pairs, 7 different binary classifiers were trained and tested. To build the training set, coreutils programs were used. The results of the experiments showed high accuracy of models based on random forest and k nearest neighbors. It was also found that the combined use of features of both types can improve the accuracy of classification.

Keywords: obfuscation, program similarity, machine learning

DOI: 10.15514/ISPRAS-2022-34(5)-4



© Steklov Math. Inst. of RAS, 2025