RUS  ENG
Full version
JOURNALS // Program Systems: Theory and Applications // Archive

Program Systems: Theory and Applications, 2021 Volume 12, Issue 2, Pages 73–103 (Mi ps383)

This article is cited in 1 paper

Hardware, software and distributed supercomputer systems

Monitoring applications on the ZHORES cluster at Skoltech

I. E. Zakharov, O. A. Panarin, S. G. Rykovanov, R. R. Zagidullin, A. K. Malyutin, Yu. N. Shkandybin, A. E. Ermekova

Skolkovo Institute of Science and Technology

Abstract: Standard monitoring tools for cluster computing systems allow assessing the performance of the whole system, but do not allow to analyze the performance of applications individually. A monitoring system for measuring the resources requested by each application separately was written in Skoltech for the high-performance Zhores cluster. The monitoring system collects both, the usual metrics of CPU and GPU utilization, as well as the CPU and GPU event counters which allow a more detailed analysis of the resources requested by the application. Service programs deployed on each node in the cluster send measurements to a common time series database in one second increments. These data are analyzed offline to isolate the characteristics associated with the use of computing resources by each application. This should reveal suboptimal applications, allow fine-tuning of the cluster functions and improve the HPC system overall.

Key words and phrases: cluster, high performance computing, application monitoring, CPU/GPU event counters, time series database.

UDC: 004.451
BBK: 32.972.11

MSC: Primary 65Y05; Secondary 68M20, 68M99

Received: 26.01.2021
29.03.2021
Accepted: 05.06.2021

DOI: 10.25209/2079-3316-2021-12-2-73-103



© Steklov Math. Inst. of RAS, 2024