Abstract:
Standard monitoring tools for cluster computing systems allow
assessing the performance of the whole system, but do not allow to analyze the
performance of applications individually. A monitoring system for measuring the
resources requested by each application separately was written in Skoltech for the
high-performance Zhores cluster. The monitoring system collects both, the usual
metrics of CPU and GPU utilization, as well as the CPU and GPU event counters
which allow a more detailed analysis of the resources requested by the application.
Service programs deployed on each node in the cluster send measurements to a
common time series database in one second increments. These data are analyzed
offline to isolate the characteristics associated with the use of computing resources
by each application. This should reveal suboptimal applications, allow fine-tuning
of the cluster functions and improve the HPC system overall.
Key words and phrases:cluster, high performance computing, application monitoring, CPU/GPU event counters, time series database.