Abstract:
Most companies have their own IT infrastructure that consists of complex systems and services. The stability of systems and services is important for companies, as problems with them can lead to loss of resources and human time. Thus, it is important to analyze previous IT service outages, which aims to identify and adjust the most critical and vulnerable elements of the infrastructure that are prone to breakage or failure. Research objective is to develop a new algorithm for improving the stability of IT infrastructure of a company by analyzing and taking into account the statistics of previous services outages. As a result, a new algorithm is proposed to identify and fix problems in IT services before they lead to serious consequences and reduce the time to find the source of problem. The algorithm is based on two new metrics: availability and reliability, which distinctive feature is the consideration of statistics of previous failures and outages in the system. The architecture of a high-performance software tool that allows real-time monitoring and evaluation of IT services stability metrics is presented. The effectiveness of the proposed algorithm is demonstrated by implementing it in a software tool and observing the growth of stability indicators – availability and reliability – after the detection and elimination of a weak link in IT services. The use of the developed algorithm allowed to reduce the time during which the material and human resources of the company were idle by 25%. The practical significance of the presented algorithm was tested in one of the large industrial information technology companies with more than 10000 employees. Based on the information obtained with created software, it was possible to obtain recommendations for improving the stability of company's IT services.
Keywords:metrics, availability, reliability, stability, IT infrastructure, outage, monitoring.