A. V. Orekhov, I. V. Vasiliev, “Metric binary trees, and nested cluster hierarchy building”, Vestnik S.-Petersburg Univ. Ser. 10. Prikl. Mat. Inform. Prots. Upr., 2024, Volume 20, Issue 4,Pages <nobr>487

Computer science

Metric binary trees, and nested cluster hierarchy building

A. V. Orekhov, I. V. Vasiliev

St. Petersburg State University, 7–9, Universitetskaya nab., St. Petersburg, 199034, Russian Federation

Abstract: Machine learning methods use data trees to organize and store information. Each of these structures has certain advantages and allows improving the quality of a particular algorithm. If all tree nodes have no more than two descendants, then it is called binary; its main advantage is the high efficiency of implementing search and sorting algorithms. In this regard, it is important to note that dendrograms of hierarchical agglomerative clustering methods are also binary trees reflecting the taxonomy of elements of a data set. Any cluster that is not a singleton can be divided into subclusters, and this allows us to form a hierarchical structure in a metric space (metric tree) with additional properties. For example, automatically set the height of the tree, considering by definition that the number of levels on which its nodes are located coincides with the number of options for dividing the sample set into clusters, subclusters, subclusters of subclusters, etc. This problem can be solved using approximation-estimation tests, the change in sensitivity of which, using the trend coefficient, allows us to obtain various clustering options. When conducting computational experiments, a synthetic set of points on the Euclidean plane was used and were studied the results of centroid-based clustering. Markov moments of stopping the clustering process were determined using approximation-estimation test. Verification of the results obtained in numerical modeling was carried out by changing the step size of the trend coefficient.

Keywords: metric tree, agglomerative clustering, Markov moment, least squares method.

UDC: 519.172.1+519.237.8+519.216.5+004.85

MSC: 05C05+62H30

Received: July 4, 2024
Accepted: October 4, 2024

DOI: 10.21638/spbu10.2024.405