RUS  ENG
Full version
JOURNALS // Sistemy i Sredstva Informatiki [Systems and Means of Informatics] // Archive

Sistemy i Sredstva Inform., 2008 special issue, Pages 6–15 (Mi ssi151)

On a simulation aproach to cluster stabilty validation

Zeev Barzily, Mati Golani, Zeev Volkovich

Software Engineering Department, ORT Braude College of Engineering

Abstract: In the current paper we outline a new approach to the “true number of clusters” determination problem. Our method combines both the stability and density concentration approaches. In the spirit of the density estimation methodology, we consider each cluster as an island of “high” density of items in a sea of “low” density. In addition, following the cluster steadiness concept, we suggest that these islands are “resistant” to a random noise. In other words, we believe that adding noise to the attributes of the data elements does not change the clusters structure. A second novelty of our approach is the proposition to measure the similarity between source-data clusters and noisy-data clusters by means of two sample test statistics, represented by probability metrics-distances. Such a pair seems as an appropriate database for the true number of clusters determination. As a consequence of the high resemblance between these samples, within the partitions, the similarity is expected to be amplified under the true number of clusters. According to our model, the true number of clusters corresponds to the empirical distance distribution which is most concentrated at zero. Thus, our procedure can be considered as the creation of an empirical normalized distance distribution, followed by testing its concentration at zero. This test is carried out by means of the sample mean and the size of the sample first quartile.

Language: English



© Steklov Math. Inst. of RAS, 2024