RUS  ENG
Full version
JOURNALS // Matematicheskaya Biologiya i Bioinformatika // Archive

Mat. Biolog. Bioinform., 2022 Volume 17, Issue 2, Pages 230–249 (Mi mbb487)

This article is cited in 1 paper

Information and Computer Technologies in Biology and Medicine

Application of Benford's law for quality assessment of preventive screening data

O. A. Starunova, S. G. Rudnev, A. E. Ivanova, V. G. Semenova, V. I. Starodubov

Russian Research Institute of Health, Moscow, Russia

Abstract: An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5%) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm ($\rho_{\mathrm{max}}$ = 0.66 and 0.62 for the mean absolute deviations and $\chi^2$ statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.

Key words: health centers, preventive screening, big data, bioelectrical impedance analysis, data quality, expert quality assessment algorithm, Benford's law.

Received 31.10.2021, 19.10.2022, Published 05.11.2022

DOI: 10.17537/2022.17.230



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024