RUS  ENG
Full version
JOURNALS // Mendeleev Communications // Archive

Mendeleev Commun., 2025 Volume 35, Issue 2, Pages 224–227 (Mi mendc7256)

Communications

How to stop worrying and love multiple citation experimental data

Ya. V. Timofeevab, A. M. Mrasovab, M. V. Panovaa, F. N. Novikova, I. V. Svitankoa

a N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, 119991 Moscow, Russian Federation
b Department of Chemistry, M. V. Lomonosov Moscow State University, 119991 Moscow, Russian Federation

Abstract: Numerous public databases now collect and disseminate biological activity data from literature and patents, forming the basis for chemogenomics and novel scoring functions. However, data quality is often compromised due to multiple citations of values across different studies with varying protocols. To address this issue, we used the XGBoost model in combination with a BERT-based NLP approach and a distance-based out-of-distribution (OOD) data detection method to enhance classification accuracy and exclude review articles.

Keywords: ChEMBL database, MEDLINE, machine learning, NLP, OOD detection, biological activity.

Received: 13.12.2024
Accepted: 20.01.2025

Language: English

DOI: 10.71267/mencom.7710



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2025