RUS  ENG
Полная версия
ЖУРНАЛЫ // Mendeleev Communications // Архив

Mendeleev Commun., 2025, том 35, выпуск 2, страницы 224–227 (Mi mendc7256)

Communications

How to stop worrying and love multiple citation experimental data

Ya. V. Timofeevab, A. M. Mrasovab, M. V. Panovaa, F. N. Novikova, I. V. Svitankoa

a N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, 119991 Moscow, Russian Federation
b Department of Chemistry, M. V. Lomonosov Moscow State University, 119991 Moscow, Russian Federation


Аннотация: Numerous public databases now collect and disseminate biological activity data from literature and patents, forming the basis for chemogenomics and novel scoring functions. However, data quality is often compromised due to multiple citations of values across different studies with varying protocols. To address this issue, we used the XGBoost model in combination with a BERT-based NLP approach and a distance-based out-of-distribution (OOD) data detection method to enhance classification accuracy and exclude review articles.

Ключевые слова: ChEMBL database, MEDLINE, machine learning, NLP, OOD detection, biological activity.

Поступила в редакцию: 13.12.2024
Принята в печать: 20.01.2025

Язык публикации: английский

DOI: 10.71267/mencom.7710



Реферативные базы данных:


© МИАН, 2025