D. S. Serenko, E. D. Terentev, D. V. Zubarev, I. V. Sochenkov, “Architecture of an information collection and extraction system for an intelligent search and analytical platform”, Proceedings of ISP RAS, 2025, Volume 37, Issue 2,Pages <nobr>263

Architecture of an information collection and extraction system for an intelligent search and analytical platform

D. S. Serenko^ab, E. D. Terentev^ab, D. V. Zubarev^a, I. V. Sochenkov^cdb

^a Federal Research Center "Computer Science and Control" of Russian Academy of Sciences
^b Peoples' Friendship University of Russia named after Patrice Lumumba
^c Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)
^d Ivannikov Institute for System Programming of the RAS

Abstract: Internet data serves as the foundation for a wide range of tasks, from information retrieval to analytical processing. With the rapid growth of data volumes, efficient metadata extraction from dynamic web resources has become critically important. Traditional information collection and extraction methods based on static templates are largely ineffective when processing interactive content. This paper presents the architecture of an adaptive information collection and extraction system that integrates standard data extraction techniques with machine learning technologies. The system has a modular structure comprising the following subsystems: task management, monitoring and logging, crawling, link management, and metadata extraction. The crawling subsystem processes both static and dynamic content through browser emulation. A hybrid approach combining structured rules and machine learning is used for metadata extraction. Experimental results demonstrated successful metadata extraction from various web resources, including pages with dynamic content and complex structures. The system exhibited high accuracy and resilience to changes in data formats while strictly adhering to ethical data collection standards, such as compliance with robots.txt directives and applying reasonable request intervals. Thus, the proposed solution represents a significant step toward the development of universal data collection and extraction systems for modern information environments. The developed software tools have been utilized in populating the index databases of the Neopoisk system.

Keywords: intelligent search and analytical systems, information collection and extraction system, metadata extraction, web crawling, dynamic content, machine learning, automated data collection, browser emulation, MarkupLM.

DOI: 10.15514/ISPRAS-2025-37(2)-20