RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2024 Volume 36, Issue 5, Pages 153–162 (Mi tisp929)

Automatic construction of information extraction rules for news websites

S. S. Dubovitskiia, P. A. Bedrinba, A. K. Yatskovba, M. I. Varlamova

a Ivannikov Institute for System Programming of the RAS
b Lomonosov Moscow State University

Abstract: This paper presents a method for the automatic generation of information extraction rules (sitemaps) for news websites. The proposed approach generates a sitemap based on a set of news pages from a single site, enabling attribute extraction from arbitrary news pages on that site. The method is based on applying a fine-tuned neural network model, MarkupLM, to extract information from web pages. This approach generalizes the model’s predictions at the site level, creating universal rules for attribute extraction. Experimental results show that using sitemaps generated with the fine-tuned model surpasses both existing open-source tools and the fine-tuned MarkupLM applied at the individual page level. The developed method can be extended to other domains if relevant data for model fine-tuning is available.

Keywords: information extraction, web scraping, news websites, neural networks.

DOI: 10.15514/ISPRAS-2024-36(5)-11



© Steklov Math. Inst. of RAS, 2025