RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2024 Volume 36, Issue 3, Pages 189–202 (Mi tisp896)

Recovering text layer from PDF documents with complex background

M. V. Zagorodnikova, A. A. Mikhailovab

a Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, Irkutsk
b Ivannikov Institute for System Programming of the RAS

Abstract: The article considers PDF as a tool for storing and transferring documents. Special attention is paid to the problem of converting data from PDF back to its original format. The relevance of the study is due to the widespread use of PDF in electronic document management of modern organizations. However, despite the convenience of using PDF, extracting information from such documents can be difficult due to the peculiarities of information storage in the format and the lack of effective tools for reverse conversion. The paper proposes a solution based on the analysis of the text information from the output stream of the PDF format. This allows automatic recognition of text in PDF documents, even if they contain non-standard fonts, complex backgrounds, or damaged encoding. The research is of interest to specialists in the field of electronic document management, as well as software developers involved in creating tools for working with PDF.

Keywords: encoding, PDF, documents, CNN, extraction, text

DOI: 10.15514/ISPRAS-2024-36(3)-13



© Steklov Math. Inst. of RAS, 2025