A. V. Ogaltsov, O. Y. Bakhteev, “Automatic metadata extraction from scientific PDF documents”, Inform. Primen., 2018, Volume 12, Issue 2,Pages <nobr>75

This article is cited in 4 papers

Automatic metadata extraction from scientific PDF documents

A. V. Ogaltsov^ab, O. Y. Bakhteev^cb

^a National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow 101000, Russian Federation
^b Antiplagiat JSC, 33 Varshavskoe Shosse, Moscow 117105, Russian Federation
^c Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation

Abstract: The authors investigate the task of metadata extraction. The authors consider scientific PDF documents in Russian. One of the features of PDF is a rich layout. It is difficult to extract metadata due to this fact. The authors propose a method based on considering blocks from pdf-parser as objects in a machine learning task. The feature space is constructed not only of text statistics but also of formatting and positioning features of the block. The authors performed computational experiments and compared their approach with the baseline.

Keywords: metadata extraction; natural language processing; layout features; information retrieval; metadescriptions.

Received: 20.12.2017

DOI: 10.14357/19922264180211