V. A. Polezhaev, “Automated citation graph building from a corpora of scientific documents”, Computer Research and Modeling, 2012, Volume 4, Issue 4,Pages <nobr>707

NUMERICAL METHODS AND THE BASIS FOR THEIR APPLICATION

Automated citation graph building from a corpora of scientific documents

V. A. Polezhaev

RUKONT-PhysTech Laboratory, CMAM department, MIPT, 9 Institutskii per., Dolgoprudny, Moscow Region, 141700, Russia

Abstract: In this paper the problem of automated building of a citation graph from a collection of scientific documents is considered as a sequence of machine learning tasks. The overall data processing technology is described which consists of six stages: preprocessing, metainformation extraction, bibliography lists extraction, splitting bibliography lists into separate bibliography records, standardization of each bibliography record, and record linkage. The goal of this paper is to provide a survey of approaches and algorithms suitable for each stage, motivate the choice of the best combination of algorithms, and adapt some of them for multilingual bibliographies processing. For some of the tasks new algorithms and heuristics are proposed and evaluated on the mixed English and Russian documents corpora.

Keywords: text mining, machine learning, information extraction, citation graph, bibliography, matching, record linkage, labeling, segmentation, conditional random fields.

UDC: 004.852

Received: 06.09.2012

DOI: 10.20537/2076-7633-2012-4-4-707-719