A. E. Vovchenko, L. A. Kalinichenko, D. Yu. Kovalev, “Methods of entity resolution and data fusion in the ETL-process and their implementation in the Hadoop environment”, Inform. Primen., 2014, Volume 8, Issue 4,Pages <nobr>94

This article is cited in 4 papers

Methods of entity resolution and data fusion in the ETL-process and their implementation in the Hadoop environment

A. E. Vovchenko^a, L. A. Kalinichenko^ab, D. Yu. Kovalev^a

^a Institute of Informatics Problems, Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
^b Faculty of Computational Mathematics and Cybernetics, M. V. Lomonosov Moscow State University, 1-52 Leninskiye Gory, GSP-1, Moscow 119991, Russian Federation

Abstract: Entities extraction, their transformation and loading in the integrated repository are the main problem of data integration. These actions are part of the ETL-process (extract–transform–loading). An entity is a digital representation of a real world object (for example, information about a person). Entity resolution takes care of duplicate detection, deduplication, record linkage, object identification, reference matching, and other ETL-related tasks. After the entity resolution step, entities should be merged into the one reference entity (containing information from all related entities). Data fusion is the final step in the data integration process. The paper gives an overview of the entity resolution and data fusion methods. Also, the paper presents the techniques for programming the entity resolution and data fusion methods for implementing the ETL-process in the Hadoop environment. High-Level Integration Language (HIL), a declarative language that focuses on resolution and fusion of entities in the Hadoop-infrastructure, is used in this part of the paper.

Keywords: data integration; ETL; entity resolution; data fusion; big data; Hadoop; Jaql; HIL.

Received: 09.11.2014

DOI: 10.14357/19922264140412