Abstract:
Entities extraction, their transformation and loading in the integrated repository are the main problem of data integration. These actions are part of the ETL-process (extract–transform–loading). An entity is a digital representation of a real world object (for example, information about a person). Entity resolution takes care of duplicate detection, deduplication, record linkage, object identification, reference matching, and other ETL-related tasks. After the entity resolution step, entities should be merged into the one reference entity (containing information from all related entities). Data fusion is the final step in the data integration process. The paper gives an overview of the entity resolution and data fusion methods. Also, the paper presents the techniques for programming the entity resolution and data fusion methods for implementing the ETL-process in the Hadoop environment. High-Level Integration Language (HIL), a declarative language that focuses on resolution and fusion of entities in the Hadoop-infrastructure, is used in this part of the paper.
Keywords:data integration; ETL; entity resolution; data fusion; big data; Hadoop; Jaql; HIL.