A. Andreev, D. Berezkin, I. Kozlov, K. Simakov, “Multicriteria method for detecting near-duplicates in a stream of text messages”, Sistemy i Sredstva Inform., 2015, Volume 25, Issue 1,Pages <nobr>34

This article is cited in 1 paper

Multicriteria method for detecting near-duplicates in a stream of text messages

A. Andreev, D. Berezkin, I. Kozlov, K. Simakov

Bauman Moscow State Technical University, 5 Baumanskaya 2nd Str., Moscow 105005, Russian Federation

Abstract: The problem of near-duplicate detection in a stream of text messages is considered. A model of a text document and a multicriteria duplicate identification method is proposed. The model provides flexible adjustment for different domains. The method is based on binary classification using support vector machine. The paper also provides a method of candidates prefiltration in order to ensure high efficiency of the approach. Several experiments with data obtained from a stream of news articles were carried out. The results show feasibility of the suggested approach.

Keywords: near-duplicate detection; similarity measure; binary classification.

Received: 30.12.2014

DOI: 10.14357/08696527150103