RUS  ENG
Full version
JOURNALS // Sistemy i Sredstva Informatiki [Systems and Means of Informatics] // Archive

Sistemy i Sredstva Inform., 2015 Volume 25, Issue 1, Pages 34–53 (Mi ssi392)

This article is cited in 1 paper

Multicriteria method for detecting near-duplicates in a stream of text messages

A. Andreev, D. Berezkin, I. Kozlov, K. Simakov

Bauman Moscow State Technical University, 5 Baumanskaya 2nd Str., Moscow 105005, Russian Federation

Abstract: The problem of near-duplicate detection in a stream of text messages is considered. A model of a text document and a multicriteria duplicate identification method is proposed. The model provides flexible adjustment for different domains. The method is based on binary classification using support vector machine. The paper also provides a method of candidates prefiltration in order to ensure high efficiency of the approach. Several experiments with data obtained from a stream of news articles were carried out. The results show feasibility of the suggested approach.

Keywords: near-duplicate detection; similarity measure; binary classification.

Received: 30.12.2014

DOI: 10.14357/08696527150103



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024