RUS  ENG
Full version
JOURNALS // Informatika i Ee Primeneniya [Informatics and its Applications] // Archive

Inform. Primen., 2013 Volume 7, Issue 3, Pages 2–13 (Mi ia267)

This article is cited in 1 paper

Unsupervised approach to web wrapper maintenance

A. M. Andreev, D. V. Berezkin, I. A. Kozlov, K. V. Simakov

Bauman Moscow State Technical University

Abstract: HTML-wrapper applications rely on formatting regularities of targeted websites. Therefore, maintenance of such applications is connected with the problem of detecting markup changes of web pages. This article describes the unsupervised approach to this problem. The proposed method of detection consists of two parts: the real-time one based on clustering considering HTML-document as a vector of some features and the time-lagged one based on comparison of distributions of such features for learning and testing sets of HTML-documents. There have been carried out several experiments with data obtained from real wrapper. The results reveal feasibility of the suggested approach.

Keywords: wrapper maintenance; web-site parsing; clustering; HTML-markup statistical processing.

DOI: 10.14357/19922264130301



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024