RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2017 Volume 29, Issue 2, Pages 161–200 (Mi tisp214)

This article is cited in 15 papers

A survey and an experimental comparison of methods for text clustering: application to scientific articles

P. A. Parhomenkoab, A. A. Grigorevbc, N. A. Astrakhantsevb

a Lomonosov Moscow State University
b Institute for System Programming of the RAS
c National Research University Higher School of Economics (HSE)

Abstract: Text documents clustering is used in many applications such as information retrieval, exploratory search, spam detection. This problem is the subject of many scientific papers, but the specificity of scientific articles in regards to the clustering efficiency remains to be studied insufficiently; in particular, if all documents belong to the same domain or if full texts of articles are unavailable. This paper presents an overview and an experimental comparison of text clustering methods in application to scientific articles. We study methods based on bag of words, terminology extraction, topic modeling, word embedding and document embedding obtained by artificial neural networks (word2vec, paragraph2vec).

Keywords: text documents clustering, bag of words, terminology extraction, topic modeling, word and document embedding, artificial neural networks.

DOI: 10.15514/ISPRAS-2017-29(2)-6



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024