RUS  ENG
Full version
JOURNALS // Modelirovanie i Analiz Informatsionnykh Sistem // Archive

Model. Anal. Inform. Sist., 2021 Volume 28, Number 3, Pages 250–259 (Mi mais748)

This article is cited in 1 paper

Theory of data

Comparison of style features for the authorship verification of literary texts

K. V. Lagutina

P. G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia

Abstract: The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.
The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.

Keywords: stylometry, natural language processing, style features, rhythm features, authorship verification.

UDC: 004.912

MSC: 68T50

Received: 04.05.2021
Revised: 20.08.2021
Accepted: 25.08.2021

Language: English

DOI: 10.18255/1818-1015-2021-3-250-259



© Steklov Math. Inst. of RAS, 2024