This article is cited in
1 paper
Theory of data
Comparison of style features for the authorship verification of literary texts
K. V. Lagutina P. G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia
Abstract:
The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.
The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.
Keywords:
stylometry, natural language processing, style features, rhythm features, authorship verification.
UDC:
004.912
MSC: 68T50 Received: 04.05.2021
Revised: 20.08.2021
Accepted: 25.08.2021
Language: English
DOI:
10.18255/1818-1015-2021-3-250-259