Comparison of Style Features for the Authorship Verification of Literary Texts ABOUT THE AUTHORS

The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.Theauthors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.

In the state-of-the-art of the authorship veri cation and close text classi cation tasks there is no set of style features that would be versatile for di erent texts. Some feature types like character-level, word-level, and syntactic features appear in many investigations, but are o en combined with more complex linguistic features [2,3]. Researchers admit that the in uence of di erent types of features on the quality of text classi cation remains underexplored [4].
Rhythm features are the subtype of the linguistic features that most o en describe the style of literary texts [5]. ey can be applied for authorship veri cation [6], but are rarely compared with other feature types [2]. e goal of this paper is comparing how di erent feature types a ect the quality of the authorship veri cation of literary texts. We analyse rhythm features and popular low-level features based on statistics of text elements. e comparison is performed on the corpora of English, Russian, French, and Spanish literary texts.
1. State-of-the-art e task of authorship veri cation is usually performed for the texts from the Internet: news articles, emails, reviews, etc. [2,7].
In many cases the researchers modeled texts using only standard low-level features and experimented with classi cation. Halvani et al. [8] used stylometric features based on n-grams. e veri cation was realized by the determination of the proximity of the numerical feature vectors of the texts. Experiments were conducted in ve European languages: Dutch, English, Greek, Spanish, and German. e F-measure varied from 67.37 % for Greek up to 83.33 % for Spanish. e method also showed good results at the PAN-2020 competition [9].
Potha and Stamatatos [7] introduced an intrinsic pro le-based veri cation method that apply latent semantic indexing for topic modeling and low-level features: word and character n-grams. en the algorithm calculated the text model that represents all texts of the same author as a common vector. en it identied the authors by searching for test texts the closest vector from the authors' train ones. e researchers compared in experiments corpora of prose, newspaper articles, reviews in four languages: Dutch, English, Greek, and Spanish. e method achieved more than 80 % of the AUC.
Boenningho et al. [10] proposed a new neural network topology to identify whether two documents with unknown authors were wri en by the same author.
is approach showed the best results of the precision, recall, and F-measure 84 % for short multi-genre social media posts.
Adamovic et al. [11] searched a wide range of word and character-based language-independent text stylistic features. en they applied the SVM-RFE feature selection method to remove redundant and irrelevant characteristics. Authorship veri cation of articles in four languages: English, Greek, Spanish, and German showed a high result over 90 % of the accuracy.
To improve the quality of authorship veri cation and take into account domain peculiarities and the authors' idiolect, the researchers frequently applied linguistic features.
Al-Khatib and Al-qaoud [12] veri ed native and non-native speakers of online opinion articles. e feature set included statistical and linguistic features: number of unique words, complexity, Gunning-Fog readability index, character space, le er space, average syllables per word, sentence count, average sentence length, and the Flesch-Kincaid Readability. e accuracy varied for text corpora from 47 % to 77 %.
Lagutina et al. [6] investigated application of rhythm features to the authorship veri cation of the artistic prose.
ey found the features based on repetitions of words and sentences (anaphora, epiphora, aposiopesis, etc.) and veri ed authors of English, Russian, French, and Spanish prose. e F-measure achieved from 60 % to 95 % for di erent authors and about 80 % in average. e literary texts are usually analysed not in the authorship veri cation but in the close task of the authorship a ribution. For example, Stanisz et al. [13] created adjacency networks with words frequently appearing in texts, and their co-occurrences as vertices and edges' weights. en the authors computed various graph characteristics: clustering coe cients of vertices, an average shortest path length, an assortativity coe cient, and modularity. e experiments showed the accuracy of 85-90 % for English and Polish books. e analysis of the state-of-the-art papers shows the lack of comparison of di erent feature types with linguistic ones, especially for artistic texts. e authors usually rely on standard statistical features based on words and characters and try to extend them by relatively small number of syntactic, topical, or other linguistic features. Deep linguistic features remains under-researched, most probably, because of their complexity in search. Although such features are directly identify the author's style [5] and can be the most interpretable ones.

Style features
We compare three types of features: character-level, word-level, and rhythm-level ones. e rst two feature types are the popular e ective features from the state-of-the art. e rhythm features describe the speci c style marks of the authors that frequently appear in literary texts.
Before feature calculation we search in plain texts the following elements: • Top-40 unigrams and top-40 bigrams of words among the text corpora. ey will be used for computing frequencies of occurrences for n-grams. • Lexico-grammatical rhythm gures. For each text we found the lists of the following gures: anaphora, epiphora, symploce, anadiplosis, diacope, epizeuxis, epanalepsis, chiasmus, polysyndeton, repeating exclamatory sentences, repeating interrogative sentences, and aposiopesis. eir de nitions and search algorithms are taken from the works of Lagutina et al. [6,14]. e quality of gures search achieves 80-95 % of precision. We compute the following style features: • Character-level features: -Average sentence length in characters including punctuation marks and spaces.

Design of authorship veri cation
A er feature extraction we get the matrix where rows are texts of particular authors, columns are feature types. We verify each author separately using the whole matrix for the author's language. His/her texts are labeled as belonging or not belonging to him/her. en the binary classi cation is performed.
Two classi ers are compared: AdaBoost and Bidirectional LSTM. ey have already show their quality in solution of state-of-the-art text classi cation tasks [15].
e AdaBoost classi er combines the results of 50 Decision Tree classi ers. e Bidirectional LSTM neural network contains the Bidirectional LSTM layer with 64 units and a dense output layer with the sigmoid activation function. e loss function is categorical cross-entropy, the optimization algorithm is Adam, the number of epochs is 100.
In order to estimate the stability of classi ers, we apply the ve-fold cross-validation technique: 80 % of texts are the training samples, 20 % are the test ones.
e estimation is performed with three standard measures: precision, recall, and F-measure [16], and also their standard deviations. e code for the feature selection and authorship veri cation is published at h ps://github.com/textprocessing/prose-rhythm-detector. It is wri en in Python programming language and uses Stanza 1.1.1 NLP library for text representation and determination of parts of speech. For the veri cation it uses Scikit-Learn 0.23.2 and Keras 2.4.3.

Text corpora
We compare literary texts of four languages: English, Russian, French, and Spanish. e corpora were created manually collecting famous works of famous authors wri en in their native language.
In order to make texts equal in size, we extracted 1-4 fragments with the size about 50 000 characters including spaces from each prose text. In such a way each author is presented by 40 text fragments. English, Russian, and French corpora contain texts of 20 famous authors of 19th-21st centuries, 800 texts per corpora.
e Spanish corpus has texts of 8 authors of 19-th-20th centuries, 320 texts in total.

Experiments
During experiments we compare features of three types: 36-43 character-level features (the le ers di ers for corpora in di erent languages), 82 word-level features, and 17 rhythm features.
Comparing two classi ers, we discover that AdaBoost outperforms the neural network by 10-15 % of precision, recall, and F-measure. Most probably, it happens due to the fact that the training sample has the insu cient size for be er performance of the LSTM network. So the tables in this section contains classi cation quality for the AdaBoost algorithm. Table 1 describes authorship veri cation quality for all feature types and their combinations. Ch means character-level features, W -word-level ones, Rh -rhythm ones, + marks the combination of two feature types, All -the combination of three feature types. Precision, recall, and F-measure are calculated as the averages for all authors. Bold marks the lines with best quality and best F-measures.
From Table 1 we can see that rhythm features provide the good classi cation quality. It is lower by 3-11 % of F-measure in the most cases, but has quite high values of 78-87 %. Besides, the number of rhythm features is several times less than character-and word-level ones, so the relatively small number of speci c style parameters allow to achieve signi cant authorship veri cation quality.
Any combination of feature types improve quality by 2-14 %, but the combination of all types is slightly higher than of the two types.
Authors of Russian, French, and Spanish texts in most cases are veri ed be er than English. In English and French texts the best feature type is character-level, In Russian and Spanish texts it is word-level.