Automated Search and Analysis of the Stylometric Features That Describe the Style of the Prose of 19th–21st Centuries

This article is devoted to the comparison of stylometric features of several levels, which are markers of the style of the prose text and analysis of the stylistic changes in Russian and British prose of the 19th–21st centuries. Stylometric features include low-level features based on the words and symbols and high-level features based on rhythm. These features model the style of a text and are indicators of the time when the text was created. Calculations of all the features are performed completely automatically, so they allow conducting large-scale experiments with artworks of a large volume and speeds up the work of a linguist. To calculate the stylometric features, including ones based on the search results for rhythm figures, the ProseRhythmDetector program is used. As a result of its work, each text is presented as a set of the same features of three levels: characters, words, and rhythm. Texts are combined by decades, and average values of stylometric features are found for each decade. The obtained models of decades are compared using standard similarity metrics, results of comparison are visualized in the form of heatmaps and dendrograms. Experiments with two corpora of Russian and British texts show that during the 19th–21st centuries there are general trends in style change for both corpora, for example, a decrease in the number of rhythm figures per sentence, and also particular trends for each language, for example, dynamics of change of the word and sentence lengths. Stylometric features of all levels reveal the similarity in the style of texts published in one century. Also, features of three levels in the complex better demonstrate the uniqueness of each decade than features of a particular level. This study shows the importance of stylometric features as style markers of the different eras and allows identifying trends in style during several centuries.


INTRODUCTION
Text rhythm is defined as the regular repetition of similar units of speech [1]. It is noted in literature that the rhythm of prose differs from the rhythm of poetry and requires its own methods of determination, including those that would model it in a quantitative form and allow comparing prose texts with each other [2].
The quantitative characteristics of text rhythm are included in an extensive group of stylometric features that describe the style of a text at different levels and make up its numerical model [3], on the basis of which it is possible to carry out statistical and comparative analysis of texts of various authors, genres, time periods, etc.
Despite the widespread prevalence of stylometric features in the field of processing natural language, they are usually used to solve specific classification problems, for example, determining the author or genre. Moreover, of the entire spectrum of features, the most studied features are low-level ones that model text at the word or symbol level. High-level or linguistic features, including rhythmic ones, are poorly studied parts of text style [4].
The authors set themselves the task of comparing high-level and low-level stylometric features in prose texts of the 19th and 21st centuries by decades and statistically analyzing the dynamics of style changes for Russian and British prose. For this purpose, the ProseRhythmDetector program was used, which was presented in the previous studies as a tool for finding rhythm figures in prose texts. The authors added to the ProseRhythmDetector the module that calculates the stylometric text features of the symbol, word, and rhythm levels. This article describes experimenting with this module on the analysis of texts for centuries and decades.

REVIEW OF RELATED WORKS
Stylometry is a scientific discipline that measures the stylistic features of texts with the purpose of their ordering, diagnostics, identification, parameterization, taxonomy, attribution, and periodization [5]. Stylometric features of prose texts change over time for literature in different languages, so they can serve as indicators of the era when the works were created [6]. Kumar et al. explore lexical style markers for the 18th-21st centuries, by which the date of publication can be determined. Using the text features of the word and phrase level, they reach an average error of 32 years; i.e., they fairly accurately determine the century when the text was published.
To solve such problems, simple features of the word and symbol level are usually taken. The authors of the article [7] built just such a model, but they recommended expanding text models with more complex linguistic features according to the results of the study.
The Semeval 2015 contest [8] was devoted to determining the time period for article publication ranging from the 18th to the 21st century. The best results of up to 86.8% accuracy were achieved by the participants who applied a wide range of stylometric features varying from lexical to grammatical ones, including even the meta-properties of a document.
Gopidi and Alam [9] showed that numerical grammatical features as well as features based on stress and rhyme varied for different 50 years for prose and poetry.
These results prove that the stylometric features of different levels simulate well the style of a text and can indicate a specific era of its creation.
Stylometric features vary not only for different eras, but also for different languages. The authors of [10] clustered texts using the k-means algorithm, based on the occurrence of words and symbols. F-measure for such a classification algorithm turned out to be no higher than 53%. Using neural networks, classification of texts by language based on features of the symbol and word level can reach higher values of F-measure of 70-80% [11]. But such studies only touch on low-level features, leaving an open question of significance of linguistic features for style modeling and analysis.
The authors studied the variability of rhythmic features for different periods of time (19th-21st centuries) and languages (Russian and English) in the previous work [12], where they showed that all three centuries differed in rhythm. This article describes the results of the comparative analysis of stylometric features of several levels.

STYLOMETRIC FEATURES
Stylometric text analysis includes searching and counting different stylometric features. Among these features, several categories can be distinguished: 1. symbol level; 2. word level; 3. rhythm level. Rhythmic text features are determined based on the use of rhythm figures, which are based on repetition in a certain configuration, in a certain position, with a certain number of repeating elements. For this study, the following rhythm figures were selected: 1. Anaphora is connecting speech segments (parts of a phrase, verses) using repetition of a word or phrase in the starting position.
2. Epiphora is connecting speech segments (parts of a phrase, verses) using repetition of a word or phrase in the final position.
3. Simploka is a figure of syntactic parallelism in adjacent verses or phrases that have the same beginning and end, but different middle or vice versa, different beginning and end, but the same middle.
4. Anadiplosis is a rhetorical figure in which the next sentence begins with the same words, which are at the end the previous sentence.
5. Epanalepsis is a figure of speech that is the repetition of the same word or phrase with slight variations. 6. Polysyndeton is a stylistic figure consisting in the deliberate increase in the number of conjunctions in a sentence, usually with the purpose of connecting homogeneous members.
7. Diakopa is a rhetorical term for repeating a word or phrase, which are broken into one or a few intermediate words.
8. Episeuxis is a figure of speech that denotes the repetition of words without a gap between repetitions. For these rhythm figures, the following numerical stylometric features were chosen: 1. The number of occurrences of a particular figure in the text, which is divided by the number of sentences; 2. The number of occurrences of all figures in the text, which is divided by the number of sentences; 3. The proportion of unique words among all that make up the figures, in this case, those that are repeated only once; 4. The proportions of nouns, adjectives, verbs and adverbs among the words that make up the figures. The choice of these figures for rhythm analysis, namely, for their automated search and quantitative processing, is due to the fact that these are the most frequent rhythm figures used in prose texts. It is these that are distinguished as rhythm figures at lexical and grammatical level by most linguists that conduct research in the field of text rhythmization [12].
The following features were selected as stylometric features at the symbol and word level. At the symbol level, 1. the number of letters, both individual letters and their total number; 2. the number of symbols, both individual symbols and their total number; 3. the average length of a sentence in symbols. At the word level, 1. the number of words; 2. the number of sentences; 3. the average length of sentences by the number of words; 4. the average word length. The choice of these stylometric means at the symbol and word level was due to the fact that they are most revealing in determining the author's style during studying a work [4].

The Main Stages of Experiments
Stylometric features of three different levels are calculated and visualized automatically. Experiments with these features were set up as follows: • First, rhythm figures were identified in the texts. The algorithms for searching for rhythm figures were taken from the article [12].
• Stylometric features were calculated for the identified rhythm figures.
• In parallel with the calculation of the rhythm features, the stylometric features of the symbol and word level were calculated for the texts.
• Stylometric features of the texts were aggregated by decades, and decades were compared with each other.
• In the last step, the comparison results were visualized using heatmaps and hierarchical clustering. At the first stage, rhythm figures are identified with an accuracy of 80-95%. Thus, the authors get a high-quality model of text rhythm.
Rhythmic and simple stylometric features are calculated based on the text and the model of its rhythm according to the exact rules described in the previous section. As a result, each text is presented as a vector of numerical features. The vectors are compared using similarity measures, on the basis of which the visualization of the results is organized.

Visualization of Stylometric Features
After the stylometric features for texts are calculated, the stylometric features for decades are estimated. For each decade, average values of features of the texts published during this period are taken. This yields vectors of features of decades that are of the same type as vectors for individual texts.
Stylometric features of decades and their comparison are visualized in three ways: • In the form of heatmaps that describe the similarity of decades in style. These are square heatmaps, on the axes of which decades are located, and a shade in a cell denotes the degree of similarity of a pair of decades: the darker the shade, the closer the objects are to each other. Four popular metrics were used as a similarity measure: Chebyshev distance, correlation coefficient, Euclidean distance, and Manhattan distance.
• In the form of heatmaps that describe ranges of stylometric features. The names of specific features are located on the horizontal axis, and decades are located on the vertical axis. Map cells contain the value of a feature and also have a color whose shade indicates the magnitude of the value relative to others. The largest values are indicated by light shades, the smallest ones are indicated by dark ones. A bar with a range of values and shades for different values is displayed on the map on the right.
• In the form of dendrograms obtained as a result of clustering. Dendrogram leaves are decades; they are placed horizontally. The distances between clusters are denoted vertically in the form of horizontal segments at a certain level. The dendrogram is built using an agglomerative approach, from leaves to stem. The used similarity metrics are the same as for heatmaps. Three methods are used as functions of the distance between the clusters: single-linkage, medium-linkage, and complete-linkage methods.
To make a comparison of decades, the stylometric features are preliminarily normalized: the average value of a given feature over the entire text corpus is subtracted from a specific value, and the resulting difference is divided by the standard deviation of this feature.
All three visualization methods are quite illustrative and allow one to analyze both the dynamics of changes in stylometric features over decades and the similarity of decades to each other in terms of style.

Software Implementation and Corpus
The ProseRhythmDetector tool 1 that makes it possible to identify and calculate stylometric features was developed in the Python language. The development also used the textblob library, which was especially useful for counting words and sentences in a text.
After the completion of the development, a number of experiments were carried out on the basis of two text corpora. One of them is in English and the other is in Russian. Each corpus includes 243 works by more than 90 famous authors. The date of publication is indicated for each of the texts: for texts in English from 1815 to 2019, and for texts in Russian from 1832 to 2019. Each text contains up to 425000 words.

Similarity Heatmaps
Based on the resulting vectors of features for the text corpus, four sets of heatmaps were built based on the similarity metrics of the Chebyshev, Euclidean, correlation, and Manhattan distances. Euclidean metrics and Manhattan distance showed almost identical results at all considered levels. Relatively noticeable differences are observed only when all levels are combined.
The correlation metric did not help much in analyzing the literature for the specified period; however, it demonstrated the similarity of the 2000s and 2010s on the basis of both corpora (see Fig. 1). In addition, the heatmap built on the basis of the corpus of Russian texts shows a special similarity of the 1830s and 1840s. In addition, the map shows that the 21st century is farther from the 19th century than from the 20th century and also shows that the 1960s and 1920s are close to the early 19th century. Finally, the map can lead to the conclusion that the literature of the late 19th century was very similar to the literature of the early 19th century, while the works of the mid-19th century are very different from them.
The heatmap built on the basis of the corpus of English texts with the calculation by the correlation metric makes it possible to understand that the works of most of the 19th century are close to each other. However, as is the case with Russian literature, English literature of the mid-19th century stands out against the background of the rest of the century, albeit not so much.
The most indicative results were the results obtained by calculation by the Chebyshev metric. The maps based on the corpus of the Russian language clearly show how the 1950s are distinct (see Fig. 2a). They differ from other periods in literature at all levels, but a special contribution to this difference is made by the symbol level (see Fig. 2b): all works of the considered epochs have similarities at the symbol level, while the 1950s are strikingly different from literature of all periods. At the word level (see Fig. 3a), we can distinguish the difference between the texts of the 21st century and those of the 19th century, and it is also worth noting the difference between the 1950s and the second half of the 19th century. In addition, the similarity of the 1930s and 1940s can be distinguished at all levels. The 1870s stand out against the background of all 19th century literature at all levels, and the rhythm level (see Fig. 3b) makes them different from all literature for the period under review. Finally, at the symbol and word levels, we can note the proximity of the second half of the 20th century to the early 21st century, however, the rhythm level slightly eliminates this similarity.
Experiments on the corpus of English texts have shown the following results. At the symbol level (see Fig. 4a), the similarity of the 1890s and 1900s as well as the 2000s and 2010s can be distinguished. In addition, the 1950s are especially distinct at this level in comparison with all other decades. At the word level (see Fig. 4b), the period from the 1890s to the 1920s as well as the 21st century are already more distinct. The rhythm level (see Fig. 5a) contributes to the identification of the period from the 1990s to the 2010s,   as well as the period from the 1850s to the 1970s and from the 1890s to the 1900s. Thus, the map reflecting all features at once (see Fig. 5b) permits us to clearly distinguish the interval from the 1870s to the 1920s as well as the 21st century and the 1990s.

Range Heatmaps
The second type of heatmaps (Fig. 6) displays the normalized values of features (the difference between the real and average value over the entire corpus divided by standard deviation) for both languages.
The first 14 columns indicate rhythmic features: nine specific figures, the total number of occurrences of figures, the proportion of words repeated only once (unique word), proportions of parts of speech. The remaining seven figures are a few low-level features: the number of letters, all symbols, words and sen- tences, as well as average lengths of sentences in symbols (avg by ch) and words (avg by word) and average word lengths (avg by word).
The rhythm figures of both languages are observed to have the tendency revealed in the previous research based on smaller text corpora: over the centuries, the total number of rhythm features decreases. This can be seen from the shades of colors on the map: shades are lighter for diakopa and ploysyndelon as the most frequent figures and the total number of figures in the 19th century and are darker at the end of the 20th century-the beginning of the 21st century. This means that these features are above average in the 19th century, and closer to our time they are below average.
In British texts, this trend correlates with the use of adjectives: the proportion of their occurrences in rhythm figures also decreases towards the 21st century. The proportions of the remaining parts of speech do not fluctuate so much. For Russian texts, the pattern is different: the shares of nouns and adjectives increase in the 20th and 21st centuries. As for simple stylometric features, they show for British texts how the average size of artworks changes: it gradually decreases by the middle of the 21st century and then increases again. For Russian texts, the trend is reversed. These patterns can rather be attributed to the features of the formation of the corpus: the well-known artworks by popular authors were chosen. In case of increase in the corpus and adding more varied artworks trends may change.
Average sentence and word lengths are the features that better reflect the style of texts than the absolute numbers of text elements. For British texts, they all diminish over almost all decades. The average length of sentences in Russian texts increases towards the end of the 19th century-the beginning of the 20th century, decreases slightly in the first half of the 20th century, grows by the 1950s and then decreases again. The average word length increases throughout all decades.
Both the similarity heatmaps and range maps show the decades that stand out against the background of the rest. Moreover, the range maps can additionally show what stylometric features the decades differ in. For British texts, these are the 1930s and the 1980s. For Russian texts, these are the 1870s and 1940s-1950s.
Thus, the range heatmaps allow both to identify general trends in style change of works for the language as a whole and to detect individual decades, which stand out against the background of the rest.
Dendrograms were built for both languages separately. For each language, texts were clustered hierarchically both on the basis of individual types of stylometric features and on the basis of all types of features to compare the division of texts only by rhythm with the division by all stylometric features.
Among the functions of the distance between the clusters, the most illustrative results were shown by the complete-linkage method; the medium-linkage method showed results that were close to it. The single-linkage method identified fewer clusters than the rest.
As for the metrics of proximity between the elements, the Manhattan distance and Euclidean distance showed similar results, as in the case of heatmaps. The correlation coefficient showed more chaotic clustering than other methods.
The Euclidean metrics turned out to be the most useful for rhythm figures (see Fig. 7). For British texts, specific rhythm is clearly featured by several small clusters containing neighboring decades: 1990-2010, 1940-1950, 1890-1920, and 1850-1870. The 1830s and 1980s decades were the most distant in rhythm from the rest. The 21st century is most similar to the middle of the 20th century: 1940-1950s and 1970s.
For Russian texts, the dendrogram shows smaller rhythm distances between decades than for British texts. It shows two large clusters; the first contains a large part of the decades of the 20th-21st centuries, and the second one includes the 19th century and the beginning of the 20th century: 1900s-1910s. The 21st century does not so obviously stand out as regards rhythm, as in British lyrics. The 1870s and 1940s prove to be the farthest from the rest.
For all stylometric means, the most obvious results were shown by the Chebyshev metric (see Fig. 8).
For British texts, some pairs of neighboring decades again turn out to be close to each other in style: 1890 and 1900, 1880 and 1870, 1910 and 1920, etc. In general, the distances between the decades are  shorter; the beginning and middle of this century turn out to be in the 19th century cluster. The 21st century is again close in style to the mid-20th century.
For Russian texts, the 21st century is distinguished into a separate cluster as regards style, but this cluster also includes the 1970s. The 19th century is split into two smaller clusters and is more similar to the 20th century. The 1940s and 1870s are again the farthest from the rest, and the 1950s join them.
Thus, the dendrograms show that the centuries differ in rhythm features more strongly than in the totality of the stylometric features. The 19th century and 1990s-2010s can form separate clusters, and the 20th century turns out to be much less homogeneous in terms of rhythm as well as simpler stylometric features.

DISCUSSION
Automatic determination of a complex of low-level and high-level stylometric features makes it possible to quickly analyze a large number of voluminous works and draw qualitative conclusions about changes in style over time. This approach allows an expert to obtain a detailed model of the style of artistic text in a short time.
The experiments have shown that although decades can be successfully clustered by proximity to each other, each of them is unique in terms of the totality of rhythmic and simple stylometric features. This means that that the model based on these features can permit texts to be potentially successfully classified by centuries and decades of creation/publication as well as makes it possible to calculate the year when the text was published.
In addition, text analysis by means of clustering based on heatmaps and dendrograms allows one to identify trends of changes in style in literature as a whole as well as to compare literatures in different languages with each other. In particular, Russian and British literatures are detected to have a decrease in the number of rhythm figures per unit of text (in this case, a sentence). For Russian literature, it has been found that the average lengths of sentences change in a wavelike manner during the period under consideration, while the average lengths of words increase. In British literature, the average lengths of both words and sentences significantly decrease. If we compare the centuries of both literatures, the 20th century turns out to be the most heterogeneous in terms of style, and the 21st and 19th centuries differ from each other, but the decades within them are quite similar.
In addition to looking for general trends, heatmaps and dendrograms make it possible to discover specific periods of time that are significantly different from others. This can be interpreted as the fact that this period included one or more texts, which especially strongly differ in style from their contemporaries. Thus, works with a unique style can be revealed in a large corpus of texts.
If we compare the significance of the stylometric features of different levels, then we can conclude that both low-level and rhythmic ones are quite useful and can detect the same large clusters of decades. However, the rhythmic features are more heterogeneous; therefore, they are the best indicators of style uniqueness. Thus, automated text modeling using stylometric features of different levels permits literatures of different languages and the eras of their development to be analyzed and successfully compared with each other.

CONCLUSIONS
The authors have conducted a study of the stylometric features of different levels: symbols, words, and rhythm, based on two corpora of artworks of Russian and English literature of the 19th-21st centuries. Stylometric features were calculated fully automatically using the ProseRhythmDetector tool 2 . This tool makes it possible to automatically find and statistically process rhythm figures in combination with simple stylometric text features, which enables one to analyze the style of prose from different angles and explore large corpora of texts.
The analysis of stylometric features has made it possible to identify both the main trends in style change during the 19th and 21st centuries and to identify the periods of time that are most different from others in rhythm and style. In addition, the study has shown the importance of rhythmic features as markers of the peculiarities of the prose style.
The next stage in the study of stylometric text features, including rhythmic ones, can be their use to classify texts by century or era of publication, to determine the author as well as to analyze and compare literature styles in other languages.