Preview

Modeling and Analysis of Information Systems

Advanced search

Comparison of Style Features for the Authorship Verification of Literary Texts

https://doi.org/10.18255/1818-1015-2021-3-250-259

Abstract

The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.

The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.

About the Author

Ksenia Vladimirovna Lagutina
P.G. Demidov Yaroslavl State University
Russian Federation

Postgraduate student.

14 Sovetskaya str., Yaroslavl 150003



References

1. E. Stamatatos, “A survey of modern authorship attribution methods”, Journal of the American Society for information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009.

2. K. Lagutina, N. Lagutina, E. Boychuk, I. Vorontsova, E. Shliakhtina, O. Belyaeva, and I. Paramonov, “A survey on stylometric text features”, in Proceedings of the 25th Conference of Open Innovations Association (FRUCT), IEEE, 2019, pp. 184–195.

3. T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying stylometry techniques and applications”, ACM Computing Surveys (CSUR), vol. 50, no. 6, pp. 1–36, 2018.

4. C.-G. Lim, Y.-S. Jeong, and H.-J. Choi, “Survey of Temporal Information Extraction.”, Journal of Information Processing Systems, vol. 15, no. 4, pp. 931–956, 2019.

5. E. Boychuk, I. Paramonov, N. Kozhemyakin, and N. Kasatkina, “Automated approach for rhythm analysis of French literary texts”, in Proceedings of 15th Conference of Open Innovations Association FRUCT, IEEE, 2014, pp. 15–23.

6. K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, and I. Paramonov, “Authorship Verification of Literary Texts with Rhythm Features”, in Proceedings of the 28th Conference of Open Innovations Association FRUCT, 2021, pp. 240–251. doi: 10.23919/FRUCT50888.2021.9347649.

7. N. Potha and E. Stamatatos, “Intrinsic author verification using topic modeling”, in Proceedings of the 10th Hellenic Conference on Artificial Intelligence, ACM, 2018, pp. 1–7.

8. O. Halvani and L. Graner, “Rethinking the evaluation methodology of authorship verification methods”, in International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2018, pp. 40–51.

9. O. Halvani, L. Graner, and R. Regev, “TAVeer: an interpretable topic-agnostic authorship verification method”, in Proceedings of the 15th International Conference on Availability, Reliability and Security, 2020, pp. 1–10.

10. B. Boenninghoff, R. M. Nickel, S. Zeiler, and D. Kolossa, “Similarity learning for authorship verification in social media”, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 2457–2461.

11. S. Adamovic, V. Miskovic, M. Milosavljevic, M. Sarac, and M. Veinovic, “Automated language-independent authorship verification (for Indo-European languages)”, Journal of the Association for Information Science and Technology, vol. 70, no. 8, pp. 858–871, 2019.

12. M. A. Al-Khatib and J. K. Al-qaoud, “Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative study”, Information, Communication & Society, pp. 1–19, 2020.

13. T. Stanisz, J. Kwapien´, and S. Droz˙dz˙, “Linguistic data mining with complex networks: a stylometric-oriented approach”, Information Sciences, vol. 482, pp. 301–320, 2019.

14. K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, and I. Paramonov, “Automatic Extraction of Rhythm Figures and Analysis of Their Dynamics in Prose of 19th-21st Centuries”, in Proceedings of the 26th Conference of Open Innovations Association FRUCT, IEEE, 2020, pp. 247–255.

15. K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey”, Information, vol. 10, no. 4, 150 (1–68), 2019.

16. M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks”, Information processing & management, vol. 45, no. 4, pp. 427–437, 2009.


Review

For citations:


Lagutina K.V. Comparison of Style Features for the Authorship Verification of Literary Texts. Modeling and Analysis of Information Systems. 2021;28(3):250-259. https://doi.org/10.18255/1818-1015-2021-3-250-259

Views: 596


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-1015 (Print)
ISSN 2313-5417 (Online)