Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
https://doi.org/10.18255/1818-1015-2024-2-206-220
Abstract
About the Authors
Dmitry A. MorozovRussian Federation
Ivan A. Smal
Russian Federation
Timur A. Garipov
Russian Federation
Anna V. Glazkova
Russian Federation
References
1. R. Flesch, “A new readability yardstick.,” Journal of Applied Psychology, vol. 32, no. 3, p. 221, 1948.
2. E. Dale and J. S. Chall, “A formula for predicting readability: Instructions,” Educational Research Bulletin, vol. 27, pp. 37–54, 1948.
3. R. J. Senter and E. A. Smith, “Automated readability index,” AMRL TR, 5302480, 1967.
4. M. Solnyshkina, V. Ivanov, and V. Solovyev, “Readability Formula for Russian Texts: A Modified Version,” in Proceedings of the 17th Mexican International Conference on Artificial Intelligence, Part II, 2018, pp. 132–145, doi: 10.1007/978-3-030-04497-8_11.
5. A. Churunina, M. Solnyshkina, E. Gafiyatova, and A. Zaikin, “Lexical Features of Text Complexity: the case of Russian academic texts,” SHS Web of Conferences, vol. 88, no. 1, p. 01009, 2020, doi: 10.1051/shsconf/20208801009.
6. D. A. Morozov, A. V. Glazkova, and B. L. Iomdin, “Text complexity and linguistic features: Their correlation in English and Russian,” Russian Journal of Linguistics, vol. 26, no. 2, pp. 426–448, 2022, doi: 10.22363/2687-0088-30132.
7. N. Karpov, J. Baranova, and F. Vitugin, “Single-Sentence Readability Prediction in Russian,” in Analysis of Images, Social Networks and Texts, Cham, 2014, pp. 91–100, doi: 10.1007/978-3-319-12580-0_9.
8. V. V. Ivanov, M. I. Solnyshkina, and V. D. Solovyev, “Efficiency of Text Readability Features in Russian Academic Texts,” Komp’juternaja Lingvistika I Intellektual’nye Tehnologii, vol. 17, pp. 267–283, 2018.
9. O. Blinova and N. Tarasov, “A hybrid model of complexity estimation: Evidence from Russian legal texts,” Frontiers in Artificial Intelligence, vol. 5, p. 1008530, 2022, doi: 10.3389/frai.2022.1008530.
10. U. Isaeva and A. Sorokin, “Investigating the Robustness of Reading Difficulty Models for Russian Educational Texts,” in Recent Trends in Analysis of Images, Social Networks and Texts, Cham, 2021, pp. 65–77, doi: 10.1007/978-3-030-71214-3_6.
11. A. N. Laposhina, T. S. Veselovskaya, M. U. Lebedeva, and O. F. Kupreshchenko, “Lexical analysis of the Russian language textbooks for primary school: corpus study,” in Komp’juternaja Lingvistika I Intellektual’nye Tehnologii, 2019, vol. 18, pp. 351–363.
12. V. Solovyev, V. Ivanov, and M. Solnyshkina, “Readability formulas for three levels of Russian school textbooks,” Investigations on Applied Mathematics and Informatics. Part II--1, vol. 529, pp. 140–156, 2023.
13. A. N. Laposhina, M. Y. Lebedeva, and A. A. Berlin Khenis, “Word frequency and text complexity: an eye-tracking study of young Russian readers,” Russian Journal of Linguistics, vol. 26, no. 2, pp. 493–514, 2022, doi: 10.22363/2687-0088-30084.
14. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
15. A. Glazkova, Y. Egorov, and M. Glazkov, “A Comparative Study of Feature Types for Age-Based Text Classification,” in Analysis of Images, Social Networks and Texts, Cham, 2021, pp. 120–134, doi: 10.1007/978-3-030-72610-2_9.
16. F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
17. A. Kutuzov and E. Kuzmenko, “WebVectors: a toolkit for building web interfaces for vector semantic models,” in Analysis of Images, Social Networks and Texts, 2017, pp. 155–161, doi: 10.1007/978-3-319-52920-2_15.
18. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” 2017.
19. N. Reimers and I. Gurevych, “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4512–4525, doi: 10.18653/v1/2020.emnlp-main.365.
20. N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Nov. 2019, doi: 10.18653/v1/D19-1410.
21. P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 101–108, doi: 10.18653/v1/2020.acl-demos.14.
22. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in International Conference on Analysis of Images, Social Networks and Texts, 2015, pp. 320–332, doi: 10.1007/978-3-319-26123-2_31.
23. E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 2002, pp. 63–70.
24. A. V. Glazkova, D. A. Morozov, M. S. Vorobeva, and A. Stupnikov, “Keyphrase generation for the Russian-language scientific texts using mT5,” Modeling and Analysis of Information Systems, vol. 30, no. 4, pp. 418–428, 2023, doi: 10.18255/1818-1015-2023-4-418-428.
25. L. Xue et al., “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 483–498, doi: 10.18653/v1/2021.naacl-main.41.
26. C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
27. T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45, doi: 10.18653/v1/2020.emnlp-demos.6.
28. O. N. Lyashevskaya and S. A. Sharov, Chastotnyj slovar' sovremennogo russkogo yazyka: na materialah Nacional'nogo korpusa russkogo yazyka. Azbukovnik, 2009.
29. B. L. Iomdin, “How to Define Words with the Same Root?,” Russian Speech = Russkaya Rech,’ vol. 1, pp. 109–115, 2019, doi: 10.31857/S013161170003980-7.
30. A. Sorokin and A. Kravtsova, “Deep Convolutional Networks for Supervised Morpheme Segmentation of Russian Language,” in Artificial Intelligence and Natural Language, Cham, 2018, pp. 3–10, doi: 10.1007/978-3-030-01204-5_1.
31. E. I. Bolshakova and A. S. Sapin, “Comparing models of morpheme analysis for Russian words based on machine learning,” in Komp’juternaja Lingvistika I Intellektual’nye Tehnologii, 2019, vol. 18, pp. 104–113.
32. E. Bolshakova and A. Sapin, “Bi-LSTM Model for Morpheme Segmentation of Russian Words,” in Artificial Intelligence and Natural Language, Cham, 2019, pp. 151–160, doi: 10.1007/978-3-030-34518-1_11.
33. A. N. Tikhonov, Slovoobrazovatel’nyi slovar’ russkogo yazyka. Moscow: Russkiy yazyk, 1990.
34. T. Garipov, D. Morozov, and A. Glazkova, “Generalization ability of CNN-based Morpheme Segmentation,” in 2023 Ivannikov Ispras Open Conference (ISPRAS), 2024, pp. 58–62.
35. A. I. Kuznetsova and T. F. Efremova, Dictionary of Morphemes of the Russian Language. Firebird Publications, Incorporated, 1986.
36. T. Cover and A. T. Joy, “Entropy, Relative Entropy, and Mutual Information,” in Elements of Information Theory, John Wiley & Sons, Ltd, 2005, pp. 13–55.
37. L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees. Chapman and Hall/CRC, 1984.
38. A. Altmann, L. Tolosi, O. Sander, and T. Lengauer, “Permutation importance: A corrected feature importance measure,” Bioinformatics (Oxford, England), vol. 26, no. 10, pp. 1340–1347, 2010, doi: 10.1093/bioinformatics/btq134.
Review
For citations:
Morozov D.A., Smal I.A., Garipov T.A., Glazkova A.V. Keywords, morpheme parsing and syntactic trees: features for text complexity assessment. Modeling and Analysis of Information Systems. 2024;31(2):206-220. (In Russ.) https://doi.org/10.18255/1818-1015-2024-2-206-220