Automated morpheme segmentation algorithms for the Belarusian language: comparison of approaches
https://doi.org/10.18255/1818-1015-2025-4-384-395
Abstract
About the Authors
Dmitry A. MorozovRussian Federation
Grigorii O. Feoktistov
Russian Federation
Anna V. Glazkova
Russian Federation
References
1. P. Gage, “A new algorithm for data compression,” C User Journal, vol. 12, no. 2, pp. 23–38, 1994.
2. A. Matthews, G. Neubig, and C. Dyer, “Using Morphological Knowledge in Open-Vocabulary Neural Language Models,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018, pp. 1435–1445, doi: 10.18653/v1/N18-1130.
3. A. Nzeyimana and A. Niyongabo Rubungo, “KinyaBERT: a Morphology-aware Kinyarwanda Language Model,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5347–5363, doi: 10.18653/v1/2022.acl-long.367.
4. M. W. Kildeberg, E. A. Schledermann, N. Larsen, and R. van der Goot, “From Smor-re-brod to Subwords: Training LLMs on Danish, One Morpheme at a Time.” 2025.
5. E. Asgari, Y. E. Kheir, and M. A. S. Javaheri, “MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies.” 2025.
6. S. O. Savchuk et al., “Russian National Corpus 2.0: New opportunities and development prospects,” Voprosy Jazykoznanija, no. 2, pp. 7–34, 2024, doi: 10.31857/0373-658x.2024.2.7-34.
7. M. Olbrich and Z. vZabokrtsk'y, “Morphological Segmentation with Neural Networks: Performance Effects of Architecture, Data Size, and Cross-Lingual Transfer in Seven Languages,” in Text, Speech, and Dialogue, 2026, pp. 275–286, doi: 10.1007/978-3-032-02551-7_24.
8. D. Morozov, L. Astapenka, A. Glazkova, T. Garipov, and O. Lyashevskaya, “BERT-like Models for Slavic Morpheme Segmentation,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6795–6815, doi: 10.18653/v1/2025.acl-long.337.
9. M. Pranji'c, M. Robnik-vSikonja, and S. Pollak, “LLMSegm: Surface-level Morphological Segmentation Using Large Language Model,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 10665–10674.
10. C. Anderson, M. Nguyen, and R. Coto-Solano, “Unsupervised, Semi-Supervised and LLM-Based Morphological Segmentation for Bribri,” in Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), 2025, pp. 63–76, doi: 10.18653/v1/2025.americasnlp-1.7.
11. K. Batsuren et al., “The SIGMORPHON 2022 Shared Task on Morpheme Segmentation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 103–116, doi: 10.18653/v1/2022.sigmorphon-1.11.
12. B. Peters and A. F. T. Martins, “Beyond Characters: Subword-level Morpheme Segmentation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 131–138, doi: 10.18653/v1/2022.sigmorphon-1.14.
13. S. Wehrli, S. Clematide, and P. Makarov, “CLUZH at SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 212–219, doi: 10.18653/v1/2022.sigmorphon-1.21.
14. A. Sorokin, “Improving Morpheme Segmentation Using BERT Embeddings,” in Analysis of Images, Social Networks and Texts, 2022, pp. 148–161, doi: 10.1007/978-3-031-16500-9_13.
15. A. Sorokin and A. Kravtsova, “Deep Convolutional Networks for Supervised Morpheme Segmentation of Russian Language,” in Artificial Intelligence and Natural Language, 2018, pp. 3–10, doi: 10.1007/978-3-030-01204-5_1.
16. E. I. Bolshakova and A. S. Sapin, “Comparing models of morpheme analysis for Russian words based on machine learning,” in Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue,” 2019, vol. 18, pp. 104–113.
17. E. Bolshakova and A. Sapin, “Bi-LSTM Model for Morpheme Segmentation of Russian Words,” in Artificial Intelligence and Natural Language, 2019, pp. 151–160, doi: 10.1007/978-3-030-34518-1_11.
18. T. Garipov, D. Morozov, and A. Glazkova, “Generalization ability of CNN-based Morpheme Segmentation,” in Proceedings of the Ivannikov Ispras Open Conference (ISPRAS), 2024, pp. 58–62, doi: 10.1109/ISPRAS60948.2023.10508171.
19. D. Morozov, T. Garipov, O. Lyashevskaya, S. Savchuk, B. Iomdin, and A. Glazkova, “Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?,” Journal of Language and Education, vol. 10, no. 4, pp. 71–84, 2024, doi: 10.17323/jle.2024.22237.
20. L. S. Mormysh, A. M. Bardovich, and L. M. Shakun, School morpheme dictionary of the Belarusian language [SHkol'ny marfemny slovnik belaruskaj movy]. Minsk: Aversev, 2005.
21. A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8024–8035.
22. D. Zmitrovich et al., “A Family of Pretrained Transformer Language Models for Russian,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 507–524.
23. M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin, “Tuning Multilingual Transformers for Language-Specific Named Entity Recognition,” in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, pp. 89–93, doi: 10.18653/v1/W19-3712.
24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
25. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2020.
Review
For citations:
Morozov D.A., Feoktistov G.O., Glazkova A.V. Automated morpheme segmentation algorithms for the Belarusian language: comparison of approaches. Modeling and Analysis of Information Systems. 2025;32(4):384-395. (In Russ.) https://doi.org/10.18255/1818-1015-2025-4-384-395






