Preview

Modeling and Analysis of Information Systems

Advanced search

Automated morpheme segmentation algorithms for the Belarusian language: comparison of approaches

https://doi.org/10.18255/1818-1015-2025-4-384-395

Abstract

The task of automated morpheme segmentation for morphologically rich but low-resource languages, such as Belarusian, remains insufficiently studied. This paper presents the first large-scale comparative study on the effectiveness of modern neural network approaches to morpheme segmentation using Belarusian language data. We compared three approaches that have demonstrated high quality for other languages: algorithms based on convolutional neural networks (CNNs), algorithms based on LSTM networks, and fine-tuning of BERT-like models. Due to the limited availability of monolingual Belarusian models, we also included larger Russian and multilingual models in the comparison. The experiments were conducted on the openly available Slounik dataset using two strategies for splitting the data into training and test sets. In the first case, the split was random; in the second, words were split by their roots to ensure that words with the same root did not appear in both the training and test sets simultaneously. An ensemble of LSTM networks achieved the best performance in the experiments, with a word accuracy of 91.42% on the random split and 73.89% on the root-based split. Comparable results were demonstrated by fine-tuned multilingual and Russian BERT-like models, highlighting the potential of applying large models, including those trained on closely related and higher-resource languages, to this task. An analysis of the errors confirmed that, as with other Slavic languages, the majority of inaccuracies are related to the identification of root boundaries.

About the Authors

Dmitry A. Morozov
Novosibirsk National Research State University
Russian Federation


Grigorii O. Feoktistov
Novosibirsk National Research State University
Russian Federation


Anna V. Glazkova
University of Tyumen
Russian Federation


References

1. P. Gage, “A new algorithm for data compression,” C User Journal, vol. 12, no. 2, pp. 23–38, 1994.

2. A. Matthews, G. Neubig, and C. Dyer, “Using Morphological Knowledge in Open-Vocabulary Neural Language Models,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018, pp. 1435–1445, doi: 10.18653/v1/N18-1130.

3. A. Nzeyimana and A. Niyongabo Rubungo, “KinyaBERT: a Morphology-aware Kinyarwanda Language Model,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5347–5363, doi: 10.18653/v1/2022.acl-long.367.

4. M. W. Kildeberg, E. A. Schledermann, N. Larsen, and R. van der Goot, “From Smor-re-brod to Subwords: Training LLMs on Danish, One Morpheme at a Time.” 2025.

5. E. Asgari, Y. E. Kheir, and M. A. S. Javaheri, “MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies.” 2025.

6. S. O. Savchuk et al., “Russian National Corpus 2.0: New opportunities and development prospects,” Voprosy Jazykoznanija, no. 2, pp. 7–34, 2024, doi: 10.31857/0373-658x.2024.2.7-34.

7. M. Olbrich and Z. vZabokrtsk'y, “Morphological Segmentation with Neural Networks: Performance Effects of Architecture, Data Size, and Cross-Lingual Transfer in Seven Languages,” in Text, Speech, and Dialogue, 2026, pp. 275–286, doi: 10.1007/978-3-032-02551-7_24.

8. D. Morozov, L. Astapenka, A. Glazkova, T. Garipov, and O. Lyashevskaya, “BERT-like Models for Slavic Morpheme Segmentation,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6795–6815, doi: 10.18653/v1/2025.acl-long.337.

9. M. Pranji'c, M. Robnik-vSikonja, and S. Pollak, “LLMSegm: Surface-level Morphological Segmentation Using Large Language Model,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 10665–10674.

10. C. Anderson, M. Nguyen, and R. Coto-Solano, “Unsupervised, Semi-Supervised and LLM-Based Morphological Segmentation for Bribri,” in Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), 2025, pp. 63–76, doi: 10.18653/v1/2025.americasnlp-1.7.

11. K. Batsuren et al., “The SIGMORPHON 2022 Shared Task on Morpheme Segmentation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 103–116, doi: 10.18653/v1/2022.sigmorphon-1.11.

12. B. Peters and A. F. T. Martins, “Beyond Characters: Subword-level Morpheme Segmentation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 131–138, doi: 10.18653/v1/2022.sigmorphon-1.14.

13. S. Wehrli, S. Clematide, and P. Makarov, “CLUZH at SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation,” in Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 2022, pp. 212–219, doi: 10.18653/v1/2022.sigmorphon-1.21.

14. A. Sorokin, “Improving Morpheme Segmentation Using BERT Embeddings,” in Analysis of Images, Social Networks and Texts, 2022, pp. 148–161, doi: 10.1007/978-3-031-16500-9_13.

15. A. Sorokin and A. Kravtsova, “Deep Convolutional Networks for Supervised Morpheme Segmentation of Russian Language,” in Artificial Intelligence and Natural Language, 2018, pp. 3–10, doi: 10.1007/978-3-030-01204-5_1.

16. E. I. Bolshakova and A. S. Sapin, “Comparing models of morpheme analysis for Russian words based on machine learning,” in Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue,” 2019, vol. 18, pp. 104–113.

17. E. Bolshakova and A. Sapin, “Bi-LSTM Model for Morpheme Segmentation of Russian Words,” in Artificial Intelligence and Natural Language, 2019, pp. 151–160, doi: 10.1007/978-3-030-34518-1_11.

18. T. Garipov, D. Morozov, and A. Glazkova, “Generalization ability of CNN-based Morpheme Segmentation,” in Proceedings of the Ivannikov Ispras Open Conference (ISPRAS), 2024, pp. 58–62, doi: 10.1109/ISPRAS60948.2023.10508171.

19. D. Morozov, T. Garipov, O. Lyashevskaya, S. Savchuk, B. Iomdin, and A. Glazkova, “Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?,” Journal of Language and Education, vol. 10, no. 4, pp. 71–84, 2024, doi: 10.17323/jle.2024.22237.

20. L. S. Mormysh, A. M. Bardovich, and L. M. Shakun, School morpheme dictionary of the Belarusian language [SHkol'ny marfemny slovnik belaruskaj movy]. Minsk: Aversev, 2005.

21. A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8024–8035.

22. D. Zmitrovich et al., “A Family of Pretrained Transformer Language Models for Russian,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 507–524.

23. M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin, “Tuning Multilingual Transformers for Language-Specific Named Entity Recognition,” in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, pp. 89–93, doi: 10.18653/v1/W19-3712.

24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.

25. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2020.


Review

For citations:


Morozov D.A., Feoktistov G.O., Glazkova A.V. Automated morpheme segmentation algorithms for the Belarusian language: comparison of approaches. Modeling and Analysis of Information Systems. 2025;32(4):384-395. (In Russ.) https://doi.org/10.18255/1818-1015-2025-4-384-395

Views: 54


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-1015 (Print)
ISSN 2313-5417 (Online)