Comparison of pre-trained models for domain-specific entity extraction from student report documents

Antonina V. Melnikova; Marina S. Vorobeva; Anna V. Glazkova

doi:10.18255/1818-1015-2025-1-66-79

Comparison of pre-trained models for domain-specific entity extraction from student report documents

Antonina V. Melnikova, Marina S. Vorobeva, Anna V. Glazkova

https://doi.org/10.18255/1818-1015-2025-1-66-79

Full Text:

PDF (Rus) |

Generate QR code

Abstract

The authors propose a methodology for extracting domain-specific entities from student report documents in Russian language using pre-trained transformer-based language models. Extracting domain-specific entities from student report documents is a relevant task since the obtained data can be used for various purposes, ranging from the formation of project teams to the personalization of learning pathways. Additionally, automating the document processing workflow reduces the labor costs associated with manual processing. As training material for training models, expert-annotated student report documents were used. These documents were created by students in information technology programs between 2019 and 2022 for project-based, practical disciplines, and theses. The domain-specific entity extraction task is approached as two subtasks: named entity recognition (NER) and annotated text generation. A comparative analysis was conducted among NER encoder-only models (ruBERT, ruRoBERTa), encoder-decoder models (ruT5, mBART), and decoder-only models (ruGPT, T-lite) for text generation. The effectiveness of the models was evaluated using the F1-score, along with an analysis of common errors. The highest F1-score on the test set was achieved by mBART (93.55%). This model also showed the lowest error rate in domain-specific entity identification during text generation and annotation. The NER models demonstrated a lower tendency for errors but tended to extract domain-specific entities in a fragmented manner. The obtained results indicate the applicability of the examined models for solving the stated tasks, considering the specific requirements of the problem.

Keywords

domain-specific entities, digital footprint, information extraction, natural language processing, pre-trained language models

MSC2020: 68T50, 97B40

About the Authors

Antonina V. Melnikova

University of Tyumen
Russian Federation

Marina S. Vorobeva

University of Tyumen
Russian Federation

Anna V. Glazkova

University of Tyumen
Russian Federation

References

1. Q. Guohao, W. Bin, W. Bai, and Z. Baoli, “Competency Analysis in Human Resources Using Text Classification Based on Deep Neural Network,” in Proceedings of the IEEE Fourth International Conference on Data Science in Cyberspace, 2019, pp. 322–329, doi: 10.1109/DSC.2019.00056.

2. I. G. Zakharova, Y. V. Boganyuk, M. S. Vorobyova, and E. A. Pavlova, “Diagnostics of professional competence of IT students based on digital footprint data,” Informatics and Education, vol. 4, no. 313, pp. 4–11, 2020, doi: 10.32517/0234-0453-2020-35-4-4-11.

3. Z. Alami Merrouni, B. Frikh, and B. Ouhbi, “Automatic keyphrase extraction: a survey and trends,” Journal of Intelligent Information Systems, vol. 54, no. 2, pp. 391–424, 2020, doi: 10.1007/s10844-019-00558-9.

4. E. P. Bruches, A. E. Pauls, T. V. Batura, V. V. Isachenko, and D. R. Shcherbatov, “Semantic Analysis of Scientific Texts: Experience in Creating a Corpus and Building Language Models,” Software & Systems, vol. 34, no. 1, pp. 132–144, 2021, doi: 10.15827/0236-235X.133.132-144.

5. Y. I. Butenko, N. S. Nikolaeva, and T. D. Margaryan, “Structural Models of Terminological Word Combinations for Marking up a Corpus of Scientific and Technical Texts,” NSU Vestnik. Series: Linguistics and Intercultural Communication, vol. 19, no. 3, pp. 45–56, 2021, doi: 10.25205/1818-7935-2021-19-3-45-56.

6. A. A. Novikova, “Comparison of Sketch Engine and TermoStat Tools for Terminology Extraction,” International Journal of Open Information Technologies, vol. 8, no. 11, pp. 73–79, 2020.

7. E. P. Bruches and T. V. Batura, “Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision,” Vestnik NSU. Series: Information Technologies, vol. 19, no. 2, pp. 5–16, 2021, doi: 10.25205/1818-7900-2021-19-2-5-16.

8. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186, doi: 10.18653/v1/N19-1423.

9. Y. Y. Dementyeva, E. P. Bruches, and T. V. Batura, “Terms Extraction from Texts of Scientific Papers,” Software & Systems, vol. 35, no. 4, pp. 689–697, 2022, doi: 10.15827/0236-235X.140.689-697.

10. Y. Kuratov and M. Arkhipov, “Adaptation of deep bidirectional multilingual transformers for Russian language,” in Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, 2019, pp. 333–339.

11. V. K. Pimeshkov, M. L. Nikonorova, and M. G. Shishaev, “A Combined Term Extraction Method for the Problem of Monitoring Thematic Discussions in Social Media,” Informatics and Automation, vol. 23, no. 4, pp. 1110–1138, 2024, doi: 10.15622/ia.23.4.7.

12. M. D. Averina and O. A. Levanova, “Extracting Named Entities from Russian-Language Documents with Different Expressiveness of Structure,” Modeling and Analysis of Information Systems, vol. 30, no. 4, pp. 382–393, 2023, doi: 10.18255/1818-1015-2023-4-382-393.

13. X. Liu, J. A. Erkoyuncu, J. Y. H. Fuh, W. F. Lu, and B. Li, “Knowledge extraction for additive manufacturing process via named entity recognition with LLMs,” Robotics and Computer-Integrated Manufacturing, vol. 93, p. 102900, 2025, doi: 10.1016/j.rcim.2024.102900.

14. T. Atnashev et al., “Razmecheno: Named Entity Recognition from Digital Archive of Diaries ‘Prozhito,’” in Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022), 2022, pp. 22–38.

15. M. Tikhomirov, N. Loukachevitch, A. Sirotina, and B. Dobrov, “Using BERT and augmentation in named entity recognition for cybersecurity domain,” in Proceedings of the 25th International Conference on Applications of Natural Language to Information Systems, 2020, pp. 16–24, doi: 10.1007/978-3-030-51310-8_2.

16. P. V. Korytov, Y. Y. Gribetskiy, E. A. Andreeva, and I. I. Kholod, “Analysis of Approaches for Identifying Key Skills in Vacancies,” in Proceedings of the International Conference on Soft Computing and Measurement, 2024, pp. 300–303, doi: 10.1109/SCM62608.2024.10554269.

17. I. E. Nikolaev, “Knowledge and Skills Extraction from the Job Requirements Texts,” Ontology of Designing, vol. 13, no. 2, pp. 282–293, 2023, doi: 10.18287/2223-9537-2023-13-2-282-293.

18. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, pp. 1–11, 2017.

19. A. V. Melnikova, M. S. Vorobeva, E. V. Egorova, and E. D. Chekanova, “Development of an Algorithm for the Formation of IT Project Teams Based on Data from the Digital Footprint of Students,” Proceedings of the Institute for System Programming of the RAS, vol. 36, no. 3, pp. 213–224, 2024, doi: 10.15514/ispras-2024-36(3)-15.

20. N. Matkin et al., “Comparative Analysis of Encoder-Based NER and Large Language Models for Skill Extraction from Russian Job Vacancies.” 2024.

21. M. Khokhlova and M. Koryshev, “Keyness Analysis and Its Representation in Russian Academic Papers on Computational Linguistics: Evaluation of Algorithms,” in RASLAN, 2022, pp. 25–33.

22. O. A. Mitrofanova and D. A. Gavrilic, “Experiments on automatic keyphrase extraction in stylistically heterogeneous corpus of Russian texts,” Terra Linguistica, vol. 13, no. 4, pp. 22–40, 2022, doi: 10.18721/JHSS.13402.

23. A. V. Glazkova, D. A. Morozov, M. S. Vorobeva, and A. A. Stupnikov, “Keyword Generation for Russian-Language Scientific Texts Using the mT5 Model,” Automatic Control and Computer Sciences, vol. 58, no. 7, pp. 995–1002, 2024, doi: 10.3103/S014641162470041X.

24. D. D. Guseva and O. A. Mitrofanova, “Keyphrases in Russian-language popular science texts: comparison of oral and written speech perception with the results of automatic analysis,” Terra Linguistica, vol. 15, no. 1, pp. 20–35, 2024, doi: 10.18721/JHSS.15102.

25. A. Glazkova, D. Morozov, and T. Garipov, “Key Algorithms for Keyphrase Generation: Instruction-Based LLMs for Russian Scientific Keyphrases.” 2024.

26. T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” 2020, pp. 38–45, doi: 10.18653/v1/2020.emnlp-demos.6.

27. Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach.” 2019.

28. C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020, doi: 10.5555/3455716.3455856.

29. M. Lewis et al., “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880, doi: 10.18653/v1/2020.acl-main.703.

30. T. Brown et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020, doi: 10.5555/3495724.3495883.

31. A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.

32. D. Zmitrovich et al., “A Family of Pretrained Transformer Language Models for Russian,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024, pp. 507–524.

33. Y. Tang et al., “Multilingual Translation from Denoising Pre-Training,” in Findings of the Association for Computational Linguistics, 2021, pp. 3450–3466, doi: 10.18653/v1/2021.findings-acl.304.

34. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations, 2019, p. 53592270.

35. A. Kartelj, M. Mladenovi'c, and S. Vujivci'c Stankovi'c, “Comparison of algorithms for the recognition of ChatGPT paraphrased texts,” Journal of Big Data, vol. 12, no. 1, pp. 1–17, 2025, doi: 10.1186/s40537-025-01082-0.

36. J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, 2022, doi: 10.1109/TKDE.2020.2981314.

37. G. Da San Martino, S. Yu, A. Barr'on-Cede no, R. Petrov, and P. Nakov, “Fine-Grained Analysis of Propaganda in News Articles,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019, pp. 5636–5646, doi: 10.18653/v1/D19-1565.

Review

For citations:

Melnikova A.V., Vorobeva M.S., Glazkova A.V. Comparison of pre-trained models for domain-specific entity extraction from student report documents. Modeling and Analysis of Information Systems. 2025;32(1):66-79. (In Russ.) https://doi.org/10.18255/1818-1015-2025-1-66-79

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1818-1015 (Print)
ISSN 2313-5417 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Modeling and Analysis of Information Systems

Comparison of pre-trained models for domain-specific entity extraction from student report documents

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy