Сравнение предварительно обученных моделей для извлечения предметно-ориентированных сущностей из студенческих отчетных документов

Антонина Владимировна Мельникова; Марина Сергеевна Воробьева; Анна Валерьевна Глазкова

doi:10.18255/1818-1015-2025-1-66-79

Сравнение предварительно обученных моделей для извлечения предметно-ориентированных сущностей из студенческих отчетных документов

Антонина Владимировна Мельникова, Марина Сергеевна Воробьева, Анна Валерьевна Глазкова

https://doi.org/10.18255/1818-1015-2025-1-66-79

Полный текст:

PDF (Rus)

сгенерировать QR код

Аннотация

Авторы предлагают методику извлечения предметно-ориентированных сущностей (ПОС) из русскоязычных текстов студенческих отчетных документов с использованием предварительно обученных языковых моделей на основе трансформеров. Извлечение ПОС из студенческих работ представляет собой актуальную задачу, так как полученные данные могут использоваться для различных целей — начиная от формирования проектных групп и заканчивая персонализацией учебных маршрутов, а также автоматизация процесса обработки документов снижает затраты труда на ручную обработку. В качестве материала для дообучения исследуемых моделей использовались размеченные экспертами отчетные документы студентов, обучающихся по направлениям информационных технологий и поступивших в период с 2019 по 2022 год, по проектным, практическим дисциплинам и выпускным квалификационным работам. Задача извлечения ПОС рассматривается как две задачи: идентификация именованных сущностей и генерация размеченного текста. Сравнительный анализ проводился между моделями, основанными исключительно на энкодерах (ruBERT, ruRoBERTa), предназначенными для извлечения именованных сущностей, и моделями, использующими как энкодеры, так и декодеры (ruT5, mBART), а также моделями, базирующимися только на декодерах (ruGPT, T-lite), применяемыми для генерации текста. Для оценки эффективности сравниваемых моделей использовалась F-мера, а также проведен анализ типичных ошибок. Наиболее высокие показатели по F-мере на тестовом наборе данных продемонстрировала модель mBART (93.55%). Эта же модель показала наименьший уровень ошибок при идентификации ПОС во время генерации текста и разметки. Модели для извлечения именованных сущностей проявляют меньшую склонность к ошибкам, однако имеют тенденцию к фрагментарному выделению ПОС. Полученные результаты свидетельствуют о применимости рассматриваемых моделей для решения поставленных задач с учетом специфики предъявляемых требований.

Ключевые слова

предметно-ориентированные сущности, цифровой след, извлечение информации, обработка естественного языка, предварительно обученные языковые модели

MSC2020: 68T50, 97B40

Об авторах

Антонина Владимировна Мельникова

Тюменский государственный университет
Россия

Марина Сергеевна Воробьева

Тюменский государственный университет
Россия

Анна Валерьевна Глазкова

Тюменский государственный университет
Россия

Список литературы

1. Q. Guohao, W. Bin, W. Bai, and Z. Baoli, “Competency Analysis in Human Resources Using Text Classification Based on Deep Neural Network,” in Proceedings of the IEEE Fourth International Conference on Data Science in Cyberspace, 2019, pp. 322–329, doi: 10.1109/DSC.2019.00056.

2. I. G. Zakharova, Y. V. Boganyuk, M. S. Vorobyova, and E. A. Pavlova, “Diagnostics of professional competence of IT students based on digital footprint data,” Informatics and Education, vol. 4, no. 313, pp. 4–11, 2020, doi: 10.32517/0234-0453-2020-35-4-4-11.

3. Z. Alami Merrouni, B. Frikh, and B. Ouhbi, “Automatic keyphrase extraction: a survey and trends,” Journal of Intelligent Information Systems, vol. 54, no. 2, pp. 391–424, 2020, doi: 10.1007/s10844-019-00558-9.

4. E. P. Bruches, A. E. Pauls, T. V. Batura, V. V. Isachenko, and D. R. Shcherbatov, “Semantic Analysis of Scientific Texts: Experience in Creating a Corpus and Building Language Models,” Software & Systems, vol. 34, no. 1, pp. 132–144, 2021, doi: 10.15827/0236-235X.133.132-144.

5. Y. I. Butenko, N. S. Nikolaeva, and T. D. Margaryan, “Structural Models of Terminological Word Combinations for Marking up a Corpus of Scientific and Technical Texts,” NSU Vestnik. Series: Linguistics and Intercultural Communication, vol. 19, no. 3, pp. 45–56, 2021, doi: 10.25205/1818-7935-2021-19-3-45-56.

6. A. A. Novikova, “Comparison of Sketch Engine and TermoStat Tools for Terminology Extraction,” International Journal of Open Information Technologies, vol. 8, no. 11, pp. 73–79, 2020.

7. E. P. Bruches and T. V. Batura, “Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision,” Vestnik NSU. Series: Information Technologies, vol. 19, no. 2, pp. 5–16, 2021, doi: 10.25205/1818-7900-2021-19-2-5-16.

8. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186, doi: 10.18653/v1/N19-1423.

9. Y. Y. Dementyeva, E. P. Bruches, and T. V. Batura, “Terms Extraction from Texts of Scientific Papers,” Software & Systems, vol. 35, no. 4, pp. 689–697, 2022, doi: 10.15827/0236-235X.140.689-697.

10. Y. Kuratov and M. Arkhipov, “Adaptation of deep bidirectional multilingual transformers for Russian language,” in Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, 2019, pp. 333–339.

11. V. K. Pimeshkov, M. L. Nikonorova, and M. G. Shishaev, “A Combined Term Extraction Method for the Problem of Monitoring Thematic Discussions in Social Media,” Informatics and Automation, vol. 23, no. 4, pp. 1110–1138, 2024, doi: 10.15622/ia.23.4.7.

12. M. D. Averina and O. A. Levanova, “Extracting Named Entities from Russian-Language Documents with Different Expressiveness of Structure,” Modeling and Analysis of Information Systems, vol. 30, no. 4, pp. 382–393, 2023, doi: 10.18255/1818-1015-2023-4-382-393.

13. X. Liu, J. A. Erkoyuncu, J. Y. H. Fuh, W. F. Lu, and B. Li, “Knowledge extraction for additive manufacturing process via named entity recognition with LLMs,” Robotics and Computer-Integrated Manufacturing, vol. 93, p. 102900, 2025, doi: 10.1016/j.rcim.2024.102900.

14. T. Atnashev et al., “Razmecheno: Named Entity Recognition from Digital Archive of Diaries ‘Prozhito,’” in Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022), 2022, pp. 22–38.

15. M. Tikhomirov, N. Loukachevitch, A. Sirotina, and B. Dobrov, “Using BERT and augmentation in named entity recognition for cybersecurity domain,” in Proceedings of the 25th International Conference on Applications of Natural Language to Information Systems, 2020, pp. 16–24, doi: 10.1007/978-3-030-51310-8_2.

16. P. V. Korytov, Y. Y. Gribetskiy, E. A. Andreeva, and I. I. Kholod, “Analysis of Approaches for Identifying Key Skills in Vacancies,” in Proceedings of the International Conference on Soft Computing and Measurement, 2024, pp. 300–303, doi: 10.1109/SCM62608.2024.10554269.

17. I. E. Nikolaev, “Knowledge and Skills Extraction from the Job Requirements Texts,” Ontology of Designing, vol. 13, no. 2, pp. 282–293, 2023, doi: 10.18287/2223-9537-2023-13-2-282-293.

18. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, pp. 1–11, 2017.

19. A. V. Melnikova, M. S. Vorobeva, E. V. Egorova, and E. D. Chekanova, “Development of an Algorithm for the Formation of IT Project Teams Based on Data from the Digital Footprint of Students,” Proceedings of the Institute for System Programming of the RAS, vol. 36, no. 3, pp. 213–224, 2024, doi: 10.15514/ispras-2024-36(3)-15.

20. N. Matkin et al., “Comparative Analysis of Encoder-Based NER and Large Language Models for Skill Extraction from Russian Job Vacancies.” 2024.

21. M. Khokhlova and M. Koryshev, “Keyness Analysis and Its Representation in Russian Academic Papers on Computational Linguistics: Evaluation of Algorithms,” in RASLAN, 2022, pp. 25–33.

22. O. A. Mitrofanova and D. A. Gavrilic, “Experiments on automatic keyphrase extraction in stylistically heterogeneous corpus of Russian texts,” Terra Linguistica, vol. 13, no. 4, pp. 22–40, 2022, doi: 10.18721/JHSS.13402.

23. A. V. Glazkova, D. A. Morozov, M. S. Vorobeva, and A. A. Stupnikov, “Keyword Generation for Russian-Language Scientific Texts Using the mT5 Model,” Automatic Control and Computer Sciences, vol. 58, no. 7, pp. 995–1002, 2024, doi: 10.3103/S014641162470041X.

24. D. D. Guseva and O. A. Mitrofanova, “Keyphrases in Russian-language popular science texts: comparison of oral and written speech perception with the results of automatic analysis,” Terra Linguistica, vol. 15, no. 1, pp. 20–35, 2024, doi: 10.18721/JHSS.15102.

25. A. Glazkova, D. Morozov, and T. Garipov, “Key Algorithms for Keyphrase Generation: Instruction-Based LLMs for Russian Scientific Keyphrases.” 2024.

26. T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” 2020, pp. 38–45, doi: 10.18653/v1/2020.emnlp-demos.6.

27. Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach.” 2019.

28. C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020, doi: 10.5555/3455716.3455856.

29. M. Lewis et al., “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880, doi: 10.18653/v1/2020.acl-main.703.

30. T. Brown et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020, doi: 10.5555/3495724.3495883.

31. A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.

32. D. Zmitrovich et al., “A Family of Pretrained Transformer Language Models for Russian,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024, pp. 507–524.

33. Y. Tang et al., “Multilingual Translation from Denoising Pre-Training,” in Findings of the Association for Computational Linguistics, 2021, pp. 3450–3466, doi: 10.18653/v1/2021.findings-acl.304.

34. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations, 2019, p. 53592270.

35. A. Kartelj, M. Mladenovi'c, and S. Vujivci'c Stankovi'c, “Comparison of algorithms for the recognition of ChatGPT paraphrased texts,” Journal of Big Data, vol. 12, no. 1, pp. 1–17, 2025, doi: 10.1186/s40537-025-01082-0.

36. J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, 2022, doi: 10.1109/TKDE.2020.2981314.

37. G. Da San Martino, S. Yu, A. Barr'on-Cede no, R. Petrov, and P. Nakov, “Fine-Grained Analysis of Propaganda in News Articles,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019, pp. 5636–5646, doi: 10.18653/v1/D19-1565.

Рецензия

Для цитирования:

Мельникова А.В., Воробьева М.С., Глазкова А.В. Сравнение предварительно обученных моделей для извлечения предметно-ориентированных сущностей из студенческих отчетных документов. Моделирование и анализ информационных систем. 2025;32(1):66-79. https://doi.org/10.18255/1818-1015-2025-1-66-79

For citation:

Melnikova A.V., Vorobeva M.S., Glazkova A.V. Comparison of pre-trained models for domain-specific entity extraction from student report documents. Modeling and Analysis of Information Systems. 2025;32(1):66-79. (In Russ.) https://doi.org/10.18255/1818-1015-2025-1-66-79

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 1818-1015 (Print)
ISSN 2313-5417 (Online)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Моделирование и анализ информационных систем

Сравнение предварительно обученных моделей для извлечения предметно-ориентированных сущностей из студенческих отчетных документов

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов