Extracting named entities from Russian-language documents with different expressiveness of structure
https://doi.org/10.18255/1818-1015-2023-4-382-393
EDN: NVTLNK
Abstract
This work is devoted to solving the problem of recognizing named entities for Russian-language texts based on the CRF model. Two sets of data were considered: documents on refinancing with a good document structure, semi-structured texts of court records. The model was tested under various sets of text features and CRF parameters (optimization algorithms). In average for all entities, the best F-measure value for structured documents was 0.99, and for semi-structured ones 0.86.
About the Authors
Maria D. AverinaRussian Federation
Olga A. Levanova
Russian Federation
References
1. E. Leitner, G. Rehm, and J. Moreno-Schneider, “Fine-grained Named Entity Recognition in legal documents,” in International Conference on Semantic Systems, 2019, pp. 272–287.
2. J. Strakov'a, M. Straka, and J. Hajivc, “Neural Architectures for Nested NER through Linearization,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331, 2019.
3. R. Yeshpanov, Y. Khassanov, and H. A. Varol, “KazNERD: Kazakh Named Entity Recognition Dataset.” 2022.
4. S. Zheng et al., “Conditional Random Fields as Recurrent Neural Networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.
5. K. W. Church, “Word2Vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, 2017.
6. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the association for computational linguistics, vol. 5, pp. 135–146, 2017.
7. C. Sutton, A. McCallum, and others, “An Introduction to Conditional Random Fields,” Foundations and Trendstextregistered in Machine Learning, vol. 4, no. 4, pp. 267–373, 2012.
8. J. Lafferty, A. Mccallum, and F. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” in Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 282–289.
9. M. Collins, “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms,” in Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), 2002, pp. 1–8.
10. S. Bird, “NLTK: The Natural Language Toolkit,” in Proceedings of the COLING/ACL on Interactive Presentation Sessions, 2006, pp. 69–72.
11. R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, 2010, pp. 46–50.
12. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts, 2015, pp. 320–332.
13. J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for Named Entity Recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, 2020.
Review
For citations:
Averina M.D., Levanova O.A. Extracting named entities from Russian-language documents with different expressiveness of structure. Modeling and Analysis of Information Systems. 2023;30(4):382-393. (In Russ.) https://doi.org/10.18255/1818-1015-2023-4-382-393. EDN: NVTLNK