RussianLanguage Thesauri: Automated Construction and Application For Natural Language Processing Tasks
https://doi.org/10.18255/1818-1015-2018-4-435-458
Abstract
The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.
About the Authors
Nadezhda S. LagutinaRussian Federation
Ksenia V. Lagutina
Russian Federation
student
Aleksey S. Adrianov
Russian Federation
student
Ilya V. Paramonov
Russian Federation
associate professor, PhD
References
1. Aitchison J., Gilchrist A. and Bawden D., Thesaurus construction and use: a practical manual, Psychology Press, 2000, 230 pp.
2. Sidorova E. A., “Ontology-based approach to modeling the process of extracting information from text”, Ontology of design, 8:1(27) (2018), 134–151, (in Russian).
3. Yelenevskaya M. N., Ovchinnikova I. G., “The storage and description of the verbal associations”, Questions of psycholinguistics, 2016, № 29, 69–92, (in Russian).
4. Paramonov I. et al., “Thesaurus-Based Method of Increasing Text-via-Keyphrase Graph Connectivity During Keyphrase Extraction for e-Tourism Applications”, Communications in Computer and Information Science, 649, Springer, 2016, 129–141.
5. Shchitov I., Lagutina K., Lagutina N., Paramonov I., “Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships”, Proceedings of the 21st Conference of Open Innovations Association FRUCT, University of Helsinki, Helsinki, Finland, 2017, 290–295.
6. Blenda N. A., “Overview of russian-language thesauri to solve the problem of calculating the semantic similarity for scientific publications”, Information Technologies and Systems, Proceedings of the Fourth International Scientific Conference, 2015, 70–74, (in Russian).
7. Porshnev S. V., “O kachestve otkrytyh ehlektronnyh tezaurusov russkogo yazyka”, Sbornik materialov Vserossijskoj molodezhnoj shkolyseminara Aktualnye problemy informacionnyh tekhnologij, ehlektroniki i radiotekhniki – 2015 (ITER –2015), 2 (2015), 45–48, (in Russian).
8. Loukachevitch N., Dobrov B., “RuThes linguistic ontology vs. Russian wordnets”, Proceedings of the Seventh Global WordNet Conference, 2014, 154–162.
9. Loukachevitch N., Dobrov B., Chetviorkin I., “RuThes-Lite, a publicly available version of Thesaurus of Russian language RuThes”, Computational Linguistics and Intellectual Technologies: papers from the Annual conference ”Dialogue”, 2014, № 13(20), 340–349.
10. Loukachevitch N. V., Lashevich G., Gerasimova A. A., Ivanov V. V., Dobrov B. V., “Creating Russian WordNet by conversion”, Computational Linguistics and Intellectual Technologies: papers from the Annual conference ”Dialogue”, 2016, № 15(22), 405–415.
11. Braslavski P., Ustalov D., Mukhin M., Kiselev Y., “YARN: Spinning-in-Progress”, Proceedings of the Eight Global Wordnet Conference, 2016, 58–65.
12. Suhonogov A. M., Yablonskij S. A., “Avtomatizaciya postroeniya anglo-russkogo WordNet”, Computational Linguistics and Intellectual Technologies, Papers from the Annual conference ”Dialogue”, 2005, 25–31, (in Russian).
13. Azarowa I., “RussNet as a computer lexicon for Russian”, Proceedings of the Intelligent Information systems IIS-2008, 2008, 341–350.
14. Azarova I. V., Zaharov V. P., Kiselev YU., Ustalov D. A., Hohlova M. V., “Integraciya tezaurusov RussNet i YARN”, Kompyuternaya lingvistika i vychislitelnye ontologii, Trudy XIX Mezhdunarodnoj objedinyonnoj nauchnoj konferencii Internet i sovremennoe obshchestvo (IMS-2016), Sankt-Peterburg, 22–24 iyunya 2016 g., Universitet ITMO, SPb, 2016, 7–13, (in Russian).
15. Sladkova O., Pirumova L., Pirumov A., “Informacionnye resursy Internet dlya specialistov selskogo hozyajstva”, International Agricultural Journal, 2016, № 2, 44–48, (in Russian).
16. Galieva A. M., Yakubova D. D., “Principles of vocabulary presentation in socio-political thesaurus of the tatar language”, Philological Sciences. Questions of theory and practice, 2016, № 12-2 (66), 80–84, (in Russian).
17. Galieva A. M., Kirillovich A. V., Lukashevich N. V., Nevzorova O. A., Sulejmanov D. SH., YAkubova D. D., “Russian-tatar socio-political thesaurus: publishing in the linguistic linked open data cloud”, International Journal of Open Information Technologies, 5:11 (2017), 64–73, (in Russian).
18. Ageev M. S., Dobrov B. V., Lukashevich N. V., “Automatic Text Categorization: Methods and Problems”, Kazan. Gos. Univ. Uchen. Zap. Ser. Fiz.-Mat. Nauki, 150:4 (2008), 25–40, (in Russian).
19. Lukashevich N. V., Dobrov B. V., Pavlov A. M., SHternov S. V., “Ontological resources and information-analytical system in security domain”, Ontology of design, 8:1 (27) (2018), 74–95, (in Russian).
20. Mishunin O. B., Savinov A. P., Firstov D. I., “Problems of automatic free-text answer grading in intelligent tutoring systems”, Modern problems of science and education, 2015, № 2–2, 189–199, (in Russian).
21. Alekseev A. A., “Thematic representation of a news cluster as a basis for summarization”, Software Engineering, 2014, № 3, 41–48, (in Russian).
22. Ustalov D. A., “Concept discovery from synonymy graphs”, Computational Technologies, 22:S1 (2017), 99–112, (in Russian).
23. Kolchin M., Chistyakov A., Lapaev M., Khaydarova R., “FOODpedia: Russian food products as a linked data dataset”, International Semantic Web Conference, 2015, 87– 09.
24. Hasan K., Vincent N., “Automatic keyphrase extraction: A survey of the state of the art”, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, 1262–1273.
25. Dobrov B. V., Lukashevich N. V., “Linguistic ontology on natural sciences and technologies for information-retrieval applications”, Kazan. Gos. Univ. Uchen. Zap. Ser. Fiz.-Mat. Nauki, 149:2 (2007), 49–72, (in Russian).
26. Lukashevich N. V., Dobrov B. V., Chujko D. S., “Automated analysis of multiword expressions for computational dictionaries”, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ”Dialogue”, 2008, № 7(14), 339–344, (in Russian).
27. Turney P. D., Pantel P., “From frequency to meaning: Vector space models of semantics”, Journal of artificial intelligence research, 37 (2010), 141–188.
28. Zakharov V. P., “Corpus-based approach to thesaurus and ontology construction”, Structural and applied linguistics, 2015, № 11, 123–141, (in Russian).
29. Kotova E. E., Pisarev I. A., “Construction of thematic ontologies using the method of automated thesauri development”, Proceedings of Saint Petersburg Electrotechnical University, 2016, № 3, 37–47, (in Russian).
30. Ayusheeva N.N., Kusheeva T.N., “Method of calculation of weight factors tops semantic network scientific text”, Fundamental Research, 2012, № 6-3, 626–630, (in Russian).
31. Ayusheeva N. N., Gombozhapova T. N., Dorzhaev T. V., “An automatic scientific text topic identification method”, Fundamental Research, 2016, № 8-2, 229–233, (in Russian).
32. Chetviorkin I, Loukachevitch N., “Extraction of Russian sentiment lexicon for product meta-domain”, Proceedings of COLING 2012, 2012, 593–610.
33. Loukachevitch N., Levchik A., “Creating a General Russian Sentiment Lexicon”, Proceedings of Language Resources and Evaluation Conference, 2016, 1171–1176.
34. Vanyushkin A. S., Grashchenko L. A., “Ocenka algoritmov izvlecheniya klyuchevyh slov: instrumentarij i resursy”, Novye informacionnye tekhnologii v avtomatizirovannyh sistemah, 20 (2017), 95–102, (in Russian)
35. Lukashevich N. V., Logachev YU. M., “Automatic term extraction based on feature combination”, Vychislitel’nye metody i programmirovanie, 11:4 (2010), 108–116, (in Russian)
36. Lagutina N. S., Lagutina K. V., Mamedov E. I., Paramonov I. V., “Methodological aspects of semantic relationship extraction for automatic thesaurus generation”, Modeling and Analysis of Information Systems, 23:6 (2016), 826–840, (in Russian)
37. Lukashevich N. V., “Near-synonyms in linguistic ontologies”, Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference ”Dialogue”, 2010, № 9(16), 307–312, (in Russian)
38. Lukashevich N. V., “Modeling of PART–WHOLE Relations in Linguistic Resource for Information-Retrieval Applications”, Information Technology, 2007, № 12, 28–34, (in Russian)
39. Baranjuk V.V., Bogoradnikova A.V., Smirnova O.S., “Defining the scope semantics by forming its thesaurus”, International Journal of Open Information Technologies, 4:9 (2016), 74–79, (in Russian)
40. Nugumanova A. B., Bessmertnyj I. A., Pecina P., Bajburin E. M., “Semantic relations in text classification based on bag-of-words model”, Software products and systems, 2016, № 2(114), 89–99, (in Russian)
41. Panchenko A., Ustalov D., Arefyev N., Paperno D., Konstantinova N., Loukachevitch N., Biemann C., “Human and machine judgements for russian semantic relatedness”, Analysis of Images, Social Networks and Texts. 5th International Conference, AIST 2016, Springer, 2016, 221–235.
42. Rapp R., “The automatic generation of thesauri of related words for English, French, German, and Russian”, International Journal of Speech Technology, 11:3–4 (2008), 147– 156.
43. Galina I. V., Kozerenko E. B., Morozova Yu. I., Somin N. V., Sharnin M. M., “Associative portraits of subject areas as a tool for automated construction of big data systems for knowledge extraction: theory, methods, visualization, and application”, Informatics and its Applications, 9:2 (2015), 92–110, (in Russian).
44. Kuznetsov I. P., Kozerenko E. B., Charnine M. M., “Technological peculiarity of knowledge extraction for logical-analytical systems”, Proceedings of ICAI, 12, 2012, 18–21.
45. Zolotarev O. V., Charnin M. M., “Methods of extracting knowledge from natural language texts and the construction of models of business processes on the basis of allocation processes, objects, their relationships and characteristics”, Proceedings of the International Scientific Conference CPT2014, 2015, 92–98, (in Russian)
46. Zolotarev O. V., Charnin M. M., Klimenko S. V., “A semantic approach to the analysis of terrorist activity on the internet based on the methods of topic modeling”, Bulletin of the Russian New University. Series: Complex systems: models, analysis and management, 2016, № 3, 64–71, (in Russian).
47. Lagutina N. S., Lagutina K. V., Shchitov I. A., Paramonov I. V., “Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems”, Modeling and Analysis of Information Systems, 24:6 (2017), 772–787, (in Russian).
48. Sabirova K., Lukanin A., “Automatic Extraction of Hypernyms and Hyponyms from Russian Texts”, Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST’2014), 2014, 35–40.
49. Bol’shakova E. I., Ivanov K. M., Sapin A. S., SHarikov G. F., “Sistema dlya izvlecheniya informacii iz tekstov na baze leksikosintaksicheskih shablonov”, Pyatnadcataya nacionalnaya konferenciya po iskusstvennomu intellektu s mezhdunarodnym uchastiem, 2016, 14–22, (in Russian).
50. Rabchevskij E. A., “Avtomaticheskoe postroenie ontologij na osnove leksiko-sintaksicheskih shablonov dlya informacionnogo poiska”, Elektronnye biblioteki: perspektivnye metody i tekhnologii, ehlektronnye kollekcii, sb. nauch. tr. 11-j Vserossijskoj nauchnoj konferencii RCDL-2009, Petrozavodsk, 2009, 69–77, (in Russian).
51. Mihalcea R., Tarau P., “TextRank: Bringing order into texts”, Proceedings of Empirical Methods in Natural Language Processing – EMNLP, ACL, Barcelona, Spain, 2004, 404– 411.
52. Wiemer-Hastings P., Wiemer-Hastings K., Graesser A., “Latent semantic analysis”, Proceedings of the 16th international joint conference on Artificial intelligence, 2004, 1–14.
53. Noh S., Kim S., Jung C., “A Lightweight Program Similarity Detection Model using XML and Levenshtein Distance”, FECS, 2006, 3–9.
54. Lefever E., Van de Kauter M., Hoste V., “Evaluation of automatic hypernym extraction from technical corpora in English and Dutch”, 9th International Conference on Language Resources and Evaluation (LREC), 2014, 490–497.
Review
For citations:
Lagutina N.S., Lagutina K.V., Adrianov A.S., Paramonov I.V. RussianLanguage Thesauri: Automated Construction and Application For Natural Language Processing Tasks. Modeling and Analysis of Information Systems. 2018;25(4):435-458. (In Russ.) https://doi.org/10.18255/1818-1015-2018-4-435-458