<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">mais</journal-id><journal-title-group><journal-title xml:lang="ru">Моделирование и анализ информационных систем</journal-title><trans-title-group xml:lang="en"><trans-title>Modeling and Analysis of Information Systems</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1818-1015</issn><issn pub-type="epub">2313-5417</issn><publisher><publisher-name>Yaroslavl State University</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.18255/1818-1015-2022-4-334-347</article-id><article-id custom-type="elpub" pub-id-type="custom">mais-1750</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Theory of Data</subject></subj-group></article-categories><title-group><article-title>Классификация русскоязычных текстов по жанрам на основе современных эмбеддингов и ритма</article-title><trans-title-group xml:lang="en"><trans-title>Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1742-3240</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Лагутина</surname><given-names>Ксения Владимировна</given-names></name><name name-style="western" xml:lang="en"><surname>Lagutina</surname><given-names>Ksenia Vladimirovna</given-names></name></name-alternatives><email xlink:type="simple">lagutinakv@mail.ru</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru">Ярославский государственный университет им. П. Г. Демидова<country>Россия</country></aff><aff xml:lang="en">P. G. Demidov Yaroslavl State University<country>Russian Federation</country></aff></aff-alternatives><pub-date pub-type="collection"><year>2022</year></pub-date><pub-date pub-type="epub"><day>18</day><month>12</month><year>2022</year></pub-date><volume>29</volume><issue>4</issue><fpage>334</fpage><lpage>347</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Лагутина К.В., 2022</copyright-statement><copyright-year>2022</copyright-year><copyright-holder xml:lang="ru">Лагутина К.В.</copyright-holder><copyright-holder xml:lang="en">Lagutina K.V.</copyright-holder><license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://www.mais-journal.ru/jour/article/view/1750">https://www.mais-journal.ru/jour/article/view/1750</self-uri><abstract><p>В статье исследуются современные векторные модели текстов для решения задачи классификации русскоязычных текстов по жанрам. Модели включают эмбеддинги ELMo, языковую модель BERT с предобучением и комплекс числовых ритмических характеристик на основе лексико-грамматических средств. Эксперименты проводились на корпусе из 10 000 текстов пяти жанров: романы, научные статьи, отзывы, посты из социальной сети Вконтакте, новости из OpenCorpora. Визуализация и анализ статистики для ритмических характеристик позволили выделить как наиболее разнообразные по ритму жанры: романы и отзывы, так и наименее - научные статьи. Именно эти жанры были впоследствии классифицированы лучше всего с помощью ритма и нейросети-классификатора LSTM. Кластеризация и классификация текстов по жанрам с помощью эмбеддингов ELMo и BERT позволила отделить один жанр от другого с небольшим количеством ошибок. F-мера мультиклассификации достигла 99%. Исследование подтверждает эффективность современных эмбеддингов в задачах компьютерной лингвистики, а также позволяет выделить достоинства и ограничения комплекса ритмических характеристик на материале классификации по жанрам.</p></abstract><trans-abstract xml:lang="en"><p>The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>стилометрия</kwd><kwd>обработка естественного языка</kwd><kwd>ритмические характеристики</kwd><kwd>жанры</kwd><kwd>классификация текстов</kwd><kwd>BERT</kwd><kwd>ELMo</kwd></kwd-group><kwd-group xml:lang="en"><kwd>stylometry</kwd><kwd>natural language processing</kwd><kwd>rhythm features</kwd><kwd>genres</kwd><kwd>text classification</kwd><kwd>BERT</kwd><kwd>ELMo</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">L. A. Kochetova and V. V. Popov, "Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus”, Nauchnyi dialog, no. 6, 2019, In Russian.</mixed-citation><mixed-citation xml:lang="en">L. A. Kochetova and V. V. Popov, "Research of Axiological Dominants in Press Release Genre based on Automatic Extraction of Key Words from Corpus”, Nauchnyi dialog, no. 6, 2019, In Russian.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">B. Kessler, G. Numberg, and H. Schutze, "Automatic detection of text genre”, in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 1997, pp. 32-38.</mixed-citation><mixed-citation xml:lang="en">B. Kessler, G. Numberg, and H. Schutze, "Automatic detection of text genre”, in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 1997, pp. 32-38.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">A. Onan, "An ensemble scheme based on language function analysis and feature engineering for text genre classification”, Journal of Information Science, vol. 44, no. 1, pp. 28-47, 2018.</mixed-citation><mixed-citation xml:lang="en">A. Onan, "An ensemble scheme based on language function analysis and feature engineering for text genre classification”, Journal of Information Science, vol. 44, no. 1, pp. 28-47, 2018.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Z. Dai and R. Huang, "A Joint Model for Structure-based News Genre Classification with Application to Text Summarization”, in Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021, 2021, pp. 3332-3342.</mixed-citation><mixed-citation xml:lang="en">Z. Dai and R. Huang, "A Joint Model for Structure-based News Genre Classification with Application to Text Summarization”, in Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021, 2021, pp. 3332-3342.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">K. V. Lagutina, N. S. Lagutina, and E. I. Boychuk, "Text classification by genre based on rhythm features”, Modeling and analysis of information systems, vol. 28, no. 3, pp. 280-291, 2021.</mixed-citation><mixed-citation xml:lang="en">K. V. Lagutina, N. S. Lagutina, and E. I. Boychuk, "Text classification by genre based on rhythm features”, Modeling and analysis of information systems, vol. 28, no. 3, pp. 280-291, 2021.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, and I. Paramonov, "Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries”, Proceedings of the 26th Conference of Open Innovations Association FRUCT, pp. 247-255, 2020.</mixed-citation><mixed-citation xml:lang="en">K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, and I. Paramonov, "Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries”, Proceedings of the 26th Conference of Open Innovations Association FRUCT, pp. 247-255, 2020.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, "Deep Contextualized Word Representations”, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227-2237.</mixed-citation><mixed-citation xml:lang="en">M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, "Deep Contextualized Word Representations”, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227-2237.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171-4186.</mixed-citation><mixed-citation xml:lang="en">J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171-4186.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">C. Wang, P. Nulty, and D. Lillis, "A comparative study on word embeddings in deep learning for text classification”, in Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, 2020, pp. 37-46.</mixed-citation><mixed-citation xml:lang="en">C. Wang, P. Nulty, and D. Lillis, "A comparative study on word embeddings in deep learning for text classification”, in Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, 2020, pp. 37-46.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Y. Kuratov and M. Arkhipov, "Adaptation of deep bidirectional multilingual transformers for Russian language”, in Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, 2019, pp. 333-339.</mixed-citation><mixed-citation xml:lang="en">Y. Kuratov and M. Arkhipov, "Adaptation of deep bidirectional multilingual transformers for Russian language”, in Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, 2019, pp. 333-339.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">A. Kutuzov, L. Pivovarova, etal., "RuShiftEval: a shared task on semantic shift detection for Russian”, in Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2021), vol. 20, 2021, pp. 533-545.</mixed-citation><mixed-citation xml:lang="en">A. Kutuzov, L. Pivovarova, etal., "RuShiftEval: a shared task on semantic shift detection for Russian”, in Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2021), vol. 20, 2021, pp. 533-545.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">J. Rodina, Y. Trofimova, A. Kutuzov, and E. Artemova, "ELMo and BERT in semantic change detection for Russian”, in International Conference on Analysis of Images, Social Networks and Texts, Springer, 2020, pp. 175-186.</mixed-citation><mixed-citation xml:lang="en">J. Rodina, Y. Trofimova, A. Kutuzov, and E. Artemova, "ELMo and BERT in semantic change detection for Russian”, in International Conference on Analysis of Images, Social Networks and Texts, Springer, 2020, pp. 175-186.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">A. V. Glazkova, "Topical classification of text fragments accounting for their nearest context”, Automation and Remote Control, vol. 81, no. 12, pp. 2262-2276, 2020.</mixed-citation><mixed-citation xml:lang="en">A. V. Glazkova, "Topical classification of text fragments accounting for their nearest context”, Automation and Remote Control, vol. 81, no. 12, pp. 2262-2276, 2020.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">I. A. Batraeva, A. D. Nartsev, and A. S. Lezgyan, "Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning”, Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie vychislitelnaja tehnika i informatika, no. 50, pp. 14-22, 2020, In Russian.</mixed-citation><mixed-citation xml:lang="en">I. A. Batraeva, A. D. Nartsev, and A. S. Lezgyan, "Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning”, Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie vychislitelnaja tehnika i informatika, no. 50, pp. 14-22, 2020, In Russian.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">V. Bocharov, S. Alexeeva, D. Granovsky, E. Protopopova, M. Stepanova, and A. Surikov, "Crowdsourcing morphological annotation”, in Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ’’Dialogue”. Volume 1, 2013, pp. 109-114.</mixed-citation><mixed-citation xml:lang="en">V. Bocharov, S. Alexeeva, D. Granovsky, E. Protopopova, M. Stepanova, and A. Surikov, "Crowdsourcing morphological annotation”, in Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ’’Dialogue”. Volume 1, 2013, pp. 109-114.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, and I. Paramonov, "Authorship verification of literary texts with rhythm features”, in 28th Conference of Open Innovations Association FRUCT, IEEE, 2021, pp. 240-251.</mixed-citation><mixed-citation xml:lang="en">K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, and I. Paramonov, "Authorship verification of literary texts with rhythm features”, in 28th Conference of Open Innovations Association FRUCT, IEEE, 2021, pp. 240-251.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
