Word-embedding Based Text Vectorization Using Clustering

Vitaly I. Yuferev; Nikolai A. Razin

doi:10.18255/1818-1015-2021-3-292-311

Word-embedding Based Text Vectorization Using Clustering

Vitaly I. Yuferev, Nikolai A. Razin

https://doi.org/10.18255/1818-1015-2021-3-292-311

Full Text:

PDF (Rus)

Generate QR code

Abstract

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.

The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.

This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.

A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Keywords

word embedding, Fasttext, TF-IDF, averaging, clustering, text similarity, distance, text ranking

MSC2020: 97R40, 68T50

About the Authors

Vitaly I. Yuferev

Department of Information Technologies of the Central Bank of the Russian Federation, Laboratory of innovations Novosibirsk
Russian Federation

Chief expert, Master of science.

12 Neglinnaya str., Moscow 107016

Nikolai A. Razin

Department of Counteraction to Unfair Practices, the Central Bank of the Russian Federation
Russian Federation

Head of division, PhD.

12 Neglinnaya str., Moscow 107016

References

1. P. Sitikhu, K. Pahi, P. Thapa, and S. Shakya, “A Comparison of Semantic Similarity Methods for Maximum Human Interpretability”, vol. 1, 2019, pp. 1–4. doi: 10.1109/AITB48515.2019.8947433.

2. C. D. Manning, P. Raghavan, and H. Schu¨tze, Introduction to Information Retrieval. USA: Cambridge University Press, 2008, isbn: 0521865719.

3. C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt, “Representation Learning for Very Short Texts Using Weighted Word Embedding Aggregation”, Pattern Recogn. Lett., vol. 80, no. C, pp. 150–156, Sep. 2016, issn: 0167-8655. doi: 10.1016/j.patrec.2016.06.012.

4. G. Kim and K. Cho, “Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search”, ArXiv, vol. abs/2010.07003, 2020.

5. O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8BERT: Quantized 8Bit BERT”, ArXiv, vol. abs/1910.06188, 2019.

6. H. Gong, Y. Shen, D. Yu, J. Chen, and D. Yu, “Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Jul. 2020, pp. 6751–6761. doi: 10.18653/v1/2020.acl-main.603. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.603.

7. Y. Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig, “When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?”, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 529–535. doi: 10.18653/v1/N18-2084. [Online]. Available: https://www.aclweb.org/anthology/N18-2084.

8. D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin, “Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms”, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 440–450. doi: 10.18653/v1/P18-1041. [Online]. Available: https://www.aclweb.org/anthology/P18-1041.

9. A. Ru¨ckle´, S. Eger, M. Peyrard, and I. Gurevych, “Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations”, ArXiv, vol. abs/1803.01400, 2018.

10. P. Turney and P. Pantel, “From Frequency to Meaning: Vector Space Models of Semantics”, Journal of Artificial Intelligence Research, vol. 37, pp. 141–188, Mar. 2010. doi: 10.1613/jair.2934.

11. A. L. O. Shahmirzadi and K. Younge, “Text Similarity in Vector Space Models: A Comparative Study”, in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 659–666. doi: 10.1109/ICMLA.2019.00120.

12. V. Gupta, A. Kumar, P. Nokhiz, H. Gupta, and P. Talukdar, “Improving Document Classification with Multi-Sense Embeddings”, in 24th European Conference on Artificial Intelligence - ECAI 2020, Nov. 2020, pp. 2030–2037. doi: 10.3233/FAIA200324.

13. V. Mekala Dheeraj and Gupta, B. Paranjape, and H. Karnick, “SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations”, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 659–669. doi: 10.18653/v1/D17-1069. [Online]. Available: https://www.aclweb.org/anthology/D17-1069.

14. V. Gupta, H. Karnick, A. Bansal, and P. Jhala, “Product Classification in E-Commerce using Distributional Semantics”, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 536–546. [Online]. Available: https://www.aclweb.org/anthology/C16-1052.

Review

For citations:

Yuferev V.I., Razin N.A. Word-embedding Based Text Vectorization Using Clustering. Modeling and Analysis of Information Systems. 2021;28(3):292-311. (In Russ.) https://doi.org/10.18255/1818-1015-2021-3-292-311

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1818-1015 (Print)
ISSN 2313-5417 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Modeling and Analysis of Information Systems

Word-embedding Based Text Vectorization Using Clustering

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy