Analysis of the influence of different word vector representations on the accuracy of text data classification

doi:10.26565/2304-6201-2023-59-05

Ihor Malyha V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022 https://orcid.org/0000-0002-5708-7739
Serhiy Shmatkov V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022 https://orcid.org/0000-0002-6328-988X

DOI: https://doi.org/10.26565/2304-6201-2023-59-05

Keywords: Machine learning, natural language processing, semantics, context, text data, neural networks, transformers, BERT, GPT-3, data mining, sentiment analysis, semantic analysis

Abstract

Relevance. The growing amount of available textual information from the Internet and other sources creates the need to improve text processing methods for efficient analysis and use of this data. The vector representation of words is defined as a key element in this context, as it allows transforming words into numerical vectors while preserving semantic relations. With the development of modern machine learning methods, especially deep learning, words vector representations have become an important element for improving the results of models in text data processing. Such models require high-quality and semantically rich vector representations. All this determines the relevance of studying the impact of different vector representations of words on text data processing and identifying optimal methods for specific tasks.

Objective: The purpose of this paper is to systematically analyze the impact of different word vectorization methods on the results of text data processing. The study aims to identify optimal approaches to word vector representation to improve the efficiency and accuracy of text processing models in various artificial intelligence and machine learning tasks.

Research methods. Analysis, experiment.

Results. It has been found that despite significant progress in machine learning technologies, the problem of semantics and context in text data processing still exists. This problem affects the quality and accuracy of decisions made by machine learning-based systems, which can lead to incorrect analysis and data distortion. It has been found that even modern transformer-based models may face challenges in understanding semantics and context, especially in complex and ambiguous scenarios.

Conclusions. Based on the study, it was concluded that the problem of semantics and context in text data processing is significant and requires further study. Existing methods and technologies, although showing good results in some tasks, may be insufficient in other, especially complex, situations. It is proposed to continue research in this area, to develop new methods and approaches that might be able to effectively solve these problems. It is also important to study how different contextual factors affect the semantics of textual data and how these influences can be taken into account when designing and using machine learning systems.

Downloads

Author Biographies

Ihor Malyha, V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022

postgraduate

Serhiy Shmatkov, V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022

doctor of science, professor; Head of the Department of Theoretical Theoretical and Applied System Engineering

References

/

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Available at: https://arxiv.org/abs/1301.3781.

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. Available at: https://arxiv.org/abs/1409.0473.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available at: https://www.aclweb.org/anthology/N19-1423/.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Available at: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Available at: https://www.aclweb.org/anthology/D14-1162/.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Available at: https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.

Davidov, D., Tsur, O., & Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. Available at: https://aclanthology.org/W10-2914/.

Blodgett, S. L., Green, L., & O'Connor, B. (2018). Demographic Dialectal Variation in Social Media: A Case Study of African-American English. Available at: https://aclanthology.org/D16-1120/.

Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. Available at: https://www.aclweb.org/anthology/P18-1031/.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Available at: https://www.aclweb.org/anthology/N18-1202/.

Huang, P. S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. Available at: https://posenhuang.github.io/papers/cikm2013_DSSM_fullversion.pdf.

Xu C., McAuley J., (2018). The Importance of Generation Order in Language Modeling. Available at: https://www.aclweb.org/anthology/D18-1324/.

Suzuki M., Matsuo Y., (2020). A survey of multimodal deep generative models. Available at: https://arxiv.org/abs/2207.02127.

Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using Millions of Emoji Occurrences to Learn Any-domain Representations for Detecting Sentiment, Emotion and Sarcasm. Available at: https://www.aclweb.org/anthology/D17-1169/.

Reyes A., Rosso P., (2016). Mining Subjective Knowledge from Customer Reviews: A Specific Case of Irony Detection. Available at: https://aclanthology.org/W11-1715.pdf.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification.. Available at: https://www.aclweb.org/anthology/N16-1174/.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Better language models and their implications. Available at: https://openai.com/blog/better-language-models/.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language Models are Few-Shot Learners Available at: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bf5478631ec67e564d04505b-Paper.pdf.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Available at: https://openreview.net/pdf?id=rJ4km2R5t7.

Lu, X., Xiong, C., Parikh, A. P., & Socher, R. (2019). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Available at: https://arxiv.org/abs/1908.02265.