Analysis of the influence of different word vector representations on the accuracy of text data classification

Keywords: Machine learning, natural language processing, semantics, context, text data, neural networks, transformers, BERT, GPT-3, data mining, sentiment analysis, semantic analysis

Abstract

Relevance. The growing amount of available textual information from the Internet and other sources creates the need to improve text processing methods for efficient analysis and use of this data. The vector representation of words is defined as a key element in this context, as it allows transforming words into numerical vectors while preserving semantic relations. With the development of modern machine learning methods, especially deep learning, words vector representations have become an important element for improving the results of models in text data processing. Such models require high-quality and semantically rich vector representations. All this determines the relevance of studying the impact of different vector representations of words on text data processing and identifying optimal methods for specific tasks.

Objective: The purpose of this paper is to systematically analyze the impact of different word vectorization methods on the results of text data processing. The study aims to identify optimal approaches to word vector representation to improve the efficiency and accuracy of text processing models in various artificial intelligence and machine learning tasks.

Research methods. Analysis, experiment.

Results. It has been found that despite significant progress in machine learning technologies, the problem of semantics and context in text data processing still exists. This problem affects the quality and accuracy of decisions made by machine learning-based systems, which can lead to incorrect analysis and data distortion. It has been found that even modern transformer-based models may face challenges in understanding semantics and context, especially in complex and ambiguous scenarios.

Conclusions. Based on the study, it was concluded that the problem of semantics and context in text data processing is significant and requires further study. Existing methods and technologies, although showing good results in some tasks, may be insufficient in other, especially complex, situations. It is proposed to continue research in this area, to develop new methods and approaches that might be able to effectively solve these problems. It is also important to study how different contextual factors affect the semantics of textual data and how these influences can be taken into account when designing and using machine learning systems.

Downloads

Download data is not yet available.

Author Biographies

Ihor Malyha, V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022

postgraduate

Serhiy Shmatkov, V.N. Karazin Kharkiv National University,4 Svobody Square, Kharkiv, Ukraine, 61022

doctor of science, professor; Head of the Department of Theoretical Theoretical and Applied System Engineering

References

/

References

Published
2023-10-30
How to Cite
Malyha, I., & Shmatkov, S. (2023). Analysis of the influence of different word vector representations on the accuracy of text data classification. Bulletin of V.N. Karazin Kharkiv National University, Series «Mathematical Modeling. Information Technology. Automated Control Systems», 59, 49-55. https://doi.org/10.26565/2304-6201-2023-59-05
Section
Статті