The spam-messages classification model in a medical information system

doi:10.26565/2304-6201-2024-64-03

Kateryna Volynets V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022 https://orcid.org/0009-0003-7661-9758
Viktoriia Strilets V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022 https://orcid.org/0000-0002-2475-1496
Danylo Yakovlev V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022 https://orcid.org/0009-0005-4785-6361

DOI: https://doi.org/10.26565/2304-6201-2024-64-03

Keywords: spam messages, medical information systems, machine learning, natural language processing, text data classification

Abstract

Relevance. In modern medical information systems, a significant number of text records are generated daily from the service, doctors and staff. For high-quality work, such systems require the implementation of models and methods for analyzing and classifying text data, in particular, detecting spam messages and blocking them. Therefore, the development, improvement and implementation of models and methods for classifying spam messages is a relevant task.

Research objective: increasing the efficiency of the spam message recognition process in medical information systems; developing and implementing spam classification models based on machine learning methods.

Research methods: natural language processing methods, modeling, machine learning, classification methods, data analysis methods, statistical methods.

Results. Spam message classification models were built using such machine learning methods as the logistic regression model, the national Bayesian classifier model and the support vector model. The SMS Spam Collection set, previously prepared using CountVectorizer and TF-IDFVectorizer, was used to train the models. All proposed models showed high accuracy in spam message classification and the ability to correctly determine the type of message.

Conclusions: The developed message classification models based on machine learning and nlp approach successfully generated unwanted messages. The best model for quality indicators was the model based on the support vector method with TF-IDF vectorization, after which it showed the highest accuracy value (98.75%) and high value of recall (90.3%) of classification. Further improvements of the models and expansion of the training set can contribute to further improvement of the quality of spam recognition.

Downloads

Download data is not yet available.

Author Biographies

Kateryna Volynets, V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022

student of Education and Research Institute of Computer Sciences and Artiﬁcial Intelligence

Viktoriia Strilets, V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022

Ph.D, associate professor of the Department of Computer Systems and Robotics, Education and Research Institute of Computer Sciences and Artiﬁcial Intelligence

Danylo Yakovlev, V.N. Karazin Kharkiv National University, 6 Svobody sq., Kharkiv, Ukraine, 61022

student of Education and Research Institute of Computer Sciences and Artiﬁcial Intelligence

References

/

References

Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc. 2009. 504 p.

Wang, M., Sun, Z., Jia, M. et al. Intelligent virtual case learning system based on real medical records and natural language processing. BMC Med Inform Decis Mak 22, 60 (2022). https://doi.org/10.1186/s12911-022-01797-7.

Robert M. Cronin, Daniel Fabbri, Joshua C. Denny, S. Trent Rosenbloom, Gretchen Purcell Jackson. A comparison of rule-based and machine learning approaches for classifying patient portal messages. International Journal of Medical Informatics, 2017. Vol. 105. P. 110-120. https://doi.org/10.1016/j.ijmedinf.2017.06.004

Elbattah M., Arnaud É., Gignon M., Dequen G. The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications. Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021), Volume 5: HEALTHINF, 2021. P. 825-832. DOI: 10.5220/0010414508250832.

Turchin Alexander, Masharsky Stanislav, Zitnik Marinka. Comparison of BERT implementations for natural language processing of narrative medical documents. Informatics in Medicine Unlocked, 36, 2022. DOI: 101139. 10.1016/j.imu.2022.101139.

Zhou Binggui, Yang Guanghua, Shi Zheng, Ma Shaodan. Natural Language Processing for Smart Healthcare. IEEE Reviews in Biomedical Engineering, 2021. DOI: 10.48550/arXiv.2110.15803.

Jurafsky Daniel, Martin James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition. Prentice Hall, 2019. 621 p. URL: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf (Last accessed: 20.11.2024)

Almeida T., Hidalgo J. SMS Spam Collection [Dataset]. UCI Machine Learning Repository. 2011. https://doi.org/10.24432/C5CC84.

The spam-messages classification model in a medical information system

Abstract

Downloads

Author Biographies

References

References

Most read articles by the same author(s)