The spam-messages classification model in a medical information system
Abstract
Relevance. In modern medical information systems, a significant number of text records are generated daily from the service, doctors and staff. For high-quality work, such systems require the implementation of models and methods for analyzing and classifying text data, in particular, detecting spam messages and blocking them. Therefore, the development, improvement and implementation of models and methods for classifying spam messages is a relevant task.
Research objective: increasing the efficiency of the spam message recognition process in medical information systems; developing and implementing spam classification models based on machine learning methods.
Research methods: natural language processing methods, modeling, machine learning, classification methods, data analysis methods, statistical methods.
Results. Spam message classification models were built using such machine learning methods as the logistic regression model, the national Bayesian classifier model and the support vector model. The SMS Spam Collection set, previously prepared using CountVectorizer and TF-IDFVectorizer, was used to train the models. All proposed models showed high accuracy in spam message classification and the ability to correctly determine the type of message.
Conclusions: The developed message classification models based on machine learning and nlp approach successfully generated unwanted messages. The best model for quality indicators was the model based on the support vector method with TF-IDF vectorization, after which it showed the highest accuracy value (98.75%) and high value of recall (90.3%) of classification. Further improvements of the models and expansion of the training set can contribute to further improvement of the quality of spam recognition.
Downloads
References
/References
Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc. 2009. 504 p.
Wang, M., Sun, Z., Jia, M. et al. Intelligent virtual case learning system based on real medical records and natural language processing. BMC Med Inform Decis Mak 22, 60 (2022). https://doi.org/10.1186/s12911-022-01797-7.
Robert M. Cronin, Daniel Fabbri, Joshua C. Denny, S. Trent Rosenbloom, Gretchen Purcell Jackson. A comparison of rule-based and machine learning approaches for classifying patient portal messages. International Journal of Medical Informatics, Vol. 105, P. 110-120, 2017. https://doi.org/10.1016/j.ijmedinf.2017.06.004
Elbattah M., Arnaud É., Gignon M., Dequen G. The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications. Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021), Volume 5: HEALTHINF, P. 825-832, 2021. DOI: 10.5220/0010414508250832.
Turchin Alexander, Masharsky Stanislav, Zitnik Marinka. Comparison of BERT implementations for natural language processing of narrative medical documents. Informatics in Medicine Unlocked, 36, 2022. DOI: 101139. 10.1016/j.imu.2022.101139.
Zhou Binggui, Yang Guanghua, Shi Zheng, Ma Shaodan. Natural Language Processing for Smart Healthcare. IEEE Reviews in Biomedical Engineering, 2021. DOI: 10.48550/arXiv.2110.15803.
Jurafsky Daniel, Martin James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition. Prentice Hall, 2019, 621 p. URL: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf (Last accessed: 20.11.2024)
Almeida T., Hidalgo J. SMS Spam Collection [Dataset]. UCI Machine Learning Repository. 2011. https://doi.org/10.24432/C5CC84.
Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc. 2009. 504 p.
Wang, M., Sun, Z., Jia, M. et al. Intelligent virtual case learning system based on real medical records and natural language processing. BMC Med Inform Decis Mak 22, 60 (2022). https://doi.org/10.1186/s12911-022-01797-7.
Robert M. Cronin, Daniel Fabbri, Joshua C. Denny, S. Trent Rosenbloom, Gretchen Purcell Jackson. A comparison of rule-based and machine learning approaches for classifying patient portal messages. International Journal of Medical Informatics, 2017. Vol. 105. P. 110-120. https://doi.org/10.1016/j.ijmedinf.2017.06.004
Elbattah M., Arnaud É., Gignon M., Dequen G. The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications. Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021), Volume 5: HEALTHINF, 2021. P. 825-832. DOI: 10.5220/0010414508250832.
Turchin Alexander, Masharsky Stanislav, Zitnik Marinka. Comparison of BERT implementations for natural language processing of narrative medical documents. Informatics in Medicine Unlocked, 36, 2022. DOI: 101139. 10.1016/j.imu.2022.101139.
Zhou Binggui, Yang Guanghua, Shi Zheng, Ma Shaodan. Natural Language Processing for Smart Healthcare. IEEE Reviews in Biomedical Engineering, 2021. DOI: 10.48550/arXiv.2110.15803.
Jurafsky Daniel, Martin James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition. Prentice Hall, 2019. 621 p. URL: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf (Last accessed: 20.11.2024)
Almeida T., Hidalgo J. SMS Spam Collection [Dataset]. UCI Machine Learning Repository. 2011. https://doi.org/10.24432/C5CC84.