Comparative analysis of YOLOv5 and MobileNetV3 models for real-time image recognition
Abstract
Relevance: With the growing need for fast and accurate real-time object recognition, especially for mobile and embedded systems, the question of choosing the optimal AI models arises. Comparisons of lightweight and high-precision architectures such as YOLOv5 and MobileNetV3 are important for developing efficient computer vision systems and exploring the principles of hybrid model construction.
Purpose: Comparison of the YOLOv5 and MobileNetV3 architectures to analyze the efficiency for real-time object recognition applications, and to confirm that hybrid models can improve the efficiency of these tasks.
Research methods: image preprocessing methods, deep neural network training methods, measurement of accuracy, processing speed, and resource usage; comparative analysis of results to assess model effectiveness.
Results: An experimental study showed that YOLOv5 demonstrates better overall accuracy on the COCO test suite, but requires more computing resources. MobileNetV3, on the other hand, provides faster output and efficient functioning on low-power devices, sacrificing accuracy in part. As such, both models have proven their suitability for real-world applications, and the choice between them depends on the specific balance between speed, accuracy, and platform limitations. Combining these models gives better results in object recognition, although this may increase the size of the model itself and resource consumption.
Conclusions: As a result of the study, the YOLOv5, MobileNetV3 and hybrid models for the object recognition problem were compared. The hybrid model demonstrated better accuracy and balance between processing speed and resource utilization than individual models. This indicates the feasibility of using hybrid approaches to improve the efficiency of computer vision systems in real conditions. Therefore, the hybrid model is a promising direction for further research and practical implementation.
Downloads
References
/References
Younesi, A., Ansari, M., Fazli, M., Ejlali, A., Shafique, M., & Henkel, J. (2024). A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends. IEEE Access, 12, 41180-41218.
Tu, C. H., Lee, J. H., Chan, Y. M., & Chen, C. S. (2020, July). Pruning depthwise separable convolutions for mobilenet compression. In 2020 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.
Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., ... & Shen, C. (2022). Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12083-12093).
YOLO Object Detection Explained: Evolution, Algorithm, and Applications. URL: https://encord.com/blog/yolo-object-detection-guide/ (date of last access: 22.03.2025).
Keras(TF backend) implementation of YoloV3 objects detection. URL: https://github.com/xiaochus/YOLOv3/tree/master (date of last access: 22.03.2025). [in Ukrainian]
YOLOv4: High-Speed and Precise Object Detection. URL: https://docs.ultralytics.com/models/yolov4/ (date of last access: 22.03.2025).
YOLOv5 Official Repository. URL: https://github.com/ultralytics/yolov5 (date of last access: 22.03.2025). [in Ukrainian]
A dive into YOLO object detection. URL: https://www.picsellia.com/post/a-dive-into-yolo-object-detection (date of last access: 22.03.2025).
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., ... & Adam, H. (2019). Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314-1324).
Channel Attention and Squeeze-and-Excitation Networks (SENet). URL: https://www.digitalocean.com/community/tutorials/channel-attention-squeeze-and-excitation-networks (date of last access: 22.03.2025).
Function Hardswish. URL: https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/nn/functional/hardswish_en.html (date of last access: 22.03.2025). [in Ukrainian]
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
MobileNet V3. URL: https://mmclassification.readthedocs.io/en/dev-1.x/papers/mobilenet_v3.html (date of last access: 22.03.2025).
COCO, large-scale object detection, segmentation, and captioning dataset. URL: https://cocodataset.org/#overview (date of last access: 22.03.2025).
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88, 303-338.
YouTube-Bounding Boxes Dataset:. URL: https://research.google.com/youtube-bb/ (date of last access: 22.03.2025).
BDD100K: A Large-scale Diverse Driving Video Database. URL: https://bair.berkeley.edu/blog/2018/05/30/bdd/ (date of last access: 22.03.2025).
Agga, A., Abbou, A., Labbadi, M., El Houm, Y., & Ali, I. H. O. (2022). CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production. Electric Power Systems Research, 208, 107908.
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., & Barnard, K. (2021). Attentional feature fusion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3560-3569).
Dhruv, P., & Naskar, S. (2020). Image classification using convolutional neural network (CNN) and recurrent neural network (RNN): A review. Machine learning and information processing: proceedings of ICMLIP 2019, 367-381.
Younesi, A., Ansari, M., Fazli, M., Ejlali, A., Shafique, M., & Henkel, J. (2024). A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends. IEEE Access, 12, 41180-41218.
Tu, C. H., Lee, J. H., Chan, Y. M., & Chen, C. S. (2020, July). Pruning depthwise separable convolutions for mobilenet compression. In 2020 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.
Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., ... & Shen, C. (2022). Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12083-12093).
YOLO Object Detection Explained: Evolution, Algorithm, and Applications. URL: https://encord.com/blog/yolo-object-detection-guide/ (дата звернення: 22.03.2025).
Реалізація Keras (TF backend) виявлення об'єктів, YoloV3. URL: https://github.com/xiaochus/YOLOv3/tree/master (дата звернення: 22.03.2025).
YOLOv4: Високошвидкісне та точне виявлення об'єктів. URL: https://docs.ultralytics.com/models/yolov4/ (дата звернення: 22.03.2025).
Офіційний репозиторій YOLOv5. URL: https://github.com/ultralytics/yolov5 (дата звернення: 22.03.2025).
Занурення в виявлення об'єктів, YOLO. URL: https://www.picsellia.com/post/a-dive-into-yolo-object-detection (дата звернення: 22.03.2025).
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., ... & Adam, H. (2019). Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314-1324).
Channel Attention and Squeeze-and-Excitation Networks (SENet). URL: https://www.digitalocean.com/community/tutorials/channel-attention-squeeze-and-excitation-networks (дата звернення: 22.03.2025).
Функція Hardswish. URL: https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/nn/functional/hardswish_en.html (дата звернення: 22.03.2025).
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
MobileNet V3. URL: https://mmclassification.readthedocs.io/en/dev-1.x/papers/mobilenet_v3.html (дата звернення: 22.03.2025).
COCO, Набір даних для виявлення, сегментації та субтитрування великомасштабних об'єктів. URL: https://cocodataset.org/#overview (дата звернення: 22.03.2025).
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88, 303-338.
YouTube-Bounding Boxes Dataset. URL: https://research.google.com/youtube-bb/ (дата звернення: 22.03.2025).
BDD100K: A Large-scale Diverse Driving Video Database. URL: https://bair.berkeley.edu/blog/2018/05/30/bdd/ (дата звернення: 22.03.2025).
Agga, A., Abbou, A., Labbadi, M., El Houm, Y., & Ali, I. H. O. (2022). CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production. Electric Power Systems Research, 208, 107908.
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., & Barnard, K. (2021). Attentional feature fusion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3560-3569).
Dhruv, P., & Naskar, S. (2020). Image classification using convolutional neural network (CNN) and recurrent neural network (RNN): A review. Machine learning and information processing: proceedings of ICMLIP 2019, 367-381.