Analysis of Modern Neural Network Methods for Visual Information Processing in High-Speed UAV Navigation Systems
Abstract
Relevance. The rapid evolution of Unmanned Aerial Vehicles (UAVs) from remotely piloted systems to fully autonomous high-speed aerial robots has intensified the demand for advanced onboard perception and navigation methods. This need is particularly acute in scenarios where computational latency, sensor noise, and environmental complexity undermine the reliability of classical computer-vision pipelines. Despite recent progress in deep learning, the existing approaches to visual information processing—especially CNN-based detectors, Transformer-based semantic models, and learning-enhanced SLAM modules—remain fragmented and insufficiently adapted to the strict Size, Weight and Power (SWaP) constraints of embedded platforms such as the NVIDIA Jetson series. This motivates a comprehensive analysis of modern neural architectures suitable for real-time, high-velocity UAV operations.
Purpose. The purpose of this study is to analyze state-of-the-art neural network methods for secondary visual processing in UAV navigation systems, compare the applicability of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), evaluate their integration into SLAM pipelines, and determine the requirements for hybrid architectures capable of supporting fully autonomous, high-speed flight.
Methods. The research employs a comparative analysis of recent deep-learning approaches, including CNN-based detectors (YOLO family), Transformer-based visual models, deep-learning–enhanced SLAM components, and Deep Reinforcement Learning (DRL) control policies. Evaluation criteria include latency, semantic robustness, dynamic-scene handling, edge-hardware compatibility, quantization performance, pruning potential, and TensorRT optimization efficiency on NVIDIA Jetson devices.
Results. The study establishes that CNNs provide superior real-time performance and remain indispensable for high-frequency reflexive perception, while Vision Transformers offer stronger global context reasoning and robustness to occlusion but suffer from significant computational overhead on embedded GPUs. Deep-learning-based SLAM methods improve feature stability and dynamic-object rejection but require careful integration to maintain real-time constraints. Hardware analysis reveals that quantization, pruning, and TensorRT acceleration are critical for deploying deep models on Jetson-class platforms, although ViTs exhibit limited INT8 quantization tolerance. Based on these findings, the work formulates a conceptual hybrid architecture that combines CNN-driven reflexive processing with Transformer-driven cognitive reasoning.
Conclusions. The results confirm the necessity of developing hybrid neuro-architectures that integrate the speed and hardware efficiency of CNNs with the semantic depth of Transformer-based models. Such architectures represent a promising pathway toward reliable, fully autonomous high-speed UAV navigation. The proposed design principles emphasize hierarchical control, asynchronous perception loops, and hardware-aware optimization as key enablers for next-generation aerial robotic systems.
Downloads
References
/References
Sheng, Y., Liu, H., Li, J., & Han, Q. (2024). UAV autonomous navigation based on deep reinforcement learning in highly dynamic and high-density environments. Drones, 8(9), 516. https://doi.org/10.3390/drones8090516
Scherbinin, V. V., Khusainov, N. S., & Kravchenko, P. P. (2014). Combined correlation-extremal navigation system to identify AV location by terrain relief and landscape objects with the use of the stereo photogrammetry method. Middle-East Journal of Scientific Research, 19(4), 479–486. https://doi.org/10.5829/idosi.mejsr.2014.19.4.13693
Mukhina, M. P., & Seden, I. V. (2014). Analysis of modern correlation extreme navigation systems. Electronics and Control Systems, 1(39), 95–101. https://doi.org/10.18372/1990-5548.39.7343
Sotnikov, A., Tiurina, V., Petrov, K., Lukyanova, V., Lanovyy, O., Onishchenko, Y., Gnusov, Y., Petrov, S., Boichenko, O., & Breus, P. (2024). Using the set of informative features of a binding object to construct a decision function by the system of technical vision when localizing mobile robots. Eastern-European Journal of Enterprise Technologies, 3(9(129)), 60–69. https://doi.org/10.15587/1729-4061.2024.303989
Seeed Studio. (2023, March 30). YOLOv8 performance benchmarks on NVIDIA Jetson devices. Seeed Studio Blog. https://www.seeedstudio.com/blog/2023/03/30/yolov8-performance-benchmarks-on-nvidia-jetson-devices/
D. Du et al. (2019). VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2019) (pp. 213-226). IEEE. https://doi.org/10.1109/ICCVW.2019.00030
Zhang, J. (2023). Towards a high-performance object detector: Insights from drone detection using ViT and CNN-based deep learning models. In Proceedings of the 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE) (pp. 141–147). IEEE. https://doi.org/10.1109/ICSECE58870.2023.10263514
Liu, T., Wang, Y., Yang, C., Zhang, Y., & Zhang, W. (2025). A lightweight hybrid CNN-ViT network for weed recognition in paddy fields. Mathematics, 13(17), 2899. https://doi.org/10.3390/math13172899
Shen, S., Yu, G., Zhang, L., Yan, Y., & Zhai, Z. (2025). LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing. Remote Sensing, 17(4), 653. https://doi.org/10.3390/rs17040653
Xue, H., Tang, Z., Xia, Y., Wang, L., & Li, L. (2025). HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery. Computer Vision and Image Understanding, 259, 104409. https://doi.org/10.1016/j.cviu.2025.104409
Favorskaya, M. N. (2023). Deep learning for visual SLAM: The state-of-the-art and future trends. Electronics, 12(9), 2006. https://doi.org/10.3390/electronics12092006
Luo, L., Peng, F., & Dong, L. (2024). Improved multi-sensor fusion dynamic odometry based on neural networks. Sensors, 24(19), 6193. https://doi.org/10.3390/s24196193
Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H., & et al. (2022). Detection and Tracking Meet Drones Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7380–7399. https://doi.org/10.1109/TPAMI.2021.3119563
Mohiuddin, M.B., Boiko, I., Tran, V.P. et al. Reinforcement learning for end-to-end UAV slung-load navigation and obstacle avoidance. Sci Rep 15, 34621 (2025). https://doi.org/10.1038/s41598-025-18220-6
Meimetis, D., Daramouskas, I., Patrinopoulou, N., Lappas, V., & Kostopoulos, V. (2025). Comparative analysis of object detection models for edge devices in UAV swarms. Machines, 13(8), 684. https://doi.org/10.3390/machines13080684
Sheng, Y., Liu, H., Li, J., & Han, Q. (2024). UAV autonomous navigation based on deep reinforcement learning in highly dynamic and high-density environments. Drones, 8(9), 516. https://doi.org/10.3390/drones8090516
Scherbinin, V. V., Khusainov, N. S., & Kravchenko, P. P. (2014). Combined correlation-extremal navigation system to identify AV location by terrain relief and landscape objects with the use of the stereo photogrammetry method. Middle-East Journal of Scientific Research, 19(4), 479–486. https://doi.org/10.5829/idosi.mejsr.2014.19.4.13693
Mukhina, M. P., & Seden, I. V. (2014). Analysis of modern correlation extreme navigation systems. Electronics and Control Systems, 1(39), 95–101. https://doi.org/10.18372/1990-5548.39.7343
Sotnikov, A., Tiurina, V., Petrov, K., Lukyanova, V., Lanovyy, O., Onishchenko, Y., Gnusov, Y., Petrov, S., Boichenko, O., & Breus, P. (2024). Using the set of informative features of a binding object to construct a decision function by the system of technical vision when localizing mobile robots. Eastern-European Journal of Enterprise Technologies, 3(9(129)), 60–69. https://doi.org/10.15587/1729-4061.2024.303989
Seeed Studio. (2023, March 30). YOLOv8 performance benchmarks on NVIDIA Jetson devices. Seeed Studio Blog. https://www.seeedstudio.com/blog/2023/03/30/yolov8-performance-benchmarks-on-nvidia-jetson-devices/
D. Du et al. (2019). VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2019) (pp. 213-226). IEEE. https://doi.org/10.1109/ICCVW.2019.00030
Zhang, J. (2023). Towards a high-performance object detector: Insights from drone detection using ViT and CNN-based deep learning models. In Proceedings of the 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE) (pp. 141–147). IEEE. https://doi.org/10.1109/ICSECE58870.2023.10263514
Liu, T., Wang, Y., Yang, C., Zhang, Y., & Zhang, W. (2025). A lightweight hybrid CNN-ViT network for weed recognition in paddy fields. Mathematics, 13(17), 2899. https://doi.org/10.3390/math13172899
Shen, S., Yu, G., Zhang, L., Yan, Y., & Zhai, Z. (2025). LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing. Remote Sensing, 17(4), 653. https://doi.org/10.3390/rs17040653
Xue, H., Tang, Z., Xia, Y., Wang, L., & Li, L. (2025). HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery. Computer Vision and Image Understanding, 259, 104409. https://doi.org/10.1016/j.cviu.2025.104409
Favorskaya, M. N. (2023). Deep learning for visual SLAM: The state-of-the-art and future trends. Electronics, 12(9), 2006. https://doi.org/10.3390/electronics12092006
Luo, L., Peng, F., & Dong, L. (2024). Improved multi-sensor fusion dynamic odometry based on neural networks. Sensors, 24(19), 6193. https://doi.org/10.3390/s24196193
Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H., & et al. (2022). Detection and Tracking Meet Drones Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7380–7399. https://doi.org/10.1109/TPAMI.2021.3119563
Mohiuddin, M.B., Boiko, I., Tran, V.P. et al. Reinforcement learning for end-to-end UAV slung-load navigation and obstacle avoidance. Sci Rep 15, 34621 (2025). https://doi.org/10.1038/s41598-025-18220-6
Meimetis, D., Daramouskas, I., Patrinopoulou, N., Lappas, V., & Kostopoulos, V. (2025). Comparative analysis of object detection models for edge devices in UAV swarms. Machines, 13(8), 684. https://doi.org/10.3390/machines13080684