Analysis of the effectiveness of the Resemblyzer library for short-command voice authentication
Abstract
Relevance. Voice interaction is widely used in Internet of Things systems and autonomous embedded devices. However, its practical deployment is constrained by security and privacy requirements as well as the limited computational resources of edge platforms. This creates a demand for fully local voice authentication solutions capable of operating without reliance on cloud services. Goal. The objective of this study is to evaluate the capabilities of the open-source Python library Resemblyzer for implementing autonomous user voice authentication based on short voice commands under conditions of no access to cloud computing and limited hardware resources. Research methods. The study was conducted using several audio datasets with varying duration, quality, and file size. Voice embeddings generated by the Resemblyzer library were used for feature representation. Quantitative similarity assessment between recordings was performed using the cosine similarity metric in scenarios involving comparisons of recordings from the same speaker and from different speakers.
Results. The results demonstrate that reliable voice authentication is achieved for audio recordings with a duration of at least 2.63 seconds and a file size of no less than 495 kB. Short fragments with durations of 1-1.5 seconds were found to be insufficiently informative for stable speaker discrimination, particularly when compared against a high-quality reference recording. A clear dependence of authentication performance on the amount of acoustic information contained in the voice signal was identified.
Conclusions. The obtained results confirm the aplicability of Resemblyzer for the development of fully autonomous real-time voice biometric authentication systems. Practical requirements for the minimum duration and informational richness of voice commands are formulated, which may be interpreted as technical constraints on the entropy of voice passwords in secure IoT applications.
Downloads
References
/References
A. Choudhary, Internet of Things: a comprehensive overview, architectures, applications, simulation tools, challenges and future directions. Discov. Internet Things. 2024. Vol. 4. P. 31. https://doi.org/10.1007/s43926-024-00084-3.
M. Lombardi, F. Pascale, D. Santaniello, Internet of Things: A General Overview between Architectures, Protocols and Applications. Information. 2021. Vol. 12. P. 87. https://doi.org/10.3390/info12020087.
L. Atzori, A. Iera, G. Morabito, The Internet of Things: A survey. Computer Networks. 2010. Vol. 54. P. 2787-2805. https://doi.org/10.1016/j.comnet.2010.05.010.
M. Hoy, Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Med. Ref. Serv. Q. 2018. Vol. 37. P. 81-88. https://doi.org/10.1080/02763869.2018.1404391.
M. Benzeghiba, R. De Mori, O. Deroo et al., Automatic speech recognition and speech variability. Speech Commun. 2007. Vol. 49. P. 763-786. https://doi.org/10.1016/j.specom.2007.02.006.
A. Javed, K. Malik, H. Malik, A. Irtaza, Voice spoofing detector: a unified anti-spoofing framework. Expert Systems Applic. 2022. Vol. 198. P. 116770. https://doi.org/10.1016/j.eswa.2022.116770.
N. Ahmed, J. Khan, N. Sheta et al., Detecting Replay Attack on Voice-Controlled Systems using Small Neural Networks. 2022 IEEE 7th Forum on Research and Technologies for Society and Industry Innovation (RTSI), Paris, France. 2022. P. 50-54. https://doi.org/10.1109/RTSI55261.2022.9905158.
Z. Wu, N. Evans, T. Kinnunen et al., Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015. Vol. 66. P. 130-153. https://doi.org/10.1016/j.specom.2014.10.005.
T. Kinnunen, Z. Wu, K. Lee et al., Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan. 2012. P. 4401-4404. https://doi.org/10.1109/ICASSP.2012.6288895.
A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics. 2018. Vol. 7. P. 403-411. https://doi.org/10.1049/iet-bmt.2017.0065.
N. Dehak, P. Kenny, R. Dehak et al., Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing. 2011. Vol. 19. P. 788-798. https://doi.org/10.1109/TASL.2010.2064307.
D. Snyder, D. Garcia-Romero, G. Sell et al., X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada. 2018. P. 5329-5333. https://doi.org/10.1109/ICASSP.2018.8461375.
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inform. Proces. Syst. 2020. Vol. 33. P. 12449-12460. https://doi.org/10.48550/arXiv.2006.11477.
L. Wan, Q. Wang, A. Papir, I. Moreno, Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. P. 4879-4883. https://doi.org/10.48550/arXiv.1710.10467.
M. Ravanelli, T. Parcollet, P. Plantinga et al., SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624. 2021. https://doi.org/10.48550/arXiv.2106.04624.
H. Bredin, R. Yin, J. Coria, Pyannote. audio: neural building blocks for speaker diarization. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. P. 7124-7128). https://doi.org/10.48550/arXiv.1911.01255.
Y. Jia, Y. Zhang, R. Weiss et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neur. Inform. Proces. Systems. 2018. arXiv:1806.04558. https://doi.org/10.48550/arXiv.1806.04558.
Q. Wang, C. Downey, L. Wan et al., Speaker diarization with LSTM. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. P. 5239-5243. https://doi.org/10.48550/arXiv.1710.10468.
A. Choudhary, Internet of Things: a comprehensive overview, architectures, applications, simulation tools, challenges and future directions. Discov. Internet Things. 2024. Vol. 4. P. 31. https://doi.org/10.1007/s43926-024-00084-3.
M. Lombardi, F. Pascale, D. Santaniello, Internet of Things: A General Overview between Architectures, Protocols and Applications. Information. 2021. Vol. 12. P. 87. https://doi.org/10.3390/info12020087.
L. Atzori, A. Iera, G. Morabito, The Internet of Things: A survey. Computer Networks. 2010. Vol. 54. P. 2787-2805. https://doi.org/10.1016/j.comnet.2010.05.010.
M. Hoy, Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Med. Ref. Serv. Q. 2018. Vol. 37. P. 81-88. https://doi.org/10.1080/02763869.2018.1404391.
M. Benzeghiba, R. De Mori, O. Deroo et al., Automatic speech recognition and speech variability. Speech Commun. 2007. Vol. 49. P. 763-786. https://doi.org/10.1016/j.specom.2007.02.006.
A. Javed, K. Malik, H. Malik, A. Irtaza, Voice spoofing detector: a unified anti-spoofing framework. Expert Systems Applic. 2022. Vol. 198. P. 116770. https://doi.org/10.1016/j.eswa.2022.116770.
N. Ahmed, J. Khan, N. Sheta et al., Detecting Replay Attack on Voice-Controlled Systems using Small Neural Networks. 2022 IEEE 7th Forum on Research and Technologies for Society and Industry Innovation (RTSI), Paris, France. 2022. P. 50-54. https://doi.org/10.1109/RTSI55261.2022.9905158.
Z. Wu, N. Evans, T. Kinnunen et al., Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015. Vol. 66. P. 130-153. https://doi.org/10.1016/j.specom.2014.10.005.
T. Kinnunen, Z. Wu, K. Lee et al., Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan. 2012. P. 4401-4404. https://doi.org/10.1109/ICASSP.2012.6288895.
A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics. 2018. Vol. 7. P. 403-411. https://doi.org/10.1049/iet-bmt.2017.0065.
N. Dehak, P. Kenny, R. Dehak et al., Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing. 2011. Vol. 19. P. 788-798. https://doi.org/10.1109/TASL.2010.2064307.
D. Snyder, D. Garcia-Romero, G. Sell et al., X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada. 2018. P. 5329-5333. https://doi.org/10.1109/ICASSP.2018.8461375.
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inform. Proces. Syst. 2020. Vol. 33. P. 12449-12460. https://doi.org/10.48550/arXiv.2006.11477.
L. Wan, Q. Wang, A. Papir, I. Moreno, Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. P. 4879-4883. https://doi.org/10.48550/arXiv.1710.10467.
M. Ravanelli, T. Parcollet, P. Plantinga et al., SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624. 2021. https://doi.org/10.48550/arXiv.2106.04624.
H. Bredin, R. Yin, J. Coria, Pyannote. audio: neural building blocks for speaker diarization. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. P. 7124-7128). https://doi.org/10.48550/arXiv.1911.01255.
Y. Jia, Y. Zhang, R. Weiss et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neur. Inform. Proces. Systems. 2018. arXiv:1806.04558. https://doi.org/10.48550/arXiv.1806.04558.
Q. Wang, C. Downey, L. Wan et al., Speaker diarization with LSTM. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. P. 5239-5243. https://doi.org/10.48550/arXiv.1710.10468.