Latent diffusion model for speech signal processing

  • Andrii Ivaniuk National University of "Kyiv-Mohyla Academy", Faculty of Computer Sciences, 2 Skovorody st., Kyiv, Ukraine, 04655 https://orcid.org/0000-0002-4189-3787
Keywords: audio modeling, artificial neural networks, speech synthesis

Abstract

Topicality. The development of generative models for audio synthesis, including text-to-speech (TTS), text-to-music, and text-to-audio applications, largely depends on their ability to handle complex and varied input data. This paper centers on latent diffusion modeling, a versatile approach that leverages stochastic processes to generate high-quality audio outputs.

Key goals. This study aims to evaluate the efficacy of latent diffusion modeling for TTS synthesis on the EmoV-DB dataset, which features multi-speaker recordings across five emotional states, and to contrast it with other generative techniques.

Research methods. We applied latent diffusion modeling to TTS synthesis specifically and evaluated its performance using metrics that assess intelligibility, speaker similarity, and emotion preservation in the generated audio signal.

Results. The study reveals that while the proposed model demonstrates decent efficiency in maintaining speaker characteristics, it is outperformed by the discrete autoregressive model: xTTS v2 in all assessed metrics. Notably, the researched model exhibits deficiencies in emotional classification accuracy, suggesting potential misalignment between the emotional intents encoded by the embeddings and those expressed in the speech output.

Conclusions. The findings suggest that further refinement of the encoder's ability to process and integrate emotional data could enhance the performance of the latent diffusion model. Future research should focus on optimizing the balance between speaker and emotion characteristics in TTS models to achieve a more holistic and effective synthesis of human-like speech.

Downloads

Download data is not yet available.

Author Biography

Andrii Ivaniuk, National University of "Kyiv-Mohyla Academy", Faculty of Computer Sciences, 2 Skovorody st., Kyiv, Ukraine, 04655

PhD Student

References

/

References

Published
2024-05-27
How to Cite
Ivaniuk, A. (2024). Latent diffusion model for speech signal processing. Bulletin of V.N. Karazin Kharkiv National University, Series «Mathematical Modeling. Information Technology. Automated Control Systems», 61, 44-51. https://doi.org/10.26565/2304-6201-2024-61-05
Section
Статті