Latent diffusion model for speech signal processing
Abstract
Topicality. The development of generative models for audio synthesis, including text-to-speech (TTS), text-to-music, and text-to-audio applications, largely depends on their ability to handle complex and varied input data. This paper centers on latent diffusion modeling, a versatile approach that leverages stochastic processes to generate high-quality audio outputs.
Key goals. This study aims to evaluate the efficacy of latent diffusion modeling for TTS synthesis on the EmoV-DB dataset, which features multi-speaker recordings across five emotional states, and to contrast it with other generative techniques.
Research methods. We applied latent diffusion modeling to TTS synthesis specifically and evaluated its performance using metrics that assess intelligibility, speaker similarity, and emotion preservation in the generated audio signal.
Results. The study reveals that while the proposed model demonstrates decent efficiency in maintaining speaker characteristics, it is outperformed by the discrete autoregressive model: xTTS v2 in all assessed metrics. Notably, the researched model exhibits deficiencies in emotional classification accuracy, suggesting potential misalignment between the emotional intents encoded by the embeddings and those expressed in the speech output.
Conclusions. The findings suggest that further refinement of the encoder's ability to process and integrate emotional data could enhance the performance of the latent diffusion model. Future research should focus on optimizing the balance between speaker and emotion characteristics in TTS models to achieve a more holistic and effective synthesis of human-like speech.
Downloads
References
/References
Y. Wang et al. Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017. ISCA: ISCA, 20-24 August 2017, Stockholm, Sweeden, 2017, p. 4006-4010
J. Shen et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 15–20 April 2018, Calgary, AB, Canada, 2018, p. 4779-4783
Taigman Y. Voiceloop: Voice fitting and synthesis via a phonological loop, 2018 (Preprint Arxiv:1707.06588)
Lee Y. Emotional end-to-end neural speech synthesizer, 2017 (Preprint Arxiv: 1711.05447)
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick and S. Dubnov. Large-Scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP 2023 - 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4–10 June 2023, Rhodes Island, Greece. 2023
W.-N. Hsu et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing. 2021. vol. 29. p. 3451–3460.
S. Chen et al. WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE journal of selected topics in signal processing. 2022. Vol. 16, p. 1505–1518.
Chen N. Wavegrad: Estimating gradients for waveform generation. International Conference on Learning Representations (ICLR), 2020.
Kong Z. Diffwave: A versatile diffusion model for audio synthesis. International Conference on Learning Representations (ICLR), 2021.
Chen M. An overview of diffusion models: Applications, guided generation, statistical rates and optimization, 2024 (Preprint Arxiv: 2404.07771)
Wang C. Neural codec language models are zero-shot text to speech synthesizers, 2023 (Preprint Arxiv: 2301.02111)
Ren Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 8-14 December 2019, Vancouver Convention Centre, Canada, vol 32.
Ren Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. ICLR 2021 The Ninth International Conference on Learning Representations, 2021
Kingma D. P. Auto-encoding variational bayes. International Conference on Learning Representations. 14-16 April 2014, Banff, AB, Canada, 2014
Kong J. et al. Hifi-gan: Generative adversarial networks for efficient and high-fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 6-12 December, 2020, vol 33, p. 17022-17033.
Xueyao Zhang, Liumeng Xue, Yicheng Gu ett.al. Amphion: An open-source audio, music and speech generation toolkit, 2024 (Preprint Arxiv: 2312.09911)
Ma Z. et al. Emotion2vec: Self-supervised pre-training for speech emotion representation, 2023 (Preprint Arxiv: 2312.15185)
Radford A. Robust speech recognition via large-scale weak supervision. Proceedings of Machine Learning Research, 23-29 July, 2023, vol 202, p. 28492-28518.
Y. Wang et al. Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017. ISCA: ISCA, 20-24 August 2017, Stockholm, Sweeden, 2017, p. 4006-4010
J. Shen et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 15–20 April 2018, Calgary, AB, Canada, 2018, p. 4779-4783
Taigman Y. Voiceloop: Voice fitting and synthesis via a phonological loop, 2018 (Preprint Arxiv:1707.06588)
Lee Y. Emotional end-to-end neural speech synthesizer, 2017 (Preprint Arxiv: 1711.05447)
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick and S. Dubnov. Large-Scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP 2023 - 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4–10 June 2023, Rhodes Island, Greece. 2023
W.-N. Hsu et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing. 2021. vol. 29. p. 3451–3460.
S. Chen et al. WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE journal of selected topics in signal processing. 2022. Vol. 16, p. 1505–1518.
Chen N. Wavegrad: Estimating gradients for waveform generation. International Conference on Learning Representations (ICLR), 2020.
Kong Z. Diffwave: A versatile diffusion model for audio synthesis. International Conference on Learning Representations (ICLR), 2021.
Chen M. An overview of diffusion models: Applications, guided generation, statistical rates and optimization, 2024 (Preprint Arxiv: 2404.07771)
Wang C. Neural codec language models are zero-shot text to speech synthesizers, 2023 (Preprint Arxiv: 2301.02111)
Ren Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 8-14 December 2019, Vancouver Convention Centre, Canada, vol 32.
Ren Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. ICLR 2021 The Ninth International Conference on Learning Representations, 2021
Kingma D. P. Auto-encoding variational bayes. International Conference on Learning Representations. 14-16 April 2014, Banff, AB, Canada, 2014
Kong J. et al. Hifi-gan: Generative adversarial networks for efficient and high-fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 6-12 December, 2020, vol 33, p. 17022-17033.
Xueyao Zhang, Liumeng Xue, Yicheng Gu ett.al. Amphion: An open-source audio, music and speech generation toolkit, 2024 (Preprint Arxiv: 2312.09911)
Ma Z. et al. Emotion2vec: Self-supervised pre-training for speech emotion representation, 2023 (Preprint Arxiv: 2312.15185)
Radford A. Robust speech recognition via large-scale weak supervision. Proceedings of Machine Learning Research, 23-29 July, 2023, vol 202, p. 28492-28518.