基于音素级韵律建模的自回归零样本语音合成

基于音素级韵律建模的自回归零样本语音合成
DOI:
                        
作者:
                        
作者单位:天津大学 电气自动化与信息工程学院
作者简介:
通讯作者:
基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Phoneme-level Prosody Modeling for Autoregressive Zero-shot Speech Synthesis

Author:

Affiliation:

School of Electrical and Information Engineering, Tianjin University

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

在文本到语音合成中，韵律建模对于提升合成语音的自然度十分重要。为了提升合成韵律的自然度和鲁棒性，提出了基于音素级韵律建模的自回归语音合成模型。该模型从词级别停顿和音素时长两方面改进韵律建模。为了提升词级别停顿的多样性和准确性，在文本前端提出了停顿预测模块。该模块基于原始文本来预测多类停顿标签，从而为语音合成提供停顿时长建模的准确参考。为了提升音素时长的自然度，提出了时长预测模块。该模块预测每个音素的高斯混合分布，并通过随机采样来获得多样化的音素时长。为了提升自回归模型中的音素时长建模的稳定性，提出了注意力判别模块。该模块应用于自回归的每个时间步中，并通过注意力和判断机制来避免对齐紊乱现象。实验结果表明，所提三种模块可有效提升韵律建模的自然度和鲁棒性，从而提升语音合成的效果。

Abstract:

In text-to-speech (TTS), prosody modeling is crucial for enhancing the naturalness of synthesized speech. To improve the naturalness and robustness of synthesized prosody, a phoneme-level prosody modeling-based autoregressive TTS model is proposed. This model enhances prosody modeling from two aspects: inter-word pauses and phoneme durations. To enhance the diversity and accuracy of inter-word pauses, a pause prediction module is proposed at the text frontend. This module predicts multiple pause labels based on the original text, providing accurate references for pause duration modeling in speech synthesis. To enhance the naturalness of phoneme durations, a duration prediction module is proposed. This module predicts a Gaussian mixture distribution for each phoneme and obtains diversified phoneme durations through random sampling. To stabilize phoneme duration modeling in the autoregressive model, an attention-based discrimination module is proposed. This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms. Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling, thereby improving the quality of speech synthesis.

参考文献

相似文献

引证文献

文章指标

PDF下载次数:
HTML阅读次数:
摘要点击次数:
引用次数:

引用本文

复制

历史

收稿日期: 2024-02-04
最后修改日期: 2024-04-02
录用日期: 2024-04-08
在线发布日期:
出版日期:

首页

期刊简介

编委会

作者中心

下载中心

学术道德

常见问题

版权声明

联系我们

English

文章指标

引用本文

历史