基于音素级韵律建模的自回归零样本语音合成

首页 > 过刊浏览>2025年第52卷第4期 >114-123

基于音素级韵律建模的自回归零样本语音合成
DOI:
                        
                    
作者:
                        岳焕景 ，王嘉玮 ，杨敬钰 †岳焕景 ，王嘉玮 ，杨敬钰 †
（天津大学 电气自动化与信息工程学院， 天津 300072）
在知网中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
基金项目:

Autoregressive Zero-shot Speech Synthesis Based on Phoneme-level Prosody Modeling

Author:

YUE Huanjing，WANG Jiawei，YANG Jingyu†
YUE Huanjing，WANG Jiawei，YANG Jingyu†
（School of Electrical and Information Engineering， Tianjin University， Tianjin 300072， China）
在知网中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

为了提升合成韵律的自然度和稳定性，提出了基于音素级韵律建模的自回归语音合成模型.该模型从词级别停顿和音素时长两方面改进韵律建模.为了提升词级别停顿的多样性和准确性，在文本前端提出了停顿预测模块.该模块基于原始文本来预测多类停顿标签，从而为语音合成提供停顿时长建模的准确参考.为了提升音素时长的自然度，提出了时长预测模块.该模块预测每个音素的混合高斯分布，并通过随机采样来获得多样化的音素时长.为了提升自回归模型中的音素时长建模的稳定性，提出了注意力判别模块.该模块应用于自回归的每个时间步中，并通过注意力和判断机制来避免对齐紊乱现象.实验结果表明，所提三种模块可有效提升韵律建模的自然度和稳定性，从而提升语音合成的效果.

关键词:语音合成;韵律建模;停顿预测

Abstract:

To improve the naturalness and robustness of synthesized prosody， a autoregressive speech synthesis model based on phoneme-level prosody modeling is proposed. This model enhances prosody modeling from two aspects： inter-word pauses and phoneme durations. To enhance the diversity and accuracy of inter-word pauses， a pause prediction module is proposed at the text frontend. This module predicts multiple pause labels based on the original text， thereby providing accurate references for pause duration modeling in speech synthesis. To enhance the naturalness of phoneme durations， a duration prediction module is proposed. This module predicts a mixture Gaussian distribution for each phoneme and obtains diversified phoneme durations through random sampling. To stabilize phoneme duration modeling in the autoregressive model， an attention-based discrimination module is proposed. This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms. Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling， thereby improving the quality of speech synthesis.

Key words:speech synthesis;prosody modeling;pause prediction

文章指标

PDF下载次数:
HTML阅读次数:
摘要点击次数:
引用次数:

引用本文

岳焕景 ,王嘉玮 ,杨敬钰 ?.基于音素级韵律建模的自回归零样本语音合成[J].湖南大学学报：自然科学版,2025,52(4):114~123

复制

历史

收稿日期:
最后修改日期:
录用日期:
在线发布日期: 2025-04-28
出版日期:

首页

期刊简介

编委会

作者中心

下载中心

学术道德

常见问题

版权声明

联系我们

English

文章指标

引用本文

历史