Abstract:In text-to-speech (TTS), prosody modeling is crucial for enhancing the naturalness of synthesized speech. To improve the naturalness and robustness of synthesized prosody, a phoneme-level prosody modeling-based autoregressive TTS model is proposed. This model enhances prosody modeling from two aspects: inter-word pauses and phoneme durations. To enhance the diversity and accuracy of inter-word pauses, a pause prediction module is proposed at the text frontend. This module predicts multiple pause labels based on the original text, providing accurate references for pause duration modeling in speech synthesis. To enhance the naturalness of phoneme durations, a duration prediction module is proposed. This module predicts a Gaussian mixture distribution for each phoneme and obtains diversified phoneme durations through random sampling. To stabilize phoneme duration modeling in the autoregressive model, an attention-based discrimination module is proposed. This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms. Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling, thereby improving the quality of speech synthesis.