To improve the naturalness and robustness of synthesized prosody, a autoregressive speech synthesis model based on phoneme-level prosody modeling is proposed. This model enhances prosody modeling from two aspects: inter-word pauses and phoneme durations. To enhance the diversity and accuracy of inter-word pauses, a pause prediction module is proposed at the text frontend. This module predicts multiple pause labels based on the original text, thereby providing accurate references for pause duration modeling in speech synthesis. To enhance the naturalness of phoneme durations, a duration prediction module is proposed. This module predicts a mixture Gaussian distribution for each phoneme and obtains diversified phoneme durations through random sampling. To stabilize phoneme duration modeling in the autoregressive model, an attention-based discrimination module is proposed. This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms. Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling, thereby improving the quality of speech synthesis.