To study the overall bearing capacity of concrete-filled fire-resistant square steel tubular (CFFRSST) columns under axial compression, six CFFRSST columns were designed and tested. The study examined the ultimate bearing capacity and strain development at key locations of the specimens with different welded residual stresses, slenderness ratios, with-to-thickness ratios, and steel strength grades. Besides, the finite element models (FEM) of CFFRSST columns were established, considering the welded residual stress. Based on the validated FEM, further analyses were conducted to evaluate the influence of the welded residual stress, slenderness ratio and width-to-thickness ratio on the axial compression behavior. The results show that the overall stability coefficients of CFFRSST are higher than that of conventional-strength steel concrete-filled square steel tubular (CSSCFSST) columns due to the lower welded residual stress of fire-resistant steel than conventional-strength steel. Furthermore, the differences in overall stability coefficients between the CFFRSST column and the CSSCFSST column increases initially and then decreases with the increase of slenderness ratios. Short columns are more prone to strength failure, and their ultimate bearing capacity are significantly influenced by the interaction between the steel tube and the concrete. As the width- to-thickness ratio decreases, the concrete compressive strength improves due to the constraint effect, leading to a higher stability coefficient. In contrast, the failure of medium and long columns is primarily governed by second-order effects, where the constraint effect is less pronounced. The stability coefficient of columns is primarily affected by welded residual compressive stress, which increases as the width-to-thickness ratio increases. The ultimate bearing capacity obtained from the parametric analysis were compared with predictions from current design codes. The results show that the predictions from Chinese and American codes are lower than the parametric analysis by about 32% and 9%, respectively, while the European code underestimates it by about 7%, providing the most accurate predictions among the three.