Malicious attackers can easily deceive neural networks by adding human-imperceptible adversarial noise to natural samples, leading to misclassification. To enhance the model’s robustness against such adversarial perturbations, previous research has predominantly concentrated on the robustness of single-modal tasks, with insufficient exploration of multimodal scenarios. Therefore, this paper aims to improve the robustness of multimodal RGB-skeleton action recognition and introduces a robust action recognition framework based on a Feature Interaction Module (FIM), which extracts global information from adversarial samples to learn inter-modal joint representations for calibrating multi-modal features. A corresponding loss function tailored to this framework is also developed. Experimental results demonstrate that against CW attack, our method achieves a RI of 25.14% and an average robust accuracy of 48.99% on the NTURGB+D dataset, outperforming the latest SimMin+ExFMem method by 8.55 and 23.79 percentage points, respectively. These findings confirm that our approach surpasses others in enhancing robustness and balancing accuracy rates.