In response to several key challenges faced by the existing tone mapping techniques in practical applications, such as insufficient stability of mapping results, difficulty in balancing the natural aesthetics of images, and limited adaptability to complex lighting environments and diverse scene types, this paper proposes a tone mapping method based on multimodal learning. The goal is to acquire cross-modal supervisory information through the shared semantic space of text and images, aiming to achieve more accurate, natural, and universally applicable tone mapping. By leveraging the text-image matching information from large text-image models to assist in unsupervised training, the method effectively suppresses the occurrence of underexposed and overexposed areas, avoiding the training instability and complexity issues present in generative adversarial methods and contrastive learning. Experiments demonstrate that the proposed tone mapping method displays superior performance across multiple open benchmark datasets. Compared with the existing mainstream tone mapping algorithms, this method not only maintains the overall lighting atmosphere of images but also more effectively suppresses overexposed areas, enhances underexposed areas, retains rich color details, and enhances visual hierarchy, with stronger adaptability to various lighting conditions and scene types. Moreover, this work also confirms the significant potential of multimodal learning in foundational vision tasks.