Accommodating audio modality in CLIP for multimodal processing

Published in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Recommended citation: L Ruan, A Hu, Y Song, L Zhang, S Zheng, Q Jin. "Accommodating audio modality in CLIP for multimodal processing." 2023 Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 7 - 14. https://ojs.aaai.org/index.php/AAAI/article/view/26153/25925

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.

替代文本
Figure 1: An overview of our CLIP4VLA model, which consists of three backbones: Text Encoder, Vision Encoder, and Audio Encoder. After encoding a batch of text features, vision features and audio features, we pre-train the model with four kinds of contrastive learning objectives for text-audio, video-audio, augmented audio-original audio respectively. The black squares are not included in the calculation.