TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

Résumés déjà disponibles dans d'autres langues : en

Auteurs : Md Kamrul Hasan, Md Saiful Islam, Sangwu Lee, Wasifur Rahman, Iftekhar Naim, Mohammed Ibrahim Khan, Ehsan Hoque

arXiv: 2303.15430v2 - DOI (cs.CL)

Licence : CC BY 4.0

Résumé : Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.

Soumis à arXiv le 27 Mar. 2023

Posez des questions sur cet article à notre assistant IA

Vous pouvez aussi discutez avec plusieurs papiers à la fois ici.

Instructions pour utiliser l'assistant IA ?

Résultats du processus de synthèse de l'article arXiv : 2303.15430v2

Résumé Complet
Points clés
Résumé vulgarisé
Article de blog

Le résumé n'est pas encore prêt

Les points clés ne sont pas encore prêts

Le résumé vulgarisé n'est pas encore prêt

L'article de blog n'est pas encore prêt

Créé le 05 Avr. 2023

Disponible dans d'autres langues : en

Évaluez la qualité du contenu généré par l'IA en votant

Note : 0

Le résumé précédent a été créé il y a plus d'un an et peut être réexécuté (si nécessaire) en cliquant sur le bouton Exécuter ci-dessous.

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

Posez des questions sur cet article à notre assistant IA

Résultats du processus de synthèse de l'article arXiv : 2303.15430v2

Articles similaires résumés avec nos outils d'IA