Details der Publikation - Reducing Vision-Answer Biases for Multiple-Choice VQA

Reducing Vision-Answer Biases for Multiple-Choice VQA

Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE.

Medienart:	E-Artikel

Erscheinungsjahr:	2023
Erschienen:	2023

Enthalten in:	Zur Gesamtaufnahme - volume:32
Enthalten in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society - 32(2023) vom: 09., Seite 4621-4634

Sprache:	Englisch

Beteiligte Personen:	Zhang, Xi [VerfasserIn] Zhang, Feifei [VerfasserIn] Xu, Changsheng [VerfasserIn]

Links:	Volltext

Themen:	Journal Article

Anmerkungen:	Date Revised 16.08.2023 published: Print-Electronic Citation Status PubMed-not-MEDLINE

doi:	10.1109/TIP.2023.3302162

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM360559506

Internformat


LEADER	01000naa a22002652 4500
001	NLM360559506
003	DE-627
005	20231226083253.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2023.3302162 \|2 doi
028	5	2	\|a pubmed24n1201.xml
035			\|a (DE-627)NLM360559506
035			\|a (NLM)37556338
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Zhang, Xi \|e verfasserin \|4 aut
245	1	0	\|a Reducing Vision-Answer Biases for Multiple-Choice VQA
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 16.08.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE
650		4	\|a Journal Article
700	1		\|a Zhang, Feifei \|e verfasserin \|4 aut
700	1		\|a Xu, Changsheng \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g 32(2023) vom: 09., Seite 4621-4634 \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:32 \|g year:2023 \|g day:09 \|g pages:4621-4634
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2023.3302162 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 32 \|j 2023 \|b 09 \|h 4621-4634

Reducing Vision-Answer Biases for Multiple-Choice VQA

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände