Reducing Vision-Answer Biases for Multiple-Choice VQA
Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2023 |
---|---|
Erschienen: |
2023 |
Enthalten in: |
Zur Gesamtaufnahme - volume:32 |
---|---|
Enthalten in: |
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society - 32(2023) vom: 09., Seite 4621-4634 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Zhang, Xi [VerfasserIn] |
---|
Links: |
---|
Themen: |
---|
Anmerkungen: |
Date Revised 16.08.2023 published: Print-Electronic Citation Status PubMed-not-MEDLINE |
---|
doi: |
10.1109/TIP.2023.3302162 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM360559506 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | NLM360559506 | ||
003 | DE-627 | ||
005 | 20231226083253.0 | ||
007 | cr uuu---uuuuu | ||
008 | 231226s2023 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1109/TIP.2023.3302162 |2 doi | |
028 | 5 | 2 | |a pubmed24n1201.xml |
035 | |a (DE-627)NLM360559506 | ||
035 | |a (NLM)37556338 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Zhang, Xi |e verfasserin |4 aut | |
245 | 1 | 0 | |a Reducing Vision-Answer Biases for Multiple-Choice VQA |
264 | 1 | |c 2023 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Revised 16.08.2023 | ||
500 | |a published: Print-Electronic | ||
500 | |a Citation Status PubMed-not-MEDLINE | ||
520 | |a Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE | ||
650 | 4 | |a Journal Article | |
700 | 1 | |a Zhang, Feifei |e verfasserin |4 aut | |
700 | 1 | |a Xu, Changsheng |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |d 1992 |g 32(2023) vom: 09., Seite 4621-4634 |w (DE-627)NLM09821456X |x 1941-0042 |7 nnns |
773 | 1 | 8 | |g volume:32 |g year:2023 |g day:09 |g pages:4621-4634 |
856 | 4 | 0 | |u http://dx.doi.org/10.1109/TIP.2023.3302162 |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 32 |j 2023 |b 09 |h 4621-4634 |