Latent Attention Network With Position Perception for Visual Question Answering

For exploring the complex relative position relationships among multiobject with multiple position prepositions in the question, we propose a novel latent attention (LA) network for visual question answering (VQA), in which LA with position perception is extracted by a novel LA generation module (LAGM) and encoded along with absolute and relative position relations by our proposed position-aware module (PAM). The LAGM reconstructs original attention into LA by capturing the tendency of visual attention shifting according to the position prepositions in the question. The LA accurately captures the complex relative position features of multiple objects and helps the model locate the attention to the correct object or region. The PAM adopts latent state and relative position relations to enhance the capability of comprehending the multiobject correlations. In addition, we also propose a novel gated counting module (GCM) to strengthen the sensitivity of quantitative knowledge for effectively improving the performance of counting questions. Extensive experiments demonstrate that our proposed method achieves excellent performance on VQA and outperforms state-of-the-art methods on the widely used datasets VQA v2 and VQA v1.

Medienart:

E-Artikel

Erscheinungsjahr:

2024

Erschienen:

2024

Enthalten in:

Zur Gesamtaufnahme - volume:PP

Enthalten in:

IEEE transactions on neural networks and learning systems - PP(2024) vom: 26. März

Sprache:

Englisch

Beteiligte Personen:

Zhang, Jing [VerfasserIn]
Liu, Xiaoqiang [VerfasserIn]
Wang, Zhe [VerfasserIn]

Links:

Volltext

Themen:

Journal Article

Anmerkungen:

Date Revised 26.03.2024

published: Print-Electronic

Citation Status Publisher

doi:

10.1109/TNNLS.2024.3377636

funding:

Förderinstitution / Projekttitel:

PPN (Katalog-ID):

NLM370202848