Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching

The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.

Medienart:

E-Artikel

Erscheinungsjahr:

2024

Erschienen:

2024

Enthalten in:

Zur Gesamtaufnahme - volume:33

Enthalten in:

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society - 33(2024) vom: 19., Seite 1326-1337

Sprache:

Englisch

Beteiligte Personen:

Shi, Zhangxiang [VerfasserIn]
Zhang, Tianzhu [VerfasserIn]
Wei, Xi [VerfasserIn]
Wu, Feng [VerfasserIn]
Zhang, Yongdong [VerfasserIn]

Links:

Volltext

Themen:

Journal Article

Anmerkungen:

Date Revised 14.02.2024

published: Print-Electronic

Citation Status PubMed-not-MEDLINE

doi:

10.1109/TIP.2022.3197972

funding:

Förderinstitution / Projekttitel:

PPN (Katalog-ID):

NLM344954587