Prototype local–global alignment network for image–text retrieval
Abstract Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods..
Medienart: |
Artikel |
---|
Erscheinungsjahr: |
2022 |
---|---|
Erschienen: |
2022 |
Enthalten in: |
Zur Gesamtaufnahme - volume:11 |
---|---|
Enthalten in: |
International journal of multimedia information retrieval - 11(2022), 4 vom: 06. Okt., Seite 525-538 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Meng, Lingtao [VerfasserIn] |
---|
Links: |
Volltext [lizenzpflichtig] |
---|
BKL: | |
---|---|
Themen: |
Global alignment |
Anmerkungen: |
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
---|
doi: |
10.1007/s13735-022-00258-1 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
OLC2080172514 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | OLC2080172514 | ||
003 | DE-627 | ||
005 | 20240405160100.0 | ||
007 | tu | ||
008 | 230131s2022 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s13735-022-00258-1 |2 doi | |
035 | |a (DE-627)OLC2080172514 | ||
035 | |a (DE-He213)s13735-022-00258-1-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |a 660 |a 070 |a 020 |q VZ |
084 | |a 54.87 |2 bkl | ||
084 | |a 54.64 |2 bkl | ||
100 | 1 | |a Meng, Lingtao |e verfasserin |4 aut | |
245 | 1 | 0 | |a Prototype local–global alignment network for image–text retrieval |
264 | 1 | |c 2022 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. | ||
520 | |a Abstract Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods. | ||
650 | 4 | |a Image–text retrieval | |
650 | 4 | |a Local alignment | |
650 | 4 | |a Global alignment | |
650 | 4 | |a Prototype | |
700 | 1 | |a Zhang, Feifei |0 (orcid)0000-0002-8153-9977 |4 aut | |
700 | 1 | |a Zhang, Xi |4 aut | |
700 | 1 | |a Xu, Changsheng |4 aut | |
773 | 0 | 8 | |i Enthalten in |t International journal of multimedia information retrieval |d Springer London, 2012 |g 11(2022), 4 vom: 06. Okt., Seite 525-538 |w (DE-627)684132834 |w (DE-600)2647391-4 |w (DE-576)9684132832 |x 2192-6611 |7 nnns |
773 | 1 | 8 | |g volume:11 |g year:2022 |g number:4 |g day:06 |g month:10 |g pages:525-538 |
856 | 4 | 1 | |u https://doi.org/10.1007/s13735-022-00258-1 |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-PHA | ||
936 | b | k | |a 54.87 |j Multimedia |j Multimedia |q VZ |
936 | b | k | |a 54.64 |j Datenbanken |j Datenbanken |q VZ |
951 | |a AR | ||
952 | |d 11 |j 2022 |e 4 |b 06 |c 10 |h 525-538 |