Evaluating automatic sentence alignment approaches on English-Slovak sentences
© 2023. The Author(s)..
Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2023 |
---|---|
Erschienen: |
2023 |
Enthalten in: |
Zur Gesamtaufnahme - volume:13 |
---|---|
Enthalten in: |
Scientific reports - 13(2023), 1 vom: 17. Nov., Seite 20123 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Forgac, Frantisek [VerfasserIn] |
---|
Links: |
---|
Themen: |
---|
Anmerkungen: |
Date Completed 20.11.2023 Date Revised 24.11.2023 published: Electronic Citation Status MEDLINE |
---|
doi: |
10.1038/s41598-023-47479-w |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM364698039 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | NLM364698039 | ||
003 | DE-627 | ||
005 | 20231226211205.0 | ||
007 | cr uuu---uuuuu | ||
008 | 231226s2023 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1038/s41598-023-47479-w |2 doi | |
028 | 5 | 2 | |a pubmed24n1215.xml |
035 | |a (DE-627)NLM364698039 | ||
035 | |a (NLM)37978270 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Forgac, Frantisek |e verfasserin |4 aut | |
245 | 1 | 0 | |a Evaluating automatic sentence alignment approaches on English-Slovak sentences |
264 | 1 | |c 2023 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Completed 20.11.2023 | ||
500 | |a Date Revised 24.11.2023 | ||
500 | |a published: Electronic | ||
500 | |a Citation Status MEDLINE | ||
520 | |a © 2023. The Author(s). | ||
520 | |a Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance | ||
650 | 4 | |a Journal Article | |
650 | 4 | |a Research Support, Non-U.S. Gov't | |
700 | 1 | |a Munkova, Dasa |e verfasserin |4 aut | |
700 | 1 | |a Munk, Michal |e verfasserin |4 aut | |
700 | 1 | |a Kelebercova, Livia |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Scientific reports |d 2011 |g 13(2023), 1 vom: 17. Nov., Seite 20123 |w (DE-627)NLM215703936 |x 2045-2322 |7 nnns |
773 | 1 | 8 | |g volume:13 |g year:2023 |g number:1 |g day:17 |g month:11 |g pages:20123 |
856 | 4 | 0 | |u http://dx.doi.org/10.1038/s41598-023-47479-w |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 13 |j 2023 |e 1 |b 17 |c 11 |h 20123 |