De-identifying Spanish medical texts - named entity recognition applied to radiology reports
BACKGROUND: Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages.
RESULTS: We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%.
CONCLUSIONS: The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2021 |
---|---|
Erschienen: |
2021 |
Enthalten in: |
Zur Gesamtaufnahme - volume:12 |
---|---|
Enthalten in: |
Journal of biomedical semantics - 12(2021), 1 vom: 29. März, Seite 6 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Pérez-Díez, Irene [VerfasserIn] |
---|
Links: |
---|
Themen: |
Journal Article |
---|
Anmerkungen: |
Date Completed 28.10.2021 Date Revised 31.03.2024 published: Electronic Citation Status MEDLINE |
---|
doi: |
10.1186/s13326-021-00236-2 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM323371485 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | NLM323371485 | ||
003 | DE-627 | ||
005 | 20240331233138.0 | ||
007 | cr uuu---uuuuu | ||
008 | 231225s2021 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1186/s13326-021-00236-2 |2 doi | |
028 | 5 | 2 | |a pubmed24n1358.xml |
035 | |a (DE-627)NLM323371485 | ||
035 | |a (NLM)33781334 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Pérez-Díez, Irene |e verfasserin |4 aut | |
245 | 1 | 0 | |a De-identifying Spanish medical texts - named entity recognition applied to radiology reports |
264 | 1 | |c 2021 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Completed 28.10.2021 | ||
500 | |a Date Revised 31.03.2024 | ||
500 | |a published: Electronic | ||
500 | |a Citation Status MEDLINE | ||
520 | |a BACKGROUND: Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages | ||
520 | |a RESULTS: We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18% | ||
520 | |a CONCLUSIONS: The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records | ||
650 | 4 | |a Journal Article | |
650 | 4 | |a Research Support, Non-U.S. Gov't | |
650 | 4 | |a Medical texts | |
650 | 4 | |a Named entity recognition | |
650 | 4 | |a Natural language processing | |
650 | 4 | |a Radiology reports | |
650 | 4 | |a Spanish | |
700 | 1 | |a Pérez-Moraga, Raúl |e verfasserin |4 aut | |
700 | 1 | |a López-Cerdán, Adolfo |e verfasserin |4 aut | |
700 | 1 | |a Salinas-Serrano, Jose-Maria |e verfasserin |4 aut | |
700 | 1 | |a la Iglesia-Vayá, María de |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Journal of biomedical semantics |d 2010 |g 12(2021), 1 vom: 29. März, Seite 6 |w (DE-627)NLM199466343 |x 2041-1480 |7 nnns |
773 | 1 | 8 | |g volume:12 |g year:2021 |g number:1 |g day:29 |g month:03 |g pages:6 |
856 | 4 | 0 | |u http://dx.doi.org/10.1186/s13326-021-00236-2 |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 12 |j 2021 |e 1 |b 29 |c 03 |h 6 |