Details der Publikation - Applying Artificial Intelligence Methods for the Estimation of Disease Incidence

Applying Artificial Intelligence Methods for the Estimation of Disease Incidence : The Utility of Language Models

Copyright © 2020 Zhang, Walecki, Winter, Bragman, Lourenco, Hart, Baker, Perov and Johri..

Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence. Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as "embeddings." We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature. Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate. Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.

Medienart:	E-Artikel

Erscheinungsjahr:	2020
Erschienen:	2020

Enthalten in:	Zur Gesamtaufnahme - volume:2
Enthalten in:	Frontiers in digital health - 2(2020) vom: 28., Seite 569261

Sprache:	Englisch

Beteiligte Personen:	Zhang, Yuanzhao [VerfasserIn] Walecki, Robert [VerfasserIn] Winter, Joanne R [VerfasserIn] Bragman, Felix J S [VerfasserIn] Lourenco, Sara [VerfasserIn] Hart, Christopher [VerfasserIn] Baker, Adam [VerfasserIn] Perov, Yura [VerfasserIn] Johri, Saurabh [VerfasserIn]

Links:	Volltext

Themen:	Deep learning Disease incidence Health statistic data Journal Article Machine learning Natural language processing

Anmerkungen:	Date Revised 30.10.2021 published: Electronic-eCollection Citation Status PubMed-not-MEDLINE

doi:	10.3389/fdgth.2020.569261

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM332513564

Internformat


LEADER	01000naa a22002652 4500
001	NLM332513564
003	DE-627
005	20231225215727.0
007	cr uuu---uuuuu
008	231225s2020 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.3389/fdgth.2020.569261 \|2 doi
028	5	2	\|a pubmed24n1108.xml
035			\|a (DE-627)NLM332513564
035			\|a (NLM)34713043
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Zhang, Yuanzhao \|e verfasserin \|4 aut
245	1	0	\|a Applying Artificial Intelligence Methods for the Estimation of Disease Incidence \|b The Utility of Language Models
264		1	\|c 2020
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 30.10.2021
500			\|a published: Electronic-eCollection
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Copyright © 2020 Zhang, Walecki, Winter, Bragman, Lourenco, Hart, Baker, Perov and Johri.
520			\|a Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence. Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as "embeddings." We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature. Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate. Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate
650		4	\|a Journal Article
650		4	\|a deep learning
650		4	\|a disease incidence
650		4	\|a health statistic data
650		4	\|a machine learning
650		4	\|a natural language processing
700	1		\|a Walecki, Robert \|e verfasserin \|4 aut
700	1		\|a Winter, Joanne R \|e verfasserin \|4 aut
700	1		\|a Bragman, Felix J S \|e verfasserin \|4 aut
700	1		\|a Lourenco, Sara \|e verfasserin \|4 aut
700	1		\|a Hart, Christopher \|e verfasserin \|4 aut
700	1		\|a Baker, Adam \|e verfasserin \|4 aut
700	1		\|a Perov, Yura \|e verfasserin \|4 aut
700	1		\|a Johri, Saurabh \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t Frontiers in digital health \|d 2019 \|g 2(2020) vom: 28., Seite 569261 \|w (DE-627)NLM319100871 \|x 2673-253X \|7 nnns
773	1	8	\|g volume:2 \|g year:2020 \|g day:28 \|g pages:569261
856	4	0	\|u http://dx.doi.org/10.3389/fdgth.2020.569261 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 2 \|j 2020 \|b 28 \|h 569261

Applying Artificial Intelligence Methods for the Estimation of Disease Incidence : The Utility of Language Models

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände