Omicron detection with large language models and YouTube audio data
Abstract Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications..
Medienart: |
Preprint |
---|
Erscheinungsjahr: |
2024 |
---|---|
Erschienen: |
2024 |
Enthalten in: |
bioRxiv.org - (2024) vom: 29. März Zur Gesamtaufnahme - year:2024 |
---|
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Anibal, James T. [VerfasserIn] |
---|
Links: |
Volltext [kostenfrei] |
---|
Themen: |
---|
doi: |
10.1101/2022.09.13.22279673 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
XBI037325965 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | XBI037325965 | ||
003 | DE-627 | ||
005 | 20240330124734.0 | ||
007 | cr uuu---uuuuu | ||
008 | 220920s2024 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1101/2022.09.13.22279673 |2 doi | |
035 | |a (DE-627)XBI037325965 | ||
035 | |a (biorXiv)10.1101/2022.09.13.22279673 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Anibal, James T. |e verfasserin |4 aut | |
245 | 1 | 0 | |a Omicron detection with large language models and YouTube audio data |
264 | 1 | |c 2024 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a Computermedien |b c |2 rdamedia | ||
338 | |a Online-Ressource |b cr |2 rdacarrier | ||
520 | |a Abstract Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications. | ||
650 | 4 | |a Biology |7 (dpeaa)DE-84 | |
650 | 4 | |a 570 |7 (dpeaa)DE-84 | |
700 | 1 | |a Landa, Adam J. |4 aut | |
700 | 1 | |a Hang, Nguyen T. T. |4 aut | |
700 | 1 | |a Song, Miranda J. |0 (orcid)0000-0003-4448-277X |4 aut | |
700 | 1 | |a Peltekian, Alec K. |4 aut | |
700 | 1 | |a Shin, Ashley |4 aut | |
700 | 1 | |a Huth, Hannah B. |4 aut | |
700 | 1 | |a Hazen, Lindsey A. |4 aut | |
700 | 1 | |a Christou, Anna S. |4 aut | |
700 | 1 | |a Rivera, Jocelyne |4 aut | |
700 | 1 | |a Morhard, Robert A. |4 aut | |
700 | 1 | |a Bagci, Ulas |4 aut | |
700 | 1 | |a Li, Ming |4 aut | |
700 | 1 | |a Bensoussan, Yael |4 aut | |
700 | 1 | |a Clifton, David A. |4 aut | |
700 | 1 | |a Wood, Bradford J. |4 aut | |
773 | 0 | 8 | |i Enthalten in |t bioRxiv.org |g (2024) vom: 29. März |
773 | 1 | 8 | |g year:2024 |g day:29 |g month:03 |
856 | 4 | 0 | |u http://dx.doi.org/10.1101/2022.09.13.22279673 |z kostenfrei |3 Volltext |
912 | |a GBV_XBI | ||
951 | |a AR | ||
952 | |j 2024 |b 29 |c 03 |