Details der Publikation - Omicron detection with large language models and YouTube audio data

Omicron detection with large language models and YouTube audio data

Abstract Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications..

Medienart:	Preprint

Erscheinungsjahr:	2024
Erschienen:	2024

Enthalten in:	bioRxiv.org - (2024) vom: 29. März Zur Gesamtaufnahme - year:2024

Sprache:	Englisch

Beteiligte Personen:	Anibal, James T. [VerfasserIn] Landa, Adam J. [VerfasserIn] Hang, Nguyen T. T. [VerfasserIn] Song, Miranda J. [VerfasserIn] Peltekian, Alec K. [VerfasserIn] Shin, Ashley [VerfasserIn] Huth, Hannah B. [VerfasserIn] Hazen, Lindsey A. [VerfasserIn] Christou, Anna S. [VerfasserIn] Rivera, Jocelyne [VerfasserIn] Morhard, Robert A. [VerfasserIn] Bagci, Ulas [VerfasserIn] Li, Ming [VerfasserIn] Bensoussan, Yael [VerfasserIn] Clifton, David A. [VerfasserIn] Wood, Bradford J. [VerfasserIn]

Links:	Volltext [kostenfrei]

Themen:	570 Biology

doi:	10.1101/2022.09.13.22279673

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	XBI037325965

Internformat


LEADER	01000caa a22002652 4500
001	XBI037325965
003	DE-627
005	20240330124734.0
007	cr uuu---uuuuu
008	220920s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1101/2022.09.13.22279673 \|2 doi
035			\|a (DE-627)XBI037325965
035			\|a (biorXiv)10.1101/2022.09.13.22279673
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Anibal, James T. \|e verfasserin \|4 aut
245	1	0	\|a Omicron detection with large language models and YouTube audio data
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a Computermedien \|b c \|2 rdamedia
338			\|a Online-Ressource \|b cr \|2 rdacarrier
520			\|a Abstract Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications.
650		4	\|a Biology \|7 (dpeaa)DE-84
650		4	\|a 570 \|7 (dpeaa)DE-84
700	1		\|a Landa, Adam J. \|4 aut
700	1		\|a Hang, Nguyen T. T. \|4 aut
700	1		\|a Song, Miranda J. \|0 (orcid)0000-0003-4448-277X \|4 aut
700	1		\|a Peltekian, Alec K. \|4 aut
700	1		\|a Shin, Ashley \|4 aut
700	1		\|a Huth, Hannah B. \|4 aut
700	1		\|a Hazen, Lindsey A. \|4 aut
700	1		\|a Christou, Anna S. \|4 aut
700	1		\|a Rivera, Jocelyne \|4 aut
700	1		\|a Morhard, Robert A. \|4 aut
700	1		\|a Bagci, Ulas \|4 aut
700	1		\|a Li, Ming \|4 aut
700	1		\|a Bensoussan, Yael \|4 aut
700	1		\|a Clifton, David A. \|4 aut
700	1		\|a Wood, Bradford J. \|4 aut
773	0	8	\|i Enthalten in \|t bioRxiv.org \|g (2024) vom: 29. März
773	1	8	\|g year:2024 \|g day:29 \|g month:03
856	4	0	\|u http://dx.doi.org/10.1101/2022.09.13.22279673 \|z kostenfrei \|3 Volltext
912			\|a GBV_XBI
951			\|a AR
952			\|j 2024 \|b 29 \|c 03

Omicron detection with large language models and YouTube audio data

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände