Details der Publikation - GPT for RCTs? Using AI to determine adherence to reporting guidelines

GPT for RCTs? Using AI to determine adherence to reporting guidelines

Abstract Importance Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results.Objective This exploratory study aimed to determine how accurate Large Language Model generative AI systems (AI-LLM) were for determining reporting guideline compliance in a sample of clinical trial reports.Design, Setting, and Participants In this cross-sectional study, the OpenAI GPT-4 and Meta LLama2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of 113 published sports medicine and exercise science clinical trial reports. For each paper, the GPT-4-Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4-Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama2 model was fine-tuned using the data from the GPT-4-Turbo analysis of the TRAIN dataset.Main Outcome Model performance (F1-score, classification accuracy) was assessed using the TEST dataset.Results Across all questions about the article text, the GPT-4-Turbo AI-LLM demonstrated acceptable performance (F1-score = 0.89, accuracy[95% CI] = 90%[85-94%]). Accuracy for all reporting guidelines was > 80%. The Llama2 model accuracy was initially poor (F1-score = 0.63, accuracy[95%CI] = 64%[57-71%]), and improved with fine-tuning (F1-score = 0.84, accuracy[95%CI] = 83%[77-88%]). The GPT-4-Vision model accurately identified all participant flow diagrams (accuracy[95% CI] = 100%[89-100%]) but was less accurate at identifying when details were missing from the flow diagram (accuracy[95% CI] = 57%[39-73%]).Conclusions and Relevance Both the GPT-4 and fine-tuned Llama2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.Key Points Question: How accurately can Large Language Models determine adherence to clinical trial reporting guidelines?Findings: In this cross-sectional study, the GPT-4 Large Langue Model accurately (∼90%) determined reporting guidelines adherence in a sample of 113 randomized clinical trial trials. Following fine-tuning, the open-source Llama2 70B model achieved acceptable overall accuracy (∼84%).Meaning:A Large Language Model such as GPT-4 could be used by journals, peer reviewers and authors to quickly and accurately check clinical trial reporting guideline adherence..

Medienart:	Preprint

Erscheinungsjahr:	2024
Erschienen:	2024

Enthalten in:	bioRxiv.org - (2024) vom: 27. März Zur Gesamtaufnahme - year:2024

Sprache:	Englisch

Beteiligte Personen:	Wrightson, J.G. [VerfasserIn] Blazey, P. [VerfasserIn] Moher, D. [VerfasserIn] Khan, K.M. [VerfasserIn] Ardern, C.L. [VerfasserIn]

Links:	Volltext [kostenfrei]

Themen:	570 Biology

doi:	10.1101/2023.12.14.23299971

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	XBI041887107

Internformat


LEADER	01000caa a22002652 4500
001	XBI041887107
003	DE-627
005	20240328090509.0
007	cr uuu---uuuuu
008	231216s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1101/2023.12.14.23299971 \|2 doi
035			\|a (DE-627)XBI041887107
035			\|a (biorXiv)10.1101/2023.12.14.23299971
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Wrightson, J.G. \|e verfasserin \|0 (orcid)0000-0001-7106-7470 \|4 aut
245	1	0	\|a GPT for RCTs? Using AI to determine adherence to reporting guidelines
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a Computermedien \|b c \|2 rdamedia
338			\|a Online-Ressource \|b cr \|2 rdacarrier
520			\|a Abstract Importance Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results.Objective This exploratory study aimed to determine how accurate Large Language Model generative AI systems (AI-LLM) were for determining reporting guideline compliance in a sample of clinical trial reports.Design, Setting, and Participants In this cross-sectional study, the OpenAI GPT-4 and Meta LLama2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of 113 published sports medicine and exercise science clinical trial reports. For each paper, the GPT-4-Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4-Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama2 model was fine-tuned using the data from the GPT-4-Turbo analysis of the TRAIN dataset.Main Outcome Model performance (F1-score, classification accuracy) was assessed using the TEST dataset.Results Across all questions about the article text, the GPT-4-Turbo AI-LLM demonstrated acceptable performance (F1-score = 0.89, accuracy[95% CI] = 90%[85-94%]). Accuracy for all reporting guidelines was > 80%. The Llama2 model accuracy was initially poor (F1-score = 0.63, accuracy[95%CI] = 64%[57-71%]), and improved with fine-tuning (F1-score = 0.84, accuracy[95%CI] = 83%[77-88%]). The GPT-4-Vision model accurately identified all participant flow diagrams (accuracy[95% CI] = 100%[89-100%]) but was less accurate at identifying when details were missing from the flow diagram (accuracy[95% CI] = 57%[39-73%]).Conclusions and Relevance Both the GPT-4 and fine-tuned Llama2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.Key Points Question: How accurately can Large Language Models determine adherence to clinical trial reporting guidelines?Findings: In this cross-sectional study, the GPT-4 Large Langue Model accurately (∼90%) determined reporting guidelines adherence in a sample of 113 randomized clinical trial trials. Following fine-tuning, the open-source Llama2 70B model achieved acceptable overall accuracy (∼84%).Meaning:A Large Language Model such as GPT-4 could be used by journals, peer reviewers and authors to quickly and accurately check clinical trial reporting guideline adherence.
650		4	\|a Biology \|7 (dpeaa)DE-84
650		4	\|a 570 \|7 (dpeaa)DE-84
700	1		\|a Blazey, P. \|0 (orcid)0000-0002-8149-9514 \|4 aut
700	1		\|a Moher, D. \|0 (orcid)0000-0003-2434-4206 \|4 aut
700	1		\|a Khan, K.M. \|4 aut
700	1		\|a Ardern, C.L. \|0 (orcid)0000-0001-8102-3631 \|4 aut
773	0	8	\|i Enthalten in \|t bioRxiv.org \|g (2024) vom: 27. März
773	1	8	\|g year:2024 \|g day:27 \|g month:03
856	4	0	\|u http://dx.doi.org/10.1101/2023.12.14.23299971 \|z kostenfrei \|3 Volltext
912			\|a GBV_XBI
951			\|a AR
952			\|j 2024 \|b 27 \|c 03

GPT for RCTs? Using AI to determine adherence to reporting guidelines

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände