GPT for RCTs? Using AI to determine adherence to reporting guidelines

Abstract Importance Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results.Objective This exploratory study aimed to determine how accurate Large Language Model generative AI systems (AI-LLM) were for determining reporting guideline compliance in a sample of clinical trial reports.Design, Setting, and Participants In this cross-sectional study, the OpenAI GPT-4 and Meta LLama2 AI-LLM were evaluated for their ability to determine reporting guideline adherence in a sample of 113 published sports medicine and exercise science clinical trial reports. For each paper, the GPT-4-Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4-Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama2 model was fine-tuned using the data from the GPT-4-Turbo analysis of the TRAIN dataset.Main Outcome Model performance (F1-score, classification accuracy) was assessed using the TEST dataset.Results Across all questions about the article text, the GPT-4-Turbo AI-LLM demonstrated acceptable performance (F1-score = 0.89, accuracy[95% CI] = 90%[85-94%]). Accuracy for all reporting guidelines was > 80%. The Llama2 model accuracy was initially poor (F1-score = 0.63, accuracy[95%CI] = 64%[57-71%]), and improved with fine-tuning (F1-score = 0.84, accuracy[95%CI] = 83%[77-88%]). The GPT-4-Vision model accurately identified all participant flow diagrams (accuracy[95% CI] = 100%[89-100%]) but was less accurate at identifying when details were missing from the flow diagram (accuracy[95% CI] = 57%[39-73%]).Conclusions and Relevance Both the GPT-4 and fine-tuned Llama2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficient, open-source AI-LLM and exploring methods to improve model accuracy.Key Points Question: How accurately can Large Language Models determine adherence to clinical trial reporting guidelines?Findings: In this cross-sectional study, the GPT-4 Large Langue Model accurately (∼90%) determined reporting guidelines adherence in a sample of 113 randomized clinical trial trials. Following fine-tuning, the open-source Llama2 70B model achieved acceptable overall accuracy (∼84%).Meaning:A Large Language Model such as GPT-4 could be used by journals, peer reviewers and authors to quickly and accurately check clinical trial reporting guideline adherence..

Medienart:

Preprint

Erscheinungsjahr:

2024

Erschienen:

2024

Enthalten in:

bioRxiv.org - (2024) vom: 27. März Zur Gesamtaufnahme - year:2024

Sprache:

Englisch

Beteiligte Personen:

Wrightson, J.G. [VerfasserIn]
Blazey, P. [VerfasserIn]
Moher, D. [VerfasserIn]
Khan, K.M. [VerfasserIn]
Ardern, C.L. [VerfasserIn]

Links:

Volltext [kostenfrei]

Themen:

570
Biology

doi:

10.1101/2023.12.14.23299971

funding:

Förderinstitution / Projekttitel:

PPN (Katalog-ID):

XBI041887107