Enhancing Early Detection of Cognitive Decline in the Elderly through Ensemble of NLP Techniques: A Comparative Study Utilizing Large Language Models in Clinical Notes

Abstract Summary We found LLM, traditional machine learning, and deep learning had diverse error profiles on cognitive decline identification from clinical notes, and the ensemble of LLM, machine learning, and deep learning achieved state of the art performance.Background Early detection of cognitive decline in elderly individuals facilitates clinical trial enrollment and timely medical interventions. This study aims to apply, evaluate, and compare advanced natural language processing techniques for identifying signs of cognitive decline in clinical notes.Methods This study, conducted at Mass General Brigham (MGB), Boston, MA, included clinical notes from the 4 years prior to initial mild cognitive impairment (MCI) diagnosis in 2019 for patients ≥ 50 years. Note sections regarding cognitive decline were labeled manually. A random sample of 4,949 note sections filtered with cognitive functions-related keywords were used for traditional AI model development, and 200 random subset were used for LLM and prompt development; another random sample of 1996 note sections without keyword filtering were used for testing. Prompt templates for large language models (LLM), Llama 2 on Amazon Web Service and GPT-4 on Microsoft Azure, were developed with multiple prompting approaches to select the optimal LLM-based method. Baseline comparisons were made with XGBoost and a hierarchical attention-based deep neural network model. An ensemble of the three models was then constructed using majority vote.Results GPT-4 demonstrated superior accuracy and efficiency to Llama 2. The ensemble model outperformed individual models, achieving a precision of 90.3%, recall of 94.2%, and F1-score of 92.2%. Notably, the ensemble model demonstrated a marked improvement in precision (from a 70%-79% range to above 90%) compared to the best performing single model. Error analysis revealed 63 samples were wrongly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them.Conclusion Our findings indicate that LLMs and traditional models exhibit diverse error profiles. The ensemble of LLMs and locally trained machine learning models on EHR data was found to be complementary, enhancing performance and improving diagnostic accuracy..

Medienart:

Preprint

Erscheinungsjahr:

2024

Erschienen:

2024

Enthalten in:

bioRxiv.org - (2024) vom: 08. Apr. Zur Gesamtaufnahme - year:2024

Sprache:

Englisch

Beteiligte Personen:

Du, Xinsong [VerfasserIn]
Novoa-Laurentiev, John [VerfasserIn]
Plasek, Joseph M. [VerfasserIn]
Chuang, Ya-Wen [VerfasserIn]
Wang, Liqin [VerfasserIn]
Chang, Frank [VerfasserIn]
Datta, Surabhi [VerfasserIn]
Paek, Hunki [VerfasserIn]
Lin, Bin [VerfasserIn]
Wei, Qiang [VerfasserIn]
Wang, Xiaoyan [VerfasserIn]
Wang, Jingqi [VerfasserIn]
Ding, Hao [VerfasserIn]
Manion, Frank J. [VerfasserIn]
Du, Jingcheng [VerfasserIn]
Zhou, Li [VerfasserIn]

Links:

Volltext [kostenfrei]

Themen:

570
Biology

doi:

10.1101/2024.04.03.24305298

funding:

Förderinstitution / Projekttitel:

PPN (Katalog-ID):

XBI043174183