A Deep Learning Approach for Transgender and Gender Diverse Patient Identification in Electronic Health Records

ABSTRACT <jats:sec id="s1">Background Although accurate identification of gender identity in the electronic health record (EHR) is crucial for providing equitable health care, particularly for transgender and gender diverse (TGD) populations, it remains a challenging task due to incomplete gender information in structured EHR fields.<jats:sec id="s2">Objective To develop a deep learning classifier to accurately identify patient gender identity using patient-level EHR data, including free-text notes.<jats:sec id="s3">Methods This study included adult patients in a large healthcare system in Boston, MA, between 4/1/2017 to 4/1/2022. To identify relevant information from massive clinical notes and to denoise, we compiled a list of gender-related keywords through expert curation, literature review, and expansion via a fine-tuned BioWordVec model. This keyword list was used to pre-screen potential TGD individuals and create two datasets for model training, testing, and validation. Dataset I was a balanced dataset that contained clinician-confirmed TGD patients and cases without keywords. Dataset II contained cases with keywords. The performance of the deep learning model was compared to traditional machine learning and rule-based algorithms.<jats:sec id="s4">Results The final keyword list consists of 109 keywords, of which 58 (53.2%) were expanded by the BioWordVec model. Dataset I contained 3,150 patients (50% TGD) while Dataset II contained 200 patients (90% TGD). On Dataset I the deep learning model achieved a F1 score of 0.917, sensitivity of 0.854, and a precision of 0.980; and on Dataset II a F1 score of 0.969, sensitivity of 0.967, and precision of 0.972. The deep learning model significantly outperformed rule-based algorithms.<jats:sec id="s5">Conclusion This is the first study to show that deep learning algorithms can accurately identify gender identity using EHR data. Future work should leverage and evaluate additional diverse data sources to generate more generalizable algorithms.<jats:sec id="s6">Graphical abstract <jats:fig id="ufig1" position="float" orientation="portrait" fig-type="figure"><jats:graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="23290988v1_unfig1" position="float" orientation="portrait" /></jats:fig>.

Medienart:

Preprint

Erscheinungsjahr:

2024

Erschienen:

2024

Enthalten in:

bioRxiv.org - (2024) vom: 23. Apr. Zur Gesamtaufnahme - year:2024

Sprache:

Englisch

Beteiligte Personen:

Hua, Yining [VerfasserIn]
Wang, Liqin [VerfasserIn]
Nguyen, Vi [VerfasserIn]
Rieu-Werden, Meghan [VerfasserIn]
McDowell, Alex [VerfasserIn]
Bates, David W. [VerfasserIn]
Foer, Dinah [VerfasserIn]
Zhou, Li [VerfasserIn]

Links:

Volltext [lizenzpflichtig]
Volltext [kostenfrei]

Themen:

570
Biology

doi:

10.1101/2023.06.07.23290988

funding:

Förderinstitution / Projekttitel:

PPN (Katalog-ID):

XBI039856771