Details der Publikation - A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10-76, r = 0.24; cohen's D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.

Medienart:	E-Artikel

Erscheinungsjahr:	2021
Erschienen:	2021

Enthalten in:	Zur Gesamtaufnahme - volume:12
Enthalten in:	Genes - 12(2021), 6 vom: 10. Juni

Sprache:	Englisch

Beteiligte Personen:	Doddahonnaiah, Deeksha [VerfasserIn] Lenehan, Patrick J [VerfasserIn] Hughes, Travis K [VerfasserIn] Zemmour, David [VerfasserIn] Garcia-Rivera, Enrique [VerfasserIn] Venkatakrishnan, A J [VerfasserIn] Chilaka, Ramakrishna [VerfasserIn] Khare, Apoorv [VerfasserIn] Kasaraneni, Akhil [VerfasserIn] Garg, Abhinav [VerfasserIn] Anand, Akash [VerfasserIn] Barve, Rakesh [VerfasserIn] Thiagarajan, Viswanathan [VerfasserIn] Soundararajan, Venky [VerfasserIn]

Links:	Volltext

Themen:	Journal Article Natural language processing Research Support, Non-U.S. Gov't Single cell genomics

Anmerkungen:	Date Completed 21.09.2021 Date Revised 24.11.2021 published: Electronic Citation Status MEDLINE

doi:	10.3390/genes12060898

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM327473924

Internformat


LEADER	01000caa a22002652 4500
001	NLM327473924
003	DE-627
005	20231226203233.0
007	cr uuu---uuuuu
008	231225s2021 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.3390/genes12060898 \|2 doi
028	5	2	\|a pubmed24n1091.xml
035			\|a (DE-627)NLM327473924
035			\|a (NLM)34200671
035			\|a (PII)898
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Doddahonnaiah, Deeksha \|e verfasserin \|4 aut
245	1	2	\|a A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
264		1	\|c 2021
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 21.09.2021
500			\|a Date Revised 24.11.2021
500			\|a published: Electronic
500			\|a Citation Status MEDLINE
520			\|a Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10-76, r = 0.24; cohen's D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data
650		4	\|a Journal Article
650		4	\|a Research Support, Non-U.S. Gov't
650		4	\|a natural language processing
650		4	\|a single cell genomics
700	1		\|a Lenehan, Patrick J \|e verfasserin \|4 aut
700	1		\|a Hughes, Travis K \|e verfasserin \|4 aut
700	1		\|a Zemmour, David \|e verfasserin \|4 aut
700	1		\|a Garcia-Rivera, Enrique \|e verfasserin \|4 aut
700	1		\|a Venkatakrishnan, A J \|e verfasserin \|4 aut
700	1		\|a Chilaka, Ramakrishna \|e verfasserin \|4 aut
700	1		\|a Khare, Apoorv \|e verfasserin \|4 aut
700	1		\|a Kasaraneni, Akhil \|e verfasserin \|4 aut
700	1		\|a Garg, Abhinav \|e verfasserin \|4 aut
700	1		\|a Anand, Akash \|e verfasserin \|4 aut
700	1		\|a Barve, Rakesh \|e verfasserin \|4 aut
700	1		\|a Thiagarajan, Viswanathan \|e verfasserin \|4 aut
700	1		\|a Soundararajan, Venky \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t Genes \|d 2011 \|g 12(2021), 6 vom: 10. Juni \|w (DE-627)NLM220446326 \|x 2073-4425 \|7 nnns
773	1	8	\|g volume:12 \|g year:2021 \|g number:6 \|g day:10 \|g month:06
856	4	0	\|u http://dx.doi.org/10.3390/genes12060898 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 12 \|j 2021 \|e 6 \|b 10 \|c 06

A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände