Details der Publikation - UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates..

Medienart:	Preprint

Erscheinungsjahr:	2020
Erschienen:	2020

Enthalten in:	arXiv.org - (2020) vom: 30. Dez. Zur Gesamtaufnahme - year:2020

Sprache:	Englisch

Beteiligte Personen:	Hozumi, Yuta [VerfasserIn] Wang, Rui [VerfasserIn] Yin, Changchuan [VerfasserIn] Wei, Guo-Wei [VerfasserIn]

Links:	Volltext [kostenfrei]

Förderinstitution / Projekttitel:

PPN (Katalog-ID):	XAR019650175

Internformat


LEADER	01000caa a22002652 4500
001	XAR019650175
003	DE-627
005	20230429062658.0
007	cr uuu---uuuuu
008	210104s2020 xx \|\|\|\|\|o 00\| \|\|eng c
035			\|a (DE-627)XAR019650175
035			\|a (DE-599)arXiv2012.15268
035			\|a (arXiv)2012.15268
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
082	0		\|a 570 \|q DE-84
100	1		\|a Hozumi, Yuta \|e verfasserin \|4 aut
245	1	0	\|a UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets
264		1	\|c 2020
336			\|a Text \|b txt \|2 rdacontent
337			\|a Computermedien \|b c \|2 rdamedia
338			\|a Online-Ressource \|b cr \|2 rdacarrier
520			\|a Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
700	1		\|a Wang, Rui \|e verfasserin \|4 aut
700	1		\|a Yin, Changchuan \|e verfasserin \|4 aut
700	1		\|a Wei, Guo-Wei \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t arXiv.org \|g (2020) vom: 30. Dez.
773	1	8	\|g year:2020 \|g day:30 \|g month:12
856	4	0	\|u https://arxiv.org/abs/2012.15268 \|z kostenfrei \|3 Volltext
912			\|a GBV_XAR
912			\|a SSG-OLC-PHA
951			\|a AR
952			\|j 2020 \|b 30 \|c 12
953			\|2 045F \|a 570

UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände