UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates..
Medienart: |
Preprint |
---|
Erscheinungsjahr: |
2020 |
---|---|
Erschienen: |
2020 |
Enthalten in: |
arXiv.org - (2020) vom: 30. Dez. Zur Gesamtaufnahme - year:2020 |
---|
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Hozumi, Yuta [VerfasserIn] |
---|
Links: |
Volltext [kostenfrei] |
---|
Förderinstitution / Projekttitel: |
|
---|
PPN (Katalog-ID): |
XAR019650175 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | XAR019650175 | ||
003 | DE-627 | ||
005 | 20230429062658.0 | ||
007 | cr uuu---uuuuu | ||
008 | 210104s2020 xx |||||o 00| ||eng c | ||
035 | |a (DE-627)XAR019650175 | ||
035 | |a (DE-599)arXiv2012.15268 | ||
035 | |a (arXiv)2012.15268 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | |a 570 |q DE-84 | |
100 | 1 | |a Hozumi, Yuta |e verfasserin |4 aut | |
245 | 1 | 0 | |a UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets |
264 | 1 | |c 2020 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a Computermedien |b c |2 rdamedia | ||
338 | |a Online-Ressource |b cr |2 rdacarrier | ||
520 | |a Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates. | ||
700 | 1 | |a Wang, Rui |e verfasserin |4 aut | |
700 | 1 | |a Yin, Changchuan |e verfasserin |4 aut | |
700 | 1 | |a Wei, Guo-Wei |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t arXiv.org |g (2020) vom: 30. Dez. |
773 | 1 | 8 | |g year:2020 |g day:30 |g month:12 |
856 | 4 | 0 | |u https://arxiv.org/abs/2012.15268 |z kostenfrei |3 Volltext |
912 | |a GBV_XAR | ||
912 | |a SSG-OLC-PHA | ||
951 | |a AR | ||
952 | |j 2020 |b 30 |c 12 | ||
953 | |2 045F |a 570 |