TRCMGene : A two-step referential compression method for the efficient storage of genetic data
BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading.
RESULTS: Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2018 |
---|---|
Erschienen: |
2018 |
Enthalten in: |
Zur Gesamtaufnahme - volume:13 |
---|---|
Enthalten in: |
PloS one - 13(2018), 11 vom: 29., Seite e0206521 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Tang, You [VerfasserIn] |
---|
Links: |
---|
Themen: |
---|
Anmerkungen: |
Date Completed 22.04.2019 Date Revised 22.04.2019 published: Electronic-eCollection Citation Status MEDLINE |
---|
doi: |
10.1371/journal.pone.0206521 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM290288444 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | NLM290288444 | ||
003 | DE-627 | ||
005 | 20231225064415.0 | ||
007 | cr uuu---uuuuu | ||
008 | 231225s2018 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1371/journal.pone.0206521 |2 doi | |
028 | 5 | 2 | |a pubmed24n0967.xml |
035 | |a (DE-627)NLM290288444 | ||
035 | |a (NLM)30395579 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Tang, You |e verfasserin |4 aut | |
245 | 1 | 0 | |a TRCMGene |b A two-step referential compression method for the efficient storage of genetic data |
264 | 1 | |c 2018 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Completed 22.04.2019 | ||
500 | |a Date Revised 22.04.2019 | ||
500 | |a published: Electronic-eCollection | ||
500 | |a Citation Status MEDLINE | ||
520 | |a BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading | ||
520 | |a RESULTS: Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki | ||
650 | 4 | |a Journal Article | |
650 | 4 | |a Research Support, Non-U.S. Gov't | |
700 | 1 | |a Li, Min |e verfasserin |4 aut | |
700 | 1 | |a Sun, Jing |e verfasserin |4 aut | |
700 | 1 | |a Zhang, Tao |e verfasserin |4 aut | |
700 | 1 | |a Zhang, Jicheng |e verfasserin |4 aut | |
700 | 1 | |a Zheng, Ping |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t PloS one |d 2006 |g 13(2018), 11 vom: 29., Seite e0206521 |w (DE-627)NLM167327399 |x 1932-6203 |7 nnns |
773 | 1 | 8 | |g volume:13 |g year:2018 |g number:11 |g day:29 |g pages:e0206521 |
856 | 4 | 0 | |u http://dx.doi.org/10.1371/journal.pone.0206521 |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 13 |j 2018 |e 11 |b 29 |h e0206521 |