Beware the Jaccard : the choice of similarity measure is important and non-trivial in genomic colocalisation analysis
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissionsoup.com..
The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure.
Medienart: |
E-Artikel |
---|
Erscheinungsjahr: |
2020 |
---|---|
Erschienen: |
2020 |
Enthalten in: |
Zur Gesamtaufnahme - volume:21 |
---|---|
Enthalten in: |
Briefings in bioinformatics - 21(2020), 5 vom: 25. Sept., Seite 1523-1530 |
Sprache: |
Englisch |
---|
Beteiligte Personen: |
Salvatore, Stefania [VerfasserIn] |
---|
Links: |
---|
Themen: |
Fold enrichment |
---|
Anmerkungen: |
Date Completed 23.09.2021 Date Revised 23.09.2021 published: Print Citation Status MEDLINE |
---|
doi: |
10.1093/bib/bbz083 |
---|
funding: |
|
---|---|
Förderinstitution / Projekttitel: |
|
PPN (Katalog-ID): |
NLM302311939 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | NLM302311939 | ||
003 | DE-627 | ||
005 | 20231225110444.0 | ||
007 | cr uuu---uuuuu | ||
008 | 231225s2020 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1093/bib/bbz083 |2 doi | |
028 | 5 | 2 | |a pubmed24n1007.xml |
035 | |a (DE-627)NLM302311939 | ||
035 | |a (NLM)31624847 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Salvatore, Stefania |e verfasserin |4 aut | |
245 | 1 | 0 | |a Beware the Jaccard |b the choice of similarity measure is important and non-trivial in genomic colocalisation analysis |
264 | 1 | |c 2020 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ƒaComputermedien |b c |2 rdamedia | ||
338 | |a ƒa Online-Ressource |b cr |2 rdacarrier | ||
500 | |a Date Completed 23.09.2021 | ||
500 | |a Date Revised 23.09.2021 | ||
500 | |a published: Print | ||
500 | |a Citation Status MEDLINE | ||
520 | |a © The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissionsoup.com. | ||
520 | |a The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure | ||
650 | 4 | |a Journal Article | |
650 | 4 | |a Review | |
650 | 4 | |a fold enrichment | |
650 | 4 | |a genomic track similarity | |
650 | 4 | |a similarity indices | |
650 | 4 | |a similarity measures | |
650 | 4 | |a statistical genomics | |
700 | 1 | |a Dagestad Rand, Knut |e verfasserin |4 aut | |
700 | 1 | |a Grytten, Ivar |e verfasserin |4 aut | |
700 | 1 | |a Ferkingstad, Egil |e verfasserin |4 aut | |
700 | 1 | |a Domanska, Diana |e verfasserin |4 aut | |
700 | 1 | |a Holden, Lars |e verfasserin |4 aut | |
700 | 1 | |a Gheorghe, Marius |e verfasserin |4 aut | |
700 | 1 | |a Mathelier, Anthony |e verfasserin |4 aut | |
700 | 1 | |a Glad, Ingrid |e verfasserin |4 aut | |
700 | 1 | |a Kjetil Sandve, Geir |e verfasserin |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Briefings in bioinformatics |d 2000 |g 21(2020), 5 vom: 25. Sept., Seite 1523-1530 |w (DE-627)NLM11366883X |x 1477-4054 |7 nnns |
773 | 1 | 8 | |g volume:21 |g year:2020 |g number:5 |g day:25 |g month:09 |g pages:1523-1530 |
856 | 4 | 0 | |u http://dx.doi.org/10.1093/bib/bbz083 |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a GBV_NLM | ||
951 | |a AR | ||
952 | |d 21 |j 2020 |e 5 |b 25 |c 09 |h 1523-1530 |