Needles in the Haystack: Identifying Individuals Present in Pooled
Genomic Data
Recent publications have described and applied a novel metric that quantifies the
genetic distance of an individual with respect to two population samples, and
have suggested that the metric makes it possible to infer the presence of an
individual of known genotype in a sample for which only the marginal allele
frequencies are known. However, the assumptions, limitations, and utility of
this metric remained incompletely characterized. Here we present empirical tests
of the method using publicly accessible genotypes, as well as analytical
investigations of the method's strengths and limitations. The results
reveal that the null distribution is sensitive to the underlying assumptions,
making it difficult to accurately calibrate thresholds for classifying an
individual as a member of the population samples. As a result, the
false-positive rates obtained in practice are considerably higher than
previously believed. However, despite the metric's inadequacies for
identifying the presence of an individual in a sample, our results suggest
potential avenues for future research on tuning this method to problems of
ancestry inference or disease prediction. By revealing both the strengths and
limitations of the proposed method, we hope to elucidate situations in which
this distance metric may be used in an appropriate manner. We also discuss the
implications of our findings in forensics applications and in the protection of
GWAS participant privacy.
Vyšlo v časopise:
Needles in the Haystack: Identifying Individuals Present in Pooled
Genomic Data. PLoS Genet 5(10): e32767. doi:10.1371/journal.pgen.1000668
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pgen.1000668
Souhrn
Recent publications have described and applied a novel metric that quantifies the
genetic distance of an individual with respect to two population samples, and
have suggested that the metric makes it possible to infer the presence of an
individual of known genotype in a sample for which only the marginal allele
frequencies are known. However, the assumptions, limitations, and utility of
this metric remained incompletely characterized. Here we present empirical tests
of the method using publicly accessible genotypes, as well as analytical
investigations of the method's strengths and limitations. The results
reveal that the null distribution is sensitive to the underlying assumptions,
making it difficult to accurately calibrate thresholds for classifying an
individual as a member of the population samples. As a result, the
false-positive rates obtained in practice are considerably higher than
previously believed. However, despite the metric's inadequacies for
identifying the presence of an individual in a sample, our results suggest
potential avenues for future research on tuning this method to problems of
ancestry inference or disease prediction. By revealing both the strengths and
limitations of the proposed method, we hope to elucidate situations in which
this distance metric may be used in an appropriate manner. We also discuss the
implications of our findings in forensics applications and in the protection of
GWAS participant privacy.
Zdroje
1. Homer
N
Szelinger
S
Redman
M
Duggan
D
Tembe
W
2008
Resolving individuals contributing trace amounts of DNA to highly
complex mixtures using high-density SNP genotyping microarrays.
PLoS Genet
4
e1000167
doi:10.1371/journal.pgen.1000167
2. Hunter
DJ
Kraft
P
Jacobs
KB
Cox
DG
Yeager
M
A genome-wide association study identifies alleles in FGFR2
associated with risk of sporadic postmenopausal breast cancer.
Nat Genet
39
870
874
3. The International HapMap Consortium
The International HapMap Project.
Nature
426
789
796
4. R Development Core Team
2004
A language and environment for statistical computing
Vienna, Austria
Štítky
Genetika Reprodukčná medicínaČlánok vyšiel v časopise
PLOS Genetics
2009 Číslo 10
- Je „freeze-all“ pro všechny? Odborníci na fertilitu diskutovali na virtuálním summitu
- Gynekologové a odborníci na reprodukční medicínu se sejdou na prvním virtuálním summitu
Najčítanejšie v tomto čísle
- Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data
- The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis
- Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection