Re-Ranking Sequencing Variants in the Post-GWAS Era for Accurate Causal Variant Identification
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
Vyšlo v časopise:
Re-Ranking Sequencing Variants in the Post-GWAS Era for Accurate Causal Variant Identification. PLoS Genet 9(8): e32767. doi:10.1371/journal.pgen.1003609
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pgen.1003609
Souhrn
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
Zdroje
1. CooperGM, ShendureJ (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12: 628–40.
2. GeorgesM (2011) The long and winding road from correlation to causation. Nat Genet 43(3): 180–1.
3. IoannidisJP, ThomasG, DalyMJ (2009) Validating, augmenting and refining genome-wide association signals. Nat Rev Genet 10: 318–29.
4. ZaitlenN, PaşaniucB, GurT, ZivE, HalperinE (2010) Leveraging genetic variability across populations for the identification of causal variants. Am J Hum Genet 86: 23–33.
5. UdlerMS, TyrerJ, EastonDF (2010) Evaluating the power to discriminate between highly correlated SNPs in genetic association studies. Genet Epidemiol 34(5): 463–8.
6. HolmH, GudbjartssonDF, SulemP, MassonG, HelgadottirHT, et al. (2011) A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet 43(4): 316–320.
7. ZegginiE (2011) Next-generation association studies for complex traits stopping RNA interference at the seed. Nat Genet 43: 287–288.
8. Kang HM, Gaulton K,Voight BF, Fuchsberger C, Pearson RD, et al.. (2011) Sequencing and genotyping thousands of European genomes and exomes to better understand the genetic architecture of type 2 diabetes: the GoT2D study [Abstract 190]. In: 61st Annual American Society of Human Genetics Program Book; 11–15 October 2011; Montreal, Quebec, Canada. Available: www.ichg2011.org/program_guide.
9. HuZ, XiaY, GuoX, DaiJ, LiH, et al. (2011) A genome-wide association study in Chinese men identifies three risk loci for non-obstructive azoospermia. Nat Genet 44(2): 183–186.
10. TenesaA, FarringtonSM, PrendergastJG, PorteousME, WalkerM, et al. (2008) Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nat Genet 40(5): 631–7.
11. FridleyBL, AboR, BrisbinA, JenkinsGD (2011) Sample selection study designs to follow-up GWAS signals with targeted sequencing. Genet Epidemiol 36: 148.
12. O'DonovanMC, CraddockN, NortonN, WilliamsH, PeirceT, et al. (2008) Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nat Genet 40(9): 1053–5.
13. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145): 661–78.
14. WiltshireS, MorrisAP, ZegginiE (2008) Examining the statistical properties of fine-scale mapping in large-scale association studies. Genet Epidemiol 32: 204–14.
15. Beavis WD. (1994) The power and deceit of QTL experiments: lessons from comparative QTL studies. In: Proceedings of the Forty-Ninth Annual Corn & Sorghum Industry Research Conference. American Seed Trade Association: Washington, DC, United States.
16. GöringHH, TerwilligerJD, BlangeroJ (2001) Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69: 1357–1369.
17. SunL, BullSB (2005) Reduction of selection bias in genome-wide genetic studies by resampling. Genet Epidemiol 28: 352–367.
18. GarnerC (2007) Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31: 288–295.
19. BowdenJ, DudbridgeF (2009) Unbiased estimation of odds ratios: combining genomewide association scans with replication studies. Genet Epidemiol 33(5): 406–18.
20. FayeLL, BullSB (2011) Two-stage study designs combining genomewide association studies, tag single-nucleotide polymorphisms, and exome sequencing: accuracy of genetic effect estimates. BMC Proceedings 5(Suppl 9): S64.
21. OssowskiS, SchneebergerK, ClarkRM, LanzC, WarthmannN, et al. (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18: 2024–2033.
22. HarismendyO, NgPC, StrausbergRL, WangX, StockwellTB, et al. (2009) Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10: R32.
23. PoolJE, HellmannI, JensenJD, NielsenR (2010) Population genetic inference from genomic sequence variation. Genome Res 20: 291–300.
24. LiY, Willer CJ, DingJ, ScheetP, AbecasisGR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34(8): 816–34.
25. LiY, SidoreC, KangHM, BoehnkeM, AbecasisG (2011) Low-coverage sequencing: Implications for the design of complex trait association studies. Genome Res 21: 940–951.
26. LuoL, BoerwinkleE, XiongM (2011) Association studies for next-generation sequencing. Genome Research 21: 1099–108.
27. NielsenR, PaulJS, AlbrechtsenA, SongYS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6): 443–51.
28. MechanicLE, ChenHS, AmosCI, ChatterjeeN, CoxNJ, et al. (2011) Next generation analytic tools for large scale genetic epidemiology studies of complex diseases. Genet Epidemiol 35: 22–35.
29. PasaniucB, RohlandN, McLarenPJ, GarimellaK, ZaitlenN, et al. (2012) Extremely low-coverage sequencing enables cost effective GWAS. Nat Genet 44: 631–635.
30. JohnsonPLF, SlatkinM (2008) Accounting for bias from sequencing error in population genetic estimates. Mol Bio Evol 25: 199–206.
31. 1000 Genomes Project (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319): 1061–73.
32. SampsonJ, JacobsK, YeagerM, ChanockS, ChatterjeeN (2011) Efficient study design for next generation sequencing. Genet Epidemiol 277: 269–277.
33. KimSY, LiY, GuoY, LiR, HolmkvistJ, et al. (2010) Design of association studies with pooled or un-pooled next-generation sequencing data. Genet Epidemiol 34: 479–91.
34. BlowN (2009) Genomics: Catch me if you can. Nat Methods 6(7): 539–544.
35. ZaitlenN, EskinE (2010) Imputation aware meta-analysis of genome-wide association studies. Genet Epidemiol 34: 537–542.
36. GarnerC (2011) Confounded by sequencing depth in association studies of rare alleles. Genet Epidemiol 35: 261–268.
37. SinnottJA, KraftP (2012) Artifact due to differential error when cases and controls are imputed from different platforms. Hum Genet 131: 111–9.
38. MachielaMJ, ChenC, LiangL, DiverRW, StevensVL, et al. (2013) One thousand genomes imputation in the National Cancer Institute Breast and Prostate Cancer Cohort Consortium aggressive prostate cancer genome-wide association study. Prostate 73(7): 677–89.
39. BrowningBL, BrowningSR (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84(2): 210–23.
40. FayeLL, SunL, DimitromanolakisA, BullSB (2011) A flexible genome-wide bootstrap method that accounts for ranking and threshold-selection bias in GWAS interpretation and replication study design. Stat Med 30: 1898–1912.
41. SunL, DimitromanolakisA, FayeLL, PatersonAD, WaggottD, et al. (2011) BR-squared: a practical solution to the winner's curse in genome-wide scans. Hum Genet 129(5): 545–52.
42. ZhongH, PrenticeR (2008) Bias-reduced estimators and confidence intervals for odds ratios in genome-wide associationstudies. Biostatistics 9: 621–634.
43. GhoshA, ZouF, WrightF (2008) Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet 82: 1064–1074.
44. XiaoR, BoehnkeM (2009) Quantifying and correcting for the Winner's curse in genetic association studies. Genet Epidemiol 33: 453–462.
45. ZollnerS, PritchardJK (2007) Overcoming the winner's curse: estimating penetrance parameters from case-control data. Am J Hum Genet 80: 605–615.
46. XuL, CraiuRV, SunL (2011) Bayesian methods to overcome the winner's curse in genetic studies. Ann Appl Stat 5(1): 201–31.
47. BrowningBL, YuZ (2009) Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85(6): 847–61.
48. SchumacherFR, BerndtSI, SiddiqA, JacobsKB, WangZ, et al. (2011) Genome-wide association study identifies new prostate cancer susceptibility loci. Hum Mol Genet 20: 3867–75.
49. ChungCC, CiampaJ, YeagerM, JacobsKB, BerndtSI, HayesRB, et al. (2011) Fine mapping of a region of chromosome 11q13 reveals multiple independent loci associated with risk of prostate cancer. Human Molecular Genetics 20(14): 2869–78.
50. HuangL, WangC, RosenbergNA (2009) The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am J Hum Genet 85(5): 692–8.
51. HaoK, WangX (2004) Incorporating individual error rate into association test of unmatched case-control design. Hum Hered 58: 154–63.
52. LinDY, HuY, HuangBE (2008) Simple and efficient analysis of disease association with missing genotype data. Am J Hum Genet 82: 444–452.
53. MarchiniJ, HowieB, MyersS, McVeanG, DonnellyP (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7): 906–13.
54. AcarEF, SunL (2013) A generalized Kruskal-Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics in press.
55. GuanY, StephensM (2008) Practical issues in imputation-based association mapping. PLoS Genet 4(12): e1000279.
Štítky
Genetika Reprodukčná medicínaČlánok vyšiel v časopise
PLOS Genetics
2013 Číslo 8
- Je „freeze-all“ pro všechny? Odborníci na fertilitu diskutovali na virtuálním summitu
- Gynekologové a odborníci na reprodukční medicínu se sejdou na prvním virtuálním summitu
Najčítanejšie v tomto čísle
- Chromosomal Copy Number Variation, Selection and Uneven Rates of Recombination Reveal Cryptic Genome Diversity Linked to Pathogenicity
- Genome-Wide DNA Methylation Analysis of Systemic Lupus Erythematosus Reveals Persistent Hypomethylation of Interferon Genes and Compositional Changes to CD4+ T-cell Populations
- Associations of Mitochondrial Haplogroups B4 and E with Biliary Atresia and Differential Susceptibility to Hydrophobic Bile Acid
- A Role for CF1A 3′ End Processing Complex in Promoter-Associated Transcription