Genome-Wide Inference of Ancestral Recombination Graphs

English version České info

The unusual and complex correlation structure of population samples of genetic sequences presents a fundamental statistical challenge that pervades nearly all areas of population genetics. Historical recombination events produce an intricate network of intertwined genealogies, which impedes demography inference, the detection of natural selection, association mapping, and other applications. It is possible to capture these complex relationships using a representation called the ancestral recombination graph (ARG), which provides a complete description of coalescence and recombination events in the history of the sample. However, previous methods for ARG inference have not been adequately fast and accurate for practical use with large-scale genomic sequence data. In this article, we introduce a new algorithm for ARG inference that has vastly improved scaling properties. Our algorithm is implemented in a computer program called ARGweaver, which is fast enough to be applied to sequences megabases in length. With the aid of a large computer cluster, ARGweaver can be used to sample full ARGs for entire mammalian genome sequences. We show that ARGweaver performs well in simulation experiments and demonstrate that it can be used to provide new insights about both demographic processes and natural selection when applied to real human genome sequence data.

Vyšlo v časopise: Genome-Wide Inference of Ancestral Recombination Graphs. PLoS Genet 10(5): e32767. doi:10.1371/journal.pgen.1004342
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pgen.1004342

Souhrn

Zdroje

1. Hein J, Schierup M, Wiuf C (2005) Gene genealogies, variation and evolution: a primer in coalescent theory. Oxford: Oxford University Press.

2. Wakeley J (2009) Coalescent theory: an introduction. Greenwood Village: Roberts & Co. Publishers.

3. Fisher RA (1930) The Genetical Theory of Natural Selection. Oxford: Oxford University Press.

4. WrightS (1931) Evolution in Mendelian Populations. Genetics 16 : 97–159.

5. KimuraM (1962) On the probability of fixation of mutant genes in a population. Genetics 47 : 713–719.

6. FelsensteinJ (1973) Maximum-likelihood and minimum-step methods for estimating evolutionary trees from data on discrete characters. Syst Zool 22 : 240–249.

7. FelsensteinJ (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17 : 368–376.

8. MenozziP, PiazzaA, Cavalli-SforzaL (1978) Synthetic maps of human gene frequencies in Europeans. Science 201 : 786–792.

9. KingmanJ (1982) The coalescent. Stoch Process Appl 13 : 235–248.

10. SawyerSA, HartlDL (1992) Population genetics of polymorphism and divergence. Genetics 132 : 1161–1176.

11. VoightBF, AdamsAM, FrisseLA, QianY, HudsonRR, et al. (2005) Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. Proc Natl Acad Sci USA 102 : 18508–18513.

12. KeightleyPD, Eyre-WalkerA (2007) Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177 : 2251–2261.

13. BoykoAR, WilliamsonSH, IndapAR, DegenhardtJD, HernandezRD, et al. (2008) Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4: e1000083.

14. LawsonDJ, HellenthalG, MyersS, FalushD (2012) Inference of population structure using dense haplotype data. PLoS Genet 8: e1002453.

15. PalamaraPF, LenczT, DarvasiA, Pe'erI (2012) Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91 : 809–822.

16. RalphP, CoopG (2013) The geography of recent genetic ancestry across Europe. PLoS Biol 11: e1001555.

17. HarrisK, NielsenR (2013) Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet 9: e1003521.

18. Hudson RR (1991) Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J, editors. Oxford Surveys in Evolutionary Biology, volume 7. pp. 1–44.

19. GriffithsRC, MarjoramP (1996) Ancestral inference from samples of DNA sequences with recombination. J Comput Biol 3 : 479–502.

20. Griffiths R, Marjoram P (1997) An ancestral recombination graph. In: Donnelly P, Tavaré S, editors, Progress in Population Genetics and Human Evolution. Springer Verlag. pp. 257–270.

21. HudsonRR (1983) Properties of a neutral allele model with intragenic recombination. Theor Popul Biol 23 : 183–201.

22. FearnheadP, DonnellyP (2001) Estimating recombination rates from population genetic data. Genetics 159 : 1299–1318.

23. StephensM, DonnellyP (2000) Inference in molecular population genetics. Journal of the Royal Statistical Society Series B (Statistical Methodology) 62 : 605–655.

24. KuhnerMK, YamatoJ, FelsensteinJ (2000) Maximum likelihood estimation of recombination rates from population data. Genetics 156 : 1393–1401.

25. NielsenR (2000) Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154 : 931–942.

26. KuhnerMK (2006) LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22 : 768–770.

27. O'FallonBD (2013) ACG: rapid inference of population history from recombining nucleotide sequences. BMC Bioinformatics 14 : 40.

28. HeinJ (1990) Reconstructing evolution of sequences subject to recombination using parsimony. Math Biosci 98 : 185–200.

29. HeinJ (1993) A heuristic method to reconstruct the history of sequences subject to recombination. J Mol Evol 36 : 396–405.

30. KececiogluJ, GusfieldD (1998) Reconstructing a history of recombinations from a set of sequences. Discrete Applied Mathematics 88 : 239–260.

31. WangL, ZhangK, ZhangL (2001) Perfect phylogenetic networks with recombination. J Comput Biol 8 : 69–78.

32. SongYS, HeinJ (2005) Constructing minimal ancestral recombination graphs. J Comput Biol 12 : 147–169.

33. SongYS, WuY, GusfieldD (2005) Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution. Bioinformatics 21 Suppl 1: i413–422.

34. MinichielloMJ, DurbinR (2006) Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet 79 : 910–922.

35. WuY (2009) New methods for inference of local tree topologies with recombinant SNP sequences in populations. IEEE/ACM Trans Comput Biol Bioinform 8 : 182–193.

36. WiufC, HeinJ (1999) Recombination as a point process along sequences. Theor Popul Biol 55 : 248–259.

37. McVeanGAT, CardinNJ (2005) Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci 360 : 1387–1393.

38. MarjoramP, WallJD (2006) Fast “coalescent” simulation. BMC Genet 7 : 16.

39. HobolthA, ChristensenOF, MailundT, SchierupMH (2007) Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet 3: e7.

40. MailundT, DutheilJY, HobolthA, LunterG, SchierupMH (2011) Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet 7: e1001319.

41. MailundT, HalagerAE, WestergaardM, DutheilJY, MunchK, et al. (2012) A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species. PLoS Genet 8: e1003125.

42. LiH, DurbinR (2011) Inference of human population history from individual whole-genome sequences. Nature 475 : 493–496.

43. LiN, StephensM (2003) Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165 : 2213–2233.

44. StephensM, ScheetP (2005) Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet 76 : 449–462.

45. MarchiniJ, HowieB, MyersS, McVeanG, DonnellyP (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39 : 906–913.

46. HowieBN, DonnellyP, MarchiniJ (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529.

47. PriceAL, TandonA, PattersonN, BarnesKC, RafaelsN, et al. (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 5: e1000519.

48. LiY, WillerCJ, DingJ, ScheetP, AbecasisGR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34 : 816–834.

49. PaulJS, SongYS (2010) A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics 186 : 321–338.

50. PaulJS, SteinrückenM, SongYS (2011) An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187 : 1115–1128.

51. SheehanS, HarrisK, SongYS (2013) Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194 : 647–662.

52. SteinruckenM, PaulJS, SongYS (2013) A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol 87 : 51–61.

53. Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro H, editor, Mammalian Protein Metabolism, New York: Academic Press. pp. 21–132.

54. HusmeierD, WrightF (2001) Detection of recombination in DNA multiple alignments with hidden Markov models. J Comput Biol 8 : 401–427.

55. KongA, GudbjartssonDF, SainzJ, JonsdottirGM, GudjonssonSA, et al. (2002) A high-resolution recombination map of the human genome. Nat Genet 31 : 241–247.

56. KongA, FriggeML, MassonG, BesenbacherS, SulemP, et al. (2012) Rate of de novo mutations and the importance of father's age to disease risk. Nature 488 : 471–475.

57. SunJX, HelgasonA, MassonG, EbenesersdottirSS, LiH, et al. (2012) A direct characterization of human mutation based on microsatellites. Nat Genet 44 : 1161–1165.

58. DrmanacR, SparksAB, CallowMJ, HalpernAL, BurnsNL, et al. (2010) Human genome sequencing using unchained base reads on self-assembling dna nanoarrays. Science 327 : 78–81.

59. DelaneauO, ZaguryJF, MarchiniJ (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods 10 : 5–6.

60. McVickerG, GordonD, DavisC, GreenP (2009) Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet 5: e1000471.

61. CaiJJ, MacphersonJM, SellaG, PetrovDA (2009) Pervasive hitchhiking at coding and regulatory sites in humans. PLoS Genet 5: e1000336.

62. HernandezRD, KelleyJL, ElyashivE, MeltonSC, AutonA, et al. (2011) Classic selective sweeps were rare in recent human evolution. Science 331 : 920–924.

63. GottipatiS, ArbizaL, SiepelA, ClarkAG, KeinanA (2011) Analyses of X-linked and autosomal genetic variation in population-scale whole genome sequencing. Nat Genet 43 : 741–743.

64. LohmuellerKE, AlbrechtsenA, LiY, KimSY, KorneliussenT, et al. (2011) Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome. PLoS Genet 7: e1002326.

65. CharlesworthB, MorganMT, CharlesworthD (1993) The effect of deleterious mutations on neutral molecular variation. Genetics 134 : 1289–1303.

66. HudsonRR, KaplanNL (1995) Deleterious background selection with recombination. Genetics 141 : 1605–1617.

67. NordborgM, CharlesworthB, CharlesworthD (1996) The effect of recombination on background selection. Genet Res 67 : 159–174.

68. CharlesworthB (2012) The effects of deleterious mutations on evolution at linked sites. Genetics 190 : 5–22.

69. Maynard SmithJ, HaighJ (1974) The hitch-hiking effect of a favourable gene. Genet Res 23 : 23–35.

70. BartonNH (1998) The effect of hitch-hiking on neutral genealogies. Genet Res 72 : 123–133.

71. WalczakAM, NicolaisenLE, PlotkinJB, DesaiMM (2012) The structure of genealogies in the presence of purifying selection: a fitness-class coalescent. Genetics 190 : 753–779.

72. VoightBF, KudaravalliS, WenX, PritchardJK (2006) A map of recent positive selection in the human genome. PLoS Biol 4: e72.

73. HughesAL, NeiM (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335 : 167–170.

74. ApaniusV, PennD, SlevPR, RuffLR, PottsWK (1997) The nature of selection on the major histocompatibility complex. Crit Rev Immunol 17 : 179–224.

75. HughesAL, YeagerM (1998) Natural selection at major histocompatibility complex loci of vertebrates. Annu Rev Genet 32 : 415–435.

76. HodgkinsonA, Eyre-WalkerA (2010) The genomic distribution and local context of coincident SNPs in human and chimpanzee. Genome Biol Evol 2 : 547–557.

77. LefflerEM, ZiyueG, PfeiferS, SegurelL, AutonA, et al. (2013) Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science 339 : 1578–1582.

78. MaruyamaT (1974) The age of a rare mutant gene in a large population. Am J Hum Genet 26 : 669–673.

79. KiezunA, PulitSL, FrancioliLC, van DijkF, SwertzM, et al. (2013) Deleterious alleles in the human genome are on average younger than neutral alleles of the same frequency. PLoS Genet 9: e1003301.

80. HillWG, RobertsonA (1966) The effect of linkage on limits to artificial selection. Genet Res 8 : 269–294.

81. KarlinS, McGregorJ (1968) Rates and probabilities of fixation for two locus random mating finite populations without selection. Genetics 58 : 141–159.

82. StrobeckC, MorganK (1978) The effect of intragenic recombination on the number of alleles in a finite population. Genetics 88 : 829–844.

83. GriffithsRC (1981) Neutral two-locus multiple allele models with recombination. Theor Popul Biol 19 : 169–186.

84. RannalaB, ReeveJP (2001) High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence. Am J Hum Genet 69 : 159–178.

85. LarribeF, LessardS, SchorkNJ (2002) Gene mapping via the ancestral recombination graph. Theor Popul Biol 62 : 215–229.

86. ZollnerS, PritchardJK (2005) Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169 : 1071–1092.

87. WuY (2008) Association mapping of complex diseases with ancestral recombination graphs: models and efficient algorithms. J Comput Biol 15 : 667–684.

88. BesenbacherS, MailundT, SchierupMH (2009) Local phylogeny mapping of quantitative traits: higher accuracy and better ranking than single-marker association in genomewide scans. Genetics 181 : 747–753.

89. Prado-MartinezJ, SudmantPH, KiddJM, LiH, KelleyJL, et al. (2013) Great ape genetic diversity and population history. Nature 499 : 471–475.

90. ThanC, NakhlehL (2009) Species tree inference by minimizing deep coalescences. PLoS Comput Biol 5: e1000501.

91. YuY, BarnettRM, NakhlehL (2013) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Syst Biol 62 : 738–751.

92. GronauI, HubiszMJ, GulkoB, DankoCG, SiepelA (2011) Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics 43 : 1031–1034.

93. TangH, CoramM, WangP, ZhuX, RischN (2006) Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet 79 : 1–12.

94. SankararamanS, SridharS, KimmelG, HalperinE (2008) Estimating local ancestry in admixed populations. Am J Hum Genet 82 : 290–303.

95. ScheetP, StephensM (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78 : 629–644.

96. BrowningSR, BrowningBL (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81 : 1084–1097.

97. McVeanGA, MyersSR, HuntS, DeloukasP, BentleyDR, et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304 : 581–584.

98. HobolthA, JensenJL (2014) Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor Popul Biol 2014: S0040–5809 doi:10.1016/j.tpb.2014.01.002

99. WuY, GusfieldD (2007) Efficient computation of minimum recombination with genotypes (not haplotypes). Journal of Bioinformatics and Computational Biology 181–200.

100. TavareS (1984) Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol 26 : 119–164.

101. Siepel A, Haussler D (2005) Phylogenetic hidden Markov models. In: Nielsen R, editor, Statistical Methods in Molecular Evolution, New York: Springer. pp. 325–351.

102. RabinerLR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 : 257–286.

103. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.

104. CawleySL, PachterL (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19 Suppl 2: II36–II41.

105. ZhuJ, LiuJS, LawrenceCE (1998) Bayesian adaptive sequence alignment algorithms. Bioinformatics 14 : 25–39.

106. International HapMap Consortium (2007) FrazerKA, BallingerDG, CoxDR, HindsDA, et al. (2007) A second generation human haplotype map of over 3.1 million snps. Nature 449 : 851–861.

107. GronauI, ArbizaL, MohammedJ, SiepelA (2013) Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol Biol Evol 30 : 1159–1171.

108. RobinsonDF, FouldsLR (1981) Comparison of phylogenetic trees. Mathematical Biosciences 53 : 131–147.

109. ArbizaL, GronauI, AksoyBA, HubiszMJ, GulkoB, et al. (2013) Genome-wide inference of natural selection on human transcription factor binding sites. Nat Genet 45 : 723–729.