Inference of Population Structure using Dense Haplotype Data
The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.
Vyšlo v časopise:
Inference of Population Structure using Dense Haplotype Data. PLoS Genet 8(1): e32767. doi:10.1371/journal.pgen.1002453
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pgen.1002453
Souhrn
The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.
Zdroje
1. MenozziPPiazzaACavalli-SforzaL 1978 Synthetic maps of human gene frequencies in europeans. Science 201 786 792
2. PritchardJKStephensMDonnellyP 2000 Inference of population structure using multilocus genotype data. Genetics 155 945 959
3. NovembreJStephensM 2008 Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646 649
4. McVeanG 2009 A Genealogical Interpretation of Principal Components Analysis. PLoS Genet 5 e1000686 doi:10.1371/journal.pgen.1000686
5. PriceALPattersonNJPlengeRMWeinblattMEShadickNA 2006 Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904 909
6. ReichDThangarajKPattersonNPriceALSinghL 2009 Reconstructing Indian population history. Nature 461 489 495
7. NovembreJJohnsonTBrycKKutalikZBoykoAR 2008 Genes mirror geography within Europe. Nature 456 98 101
8. RosenbergNABurkeTEloKFeldmanMWFreidlinPJ 2001 Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159 699 713
9. LiJZAbsherDMTangHSouthwickAMCastoAM 2008 Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 1100 1104
10. TishkoffSAReedFAFriedlaenderFREhretCRanciaroA 2009 The genetic structure and history of africans and african americans. Science 324 1035 1044
11. CoranderJWaldmannPSillanpääMJ 2003 Bayesian analysis of genetic differentiation between populations. Genetics 163 367 374
12. FalushDStephensMPritchardJK 2003 Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164 1567 1587
13. GuillotGEstoupAMortierFCossonJF 2005 A spatial statistical model for landscape genetics. Genetics 170 1261 1280
14. TangHPengJWangPRischNJ 2005 Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology 28 289 301
15. AlexanderDHNovembreJLangeK 2009 Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19 1655 1664
16. DurandEJayFGaggiottiOEFrançoisO 2009 Spatial inference of admixture proportions and secondary contact zones. Molecular Biology and Evolution 26 1963 1973
17. JombartTDevillardSBallouxF 2010 Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11 94
18. DawsonKJBelkhirK 2001 A bayesian approach to the identification of panmictic populations and the assignment of individuals. Genetical Research 78 59 77
19. PellaJMasudaM 2006 The gibbs and split–merge sampler for population mixture analysis from genetic data with incomplete baselines. Can J Fish Aquat Sci 63 576 596
20. NiuT 2004 Algorithms for inferring haplotypes. Genetic Epidemiology 27 334 347
21. ScheetPStephensM 2006 A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78 629 644
22. BrowningSRBrowningBL 2007 Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. American Journal of Human Genetics 81 1084 1097
23. FanHCWangJPotaninaAQuakeSR 2011 Whole-genome molecular haplotyping of single cells. Nature 29 51 57
24. KitzmanJOMacKenzieAPAdeyAHiattJBPatwardhanRP 2011 Haplotype-resolved genome sequencing of a gujarati indian individual. Nature 29 59 63
25. ConradDFJakobssonMCoopGWenXWallJD 2006 A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 38 1251 1260
26. International HapMap Consortium 2007 A second generation human haplotype map of over 3.1 million snps. Nature 449 851 61
27. HellenthalGAutonAFalushD 2008 Inferring Human Colonization History Using a Copying Model. PLoS Genet 4 e1000078 doi:10.1371/journal.pgen.1000078
28. JakobssonMScholzSWScheetPGibbsJRVanLiereJM 2008 Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451 998 1003
29. BrowningSRWeirBS 2010 Population structure with localized haplotype clusters. Genetics 185 1337 1344
30. DonnellyPLeslieS 2010 The coalescent and its descendants. arXiv 1006.1514v1
31. GattepailleLMJakobssonM 2011 Combining markers into haplotypes can improve population structure inference. Genetics
32. TangHCoramMWangPZhuXRischN 2006 Reconstructing genetic ancestry blocks in admixed individuals. American Journal of Human Genetics 79 1 12
33. SankararamanSSridharSKimmelGHalperinE 2008 Estimating local ancestry in admixed populations. American Journal of Human Genetics 82 290 303
34. PriceALTandonAPattersonNBarnesKCRafaelsN 2009 Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 5 e1000519 doi:10.1371/journal.pgen.1000519
35. LiNStephensM 2003 Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165 2213 2233
36. DempsterAPLairdNMRubinDB 1977 Maximum likelihood from incomplete data via the em algorithm. J Roy Stat Soc B 39 1 38
37. HuelsenbeckJPAndolfattoP 2007 Inference of population structure under a dirichlet process model. Genetics 175 1787 1802
38. GamermanD 1997 Markov Chain Monte Carlo: Stochastic simulation for Bayesian inference London, UK. SE1 8HN Chapman and Hall
39. CardinN 2007 Approximating the Coalescent with Recombination. Ph.D. thesis, Corpus Christi College, University of Oxford
40. McVeanGCardinNJ 2005 Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci 360 1387 1393
41. PattersonNPriceALReichD 2006 Population structure and eigenanalysis. PLoS Genet 2 e190 doi:10.1371/journal.pgen.0020190
42. HernandezR 2008 A flexible forward simulator for populations subject to selection and demography. Bioinformatics 24 2786 2787
43. EngelhardtBEStephensM 2010 Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet 6 e1001117 doi:10.1371/journal.pgen.1001117
44. PickrellJKCoopGNovembreJKudaravalliSLiJZ 2009 Signals of recent positive selection in a worldwide sample of human populations. Genome Research 19 826 37
45. RosenbergNPritchardJWeberJCannHKiddK 2002 The genetic structure of human populations. Science 298 2381 2385
46. ZhivotovskyLRosenbergNFeldmanM 2003 Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. American Journal of Human Genetics 72 1171 1186
47. BaconEE 1951 The inquiry into the history of the hazara mongols of afghanistan. Southwestern Journal of Anthropology 7 230 247
48. PellecchiaMNegriniRColliLPatriniMMilanesiE 2007 The mystery of Etruscan origins: novel clues from bos taurus mitochondrial dna. Proceedings of the Royal Society B: Biological Sciences 274 1175 1179
49. WenBLiHLuDSongXZhangF 2004 Genetic evidence supports demic diffusion of han culture. Nature 431 302 305
50. PattersonNPriceALReichD 2006 Population Structure and Eigenanalysis. PLoS Genet 2 e190 doi:10.1371/journal.pgen.0020190
51. The 1000 Genomes Project Consortium 2010 A map of human genome variation from populationscale sequencing. Nature 467 1061 1073
52. HudsonRR 2002 Generating samples under a wright-fisher neutral model. Bioinformatics 18 337 338
Štítky
Genetika Reprodukčná medicínaČlánok vyšiel v časopise
PLOS Genetics
2012 Číslo 1
- Je „freeze-all“ pro všechny? Odborníci na fertilitu diskutovali na virtuálním summitu
- Gynekologové a odborníci na reprodukční medicínu se sejdou na prvním virtuálním summitu
Najčítanejšie v tomto čísle
- Poly(ADP-Ribose) Polymerase 1 (PARP-1) Regulates Ribosomal Biogenesis in Nucleoli
- Microenvironmental Regulation by Fibrillin-1
- Parallel Mapping and Simultaneous Sequencing Reveals Deletions in and Associated with Discrete Inherited Disorders in a Domestic Dog Breed
- Two-Component Elements Mediate Interactions between Cytokinin and Salicylic Acid in Plant Immunity