Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis
We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.
Vyšlo v časopise:
Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis. PLoS Genet 6(9): e32767. doi:10.1371/journal.pgen.1001117
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pgen.1001117
Souhrn
We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.
Zdroje
1. RosenbergNA
PritchardJK
WeberJL
CannHM
KiddKK
2002 Genetic Structure of Human Populations. Science 298 2381 2385
2. ReichD
ThangarajK
PattersonN
PriceAL
SinghL
2009 Reconstructing Indian population history. Nature 461 489 494
3. WasserSK
MailandC
BoothR
MutayobaB
KisamoE
2007 Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of Sciences 104 4228 4233
4. ParkerHG
KimLV
SutterNB
CarlsonS
LorentzenTD
2004 Genetic Structure of the Purebred Domestic Dog. Science 304 1160 1164
5. PritchardJK
RosenbergNA
1999 Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics 65 220 228
6. PritchardJ
2001 Case-Control Studies of Association in Structured or Admixed Populations. Theoretical Population Biology 60 227 237
7. PriceAL
PattersonNJ
PlengeRM
WeinblattME
ShadickNA
2006 Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904 909
8. FalushD
StephensM
PritchardJK
2003 Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies. Genetics 164 1567 1587
9. TangH
PengJ
WangP
RischNJ
2005 Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology 28 289 301
10. TangH
CoramM
WangP
ZhuX
RischN
2006 Reconstructing genetic ancestry blocks in admixed individuals. American Journal of Human Genetics 79 1 12
11. AlexanderDH
NovembreJ
LangeK
2009 Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19 1655 1664
12. ZhuX
ZhangS
ZhaoH
CooperRS
2002 Association mapping, using a mixture model for complex traits. Genetic Epidemiology 23 181 196
13. PattersonN
PriceAL
ReichD
2006 Population Structure and Eigenanalysis. PLoS Genetics 2 e190 doi:10.1371/journal.pgen.0020190
14. NovembreJ
JohnsonT
BrycK
KutalikZ
BoykoAR
2008 Genes mirror geography within Europe. Nature 456 98 101
15. LaoO
LuTT
NothnagelM
JungeO
Freitag-WolfS
2008 Correlation between Genetic and Geographic Structure in Europe. Current Biology 18 1241 1248
16. BuntineW
2002 Variational extensions to EM and multinomial PCA. In: Proceedings of the European Conference on Machine Learning
17. PritchardJK
StephensM
DonnellyP
2000 Inference of population structure using multilocus genotype data. Genetics 155 945 959
18. EckartC
YoungG
1936 The approximation of one matrix by another of lower rank. Psychometrika 1 211 218
19. LucasJ
CarvalhoC
WangQ
BildA
NevinsJ
2006 Sparse Statistical Modelling in Gene Expression Genomics Cambridge University Press 155 176
20. FokoueE
2004 Stochastic determination of the intrinsic structure in Bayesian factor analysis. Tech. rep., Statistical and Applied Mathematical Sciences Institute (SAMSI)
21. CarvalhoC
ChangJ
LucasJ
NevinsJR
WangQ
2008 High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics. Journal of the American Statistical Association 103 1438 1456
22. PournaraI
WernischL
2007 Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics 8
23. LeeDD
SeungHS
1999 Learning the parts of objects by non-negative matrix factorization. Nature 401 788 791
24. WittenDM
TibshiraniR
HastieT
2009 A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515 534
25. MackayDJC
1992 Bayesian methods for adaptive models. Ph.D. thesis, California Institute of Technology, Pasadena, CA
26. NealRM
1996 Bayesian Learning for Neural Networks. Lecture Notes in Statistics No. 118, Springer-Verlag
27. TippingME
2000 The relevance vector machine. In: Proceedings of the Neural Information Processing Systems 12
28. LawrenceN
2005 Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6 1783 1816
29. ConradDF
JakobssonM
CoopG
WenX
WallJD
2006 A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 38 1251 1260
30. NovembreJ
StephensM
2008 Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646 649
31. McVeanG
2009 A Genealogical Interpretation of Principal Components Analysis. PLoS Genetics 5 e1000686 doi:10.1371/journal.pgen.1000686
32. NelsonMR
BrycK
KingKS
IndapA
BoykoAR
2008 The Population Reference Sample, POPRES: A Resource for Population, Disease, and Pharmacological Genetics Research. American Journal of Human Genetics 83 347 358
33. SerreD
PääboS
2004 Evidence for Gradients of Human Genetic Diversity Within and Among Continents. Genome Research 14 1679 1685
34. LeeDD
SeungSH
2001 Algorithms for Non-negative Matrix Factorization. 556 562 In: Advances in Neural Information Processing Systems 13
35. WestM
2003 Bayesian Factor Regression Models in the Large p, Small n Paradigm. Bayesian Statistics 7 723 732
36. CannyJ
2002 Collaborative filtering with privacy via factor analysis. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval New York, NY, USA ACM 238 245
37. LopesHF
WestM
2004 Bayesian model assessment in factor analysis. Statistica Sinica 14 41 67
38. HudsonRR
2002 Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18 337 338
39. HowieBN
DonnellyP
MarchiniJ
2009 A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics 5 e1000529 doi:10.1371/journal.pgen.1000529
40. TippingME
FaulAC
2003 Fast marginal likelihood maximization for sparse Bayesian models.
BishopCM
FreyBJ
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics
41. LiuC
RubinDB
1994 The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81 633 648
42. GhahramaniZ
HintonGE
1996 The EM algorithm for mixtures of factor analyzers. Tech. rep., CRG-TR-96-1
43. R Development Core Team 2008 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria
Štítky
Genetika Reprodukčná medicínaČlánok vyšiel v časopise
PLOS Genetics
2010 Číslo 9
- Je „freeze-all“ pro všechny? Odborníci na fertilitu diskutovali na virtuálním summitu
- Gynekologové a odborníci na reprodukční medicínu se sejdou na prvním virtuálním summitu
Najčítanejšie v tomto čísle
- Synthesizing and Salvaging NAD: Lessons Learned from
- Optimal Strategy for Competence Differentiation in Bacteria
- Long- and Short-Term Selective Forces on Malaria Parasite Genomes
- Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data