Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data

Download PDF České info

There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is in MXL, in CLM, and in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern America ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas thousand years ago (kya), supports that the MXL Ancestors split kya, with a subsequent split of the ancestors to CLM and PUR kya. The model also features effective populations of in Mexico, in Colombia, and in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.

Published in the journal: Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data. PLoS Genet 9(12): e32767. doi:10.1371/journal.pgen.1004023
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1004023

Summary

Introduction

The 1000 Genomes project [1] released sequence data for 66 Mexican-American (MXL), 60 Colombian (CLM), and 55 Puerto Rican (PUR) individuals using an array of technologies including low-coverage whole genome sequence data, high-coverage exome capture data, and OMNI 2.5 genotyping data. These data provide a unique window into the settlement of the Americas that complement archeological and the more limited genetic data previously available. Here we interpret these data to answer basic questions about the pre -⁠ and post-Columbian demographic history of the Americas.

People reached the Americas by crossing Beringia during the Last Glacial Maximum, likely between 16–20 kya (see e.g. [2], [3], [4], [5]). The presence of early South American sites such as Monte Verde [6] suggests a rapid occupation of the continent, which is supported also by recent mitochondrial DNA studies [7]. A coastal route has been proposed to explain this rapid expansion (e.g., [8],[6],[7]), but other migration routes, possibly concurrent, have also been proposed (see. e.g., [5],[9], and references therein). This original peopling of the Americas, followed by European contact starting in 1492 and substantial African slave trade starting in 1502, have created a diverse genetic heritage in American populations.

The initial settlement of the Caribbean has been much debated (e.g. [10],[11],[12] and references therein). People reached the islands around 7 kya, probably from a Mesoamerican source [11]. Around 4.5 kya, a second wave of migrants probably reached the islands, likely coming from the Orinoco Delta or the Guianas in South America and speaking Arawakan languages (see [13] and references therein). By approximately 1.3 kya, they had established large Taíno communities through the Greater Antilles, including Puerto Rico.

The earliest available account reports 600,000 Native Americans in Puerto Rico at the time of European arrival, not counting women and children (Vázquez de Espinosa 1629). More conservative estimates suggest 110,000 individuals [14], and as few as 30,000 inhabitants in 1508 [15]. All references agree that the Native American population was subsequently largely decimated through disease, forced labor, emigration, and war. Despite the bottleneck at contact, admixture and the subsequent population growth on the Island resulted in a Native American genetic contribution averaging of the modern population of million [16].

The MXL were sampled in Los Angeles, USA and the CLM in Medellin, Colombia. These panels represent urban populations, but recent urbanization means that they derive ancestry from larger geographic areas. Among respondents to the 2005 Colombia Census in Medellin, were born in the city, and were born in another part of Colombia, with a sizable proportion from the surrounding Department of Antioquia. Given this high rate of within-country migration, but a relatively low rate of migration from outside Colombia, we can think of the sample as representing a diverse sample from Antioquia. Similarly, the 1.2M Angelenos of Mexican origin in the 2010 US census represent the added contributions of multiple waves of migrations starting with the city's foundation in 1781 and received contributions from diverse states.

The use of genetic data to study Native American history is well established. The bulk of these studies rely on Y chromosome [17],[18],[19],[20],[21],[22],[23],[24] and mitochondria DNA (mtDNA) [25],[26],[27],[28],[29],[30],[31],[22],[32],[7],[33],[34],[35], with a number of studies using increasingly dense sets of autosomal markers [22],[36],[37],[38],[39],[40]. Such studies provided evidence for a bottleneck recovery into the Americas 16–12 kya (e.g., [34],[35]), and for complex models of migrations and admixture within Native groups [40].

In this article, we use the 1000 Genomes data and a diversity of population genetic tools to delve deeper in the founding of the Puerto Rican, Mexican, and Colombian populations. To propose models for Native American demography, we must first quantify the African, European, and Native American contributions to these populations. Because of strong sex-asymmetric migrations, autosomal and sex-linked markers exhibit substantial differences in ancestry proportions [41],[42],[43],[44],[45],[46]. Focusing on the autosomal regions, we infer the locus-specific pre-Columbian continental ancestry in each sample, and estimate the timing and intensity of different migration waves that contributed to these populations. Using identity-by-descent analysis, we identify relatedness among the different ancestral groups and estimate recent effective population sizes.

We also propose a three-population model based on the diffusion approximation to study the distribution of allele frequencies across the Native American ancestors of the MXL, PUR, and CLM. We present statistical methods that take advantage of admixture linkage patterns to disentangle the histories of each continental group. The large sample of sequence data allows for the joint inference of split times and effective population sizes among the Native ancestors to the three panels. Finally, through an expectation maximization (EM) framework, we estimate genome-wide allele frequencies in the inferred Native components of MXL, CLM, and PUR genomes.

A broad summary of the data and analysis pipelines used in this article are displayed in Figure 1.

**Fig. 1. Schematic of the data and analysis pipelines used in this article.**

Results

Global ancestry proportions and clustering

To estimate the global proportions of African, European, and Native American ancestry in the CLM, MXL, and PUR, we combined them with YRI, CEU, and a panel of Native American samples [40] and performed an admixture [47] analysis (Figure 2(a)) and principal component analysis (Figure S1). Dense genotyping arrays allow for inference of ancestry at the level of individual loci, using software such as RFMix [48]. Trio-phased OMNI data was used to generate such locus-specific ancestry calls for 66 CLM, 68 MXL, and 64 PUR individuals, including all sequenced individuals, as part of the 1000 Genomes Project. Summing up the local ancestry contribution inferred by RFMix provides an alternate estimate of ancestry proportions.

**Fig. 2. Genome-wide ancestry patterns.**

Using admixture, we find Native American proportions being in PUR, in CLM, and in MXL (Figure 2a). RFMix finds values falling within percentage points of these values, and within one percentage point of the values inferred in the 1000 Genomes project through related methods [1]. Estimates of African ancestry showed a larger difference across methods, with admixture (RFMix) estimates at in PUR, in CLM, and in MXL.

The inferred Native American ancestry proportions are in good agreement with results from the GALA study [49], which reported proportions of in Puerto Rico and in Mexico. The PUR result is also comparable to the of Native ancestry inferred in a different Puerto Rican sample [16]. By contrast, none of the populations from Colombia in [37] show median ancestry proportions quite similar to the CLM sample from Medellin, the closest being the sample from the surrounding Department of Antioquia, with Native, African and European.

Figure 2(c–d) shows a principal component analysis restricted to segments of inferred Native ancestry [50]. We find that the MXL individuals cluster primarily with southern Mexican Native groups (mostly Mixe), and the CLM cluster primarily with the Embera, Kogii, and Wayu, all of which were sampled in Colombia North-West of the Andes, where Medellin is also located. The PUR clusters principally with populations South-East of the Andes, surrounding the Guyanas and the Orinoco River basin (Ticuna, Guahibo, Palikur, Jamamadi, Piapoco), although a few populations from further south are also close in PCA space, particularly the Guaraní and the Chané, together with some Kaqchikel, Toba, and Wichi individuals. The Piapoco and the Palikur speak Arawakan languages. The other groups with known Arawakan-speaking ancestors in our panel are the Chané, whose ancestors spoke Arawakan and likely originated in Guiana [51], and the Guarani, through gene flow from the Chané [52]. Taken together, these clustering patterns support a demic diffusion of the Arawakan/Taínos into Puerto Rico from a southern American route, and reduced gene flow between Native Americans groups living in the Andes or to the west, and groups living east of the Andes.

Ancestry tracts analysis

Because continuous tracts of local ancestry are progressively broken down by recombination, the length distribution of continuous ancestry tracts can reveal details of the timing and mode of the migration processes. We used RFMix to infer ancestry tracts (Text S1), and the software tracts [53] to infer the migration rates and model likelihoods under different scenarios. Tracts can predict the distribution of ancestry block length for arbitrary models of time-varying migration, under the assumptions that the migrants are themselves not admixed, and that the admixed population follows Wright-Fisher reproduction. Since admixture only begins after two populations are in contact, the admixed population is founded when the second population arrives. Tracts determines the time and ancestry proportions at the onset of admixture and the time and magnitude of subsequent migrations by maximum likelihood. Because of limited statistical power, we start with a simple model in which each population contributes a single pulse of migration. We then progressively introduce models with additional periods of migration when justified by information criteria, as described in Text S1. The models that best describe the data are shown in Figures 3 and S2. Parameters for these, together with confidence intervals obtained through bootstrap over individuals, are provided in Table S1 in the Text S1 file.

**Fig. 3. Ancestry tract length distribution in PUR (a) and CLM (b) compared to the predictions of the best-fitting migration model.**

For MXL, we considered a model introduced in [54]: three populations start contributing migrants at the same time, but Europeans and Native Americans keep contributing at a constant rate. The best-fitting model has an onset of admixture 15.1 generations ago (ga), with a CI of , in good agreement with [54] despite a different genotyping chip and local ancestry inference method.

In PUR, we found evidence for two periods of European and African migration, the first ga ( CI ) and the most recent period at ga ( CI 5.9–8.8). This model is in excellent agreement with historical records, which suggest that isolated Native populations contributed little gene flow to the colony after the initial contact period, and that substantial slave trade and European immigration continued until the second half of the 19th century. We do not mean to imply that migrations actually occurred in exactly two distinct pulses-we do not have the resolution to distinguish more than two pulses per population. However, the inference of a migration pulse 6.8 ga indicates that migrations occurred during a period spanning this date. This complex scenario, with multiple waves of migration from African and European individuals, is consistent with the observation that European and African ancestries vary across the island, whereas no evidence of such variation was found in Native ancestry [16].

The inferred onset of admixture in CLM is 13.0 ga ( CI ), significantly later than that in both MXL and PUR and consistent with later European settlement in western Colombia compared to Mexico and Puerto Rico. We also find evidence for a small but statistically significant second wave of Native American migration, 4.8 ga ( CI 4–6). As above, this does not necessarily indicate a single, punctual event, but probable contact between an admixed population and Native American individuals during that period. By contrast, we find no evidence for continuing African gene flow in CLM.

Identity by descent analysis

We used germline [55] and the trio-phased OMNI data above to identify segments identical-by-descent (IBD) within and across populations (see Text S1). Not surprisingly, we found more IBD segments within populations (23936) compared to across populations (1440), and within-population segments were longer (Figure S3).

The MXL population exhibits significantly less within-population IBD compared to the other two panels (Figure 4). The amount of IBD among unrelated individuals can be used to infer the underlying population size under panmictic assumption: the larger a population, the more distant the expected relationship between any two individuals [56]. Using IBD segments longer than 4 cM, we infer effective population sizes of 140,000 in MXL, 15,000 in CLM, and 10,000 in PUR. As we will show, these largely reflect post -⁠ admixture population sizes.

**Fig. 4. Number of IBD tracts by length bin in the three panel populations (independent of ancestry estimations), normalized by the number of individual pairs.**

We expect long IBD segments to be inherited from a recent common ancestor, and therefore to have identical continental ancestry. Comparing the RFMix ancestry assignments on chromosomes that have been identified as IBD by germline thus provides a measure of the consistency of the two methods (see [57] for a related metric). Rates of IBD-Ancestry mismatch ranged from in segments of to less than for segments longer than 40 Mb (Figure S4).

Patterns of ancestry in IBD segments within a population differ markedly from those across populations (Figure 5): IBD segments within populations contain many ancestry switches. This indicates that many common ancestors lived after contact, and that the effective population sizes estimated using IBD largely reflects post-contact demography. The IBD patterns in cross-population IBD segments exhibited fewer ancestry switches than a random control (Figure S5), as may be expected if common ancestors often predate the onset of admixture. Cross-population IBD segments were also found to be overwhelmingly of European origin: among the 120 longest cross-population IBD segments, 117 are in European-inferred segments, two are among Native segments, and one is among African segments. This is not due to overall ancestry proportions, as can be observed by considering the alternate (non-IBD) haplotypes at the same positions (Figure S5). This is likely a result of the colonization history, in which European colonists rapidly spread from a relatively specific region over a large continent. This interpretation is supported by the admixture analysis (Figure S6), showing a common cluster of ancestry for the European component dominant in PUR, CLM, MXL, and Andean populations, but not in CEU, Eskimo-Aleut, and Na-Dene. Finally, we were interested in testing whether the relationship between IBD and ancestry can be used to date recombination events. The ancestry within an IBD segment represents the ancestry state of the most recent common ancestor. The shorter the IBD segment, the older the ancestor, and the less time available since the onset of admixture to create ancestry switch points through recombination. Indeed, we find that the density of ancestry switch-points on IBD tracts increases with IBD tract length in PUR (bootstrap , see Text S1) and in MXL (bootstrap ), whereas the results are not significant in CLM. Thus we can use ancestry patterns in admixed populations not only to recognize recombination events but also to help date most recent common ancestors and recombination events (see Text S1 for details). The small amount of cross-population IBD among Native American tracts tells us that the ancestral Native populations were not as closely related as European founders, consistent with historical and anthropological data.

**Fig. 5. Continental origin of IBD segments.**

Demographic inference from sequence data

To infer split times and population sizes of the Native ancestors, we consider the joint site frequency spectrum (SFS). The SFS is informative of demography because stochastic differences in allele frequencies accumulate over time and at a rate that depends on population sizes. We use the diffusion-approximation framework implemented in [58] to perform the inference. We focus on synonymous sites in the 1000 Genomes exome capture data of 60 CLM, 66 MXL, and 55 PUR individuals because the high coverage reduces sequencing artifacts and synonymous sites are less affected by selection compared to non-synonymous sites. A complete model with admixture would require at least one European, one African, and three Native American populations, which is beyond the 3-population limit of We therefore wish to focus on variants within Native American backgrounds.

Unfortunately, trio-phased sequencing data was not available for most samples. Because of phasing uncertainty, the actual ancestry assignment for variants at ancestry-heterozygous loci is uncertain. To overcome this, we introduce a negative ascertainment scheme, in which we only consider variable sites that have not been observed in any of the non-Native populations in the 1000 Genomes data set. The effect of this ascertainment scheme is to remove the majority of variants that predate the split of Native Americans from the rest of the populations. An additional benefit of this approach is that the impact of European and African tracts incorrectly assigned as Native American will be substantially reduced. We hypothesized that the effect of negative ascertainment could be approximately modeled by a strict bottleneck at the Native/non-Native split time. This was confirmed through simulations (see S1).

We considered a simple 3-population demographic model starting with a constant population of size . At time the population size changes to . From this population of size , population diverged with size at time and populations and diverge at a later time with respective sizes and . We considered all three split orderings, with . In the optimal model, illustrated on Figure 6, we have , , . This model is a vast oversimplification of the historical demographic processes. However, given the limited statistical power to reconstruct time-dependent demographic histories using allele frequency data (e.g. [59]), such simple models with step-wise constant population sizes provide useful coarse-grained pictures of human demography. The population sizes in this model are effective population sizes: they are the size of Wright-Fisher populations that best explain the observed patterns of polymorphism. They differ from census sizes because of population size fluctuations, overlapping generations, sex bias, offspring number dispersion, and other departures from the Wright-Fisher assumptions. The ratio is expected to converge to large values to reflect both the negative ascertainment scheme (see Methods) and the expansion post-founding of the Americas. The current data does not enable us to model these two effects separately, so the recovery time can be thought of as an interpolation between the two events. When performing likelihood optimization, tended to slowly increase without bound. Beyond a value of 100, this had minimal impact on the likelihood function and other parameter estimates. We therefore fixed this value to to facilitate optimization and prevent numerical instabilities. All other parameters, and the order of population splits, were chosen to maximize the model likelihood.

**Fig. 6. An illustration of the maximum likelihood demographic model for the Native American ancestors to the CLM, MXL, and PUR panels.**

We find dramatic differences in the inferred population sizes of the Native Ancestors to the MXL, CLM, and PUR (see Table 1), with the MXL showing by far the largest effective population size at 64,000, times larger than the CLM and 32 times larger than the PUR. Given the many sources of uncertainty and model limitations, these ratios are in good qualitative agreement with pre-Columbian populations estimated at 14M in central Mexico [60], 3M in Colombia [60], and somewhat over 110,000 in Puerto Rico [61]. This could largely be a coincidence, given that the Native ancestors to the MXL and CLM were not panmictic populations over present-day political divisions. Another possible explanation for the differences in effective population sizes is a serial founder model after the crossing of Beringia: CLM and PUR would have experienced stricter and longer bottlenecks compared to MXL due to greater distances traveled from Beringia. The crossing to Puerto Rico is likely to have introduced intense bottlenecks in PUR, resulting in a smaller recent effective population size.

**Tab. 1. Parameter estimates for the model displayed on <em class="ref">Figure 6</em>, assuming a bottleneck at the foundation of the Americas 16,000 years ago.**

The model suggests that PUR and CLM ancestral populations did not share serial founding events past the split with the MXL ancestors and split well before the expected arrival of the Arawak people of the Caribbean. Indeed, the first and second split times ( and , respectively) are remarkably close to each other, with (bootstrap CI: , see S1, Figure S7, and Table 1). This corresponds to a difference of about 500 years, 12,000 years ago. In fact, the splits are so close that it is impossible to distinguish which population split first, with bootstrap instances supporting all three orderings: the Taíno ancestry does not appear much more closely related to either CLM or MXL Native ancestors. This is also consistent with the PCA results shown in Figure 2, showing a clear distinction between Native American groups in eastern and western Colombia.

Despite strong historical evidence for extensive population bottlenecks suffered by Native American populations following the arrival of Europeans [62], we could not detect the presence of such bottlenecks through allele frequency analysis. However, the presence of such bottlenecks may affect our interpretation of effective population sizes. To quantify this, we fixed the timing and magnitudes of bottlenecks using non-genetic sources, and re-inferred model parameters. Dobyns [62] proposed a maximum population reduction of in the Native American population after European contact, but this number is expected to vary from location to location. Because we are studying admixed populations, the size of the bottleneck is related to the number of individuals that contributed to the admixed population, thus Dobyns' estimate may not apply. In PUR, where the decline was particularly abrupt, we considered a decline of spanning years (see S1). We found that inferred parameters were little affected by the existence of such a bottleneck, with the exception of the effective population size in the pre-bottleneck PUR population, which would be 3.9 times larger than in the no-bottleneck model. Assuming an additional bottleneck in the CLM population led to similar 4-fold increase in inferred pre-bottleneck CLM population size, with little effect on inferred split times. These are significant effects, but are less than the inferred differences in effective population sizes. Thus, in the absence of extreme differences in the recent bottlenecks experienced by the three populations, the observed differences in population sizes likely point to differences in pre-Columbian demography.

By calibrating our results using , towards the most recent end of the range of plausible values for the peopling of the Americas (see e.g., [6] and references therein), we find a mutation rate of (bootstrap CI: ), within the range of recently published human mutation rates [63]. The narrowest confidence interval reported in [63] was , obtained from a de novo exome sequencing study [64]. Our sampling confidence interval is narrower than this value, but the main source of uncertainty here is the degree to which the bottleneck in our model reflects the bottleneck at the founding of the Americas, or the earlier split with the ancestors to the Chinese (CHB) and Japanese (JPT) sample, as well as uncertainty with respect to the timing of these two events (see Figure 7). The effect of changing the founding time or mutation rate assumptions would be to scale all parameters and confidence intervals according to Thus the absolute uncertainty on individual parameters is larger than the sampling uncertainty suggests.

**Fig. 7. Plausible parameter range for the human mutation rate and the founding time of the Native American populations.**

Estimating Native American allele frequencies

There is scarce publicly available, genome-wide data about Native American genomic diversity. The 1000 Genomes dataset offers the opportunity to provide a diversity resource for Native American genomics by reconstructing the genetic makeup of Native American populations ancestral to the PUR, CLM, and MXL. This is particularly interesting in the case of the Puerto Rican population, where such reconstruction may be the only way to understand the genetic make-up of the pre-Columbian inhabitants of the Islands. Using the expectation maximization method presented in the Methods section, we estimated the allele frequencies in the Native-American-inferred part of the genomes of the sequenced individuals. These estimates are available at http://genomes.uprm.edu/Taino/.

Figure 8 shows the distribution of the number of Native American haplotypes per site and the resulting confidence intervals for allele frequency in each population for exome capture target regions. Absolute confidence intervals are narrow for rare variants, and reach a maximum for SNPs at intermediate frequency; the leftmost peak in the bimodal distribution corresponds to the large number of rare variants, whereas the right most peak encompasses a broader range of frequencies.

**Fig. 8. Estimating Native American allele frequencies.**

Focusing on the variants with observations in all populations and within the exome capture regions, where coverage and accuracy were highest, the most significantly different among Native groups is rs11183610 on chromosome 12, with an estimated frequency of in MXL Native ancestry, in CLM Native ancestry, and in PUR Native Ancestry. The MXL-PUR difference remains significant after Bonferroni correction (bootstrap , see Methods). The bulk of the differentiation among populations is likely due to genetic drift, but such sub-continental ancestry informative markers are also interesting candidates for further selection scans.

Discussion

The bottleneck at the founding of the Americas provides a unique opportunity to obtain precise estimates of the human autosomal mutation rate, as reported in Table 1 and Figure 7. One remaining challenge in interpretation is whether the ‘founding time’ studied here corresponds to the bottleneck at the founding of the Americas, or the split time of the Native Americans with the Asian populations. Fortunately, this uncertainty can be addressed by sequencing either trio-phased populations from the Americas, or individuals of Native American ancestry without large amounts of recent European and African ancestry. In either case, the dramatic events that led to the initial peopling of the Americas, together with the early dates of South American archaeological sites, provides us with estimates of the human mutation rate that are more precise than pedigree-based estimates. A more thorough study of the robustness of these estimates to model assumptions is therefore desirable.

We find substantially larger effective population size in Mexico than in the other two populations through IBD-based and allele-frequency based estimates. These methods are sensitive to different time-scales: IBD analysis largely reflects post-Columbian events, as evidenced by the large number of mixed ancestry IBD segments in Figure 5(a). Allele frequencies reflect older events as well, and we showed that recent bottlenecks alone are unlikely to be responsible for the much larger effective MXL population size. To interpret the population size differences, we must consider the recent histories of the populations studied here. The MXL panel was recruited in Los Angeles among Mexican-American individuals, who may come from different regions in Mexico, a much wider geographical region than Puerto Rico, thus likely more populated. A natural question is whether the larger effective population sizes in MXL reflect a large panmictic population in Mexico, or a large number of small, previously isolated populations. Figure 2 and references [65],[40] provide compelling evidence that there is substantial population structure within Native groups of Mexico. However, Figure 2 also shows that the Native component of the MXL forms a relatively homogeneous cluster together with populations from southern Mexico. The much larger Native populations in central and southern Mexico are likely to have contributed the most to the Native American ancestry of Mexican mestizos, and thus Mexicans-Americans. Even though the MXL may have ancestors in different parts of Mexico, their Native genetic origins likely reflect the demographic history of the areas in Mexico with the highest Native American population sizes.

Because Puerto Rico is an island, building a relatively complete population genetic model for the population may be more tractable. Clearly, our model of a single idealized pre-Columbian Native American, European, and African populations, joining to form a panmictic admixed population, is an oversimplification. African and European ancestry proportions vary along the island [16] and eastern parts of Puerto Rico, with elevated proportions of African ancestry, are underrepresented in this study. By contrast, we do not have evidence for variation in the amount or composition of the Native American ancestry across the island, and it is likely that the conclusions about the pre-Columbian Native American fraction of the population are robust to sampling ascertainment. Interestingly, we find that the distribution of ancestry tract length in a sample of individuals of Puerto Rican descent in south Florida gave very similar results, despite different location, sequencing platform, and local ancestry inference method [50]. Historical gene flow inference using individuals of Colombian descent in south Florida provided comparable estimates of the time of admixture onset, but different patterns of recent gene flow–as is typical in demographic inference, inference of recent events is more sensitive to population structure.

Our analyses largely rely on accurate estimates of local ancestry patterns along the genome obtained through RFMix. This method has been shown to provide more than accuracy on three-way admixture using comparable reference panels [48], an accuracy level that enables accurate estimation of genome-wide diversity [54]. To ensure that our results are robust to residual errors, we further took into account the difficulty of calling short ancestry tracts in our migration estimates, and performed negative ascertainment of non-Native American alleles in the demographic inference. Some of these results can be independently verified by independent sequencing of contemporary or ancient individuals with more uniform ancestry. However, understanding the genetic history of admixed populations will continue to rely on statistically picking apart the contributions of different ancestral populations, and the development of improved statistical methods, particularly for admixture that is ancient or between closely related populations, remains highly desirable.

The genetic heterogeneity in continental ancestry proportions among populations of the Americas is well appreciated [66],[67],[43]. Our results emphasize more fine-scale aspects of this diversity: because of the similarity between European founders of different populations and the high divergence among the Native American ancestors, populations that appear similar under classical tests such as or principal component analysis may still harbor population specific Native American haplotypes that must be carefully accounted for when performing rare-variant association testing in cosmopolitan cohorts. Similarly, the choice of a replication cohort for an identified risk variant should be guided by the ancestral background on which the variant is found. The PUR may be an excellent replication cohort for a result found in CLM if the background is European. If the background is Native American, a different cohort with related Native Ancestry would likely be much more appropriate. Understanding the genetics of the different ancestral populations of the Americas, and the relatedness among these ancestral groups, will therefore facilitate the development of association methods that account for and take advantage of this rich diversity.

Methods

Negative ascertainment

Ideally, we would have been able to directly model the joint site-frequency spectrum (SFS) of all the ancestral populations to the PUR, CLM, and MXL. However, because we are interested in distinguishing the Native American ancestries to the three populations, this would require modeling at least 5 populations, which is beyond the scope of current methods. We would like to use the inferred local ancestry to focus on the Native American ancestry only, but this is difficult because most Native American haplotypes are in segments heterozygous for ancestry. Because of phasing errors, allele-specific ancestry can be incorrectly assigned. To minimize the impact of such mis-assigned ancestry and to ensure that we focused on variants of genuine Native American ancestry, we discarded all variants observed in 1000 Genomes individuals of African, European, and Asian ancestry, as well as variants observed in Hispanic/Latino populations in segments with no Native American ancestry inferred.

We then considered all remaining variable sites that were assigned Nat/Nat diploid ancestry and Nat/Eur ancestry, and calculated the expected frequency distribution under the assumption of perfect negative ascertainment, that is, that all remaining variants were on the Native American background. Because the European backgrounds are expected to carry a number of singletons, this would result in an overestimate of the number of singletons in the Native Ancestry. Fortunately, this bias is easy to estimate empirically: we first choose segments of Eur/Eur ancestry to mimic the European haplotypes in our sample. After performing the negative ascertainment scheme on these genotypes, we can directly estimate the bias in the negative ascertainment scheme. In practice, this correction is very low except for singletons, as expected. The number of excess singletons was 129 for CLM, 73 for PUR, and 40 for MXL. The largest non-singleton correction is 1.3 for doubletons in CLM.

Because negative ascertainment removes a significant proportion of the variants that were present at the Native American split from other populations, we hypothesized that this effect could be well-approximated by a severe bottleneck at the time of split between non-Native and Native American ancestry.

Figure 9 provides a simulated example, wherein a marginal spectrum (top) is compared to a spectrum negatively ascertained using 100 diploid individuals from the ‘outgroup’ population (middle) and to a bottleneck approximation equivalent (bottom). More quantitatively, we simulated a two-populations sample diverged 12.1kya, and negatively ascertained using a population diverged at 16.5 kya, and attempted to model this as a two-population model with an early bottleneck. The inferred bottleneck timing was within of the split time with the outgroup, and the three population sizes and split time between populations 1 and 2 were within of the correct value. These biases are well within the acceptable range given other biases and uncertainties.

**Fig. 9. Illustration of the negative ascertainment scheme, with simulation.**

Allele frequencies in Native American segments

We wish to estimate the allele frequencies at each site among segments of Native American origin, but we have to contend with a finite sample and inaccurate phasing. We therefore choose to model the underlying population frequency across all populations using Bayes rule(1)where is the observed genotype data, , and is the diploid local ancestry calls (e.g., for populations A and B). From this distribution we can calculate expected frequency and confidence intervals. We report inferred frequencies and confidence intervals at non-monomorphic sites.

To estimate , we write as the frequencies of the non reference allele in populations and . We have , for ancestry and genotype heterozygous segments, , and so forth. To estimate , we first observe that because we are considering population frequencies, rather than sample frequencies, is independent of : . This suggests the use of a self-consistent, expectation-maximization procedure. We estimate the underlying frequency distribution as(2)the sum over the estimated probabilities at each site. We can thus iterate Equations (1) and (2) until self-consistency is reached to estimate both allele frequency distributions and single-site allele frequencies in each population.

A final caveat is that the sum runs over all sites, including monomorphic ones. If we only observe the subset of sites that are polymorphic, an additional step is needed. If is the number of monomorphic (unobserved) sites (denoted as ), and represents the sum over polymorphic sites, we have(3)and, therefore,Intuitively, we are correcting for the proportions of sites at every frequency that might have gone undetected. Results are reported using 20 EM iterations, for sites where all individuals had both ancestry and genotype calls, and data can be downloaded at http://genomes.uprm.edu/Taino/.

To test this method, we considered 84 diploid individuals, each formed by drawing two chromosomes (without replacement) from 84 CEU and 84 YRI individuals, resulting in a simulated 50–50 admixture proportion. We considered 100,000 sites on chromosome 22, and performed the EM inference as described.

Among the 85677 sites that were found to be polymorphic, only 13 had a sample allele frequency departing from the confidence interval for the European ancestry, and 51 among the African ancestry. Confidence intervals encompass much more than of sample allele frequencies, emphasizing that the width of the confidence interval largely reflects the uncertainty about the population frequency given a fixed sample frequency, rather than the phasing uncertainty.

Optimizing the demographic model

Because the demographic model considered here does not involve migrations between Native groups, we considered the composite likelihood of three pairwise two-population allele frequency distributions, rather than the full three-population spectrum. This allows for much faster inference and better convergence of the numerical optimization. In principle, it also enables the joint inference of more than three populations. We showed through simulations that the use of a composite likelihood had an effect on inferred parameters that was much smaller than other sources of uncertainty. We used grids of 20,40, and 60 grid points per population, and projected Native American allele frequencies to sample sizes of 10 in PUR, 20 in CLM, and 40 in MXL.

Supporting Information

Zdroje

1. Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491 : 56–65.

2. O'RourkeDH, RaffJA (2010) The human genetic history of the Americas: the final frontier. Curr Biol 20: R202–R207.

3. Luis LanataJ, MartinoL, OsellaA, Garcia-HerbstA (2008) Demographic conditions necessary to colonize new spaces: the case for early human dispersal in the Americas. World Archaeology 40 : 520–537.

4. GoebelT, WatersMR, O'RourkeDH (2008) The late Pleistocene dispersal of modern humans in the Americas. Science 319 : 1497–1502.

5. DillehayTD (2009) Probing deeper into first American studies. Proc Natl Acad Sci USA 106 : 971–978.

6. DillehayTD, RamrezC, PinoM, CollinsMB, RossenJ, et al. (2008) Monte Verde: seaweed, food, medicine, and the peopling of South America. Science 320 : 784–786.

7. BodnerM, PeregoUA, HuberG, FendtL, RockAW, et al. (2012) Rapid coastal spread of First Americans: Novel insights from South America's Southern Cone mitochondrial genomes. Genome Res 22 : 811–820.

8. HurstCT (1943) A Folsom site in a mountain valley of Colorado. American Antiquity 8 : 250–253.

9. Meltzer DJ (2009) First Peoples in a New World. Colonizing Ice Age America. Univ of California Press.

10. Rouse I (1992) The Tainos: Rise and decline of the people who greeted Columbus. Yale University Press.

11. Rodrguez-Ramos R (2010) Rethinking Puerto Rican Precolonial History. University Alabama Press.

12. Veloz Maggiolo M (1991) Panorama histórico del caribe precolombino. Quinto Centenario del Descubrimiento de América Banco Central de la República Dominicana.

13. HopperR (2008) Taino Indians: Settlements of the Caribbean. Lambda Alpha Journal 38 : 62–69.

14. Moscoso F (2008) Caciques, aldeas y población tana de Boriquén (Puerto Rico), 1492–1582. Academia Puertorriqueña de la Historia.

15. Alegra RE, Quiñones ER (1999) Historia y cultura de Puerto Rico: Desde la época pre-colombina hasta nuestros das. Fundación Francisco Carvajal.

16. ViaM, GignouxCR, RothLA, FejermanL, GalanterJ, et al. (2011) History Shaped the Geographic Distribution of Genomic Admixture on the Island of Puerto Rico. PLoS ONE 6: e16513.

17. UnderhillPA, JinL, ZemansR, OefnerPJ, Cavalli-SforzaLL (1996) A pre-Columbian Y chromosome-specific transition and its implications for human evolutionary history. Proc Natl Acad Sci USA 93 : 196–200.

18. LellJT, BrownMD, SchurrTG, SukernikRI, StarikovskayaYB, et al. (1997) Y chromosome polymorphisms in native American and Siberian populations: identification of native American Y chromosome haplotypes. Hum Genet 100 : 536–543.

19. BianchiNO, CatanesiCI, BaillietG, Martinez-MarignacVL, BraviCM, et al. (1998) Characterization of ancestral and derived Y-chromosome haplotypes of New World native populations. Am J Hum Genet 63 : 1862–1871.

20. KarafetTM, ZeguraSL, PosukhO, OsipovaL, BergenA, et al. (1999) Ancestral Asian source(s) of new world Y-chromosome founder haplotypes. Am J Hum Genet 64 : 817–831.

21. BortoliniMC, SalzanoFM, ThomasMG, StuartS, NasanenSPK, et al. (2003) Y-chromosome evidence for differing ancient demographic histories in the Americas. Am J Hum Genet 73 : 524–539.

22. MesaNR, MondragónMC, SotoID, ParraMV, DuqueC, et al. (2000) Autosomal, mtDNA, and Y-chromosome diversity in Amerinds: pre -⁠ and post-Columbian patterns of gene flow in South America. Am J Hum Genet 67 : 1277–1286.

23. BortoliniMC, SalzanoFM, BauCHD, LayrisseZ, Petzl-ErlerML, et al. (2002) Y-chromosome biallelic polymorphisms and Native American population structure. Ann Hum Genet 66 : 255–259.

24. BaillietG, RamalloV, MuzzioM, GarcíaA, SantosMR, et al. (2009) Brief communication: Restricted geographic distribution for Y-Q* paragroup in South America. Am J Phys Anthropol 140 : 578–582.

25. TorroniA, SchurrTG, CabellMF, BrownMD, NeelJV, et al. (1993) Asian affinities and continental radiation of the four founding Native American mtDNAs. Am J Hum Genet 53 : 563–590.

26. AchilliA, PeregoUA, BraviCM, CobleMD, KongQP, et al. (2008) The Phylogeny of the Four Pan-American MtDNA Haplogroups: Implications for Evolutionary and Disease Studies. PLoS ONE 3: e1764.

27. KumarS, BellisC, ZlojutroM, MeltonPE, BlangeroJ, et al. (2011) Large scale mitochondrial sequencing in Mexican Americans suggests a reappraisal of Native American origins. BMC Evol Biol 11 : 293.

28. MalhiRS, CybulskiJS, TitoRY, JohnsonJ, HarryH, et al. (2010) Brief communication: mitochondrial haplotype C4c confirmed as a founding genome in the Americas. Am J Phys Anthropol 141 : 494–497.

29. PeregoUA, AchilliA, AngerhoferN, AccetturoM, PalaM, et al. (2009) Distinctive Paleo-Indian migration routes from Beringia marked by two rare mtDNA haplogroups. Curr Biol 19 : 1–8.

30. PeregoUA, AngerhoferN, PalaM, OlivieriA, LancioniH, et al. (2010) The initial peopling of the Americas: a growing number of founding mitochondrial genomes from Beringia. Genome Res 20 : 1174–1179.

31. TammE, KivisildT, ReidlaM, MetspaluM, SmithDG, et al. (2007) Beringian standstill and spread of Native American founders. PLoS ONE 2: e829.

32. SandovalK, Buentello-MaloL, Peñaloza-EspinosaR, AvelinoH, SalasA, et al. (2009) Linguistic and maternal genetic diversity are not correlated in Native Mexicans. Hum Genet 126 : 521–531.

33. BonattoSL, SalzanoFM (1997) A single and early migration for the peopling of the Americas supported by mitochondrial DNA sequence data. Proc Natl Acad Sci USA 94 : 1866–1871.

34. MulliganCJ, KitchenA, MiyamotoMM (2008) Updated three-stage model for the peopling of the Americas. PLoS ONE 3: e3199.

35. FagundesNJR, KanitzR, EckertR, VallsACS, BogoMR, et al. (2008) Mitochondrial population genomics supports a single pre-Clovis origin with a coastal route for the peopling of the Americas. Am J Hum Genet 82 : 583–592.

36. WangS, LewisCM, JakobssonM, RamachandranS, RayN, et al. (2007) Genetic variation and population structure in native Americans. PLoS Genet 3: e185.

37. RojasW, ParraMV, CampoO, CaroMA, LoperaJG, et al. (2010) Genetic make up and structure of Colombian populations by means of uniparental and biparental DNA markers. Am J Phys Anthropol 143 : 13–20.

38. YangNN, MazieresS, BraviC, RayN, WangS, et al. (2010) Contrasting patterns of nuclear and mtDNA diversity in Native American populations. Ann Hum Genet 74 : 525–538.

39. ScliarMO, Soares-SouzaGB, ChevitareseJ, LemosL, MagalhãesWCS, et al. (2012) The population genetics of quechuas, the largest native south american group: Autosomal sequences, SNPs, and microsatellites evidence high level of diversity. Am J Phys Anthropol 147 : 443–451.

40. ReichD, PattersonN, CampbellD, TandonA, MazieresS, et al. (2012) Reconstructing Native American population history. Nature 488 : 370–374.

41. Martínez-CortésG, Salazar-FloresJ, Fernández-RodríguezLG, Rubi-CastellanosR, Rodrguez-LoyaC, et al. (2012) Admixture and population structure in Mexican-Mestizos based on paternal lineages. J Hum Genet 57 : 568–574.

42. Rubi-CastellanosR, Martínez-CortésG, Muñoz-ValleJF, González-MartínA, Cerda-FloresRM, et al. (2009) Pre-Hispanic Mesoamerican demography approximates the present-day ancestry of Mestizos throughout the territory of Mexico. Am J Phys Anthropol 139 : 284–294.

43. BedoyaG, MontoyaP, GarcíaJ, SotoI, BourgeoisS, et al. (2006) Admixture dynamics in Hispanics: a shift in the nuclear genetic ancestry of a South American population isolate. Proc Natl Acad Sci USA 103 : 7234–7239.

44. Martínez-CruzadoJC, Toro-LabradorG, Viera-VeraJ, Rivera-VegaMY, StartekJ, et al. (2005) Reconstructing the population history of Puerto Rico by means of mtDNA phylogeographic analysis. Am J Phys Anthropol 128 : 131–155.

45. BolnickDA, BolnickDI, SmithDG (2006) Asymmetric male and female genetic histories among Native Americans from Eastern North America. Mol Biol Evol 23 : 2161–2174.

46. Carvajal-CarmonaLG, SotoID, PinedaN, Ortíz-BarrientosD, DuqueC, et al. (2000) Strong Amerind/White Sex Bias and a Possible Sephardic Contribution among the Founders of a Population in Northwest Colombia. The American Journal of Human Genetics 67 : 1287–1295.

47. AlexanderDH, NovembreJ, LangeK (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19 : 1655–1664.

48. MaplesBK, GravelS, KennyEE, BustamanteCD (2013) RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. Am J Hum Genet 93 : 278–288.

49. GalanterJM, Fernandez-LopezJC, GignouxCR, Barnholtz-SloanJ, Fernandez-RozadillaC, et al. (2012) Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genet 8: e1002554.

50. Moreno-EstradaA, GravelS, ZakhariaF, McCauleyJL, ByrnesJK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. arXiv

51. MoseleyC (2004) Encyclopedia of the World's Endangered Languages. Routledge

52. CombesI, LowreyK (2006) Slaves without Masters? Arawakan Dynasties among the Chiriguano (Bolivian Chaco, Sixteenth to Twentieth Centuries). Ethnohistory 53 : 689–714.

53. GravelS (2012) Population genetics models of local ancestry. Genetics 191 : 607–619.

54. KiddJM, GravelS, ByrnesJ, Moreno-EstradaA, MusharoffS, et al. (2012) Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am J Hum Genet 91 : 660–671.

55. GusevA, LoweJK, StoffelM, DalyMJ, AltshulerD, et al. (2008) Whole population, genome-wide mapping of hidden relatedness. Genome Res 19 : 318–326.

56. PalamaraPF, LenczT, DarvasiA, Pe'erI (2012) Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91 : 809–822.

57. BaranY, PasaniucB, SankararamanS, TorgersonDG, GignouxC, et al. (2012) Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28 : 1359–1367.

58. GutenkunstRN, HernandezRD, WilliamsonSH, BustamanteCD (2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5: e1000695.

59. MyersS, FeffermanC, PattersonN (2008) Can one learn history from the allelic spectrum? Theor Popul Biol 73 : 342–348.

60. Salzano FM, Bortolini MC (2005) The Evolution and Genetics of Latin American Populations. Cambridge University Press.

61. Moscoso F (2008) Caciques, aldeas y población taína de Boriquén (Puerto Rico), 1492–1582. Academia puertorriquena de la historia.

62. DobynsHF (1966) An appraisal of techniques with a new hemispheric estimate. Current Anthropology 7 : 395–416.

63. ScallyA, DurbinR (2012) Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet 13 : 745–753.

64. SandersSJ, MurthaMT, GuptaAR, MurdochJD, RaubesonMJ, et al. (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485 : 237–241.

65. GorostizaA, Acunha-AlonzoV, Regalado-LiuL, TiradoS, GranadosJ, et al. (2012) Reconstructing the history of Mesoamerican populations through the study of the mitochondrial DNA control region. PLoS ONE 7: e44666.

66. BrycK, VelezC, KarafetT, Moreno-EstradaA, ReynoldsA, et al. (2010) Colloquium paper: genome-wide patterns of population structure and admixture among Hispanic/Latino populations. Proc Natl Acad Sci USA 107 Suppl 2 : 8954–8961.

67. WangS, RayN, RojasW, ParraMV, BedoyaG, et al. (2008) Geographic patterns of genome admixture in Latin American Mestizos. PLoS Genet 4: e1000037.

68. TremblayM, VézinaH (2000) New estimates of intergenerational time intervals for the calculation of age and origins of mutations. Am J Hum Genet 66 : 651–658.