Dissection of a Complex Disease Susceptibility Region Using a Bayesian Stochastic Search Approach to Fine Mapping

Download PDF České info

Genetic association studies have identified many DNA sequence variants that associate with disease risk. By exploiting the known correlation that exists between neighbouring variants in the genome, inference can be extended beyond those individual variants tested to identify sets within which a causal variant is likely to reside. However, this correlation, particularly in the presence of multiple disease causing variants in relative proximity, makes disentangling the specific causal variants difficult. Statistical approaches to this fine mapping problem have traditionally taken a stepwise search approach, beginning with the most associated variant in a region, then iteratively attempting to find additional associated variants. We adapted a stochastic search approach that avoids this stepwise process and is explicitly designed for dealing with highly correlated predictors to the fine mapping problem. We showed in simulated data that it outperforms its stepwise counterpart and other variable selection strategies such as the lasso. We applied our approach to understand the association of two immune-mediated diseases to a region on chromosome 10p15. We identified a model for multiple sclerosis containing two variants, neither of which was found through a stepwise search, and functionally linked both of these to the neighbouring candidate gene, IL2RA, in independent data. Our approach can be used to aid fine mapping of other disease-associated regions, which is critical for design of functional follow-up studies required to understand the mechanisms through which genetic variants influence disease.

Published in the journal: Dissection of a Complex Disease Susceptibility Region Using a Bayesian Stochastic Search Approach to Fine Mapping. PLoS Genet 11(6): e32767. doi:10.1371/journal.pgen.1005272
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1005272

Summary

Introduction

Genome-wide association studies have been very successful at identifying disease-associated variation by exploiting linkage disequilibrium (LD), meaning that only a subset of markers need be surveyed to detect association. However, this same LD, particularly when combined with the presence of multiple disease causal variants in relative proximity, makes disentangling the specific causal variants difficult. Mapping the likely causal variants in regions associated with complex traits as precisely as possible is becoming increasingly important for two reasons. Firstly, as detailed chromatin annotations become available, more precise maps of probable causal variants will allow researchers to fully exploit these resources through integrative analyses targeted at identifying the specific genes, molecules and cells underlying disease association [1]. Secondly, the functional studies which are ultimately required to confirm causal mechanisms are laborious and need to focus, from the outset, on the smallest yet most complete set of plausible causal variants.

It is widely recognised that the most associated single nucleotide polymorphism (SNP) in a region is not necessarily the causal variant [2], yet attempts at fine mapping disease signals typically proceed by successive conditioning on the most associated SNPs, a form of stepwise regression that may produce incorrect results, particularly when causal variants are correlated, i.e. in linkage disequilibrium (LD) [3]. Bayesian methods have been used for fine mapping association signals in dense and imputed genotyping data and generating credible sets of variants that are likely to contain the causal variant, analogous to credible intervals for odds ratios [4]. However, these approaches typically make the simplifying assumption that exactly one causal variant exists at any individual region [4] because accounting for multiple causal variants leads to exponential increase in the number of models that need to be considered with the number of causal variants considered. Bayesian methods that summarise evidence across SNPs in a region to assess either enrichment of signals in chromatin states [5] or colocalisation between association signals for different traits [6] make a similar single causal variant assumption. As stepwise approaches often obtain evidence for additional, independent association signals in a region [7–10], this assumption is unrealistic. One exception is BimBam [11] which can fit multi SNP models, but considers all possible models up to a specified maximum number of causal SNPs. As the number of potential models grows exponentially, BimBam is limited to regions with relatively few SNPs for computational reasons.

Monte Carlo methods can avoid limitations on the number of causal variants by sampling the model space rather than visiting all possible models. Here we adapt a Bayesian evolutionary stochastic search algorithm, GUESS [12, 13], to the fine mapping problem. This method, and its fast computational implementation, is tailored to efficiently explore the multimodal space created by multiple SNP models. However, the very dense SNP map that is required for fine mapping leads to extreme LD, which presents two specific challenges for GUESS. The first is that SNPs in extremely tight LD can cause numerical instability in model fitting, so we use minimal tagging to explore the model space and then expand all the tag models initially selected by GUESS (Fig 1). Second, posterior support is diluted across SNPs in tight LD, potentially preventing direct inference on the importance of individual SNPs. We therefore use posterior model probabilities and patterns of LD to define sets of SNPs which have strong joint posterior support for the hypothesis that one member of the set is causal for the trait. These are analogous to the credible sets generated in the Bayesian fine mapping framework which assumes a single causal variant per region [4], but allow for multiple causal variants.

**Fig. 1. Overview of the fine mapping tailored stochastic search strategy in GUESSFM.**

Our adaptions around GUESS are available in an R package, GUESSFM (https://github.com/chr1swallace/GUESSFM). We used GUESSFM to fine map the association of multiple sclerosis (MS) and type 1 diabetes (T1D) to an established susceptibility region for immune-mediated diseases on chromosome 10p15 (S1 Fig), which contains the candidate causal gene IL2RA. IL2RA encodes a subunit CD25 (IL-2RA) of the interleukin 2 (IL-2) receptor that is essential for the high affinity binding of IL-2 [14]. The region has provided some prime examples of evolving genetic research. Initially associated with T1D susceptibility using a candidate gene and tag SNP approach [15], further studies revealed association to other autoimmune diseases, including MS [16] and rheumatoid arthritis [17], and demonstrated that multiple genetic markers were needed to explain the T1D association [18]. Genotype to phenotype studies have demonstrated that disease-associated variants in the region also associate with IL2RA mRNA expression and CD25 protein expression on the surface of naive and memory CD4⁺ T cells [19–21] and sensitivity of memory CD4⁺ T and activated T regulatory cells to IL-2 [22].

Results

The use of different genotyping panels for different diseases has presented a challenge for comparative analysis across diseases. Here, we refined the T1D and MS association signals in the associated region on chromosome 10p15 by taking advantage of a unified dense SNP map provided by the ImmunoChip, a custom Illumina 200K Infinium High-Density array covering 186 distinct autoimmune susceptibility risk regions [7, 23]. We used genotypes on a total of 52,637 samples (S1 Table), and supplemented the ImmunoChip variants by imputation from the 1000 Genomes Project data. This gave a total of 667 SNPs and small indels with minor allele frequency > 0.005 in controls for analysis after genotype quality control.

To allow for the effects of LD, traditional conditional regression is often supplemented by identifying subsets of SNPs with some minimum LD (e.g. r² > 0.8) with individual SNPs selected by a forward stepwise procedure, within which it is suggested the causal variant may lie. We compared our proposed approach to traditional conditional regression by simulating phenotypes conditional on multiple “causal variants” selected randomly from within these genetic data, and confirmed that the proposed approach both recovered a higher proportion of the true effects and simultaneously detected fewer false positive results (Fig 2). GUESSFM requires specification of the a priori expected number of causal variants in a region, n_exp. We considered either setting n_exp = 3 or n_exp = 5, and found the performance decreased when n_exp was less than the true value, but not when it was greater than the true value, suggesting that for recovery of the correct causal variants it is better to set n_exp higher rather than lower. We also considered three regularised regression approaches: the lasso, the elastic net and the group lasso. The ROC curves of those methods (for decreasing penalties) are intermediate to the stepwise and stochastic GUESSFM search approaches (Fig 2). However, at traditional optimization thresholds (minimising the ten fold cross validation), regularized regression approaches were extremely anti-conservative, with a very high false discovery rate, compared to stepwise (stopping at p < 10⁻⁸ or p < 10⁻⁶) or GUESSFM for suitable thresholds on posterior probabilities (S2 Table). Picking a regularisation parameter according to the number of predictors included (three or five) produced a more competitive false discovery rate, but discovery rates were still lower than with GUESSFM (S2 Table).

**Fig. 2. Comparison of of several multivariate methods for fine mapping using simulated data.**

We applied our approach to the MS and T1D data sets. Posterior support for the number of SNPs required to model the association is strongly peaked, favouring a two-SNP model in MS and a four SNP model in T1D (S2 Fig). We summarised inference for SNPs over multiple models by considering the support for a SNP group according to the sum of posterior probabilities (PP) over all models containing a SNP from that group. We term this the group marginal posterior probability of inclusion (gMPPI), which can be understood as the probability that one SNP in the group is causal for the trait of interest. For T1D, there is high confidence (gMPPI > 0.8) for four SNP groups that we denote according to their order in S3 Fig: A (indexed by rs12722496), C (rs11594656), E (rs6602437) and F (rs41295159). As the T1D data have previously been analysed using stepwise regression [23], we compared our results to those found by forward stepwise analysis of the same data (Table 1). We saw a direct correspondence in terms of LD (r² > 0.6) between the SNPs identified by the two approaches (Table 2). However, models found by GUESSFM had larger log likelihoods for a given number of SNPs, indicating that these models offered a better explanation of T1D-associated genetic variation in the region.

**Tab. 1. Comparison of results from stochastic GUESSFM and stepwise searches.**

LD (<i>r</i><sup>2</sup>) between SNPs in selected models for T1D using stochastic GUESSFM and stepwise searches. — **Tab. 2. LD (r²) between SNPs in selected models for T1D using stochastic GUESSFM and stepwise searches.**

In contrast, for MS, there were no SNP groups with high gMPPI. We instead saw two distinct models, which together accounted for > 80% of the posterior support amongst the 514,476 visited models: either model M1, consisting of SNP groups A (indexed by rs12722496) and D (rs56382813), or model M2 consisting only of SNP group B (rs2104286). SNPs from M1 and M2 were rarely selected together (total PP = 0.039 across the visited models). rs2104286 was reported as the single associated SNP in the region in the original MS ImmunoChip analysis after forward stepwise regression [7] (Table 1). rs2104286 is in low to moderate LD with both of the M1 index SNPs (r² ≃ 0.3, Table 3).

LD (<i>r</i><sup>2</sup>) between SNPs in selected models for MS using stochastic GUESSFM and stepwise searches. — **Tab. 3. LD (r²) between SNPs in selected models for MS using stochastic GUESSFM and stepwise searches.**

Haplotype analysis showed that whilst haplotypes carrying the rs2104286:C allele are protective for MS, so is a less common haplotype carrying the rs2104286:T allele and the protective allele at rs56382813 (Table 4). Indeed, the addition of rs2104286 to a model consisting of the haplotypes formed by M1 was non-significant (p = 0.196) whilst the addition of the haplotypes formed by the M1 SNPs to a model consisting of only rs2104286 was significant (p = 2.64 × 10⁻⁶). However, the posterior support favours model M1 over M2 only by a factor of 1.7 and this is dependent on our prior expectation for the number of causal variants: had we expected only a single causal variant in the region, we would have favoured M2 by a factor of 1.7, with M1 being favoured only with a prior expectation of 1.8 or more causal variants (S4 Fig). Under any reasonable prior, the posterior support for the two models is so close that we cannot choose between them on statistical evidence alone: we must consider them as plausible alternative explanations for MS association in the region. Note that the M1 SNPs were not within the credible set created under the single causal variant assumption, which consists of rs2104286 alone [7].

**Tab. 4. Haplotype analysis of MS association using the index SNPs for the one and two SNP models selected.**

We selected all high confidence SNP groups for more detailed exploration (Fig 3). The T1D signals are located in (1) intron 1 of IL2RA—SNP group A, (2) intergenic between IL2RA and RBM17—C, (3) 5’ of RBM17—E, and (4) 5’ of RBM17 to intron 2 of PFKFB3—F. Under the model M1 for MS, SNP group A was also associated with MS, with the same alleles protective for both (Table 5) whilst the second M1 signal (SNP group D) physically overlapped, but was not in LD with, SNPs from group C. Under the model M2, the sole-MS associated SNP (B) is located in intron 1 of IL2RA, neighbouring the T1D-associated SNP group A, but there was only weak LD between A and B (r² = 0.3).

**Fig. 3. Six sets of SNPs can best explain the association of T1D and MS in the chromosome 10p15 region.**

**Tab. 5. SNPs selected by GUESSFM mapping of MS and T1D associations.**

Since we were unable to distinguish between the competing M1 and M2 models for MS using SNP-disease association data alone, we sought corroborating evidence to support SNPs in either model using CD25 flow cytometric expression data. The M2 SNP rs2104286 has been associated with the age-dependent proportion of naïve T cells that express CD25 on their cell surface and rs12722495 (within SNP group A, located in intron 1 of IL2RA) with the total expression of CD25 on memory T cells [19, 20]. rs12722495 has also been associated with IL2RA mRNA expression in T cells, both resting [19] and activated by 48 hour culture with anti-CD3+CD28 [21]. We selected a single tag SNP from each of the credible sets and examined which could best explain CD25 protein expression phenotypes using previously published data from 179 samples [19], plus an additional 30 samples. Both phenotypes were best explained by a single SNP: rs12722495 from group A again showed the strongest association with intensity of CD25 expression on memory T cells (p = 5.50 × 10⁻¹⁰, S5 Fig). For the proportion of naive CD4⁺ cells that express CD25, the M1 SNP rs41295055 from group D, whose association with CD25 expression was not previously tested, was preferred to the M2 SNP rs2104286 (p = 3.45 × 10⁻⁸ versus p = 2.56 × 10⁻⁶, ΔBIC = 8.43, which is interpreted as “strong” evidence in favour of rs41295055 [24]; Fig 4), supporting the hypothesis that rs2104286 is not itself functional, but merely tags other functional variants which are causal for MS. The A and D SNP groups coincide with regions in IL2RA that contain DNase I sensitivity sites indicating open chromatin available to bind transcription factors under appropriate conditions (Fig 3). In addition, the existence of RNA-seq reads from resting and stimulated CD4⁺ T cells within intron 1, where group A SNPs lie (Fig 3), support the regulatory nature of this region [25].

The proportion of naive CD4<sup>+</sup> T cells that express CD25 (log scale) increases with age. — **Fig. 4. The proportion of naive CD4⁺ T cells that express CD25 (log scale) increases with age.**

Discussion

The 10p15 region contains at least five apparently distinct associations to the immune-mediated diseases, T1D and MS. Our results are the first to support a four SNP model in T1D, which most likely reflects a combination of increased sample sizes and the increased variant density available due to ImmunoChip and imputation, as this model was supported by both stepwise regression and GUESSFM. A previous comparative study of MS and T1D susceptibility in this region using conditional analysis of tag SNPs identified three groups of SNPs [26]: their group I matches our group A, their group II our group C, and their group III our group B. The results showed that the minor allele of SNPs in all these groups was protective for T1D, but that group I (A) was not associated with MS and that group II (C) was susceptible for MS. Our results, based on larger sample sizes and more extensive variant coverage, suggest that the minor allele of SNPs in group A, shared between T1D and MS, are associated with the same protective effect for both diseases. This emphasises the need for surveying the most complete set of variants possible in order to make cross disease comparisons.

Variants in this region have previously been linked to several aspects of IL-2R signalling in T1D and MS patients [27] and the MS-associated variants we identify (whether under the M1 or M2 models) all showed some evidence of association with expression of CD25 on the surface of T cells, linking IL2RA in a primary way into the etiology of this disease. For T1D, we were only able to link the IL2RA intron 1 signal, shared with MS under the M1 model, to CD25 expression on memory T cells. Neither the intergenic T1D SNP group (C), despite physical overlap with the MS associated SNP group D, indexed by rs41295055, nor the sets near RBM17 and PFKFB3 (E and F) have yet been linked to IL2RA expression. These signals could relate to CD25 expression on other cell subsets or under specific conditions. The cell-specific regulation of IL2RA expression and its role in modulating the immune system are both complex and only partially understood. For example, in addition to T cells, IL2RA is also expressed on many other cell types, especially under inflammatory conditions [28]. We have data on only a subset of IL-2R phenotypes that have been previously reported, and previous studies have genotyped only a subset of the disease associated SNPs described in this paper. Further genotyping of large samples with IL-2R related phenotypes is warranted to properly assess any other potential effects of the disease signals we identify on IL-2R signalling. Alternatively, it may be that other genes in the region, which have not been as well studied as IL2RA, are also causal for T1D, either directly or through interaction with IL2RA. For example, PFKFB3, an inducible 6-phosphofructo-2-kinase/fructose-2, 6-bisphosphatase isoform, allows rapid responses to energy requirements and insufficiency can lead to apoptosis and an inability to undergo autophagy in CD4+ T cells in RA [29]. PFKFB3 has also been implicated in regulating insulin secretion in pancreatic beta cells [30], which could explain the T1D specific signals across this gene. All three genes, IL2RA, PFKFB3 and RBM17 are upregulated in CD4⁺ T cells upon ex vivo activation (S3 Table) and two candidate SNPs, from groups E and F, sit between two DNase I sites close to the RBM17 promoter. Furthermore, gene regulation can extend over 100 kb [31] and further studies will be needed to confirm the gene(s) through which these other SNPs influence T1D risk.

Two or more independent signals are commonly observed in fine mapped autoimmune disease regions using stepwise regression [7–10]. Despite long established doubts regarding the validity of stepwise regression [3], its use has continued to dominate in GWAS because of the number of SNPs measured simultaneously. Stepwise regression addresses the questions: ‘which is the single variant that can best explain the maximum trait variance?’ and, conditional on this single variant, ‘do any other variants explain additional trait variance?’ This is not equivalent to asking what set of variants best jointly explain trait variance. Often, as we have observed for T1D, stepwise regression may be expected to identify the same signals as a stochastic search approach. However, our analyses demonstrate that different results may be produced by stepwise and full multi-variant search approaches in some cases. In particular, we highlight and prefer a two SNP model for MS association to the IL2RA region, one of the SNPs shared with T1D (group A, indexed by rs12722495) and associated with CD25 expression on memory T cells, the other (group D, indexed by rs41295055) a better predictor for CD25 expression on the surface of naive CD4⁺ T cells than the SNP identified by stepwise regression (group C, rs2104286).

Methods that search the multimodal model space formed by multiple SNPs without conditioning on univarate association signals can therefore reveal a more accurate picture of the likely disease causal variants in any region. Other Bayesian [32], stochastic search [33] and penalised regression approaches such as lasso [34, 35] have been proposed on a GWAS scale, but these aim primarily to detect association in data sets with relatively sparse genotype coverage. Our focus is the adaptation of a Bayesian variable selection method to the fine mapping problem, and it is likely that an alternative method, such as piMASS [36], could have been substituted for GUESS in the model selection step. A detailed and direct comparison of piMASS with GUESS showed a similar recovery of models in simulated data, but indicated that GUESS was more computationally efficient [13]. We chose to use GUESS for its computation efficiency, but also because its g-prior formulation is specifically designed to deal with highly correlated predictors, a beneficial feature for fine mapping.

Applying the regularised regression approaches to our simulated data showed that the ROC curve for lasso with decreasing penalty out-performed stepwise analysis, but optimising the regularization parameter by minimising the cross validation error led to very high false discovery rates, suggesting that in this context this criterion is highly anti-conservative. Previous studies have also indicated that while regularised regression has strengths in terms of prediction, it may have weaknesses in terms of model selection [37], particularly for correlated predictors [38]. Elastic net [39] and the group lasso [40] have been proposed as regularised approaches tailored to correlated predictors, but in our hands did not outperform the simple lasso plus construction of a set of plausible SNPs according to r² > 0.8. We considered that the information supplied to GUESSFM as a prior expected number of causal variants could be included equivalently in a regularised approach by setting the regularisation parameter according to the number of predictors selected. Although this produced a more competitive false discovery rate, discovery rates remained lower than with GUESSFM (S2 Table).

Our multidimensional analysis strategy, tailored to the high genotyping coverage (and, hence, high LD) required for fine mapping causal variants can provide a more complete picture of the likely causal variants in a region. Any fine mapping study is limited by the set of variants included for study. While we attempted to survey the fullest possible set of SNPs and small indels by using dense genotyping data, plus imputation to the 1000 Genomes Project data, we cannot be sure we included all variants that might affect gene function without sequencing of cases and controls [41]. Further, larger indels, VNTRs and microsatellites remain particularly difficult to genotype with accuracy yet may contribute to disease. Therefore, it is important to bear in mind that all claims that a particular variant is causal are conditional on the true causal variants being accurately genotyped and included in the study.

It is well known that functional biological interactions may exist between variants [42], and therefore the possibility of statistical interactions needs to be considered in fine mapping studies. However, the need to fit interaction terms depends on the evolutionary history of the region. In the IL2RA region, there has been very little historical recombination (S1 Fig) and D′ between markers is nearly always 1, also indicating a lack of recombination [43]. As a consequence, only three of the possible four haplotypes that may be formed from any pair of SNPs are observed, and statistical interaction parameters are inestimable. This also implies that there is a one to one transformation between a haplotype model and a model expressing the log odds of disease as a linear function of single SNP effects [44], as we have employed. Thus, for the IL2RA region, we may neglect statistical interactions without making any assumptions about the existence of biological interactions. For our method to be applied to regions in which D′ < 1, it would need extension to include statistical interaction terms. One simple approach, if the number of SNPs is not to large, would be to generate additional variables representing interactions between SNPs with D′ < 1, but extending GUESS to fit interaction terms is another direction for further research.

An alternative approach would be to attempt to study disease risk across haplotypes directly [45], although these need to be inferred probabilistically when recombination has occurred. Haplotype-based analysis has become less common as marker density has increased. One reason for this is that haplotypes are particularly useful when marker density is low, because a haplotype may tag a causal variant better than any single variant, and haplotype studies have thus fallen out of fashion as denser genotype data have become available. This is exemplified by a study that found multiple statistical interactions underlying gene expression [46] (statistical interactions may be thought of as haplotype models) using SNPs on a genomewide array. Subsequent analysis found that all interactions which replicated in an external dataset with whole genome sequencing, and thus with more complete coverage of potential causal variants, could be better explained by a single SNP [47]. This suggests, perhaps, that statistical interactions between SNPs will prove rare. However, if biological interactions between variants separated by LD do exist, and the interaction depends on the phase of the variants, i.e., the diplotype risk differs from the two locus genotype risk, haplotype methods will again be required to infer likely causal variants.

One way to understand the means through which the effects of a disease-associated variant are mediated is to perform functional assays to examine its effects on a gene’s function [48] or to identify potentially intermediate phenotypes with which it is also associated [19]. Improved identification of likely causal variants should lead to more powerful and reliable follow-up studies, an important factor when many of these experiments require fresh primary cells and laborious wet lab protocols. Similarly, comparison with summary results from eQTL studies would be facilitated by the application of multi modal search strategies to the eQTL data set to ensure the effects of genetic variation are mapped as accurately as possible [49]. Even with large samples, and careful fine-mapping analyses, the causal candidacy of SNPs in high LD cannot be resolved through statistical association alone. Functional studies designed to directly address the confounding induced by linkage disequilibrium, such as allele-specific expression using rare haplotypes that distinguish SNPs in the same tag group, may be helpful in refining further the likely causal variants.

Informative approaches to understanding disease etiology have recently been developed based on looking for enrichment of the cell-specific chromatin marks localising to likely causal variants, and these are being used to highlight disease relevant cells [23, 50]. Here, again, more accurate identification of causal variants will lead to more powerful and precise comparisons, and our approach, which is associated to a more accurate estimation of posterior probabilities for the SNPs in the considered region, should be readily adaptable to the growing set of methods that aim to examine enrichment [5] or colocalisation [6] by model averaging over the posterior probabilities that a SNP is causal.

Materials and Methods

Definition of target region

For fine mapping, we require dense coverage of genetic polymorphisms in the region. We targeted the ImmunoChip fine mapping region centred on IL2RA, namely chr10 : 6030000-6220000 (hg19). This region is bounded by recombination hot spots (S1 Fig).

Genotype data

Genotype data for this region comes from the MS [7] and the T1D [23] ImmunoChip studies. Quality control measures were applied to SNPs and samples as previously described [7]. As MS samples were derived from multiple international cohorts and we found allele frequencies for SNPs in the region varied between cohorts, we manually inspected plots of the first five principal components formed from ancestry informative markers [7] and additionally excluded samples lying outside the main cluster in each cohort.

Imputation

We used IMPUTE2 [51] to impute untyped SNPs using the 1000 Genomes Phase I reference panel. We included a 500 kb window either side of our target region to allow variants in the less densely genotyped neighbouring regions to contribute information on the untyped SNPs.

Adapted “shotgun” GUESS analysis

Tagging

Although GUESS is designed to work with correlated variables, the extreme, and occasionally perfect, LD in very dense genotyping data can lead to unstable results. We found, through a process of trial and error, that tagging so that no two SNPs remained with r² > 0.99 retained good convergence of the stochastic GUESS algorithm. This reduced the number of SNPs from 667 to 443 (T1D) and 453 (MS). For each SNP that was removed, we noted its index SNP, the remaining SNP which had the highest r² with it. We call a set of SNPs sharing a single index SNP a tagset; the number and size of tagsets is shown in S6 Fig.

GUESS analysis

We ran GUESS in parallel on each disease using the index SNPs defined above. For MS, which included internationally derived samples, we included country of origin as a categorical covariate. Monitoring plots for each GUESS run, generated by R2GUESS [52] (http://cran.r-project.org/web/packages/R2GUESS), are shown in S7 and S8 Figs. We saved the top 30,000 models visited for each disease and for each saved model, we generated an expanded set of models, by adding each model that could be formed by replacing any index SNP with any of the SNP(s) in its tagset (Fig 1). This expanded the set of models under consideration, ℳ, ten fold, to 514,476. For each trait, we derived a posterior weight for each model in ℳ. To be precise, we generated approximate Bayes factors for each model, m ∈ ℳ, for each trait, using the Bayesian Information Criterion (BIC) approximation:

where

L_m and k_m represent the logistic likelihood of the data and the number of parameters under model m, n is the number of samples and 0 represents the null model. Finally we combined these Bayes factors with priors for each model, π_m, to derive a posterior probability for each model

Choice of priors

Priors could conceivably be generated on the basis of individual SNP content, for example, to prioritise those overlapping genomic annotations of particular interest. Such an approach has been adopted in a hierarchical framework, albeit with the single causal variant assumption [5]. For simplicity, and because we were interested to discover likely causal variants without predefining their likely mechanism, our model priors were determined only by the number of SNPs contained in a model, N_m. A natural prior is then binomial or beta binomial. Given that, for a fixed expected value, a beta binomial puts larger weight on implausibly large models, we chose to use a binomial model, and, given published data on T1D, set the expected number of SNPs at three. For N SNPs in the target region, this means the prior for a model containing N_m SNPs is

where q is set so the expected value of the binomial distribution was 3, ie q = 3 N.

Priors and posteriors for the number of causal SNPs are shown in S2 Fig.

SNP groups for candidate causal variants

We do not expect that we will be able to discriminate, statistically, between highly correlated variants. Instead, the posterior support is likely to be diluted across such sets of variants. To group such SNPs, we used the marginal posterior probabilities of inclusion (MPPI) for each SNP, and applied the following algorithm:

Pick the index SNP with maximum MPPI
Order remaining SNPs by decreasing r² with index SNP
Exclude SNPs which co-occur in models with the index SNP (joint MPPI > 0.02)
Step away from the index SNP in order of decreasing r², adding SNPs to its group until MPPI < 0.001 for two SNPs in a row (NB, these SNPs will not be added to the SNP group), or until r² < 0.5
Remove this set of SNPs and return to step 1 until no SNP remains with MPPI > 0.01.

We summarize the support for any group of SNPs by the gMPPI, the sum of the posterior probabilities over all models containing a SNP in that group. The complete algorithm is available in the function snp.picker from the R package GUESSFM (https://github.com/chr1swallace/GUESSFM).

Model averaged effect estimates

We produced effect estimates for each SNP i as

where β i ( m ) is the estimated effect of SNP i in model m, I(i ∈ m) is an indicator function taking the value 1 if SNP i is included in model m, and 0 otherwise, and P(m) is the posterior probability of m ∈ ℳ.

Simulation analysis

We used simulated data to compare the performance of our proposed method, traditional stepwise regression, the lasso [53] and the elastic net [39]. For each simulation, we selected a random subset of 2,000 T1D control samples genotyped in the IL2RA region and a random set of between two and five “causal variants” from amongst the SNPs with MAF > 0.01. We simulated a Gaussian phenotype for which the causal variants acted in an additive manner to jointly explain 10% of the phenotypic variance. We conducted a total of 1,000 simulations for each scenario.

The data were analysed in parallel using the four different methods. We performed forward stepwise regression, and selected index SNPs at a given p value threshold, α, as those SNPs selected at any stage of the stepwise process with p < α. For each index SNP, we created pseudo “credible sets” as the set of SNPs with r² > 0.8 with the selected index SNPs. Note, by this definition, simulated causal SNPs may appear in more than one SNP group. We calculated the false discovery rate as the proportion of SNP groups which did not contain a causal variant, and we calculated the discovery rate as the proportion of causal variants found in at least one selected SNP group.

We used the snp.picker function from GUESSFM (http://github.com/chr1swallace/GUESSFM) with default settings to define credible sets for causal SNPs. By definition of the algorithm, SNPs may only be members of at most one SNP group. We selected SNP groups according to varying thresholds for the gMPPI, and calculated false discovery and discovery rates as above.

For lasso and elastic net, we used the R package glmnet [54], and optimised the regularisation parameter λ by minimising the ten fold cross validation error using the cv.glmnet() function. For elastic net, we kept the folds constant across different values of α ∈ [0.1,1] and chose the pair of values (α, λ) which minimised the ten fold cross validation error overall. To examine the path of elastic net solutions, we fixed α at this value and varied λ. We used the R package gglasso [55] to implement the grouped lasso, predefining groups of variables as those with r² > 0.8 with each other using heirarchical clustering. We also considered the discovery and false discovery rates with the first three or first five predictors selected, as something analagous to the prior information given to GUESSFM about our expectation of the number of causal variants.

Association with cell surface expression of CD25

Naive CD4⁺ T cells from 209 donors were gated as previously described [19, 56]. Association with index SNPs from disease-associated groups was assessed through linear regression. All possible one, two and three SNP models were considered, and the model with the minimum BIC reported. Expression was log transformed to reduce right skew of the phenotype.

DNase hypersensitivity data

We downloaded DNase I hypersensitivity for CD4⁺ T cells from the Roadmap Consortium [57] from http://vizhub.wustl.edu/VizHub/hg19, accessed 19 September.

RNA seq gene expression data

We measured gene expression in the studied region in two pooled samples (n = 3 and n = 4 individuals / pool) comparing unstimulated and stimulated CD4⁺ T cells. For each individual, 250,000 CD4⁺ T cells (93–99% pure, RosetteSep Human CD4⁺ T Cell Enrichment Cocktail, StemCell Technologies) were stimulated with anti-CD3/CD28 T-activator beads (Dynabeads Life Technologies) at a ratio of 0.3 beads / cell for four hours at 37°C in X-VIVO-15 (Lonza) + 1% AB serum (Lonza) and penicillin/ streptomycin (Life Technologies). Unstimulated CD4⁺ T cells were cultured in medium alone for four hours. Cells were harvested directly into Qiagen lysis buffer (Qiagen) and stored at -80°C until RNA isolation.

RNA was isolated using RNeasy micro kit including gDNA depletion (Qiagen). RNA integrity and concentration was evaluated using the Bioanalyzer platform (Agilent), with all samples showing an RNA integrity score (RIN) > 9.8. 750 ng of total RNA were used for the preparation of cDNA libraries using the Illumina TruSeq (Illumina) platform with a low-cycle-number PCR protocol, and was followed by transcriptome sequencing on an Illumina HiSeq 2000. This yielded four libraries with ∼ 38 million 100 bp paired-end reads each.

We trimmed raw reads to remove primer and adapter contamination, which affected 2% of our sequences, using HTSeq [58]. Reads were aligned to the reference genome Ensembl GRCh37.p13 using STAR [59]. Removal of low quality and unpaired reads, indexing, IL2RA region extraction, and depth counting were performed using SAMtools [60]. We employed HTSeq and DESeq2 [61] to carry out a differential expression analysis between the two conditions based on normalised read counts. We only considered paired-end reads that featured a total and unambiguous overlap with genomic sequence assigned to genes, around 73% of the initial raw sequences. Mapped read counts at each position in each sample are in S4 Table.

This study was approved by local Institutional Review Boards (IRBs) and written informed consents were obtained from all participants in the study. For JDRF/Wellcome Trust Diabetes and Inflammation Laboratory the IRBs are: NRES Committee East of England—Cambridge Central (ref: 08/H0308/153), statistical analysis; and NRES Committee East of England—Norfolk (ref: 05/Q0106/20), for gene expression work. This study was approved by the University of Virginia Institutional Review Board (IRB number 17457); the Type 1 Diabetes Genetics Consortium collection sites obtained approvals for all subject collections and written consent and/or assent was obtained from all participants or their surrogates in the study.

Supporting Information

Zdroje

1. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, et al. (2012) The accessible chromatin landscape of the human genome. Nature 489 : 75–82. doi: 10.1038/nature11232 22955617

2. McCarthy MI, Hirschhorn JN (2008) Genome-wide association studies: potential next steps on a genetic journey. Hum Mol Genet 17: R156–R165. doi: 10.1093/hmg/ddn289 18852205

3. Miller AJ (1984) Selection of subsets of regression variables. Journal of the Royal Statistical Society Series A (General) 147: pp. 389–425. doi: 10.2307/2981576

4. Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, et al. (2012) Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet 44 : 1294–1301. doi: 10.1038/ng.2435 23104008

5. Pickrell JK (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet 94 : 559–573. doi: 10.1016/j.ajhg.2014.03.004 24702953

6. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, et al. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet 10: e1004383. doi: 10.1371/journal.pgen.1004383 24830394

7. International Multiple Sclerosis Genetics Consortium (IMSGC) (2013) Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet 45 : 1353–1360. doi: 10.1038/ng.2770 24076602

8. Trynka G, Hunt KA, Bockett NA, Romanos J, Mistry V, et al. (2011) Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat Genet 43 : 1193–1201. doi: 10.1038/ng.998 22057235

9. Eyre S, Bowes J, Diogo D, Lee A, Barton A, et al. (2012) High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat Genet 44 : 1336–1340. doi: 10.1038/ng.2462 23143596

10. Hinks A, Cobb J, Marion MC, Prahalad S, Sudman M, et al. (2013) Dense genotyping of immune-related disease regions identifies 14 new susceptibility loci for juvenile idiopathic arthritis. Nat Genet 45 : 664–669. doi: 10.1038/ng.2614 23603761

11. Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3: e114. doi: 10.1371/journal.pgen.0030114 17676998

12. Bottolo L, Richardson S (2010) Evolutionary stochastic search for bayesian model exploration. Bayesian Analysis 5 : 583–618. doi: 10.1214/10-BA523

13. Bottolo L, Chadeau-Hyam M, Hastie DI, Zeller T, Liquet B, et al. (2013) GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm. PLoS Genet 9: e1003657. doi: 10.1371/journal.pgen.1003657 23950726

14. Malek TR, Castro I (2010) Interleukin-2 receptor signaling: at the interface between tolerance and immunity. Immunity 33 : 153–165. doi: 10.1016/j.immuni.2010.08.004 20732639

15. Vella A, Cooper JD, Lowe CE, Walker N, Nutland S, et al. (2005) Localization of a type 1 diabetes locus in the IL2RA/CD25feng region by use of tag single-nucleotide polymorphisms. Am J Hum Genet 76 : 773–779. doi: 10.1086/429843 15776395

16. International Multiple Sclerosis Genetics Consortium (IMSGC) (2007) Risk alleles for multiple sclerosis identified by a genomewide study. N Engl J Med 357 : 851–862. doi: 10.1056/NEJMoa073493 17660530

17. Barton A, Thomson W, Ke X, Eyre S, Hinks A, et al. (2008) Rheumatoid arthritis susceptibility loci at chromosomes 10p15, 12q13 and 22q13. Nat Genet 40 : 1156–1159. doi: 10.1038/ng.218 18794857

18. Lowe CE, Cooper JD, Brusko T, Walker NM, Smyth DJ, et al. (2007) Large-scale genetic fine mapping and genotype-phenotype associations implicate polymorphism in the IL2RA region in type 1 diabetes. Nat Genet 39 : 1074–1082. doi: 10.1038/ng2102 17676041

19. Dendrou CA, Plagnol V, Fung E, Yang JHM, Downes K, et al. (2009) Cell-specific protein phenotypes for the autoimmune locus IL2RA using a genotype-selectable human bioresource. Nat Genet 41 : 1011–1015. doi: 10.1038/ng.434 19701192

20. Orrù V, Steri M, Sole G, Sidore C, Virdis F, et al. (2013) Genetic variants regulating immune cell levels in health and disease. Cell 155 : 242–256. doi: 10.1016/j.cell.2013.08.041 24074872

21. Ye CJ, Feng T, Kwon HK, Raj T, Wilson MT, et al. (2014) Intersection of population variation and autoimmunity genetics in human T cell activation. Science 345 : 1254665. doi: 10.1126/science.1254665 25214635

22. Garg G, Tyler JR, Yang JHM, Cutler AJ, Downes K, et al. (2012) Type 1 diabetes-associated IL2RA variation lowers IL-2 signaling and contributes to diminished CD4+ CD25+ regulatory T cell function. J Immunol 188 : 4644–4653. doi: 10.4049/jimmunol.1100272 22461703

23. Onengut-Gumuscu S, Chen WM, Burren O, Cooper NJ, Quinlan AR, et al. (in press) Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat Genet.

24. Kass R, Raftery A (1995) Bayes factors. Journal of the American Statistical Association 90 : 773–795. doi: 10.1080/01621459.1995.10476572

25. Mousavi K, Zare H, Dell’orso S, Grontved L, Gutierrez-Cruz G, et al. (2013) eRNAs promote transcription by establishing chromatin accessibility at defined genomic loci. Mol Cell 51 : 606–617. doi: 10.1016/j.molcel.2013.07.022 23993744

26. Maier LM, Lowe CE, Cooper J, Downes K, Anderson DE, et al. (2009) IL2RA genetic heterogeneity in multiple sclerosis and type 1 diabetes susceptibility and soluble interleukin-2 receptor production. PLoS Genet 5: e1000322. doi: 10.1371/journal.pgen.1000322 19119414

27. Cerosaletti K, Schneider A, Schwedhelm K, Frank I, Tatum M, et al. (2013) Multiple autoimmune-associated variants confer decreased IL-2R signaling in CD4+ CD25(hi) T cells of type 1 diabetic and multiple sclerosis patients. PLoS One 8: e83811. doi: 10.1371/journal.pone.0083811 24376757

28. Dendrou CA, Wicker LS (2008) The IL-2/CD25 pathway determines susceptibility to T1D in humans and NOD mice. J Clin Immunol 28 : 685–696. doi: 10.1007/s10875-008-9237-9 18780166

29. Yang Z, Fujii H, Mohan SV, Goronzy JJ, Weyand CM (2013) Phosphofructokinase deficiency impairs ATP generation, autophagy, and redox balance in rheumatoid arthritis T cells. J Exp Med 210 : 2119–2134. doi: 10.1084/jem.20130252 24043759

30. Arden C, Hampson LJ, Huang GC, Shaw JAM, Aldibbiat A, et al. (2008) A role for PFK-2/FBPase-2, as distinct from fructose 2,6-bisphosphate, in regulation of insulin secretion in pancreatic beta-cells. Biochem J 411 : 41–51. doi: 10.1042/BJ20070962 18039179

31. Davison LJ, Wallace C, Cooper JD, Cope NF, Wilson NK, et al. (2012) Long-range dna looping and gene expression analyses identify DEXI as an autoimmune disease candidate gene. Hum Mol Genet 21 : 322–333. doi: 10.1093/hmg/ddr468 21989056

32. Carbonetto P, Stephens M (2013) Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in Crohn’s disease. PLoS Genet 9: e1003770. doi: 10.1371/journal.pgen.1003770 24098138

33. Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ (2008) Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet 4: e1000130. doi: 10.1371/journal.pgen.1000130 18654633

34. Wu TT, Chen YF, Hastie T, Sobel E, Lange K (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 : 714–721. doi: 10.1093/bioinformatics/btp041 19176549

35. Ayers KL, Cordell HJ (2010) SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol 34 : 879–891. doi: 10.1002/gepi.20543 21104890

36. Guan Y, Stephens M (2011) Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat 5 : 1780–1815. doi: 10.1214/11-AOAS455

37. Leng C, Lin Y, Wahba G (2006) A note on the lasso and related procedures in model selection. Stat Sin.

38. Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7 : 2541–2563.

39. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Statist Soc B 67 : 301–320. doi: 10.1111/j.1467-9868.2005.00503.x

40. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Series B Stat Methodol 68 : 49–67. doi: 10.1111/j.1467-9868.2005.00532.x

41. Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, et al. (2015) RNA splicing. the human splicing code reveals new insights into the genetic determinants of disease. Science 347 : 1254806.

42. Chen H, Wilkins LM, Aziz N, Cannings C, Wyllie DH, et al. (2006) Single nucleotide polymorphisms in the human interleukin-1b gene affect transcription according to haplotype context. Hum Mol Genet 15 : 519–529. doi: 10.1093/hmg/ddi469 16399797

43. Lewontin RC (1964) The interaction of selection and linkage. i. general considerations; heterotic models. Genetics 49 : 49–67. 17248194

44. Chapman JM, Cooper JD, Todd JA, Clayton DG (2003) Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Hum Hered 56 : 18–31. doi: 10.1159/000073729 14614235

45. Morris AP (2006) A flexible bayesian framework for modeling haplotype association with disease, allowing for dominance effects of the underlying causative variants. Am J Hum Genet 79 : 679–694. doi: 10.1086/508264 16960804

46. Hemani G, Shakhbazov K, Westra HJ, Esko T, Henders AK, et al. (2014) Detection and replication of epistasis influencing transcription in humans. Nature 508 : 249–253. doi: 10.1038/nature13005 24572353

47. Wood AR, Tuke MA, Nalls MA, Hernandez DG, Bandinelli S, et al. (2014) Another explanation for apparent epistasis. Nature 514: E3–5. doi: 10.1038/nature13691 25279928

48. Downes K, Pekalski M, Angus KL, Hardy M, Nutland S, et al. (2010) Reduced expression of IFIH1 is protective for type 1 diabetes. PLoS ONE 5: e12646. doi: 10.1371/journal.pone.0012646 20844740

49. Bottolo L, Petretto E, Blankenberg S, Cambien F, Cook SA, et al. (2011) Bayesian detection of expression quantitative trait loci hot spots. Genetics 189 : 1449–1459. doi: 10.1534/genetics.111.131425 21926303

50. Okada Y, Wu D, Trynka G, Raj T, Terao C, et al. (2014) Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506 : 376–381. doi: 10.1038/nature12873 24390342

51. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529. doi: 10.1371/journal.pgen.1000529 19543373

52. Liquet B, Bottolo L, Campanella G, Richardson S, Chadeau-Hyam M (in press) R2GUESS: GPU-based R package for Bayesian variable selection regression of multivariate responses. J Stat Softw.

53. Tibshirani R (1996) Optimal reinsertion:regression shrinkage and selection via the lasso. J R Statist Soc B 58 : 267–288.

54. Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 : 1–22. 20808728

55. Yang Y, Zou H (2014) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput: 1–13. doi: 10.1080/00949655.2014.975226

56. Pekalski ML, Ferreira RC, Coulson RMR, Cutler AJ, Guo H, et al. (2013) Postthymic expansion in human CD4 naive T cells defined by expression of functional high-affinity IL-2 receptors. J Immunol 190 : 2554–2566. doi: 10.4049/jimmunol.1202914 23418630

57. Romanoski CE, Glass CK, Stunnenberg HG, Wilson L, Almouzni G (2015) Epigenomics: Roadmap for regulation. Nature 518 : 314–316. doi: 10.1038/518314a 25693562

58. Anders S, Pyl PT, Huber W (2014) HTSeq—a Python framework to work with high-throughput sequencing data. bioRxiv: 10.1101/002824.

59. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 : 15–21. doi: 10.1093/bioinformatics/bts635 23104886

60. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25 : 2078–2079. doi: 10.1093/bioinformatics/btp352 19505943

61. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. bioRxiv: 10.1101/002832.