Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval

Authors: Emily A. King ^aff001; J. Wade Davis ^aff001; Jacob F. Degner ^aff001
Authors place of work: Department of Computational Genomics, AbbVie, North Chicago, Illinois, United States of America ^aff001
Published in the journal: Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet 15(12): e32767. doi:10.1371/journal.pgen.1008489
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1008489

Summary

Despite strong vetting for disease activity, only 10% of candidate new molecular entities in early stage clinical trials are eventually approved. Analyzing historical pipeline data, Nelson et al. 2015 (Nat. Genet.) concluded pipeline drug targets with human genetic evidence of disease association are twice as likely to lead to approved drugs. Taking advantage of recent clinical development advances and rapid growth in GWAS datasets, we extend the original work using updated data, test whether genetic evidence predicts future successes and introduce statistical models adjusting for target and indication-level properties. Our work confirms drugs with genetically supported targets were more likely to be successful in Phases II and III. When causal genes are clear (Mendelian traits and GWAS associations linked to coding variants), we find the use of human genetic evidence increases approval by greater than two-fold, and, for Mendelian associations, the positive association holds prospectively. Our findings suggest investments into genomics and genetics are likely to be beneficial to companies deploying this strategy.

Keywords:

Gene mapping – Genome-wide association studies – Human genetics – Drug research and development – Drug discovery – Genetics of disease – Genetic linkage – Catalogs

Introduction

The cost of developing new molecular entities (NMEs) into approved therapies continues to increase with cost per launched NME ranging from $3 billion to more than $10 billion across major research based pharmaceutical companies [1]. Despite strong vetting for disease activity, only 5-10% of candidate NMEs in early stage clinical trials are eventually approved and this probability of approval has a direct relationship to total cost per approved drug [1, 2]. Thus, to maintain a sustainable drug development process, there is a critical need to increase the number of successful NMEs, while reducing the number of failures.

Analyzing historic data of the progress of drug compounds through the drug development pipeline, Nelson et al. 2015 [3] concluded pipeline drug targets with human genetic evidence of disease association are twice as likely to lead to approved drugs. The specific claim of doubled approval probability, if true, could lead to fewer failed clinical programs thereby lowering drug development costs. Indeed, using the estimated impact of genetics from Nelson et al. [3], increasing the fraction of NMEs in development with genetic support from the current value of 15% to 50% is predicted to decrease the direct R&D cost per launched drug by 22 ± 13% [4].

Several recent successes have corroborated the power of leveraging genetic data to predict the success of a new drug targets [5]. For example, the gain of function mutations in PCSK9 [6–9], which cause familial hypercholesterolemia and coronary artery disease led to to the launch of evolocumab (Amgen) and alirocumab (Regeneron). How widely the pharmaceutical industry can expect genetics and genomics to yield increased success rates beyond these more narrowly defined examples that have unambiguous causal genes and multiple verified Mendelian mutations remains to be determined. If the association between human genetic evidence and approved drugs is genuine and continues to hold for present-day drug development, we expect better variant to gene mapping methods and more sophisticated predictive approaches will further improve our ability to prioritize drug targets. Because of the foundational nature of the Nelson et al. work [3], it is important to determine whether the reported association holds prospectively, and whether it replicates on independent data subsets not used in the original model construction.

Three years have passed since the publication by Nelson et al. and five years have passed since the data freeze used for analysis occurred [3]. The results may now be validated using drug progression events to which Nelson et al. were completely blinded at the time. Similarly, ongoing efforts in discovering disease-associated variants in increasingly large patient samples have rapidly grown the number of potential gene trait links. For example, a public central repository of genetic association studies (GWAS Catalog [10], https://www.ebi.ac.uk/gwas/) has grown by four-fold [11, 12]. Additionally, the quantity and quality of links between noncoding SNPs and genes has expanded with the development of GTEx [13]. Here we report revised estimates of the impact of genetic evidence on drug target success and extend Nelson’s observations into a model that can be deployed by other companies and academics to predict the likelihood of success of targets of interest to them.

Results

Identifying validation sets

Nelson et al. [3] estimated a twofold increase in approval probability for Phase I drug targets with genetic evidence using drug pipeline data from Informa Pharmaprojects along with genetic data from a variety of sources, all obtained in 2013. This estimate comes from historical rather than experimental data so a direct replication is not possible. However, we can obtain updated sources of pipeline and genetic data and use the data subsets not used in the Nelson et al. study study to validate its claims. Fig 1A shows how updated pipeline (Informa Pharmaprojects [14]) and genetic association (GWAS Catalog, OMIM [15]) datasets may be split into discrete subsets, several of which were not used in the original analysis. We call these sets validation sets. In addition to genetic associations and pipeline progression events added after 2013 (New Genetic and Pipeline Progression sets), we identified a large subset of pipeline data that was available to Nelson et al., but that was excluded from analysis because Pharmaprojects reported an inactive status, most commonly “No Development Reported”. Instead of directly using Pharmaprojects development status, we use other fields in the database to label drugs with a latest historical development phase (see Methods, S2 Text), enabling us to use 83% of this data in our analysis.

**Fig. 1. Estimated effect of evidence from human genetic studies on the probability of advancing in clinical development.**

Following Nelson, we aggregate data at the level of gene target-indication pair, the unit on which genetic evidence is computed. In total, we mapped 21934 gene target-indication pairs to a highest pipeline phase, in contrast to 8853 pairs labelled with a known phase in the Nelson et al. analysis. 5513 pairs could be tested for progression to a more advanced clinical phase since 2013, and 14759 pairs either absent or inactive in the 2013 data set could now be assigned a highest historical pipeline phase. Two validation sets (New Pipeline, and new GWAS associations) are larger than the original datasets used in Nelson et al. giving us sufficient power to test predictions.

Our replication analysis occurred in three steps. In the first step, we took labels of genetic evidence directly from Nelson et al. 2015 and tested how these labels predict pipeline outcomes in the New Pipeline and Pipeline Progression validation sets. Second, we repeated the analysis using both updated pipeline data and updated genetic association datasets and determined whether genetic evidence labels constructed from associations reported after 2013 are positively associated with historical progression. This analysis uses the New Genetic validation set, defined as GWAS data added after May 2013 and OMIM data added after October 2013. Third, we determine whether genetic labels constructed from the full set of updated GWAS and OMIM genetic associations are linked to improved pipeline outcomes over the entire updated Pharmaprojects dataset (See Methods for more details). We refer to this analysis as Full Data.

Estimated effect of genetic evidence on validation sets

Of the many results from the original Nelson et al. publication, we focus on determining whether the probability of progressing along the development pipeline is greater for gene target-indication pairs with genetic evidence as this most directly impacts business decision-making (S8, S9, S11 and S12 Figs show replication of other results). A gene target-indication pair is said to have genetic evidence if there is human genetic evidence of association between the gene target and a trait sufficiently similar to the indication, as measured by semantic similarity in the MeSH vocabulary (see Methods and S4 Text). Fig 1B shows estimates and 95% confidence intervals for the ratio of the probability of progression for gene target-indication pairs with and without genetic evidence computed on the three validation sets and the full set of new data each plotted against values computed from Nelson et al. supplementary tables.

Across all three validation sets (Pipeline Progression, New Genetic, and New Pipeline), we consistently see a marked difference between the effect of genetic evidence derived from the OMIM database and genetic evidence derived from the GWAS Catalog. Estimated effects of OMIM genetic evidence are comparable to or greater than previously reported values [3], except for progressions from Phase I to Phase II, which are lower using new data. Notably, we see a positive and significant effect of OMIM genetic evidence on the probability of progression from Phase II to Phase III since 2013 (Pipeline Progression validation set). With the exception of progressing from Phase III to Approval, estimated effects from GWAS Catalog-derived genetic evidence are consistently lower than the originally reported values. Our estimated effects of GWAS genetic evidence in the New Genetic validation set are often significantly lower than the originally reported values. In validation sets, all estimates of the effect of GWAS evidence overlap one (no effect), except in the Pipeline Progression validation set, where we estimate a negative effect of GWAS evidence on Phase II to III progression (Fig 1B).

In both GWAS and OMIM datasets, our estimates of the effect of genetic evidence on Phase I to II progression probabilities are lower than originally reported, and confidence intervals sometimes exclude original estimates. With some exceptions (e.g. oncology studies), Phase I trials assess safety in healthy volunteers, not efficacy, so their success may be less closely linked to human genetic evidence for target involvement in disease. Validation sets may also differ systematically from the 2013 training data. For example, it is possible that there are systematic differences in the types of associations discovered before and after 2013 (New Genetic validation set). Later associations may be biased towards those with smaller effect sizes or rarer variants only detectable in larger cohorts, and could also be less predictive of drug efficacy. Using the complete updated dataset (Full Data), including all Pharmaprojects drugs and pre and post 2013 genetic associations, we find the estimated effect of GWAS genetic evidence on Phase I to Approval is still significantly positive, and the effect of OMIM genetic evidence is greater than originally reported.

Statistical modeling of genetic effect on drug approval

The effect of GWAS genetic evidence on approval was considerably reduced and lacked statistical significance in the New Genetic dataset. In reanalyzing the original data, we found the estimated effect of GWAS genetic evidence was highly sensitive to the choice of trait-indication similarity cutoff used to determine whether or not a drug target had a genetic association (S3 Fig). Learning from this analysis, we sought to build a model relating genetic evidence to the probability of drug approval in the full dataset.

We fit multivariate logistic regression models predicting target-indication pair approval using several independent variables. The first was a measure of (continuous) genetic evidence, defined as the maximum semantic similarity to the indication across all traits linked to the drug target through human genetic evidence. The remaining independent variables are target and indication-level properties that could confound the relationship between genetic evidence and approval. Previous work has shown that approved drug targets tend to be more conserved than genes linked to GWAS associations [16], so we included residual variant intolerance score (RVIS) [17], measuring the amount of common functional variation in each gene relative to the amount of neutral variation, as a predictor. We also included the amount of time each target is known to have been under development as a predictor, with the rationale that if accumulating genetic evidence informs drug development, targets supported by genetic evidence might be newer on average. Finally, we included gene ontology (GO) terms and high level MeSH terms for each indication as predictors to control for known differences [18, 19] in approval probability among indication and target classes.

Under this model, approval is positively associated with trait similarity for supporting GWAS and OMIM associations, with 95% credible intervals excluding zero (Fig 2A). When associated traits are sufficiently similar (for GWAS, roughly the similarity between Stomach Neoplasms and Colorectal Neoplasms), gene target-indication pairs with GWAS or OMIM associations are more likely to be approved. Evaluation of the data also revealed when there is a genetic association for a dissimilar disease, they are less likely to be approved than gene target-indication pairs with no known genetic association. This negative association is a novel finding.

**Fig. 2. Estimated odds ratio of gene target-indication pair attaining approval, as a function of similarity between drug indication and the most similar trait associated with the target.**

GWAS genetic evidence has a smaller positive effect on approval than does OMIM genetic evidence, and we only find a small beneficial effect of GWAS genetic evidence in the New Genetic validation set. One possible explanation is that most GWAS associations are to noncoding variants, and determining function from these associations will require more advanced methodology [20]. Indeed, when we only consider GWAS Catalog SNPs in high LD (R² ≥ 0.9) to a missense variant or other variant predicted to be moderately or highly deleterious [21], the estimated effect of GWAS genetic evidence on drug target approval approaches that of OMIM. Moreover, for missense variants, we see a larger estimated effect of genetic evidence when using a more stringent LD cutoff to the lead SNP (Fig 2B).

Discussion

Pharmaceutical companies are investing in the creation and analysis of genomics data in the hope of improving target selection and decreasing failures due to lack of efficacy [22] or adverse effects [23]. Previous work by Nelson et al. 2015 [3] supported this investment, showing gene target-indication pairs with genetic evidence are approximately twice as likely to progress from Phase I to approval. This quantitative estimate is the product of many decisions, for example how to identify similar traits in genomics and pipeline databases, that, although reasonable, could have been made differently. Additionally, the results were based on a large historical set of approved drugs and might not hold for present-day target selection. This motivated us to replicate the analysis using 5 years of data that has accumulated since their data freeze in 2013.

In the replication study, we recovered a robust association between OMIM genetic evidence and drug approval of a similar or greater magnitude to that originally reported [3] across several independent test sets. GWAS genetic evidence also is generally positively associated with progressing in clinical development, but the magnitude of the association is smaller and not clearly different from zero in any independent replication set. One possible reason is that recently reported GWAS variants have smaller reported effect sizes. We find evidence for this claim, but do not detect an effect of GWAS evidence effect size on approval (S13 Fig, S22 Table). There appears to be some confounding due to GWAS genes having different properties than approved drug targets. When this is controlled for using logistic regression, GWAS-supported target-indication pairs are more likely to be approved than those without a GWAS-linked gene target. This highlights the need for predictive models including target properties, work that is beginning to emerge [24].

The OMIM database provides expert-curated gene-trait links, bypassing the need to assign noncoding SNPs to genes, a major source of uncertainty for present GWAS methods. Better methods for linking GWAS SNPs to causal genes may improve performance, supported by the fact that we found strong and statistically significant positive associations between GWAS genetic evidence and drug success when considering only the highest confidence SNP-gene links, characterized as having a leading SNP with R² ≥ 0.9 to a variant predicted to be highly or moderately deleterious. However, OMIM’s focus on Mendelian phenotypes also means genetic variants will be higher effect size than those for quantitative traits or conditions prominent in the GWAS Catalog, which is unlikely to be addressed by improved computational methods.

Because OMIM is a manually curated database, it is possible that known drug mechanisms influence OMIM entries, creating a positive association between OMIM genetic evidence and approval. However, we observe a positive effect of OMIM genetic associations reported by Nelson et al. 2015 on progression events occurring after data were collected for that paper, which is inconsistent with this reverse causal hypothesis. It is also possible these progression events are not truly independent of pre-2013 approvals, because they may represent approval for an indication similar to the original indication. However, the positive effect of OMIM genetic evidence on 2013-18 progression remains significant when targets with pre-2013 approvals for similar indications are excluded (S11 and S12 Tables). Another possibility is that the success of OMIM is due to treatments such as protein replacement therapies for monogenic diseases, which may have higher success rates as a whole [25]. However, we still find a large positive effect of OMIM genetic evidence when we exclude hereditary diseases and MeSH terms mapped to OMIM phenotypes from the analysis (S21 Table, S27 Fig). We conclude the predictive effect of OMIM genetic evidence is not a statistical artifact, and is more likely to reflect the value of well-defined disease biology to drug development.

Due to the MeSH ontology structure, current methods require manual similarity assignments to recognize relationships between most quantitative traits and diseases. The high sensitivity of key results to MeSH similarity motivates treating similarity as a continuous variable and suggests improvements to its quantification. While expert curation can be advantageous in identifying closely related traits, it also leaves more room for human input to bias the analysis outcome. To assess this we removed automatically assigned similarities. Positive associations between GWAS genetic evidence and approval remain, though in some cases are greatly reduced in magnitude (S19 and S26 Figs) (OMIM is minimally impacted as it contains few quantitative traits). We expect improved methods automatically identifying similar phenotypes to drug indications will expand our ability to use genomics data in predictive models.

Our results highlight the importance of similarity between associated trait and drug indication in determining which gene target-indication pairs are likely to lead to approved drugs. Our finding that genetic associations for highly dissimilar traits reduce the probability of approval is new and could be of significance once the reason is better understood. A possible explanation is an increased incidence of side effects due to involvement in unrelated disease mechanisms. It suggests that when target disease links are known, genetic data can improve the drug development process through improved indication selection.

Our analysis of the last five years of drug development data validates the results of Nelson et al. and indicates that the positive association between genetic evidence and drug success is not just a historical phenomenon. Using logistic regression to control for target and indication level properties, and quantifying genetic evidence on a continuous scale, we also demonstrated that associations to disparate phenotypes is a negative predictor of approval. With these algorithmic developments, we have built a Shiny [26] app that others can use to evaluate target-indication pairs of interest. As mechanistic understanding of genetic associations increases, our data suggests the reliability of genetic predictions of drug targets will continue to improve. In closing, public and private investments into genomics for the purpose of improving the fraction of successful drug targets appears to be well warranted.

Materials and methods

Data sources

Pipeline data

Data on drug gene targets, indications, latest development phase, and approvals by country were collected from the Informa Pharmaprojects database (accessed January 25, 2018) [14]. For each drug, Pharmaprojects provides country-level, indication-level, and global development status. The latter is the latest development status across indications for any country. A drug was considered US/EU approved for an indication if it was approved in the US or EU and approved for that indication (so if a drug is US/EU approved for one but not all of its approved indications, we will incorrectly assign some approvals). We infer this was also the approach of Nelson et al., as they mention no source other than Pharmaprojects for drug approval data and Pharmaprojects does not provide drug-indication-country level approval data.

To calculate phase-specific progression probabilities by genetic evidence, we must assign a latest historical development phase to Pharmaprojects drug-indication pairs that are not in active development using other database fields. Country status gives the latest phase for single-indication and preclinical drugs. Other drug-indication phases are determined through assessing the presence or absence of key events and clinical details matching the trial phase and the disease name. Clinical details were only used when other sources were unavailable because this field may contain information about planned or anticipated trials. Details are provided in S2 Text.

Pharmaprojects gene targets were mapped from Entrez to ensembl IDs. Drugs with non-human and xMHC targets were excluded (following the original analysis) as were a small number of drugs with non protein coding targets.

Genetic data

Genetic association data was obtained from the GWAS Catalog [10] downloaded 2018-11-18. OMIM data was downloaded from [15] on 2018-11-18. GWAS Catalog associations with reported p-value greater than 10⁻⁸ were excluded, following the original analysis, as were OMIM provisional associations, drug response associations, and somatic variant associations.

OMIM reports gene-trait links, but the GWAS Catalog reports SNP-trait links which must be converted to gene-trait links via SNP-gene links. Although methods for creating SNP-gene links have since advanced [20], we closely follow the approach of [3] with updated data sources to reduce our degrees of freedom for overfitting to new data and to make our new estimates of the effect of genetic evidence comparable to the original estimates. Our gene-trait mapping procedure attempts to replicate that used by Nelson et al. with updated data sources. An LD expansion of GWAS Catalog reported variants was performed using an LD threshold of 0.5 in the 1000 Genomes Phase 3 EUR super population [27]. A distance-based gene-trait association was established when an LD SNP was within 5000 b.p. of the gene in hg38 as annotated by SNPEff [21]. An eQTL-based gene-trait link was established when an LD SNP was reported associated with a gene with nominal p-value less than 10⁻⁶ in any GTEx tissue [13]. Using a cutoff of 10⁻¹² makes little difference to results (S20 Table). A DHS-based gene-trait link was established when an LD SNP was located in a DNAse I hypersensitivity site correlated with gene expression with one-sided permutation p-value 1.000 (from 1000 replicates) [28]. All linked genes were mapped to Ensembl IDs, and links to genes not annotated as protein coding by Ensembl were removed from the dataset. Additional details are available in S3 Text.

Genetic evidence

Trait-indication similarities

Pharmaprojects indications and GWAS Catalog and OMIM traits were mapped to MeSH headings to link traits and indications by a common vocabulary. We mapped as many terms as possible automatically by string matching to MeSH terms and their synonyms, and the remainder were manually assigned to the most specific MeSH heading encompassing the term. The MeSH vocabulary consists of MeSH headings, which are organized in a hierarchy, and supplementary concepts, which are not. We did not map to MeSH supplementary concepts as the lack of structure means we cannot compute similarities between these concepts and other terms. However, each supplementary concept is assigned one or more mapped headings, and so terms matching a supplementary concept were assigned to the mapped heading. This set of MeSH term mappings was used in the full replication with new genetic data sources.

When testing predictions from the 2013 genetic association data, it was important that MeSH headings mapped to Pharmaprojects indications be consistent with the original analysis by Nelson et al. in order to correctly identify common pairs between datasets for which progression can be tested and to ensure that our New Pipeline test set contained truly novel pairs. Nelson et al. provided mappings for many Pharmaprojects indications in a supplementary dataset. Terms without provided mappings were mapped to maximize the number of Nelson et al. gene target-indication pairs also present in our dataset, subject to the mapping being biologically justifiable. Standardized mapping increased the percent of Nelson et al gene target-indication pairs present in our dataset from 88% (using our independently mapped terms) to 98%.

Resnik [29] and Lin [30] similarities between MeSH headings were computed in R in the ontologySimilarity package [31], standardized to have a maximum value of 1 for each trait, and averaged to compute a similarity between each pair of MeSH headings (S4 Text). Two traits are considered similar if the similarity is greater than or equal to a critical value. Our assigned similarities are not identical to those of Nelson et al. because of using different versions of MeSH (2009 versus 2017), but were correlated with those originally reported (R² = 0.86, S17 Fig). We determined a critical value of 0.73 in our analysis corresponded to the critical value 0.7 used in the original analysis, and used this to determine similar traits in our replication study. Manually assigned similarities were taken from the supplement of [3]. Manual assignment was performed because the MeSH ontology makes few connections between diseases and closely related quantitative phenotypes, for example osteoporosis and bone density.

Defining genetic evidence

We formalize and extend the concept of genetic evidence used by Nelson et al. We first define a similarity function operating on two gene-trait pairs. Define function S from ( G × T ) 2 to [0, 1] where G is the space of genes and T is the space of traits.

where S T : T × T → [ 0 , 1 ] is a trait similarity function (in the Nelson et al analysis and here, computed from Resnik and Lin similarities). Let A be a set of gene-trait pairs with elements in G × T obtained from genetic data sources (for example, when analyzing the effect of OMIM genetic evidence A is the set of gene-trait pairs in OMIM). Genetic evidence according to Nelson et al. 2015 is a function E_D from G × T to {0,1}

However, trait similarity is a real number in [0, 1], so we can define another genetic evidence function E_C from G × T to [0, 1]

E_D(g, t) = 1 if and only E_C(g, t) ≥ 0.7.

Statistical analysis

All statistical analyses are performed on pipeline data collapsed to one row per gene target-indication pair, as this is the unit on which genetic evidence is measured. The latest phase of a gene target-indication pair (g, i) is the most advanced pipeline phase attained by any drug with target g for indication i. Of several results of [3], we are most interested in the claim that target-indication pairs supported by genetic evidence are more likely to advance than those without. In the first part of the analysis, we quantify this association as a risk ratio, attempting to replicate the original Nelson et al analysis as closely as possible. Second, we introduce a logistic regression model for the relationship between approval and genetic evidence, adjusting for covariates at the target and indication levels.

Two-by-two tables

Let D be a vector of gene target-indication-phase triplets with elements (g_i, t_i, h_i), i = 1, …, n. H_i ∈ {0, …, 4} is an ordered categorical variable giving the latest phase each gene target-indication pair has achieved (0 = Preclinical, 1 = Phase I, 2 = Phase II, 3 = Phase III, and 4 = US/EU Approved).

Risk ratios for progressing from Phase x to Phase y, x > y were computed as

where N g , x = ∑ i = 1 n E D ( g i , t i ) I ( h i ≥ x ) is the number of gene target-indication pairs in Phase x or later with genetic evidence and N g ′ , x = ∑ i = 1 n ( 1 -⁠ E D ( g i , t i ) ) I ( h i ≥ x ) is the number of gene target-indication pairs in Phase x or later without genetic evidence. We required at least 5 reported genetic associations for similar traits. Phase progression probability calculations usually exclude in progress development [18] but here we include them for consistency with Nelson et al. Confidence intervals were computed using the riskratio.boot function in the epitools R package [32]. We ensured consistency of this approach with that of Nelson et al. by verifying our code could reproduce their results from supplemental materials (S1 Text). Drugs approved only outside the US and EU and drugs with unknown latest phase were excluded from this analysis.

Bayesian logistic regression

Let i index gene target-indication pairs (g_i, t_i), i = 1, …, N. Let y_i ∈ {0, 1} be 1 if pair i is found in at least one US/EU approved drug and 0 otherwise. Let X be an N × d design matrix where d is the number of non-genetic predictors with i^th row x i ′.

where

Our choice of p = 2 is supported by WAIC [33] [34]. Predictors in X were top-level MeSH category, target class, estimated time the target has been under development, and RVIS score [17]. Details are provided in S5 Text. Priors were

All models were fit in Stan [35] using four chains with default initialization and run settings.

Prior parameters μ_a = -2.2, σ_a = 0.75 was chosen to reflect prior knowledge that approximately 10% of Phase I compounds become approved [18] and prior standard deviations σ_b = 2, and σ_g = 2 were chosen prior belief that observed effect sizes should be moderate. Note α, for which we have chosen a nonzero mean prior, controls the baseline approval probability, not the effect of genetic evidence. Continuous covariates in X were standardized to have mean 0 and standard deviation 1 as was E_C.

In this analysis we depart from the original Nelson et al. approach and exclude all drugs assigned an active development phase by Pharmaprojects, as it is unknown whether these development programs will ultimately lead to approval. This decision is consistent with other work estimating clinical success probabilities [18] [24]. We include unapproved drugs with unknown latest historical phase. A total of 20292 gene target-indication pairs were associated with at least one US/EU approved or inactive drug and included in the analysis.

Code availability

Code and data tables required to reproduce the main text figures are provided on Github (https://github.com/AbbVie-ComputationalGenomics/genetic-evidence-approval). The git repository also contains instructions for running a Shiny app displaying model predictions.

Supporting information

S1 Text [pdf]
Reanalysis of Nelson et al. supplementary datasets.

S2 Text [pdf]
Updating pipeline data: Supplementary methods and results.

S3 Text [pdf]
Updated genetic dataset: Supplementary methods and results.

S4 Text [pdf]
Trait-indication similarity: Supplementary methods and results.

S5 Text [pdf]
Modeling drug success probability: Supplementary methods and results.

S1 Fig [pdf]
Enrichment of approved drug targets among genes with human genetic evidence recreated from Nelson et al. 2015 supplementary tables.

S2 Fig [pdf]
Enrichment of approved drug targets among genes with human genetic evidence recreated from Nelson et al. 2015 supplementary tables.

S3 Fig [pdf]
Sensitivity of the effect of genetic evidence to trait similarity.

S4 Fig [pdf]
Sensitivity of the effect of genetic evidence to minimum number of genetic associations.

S5 Fig [pdf]
Concordance between known Pharmaprojects statuses and assigned phases.

S6 Fig [pdf]
Latest change date to Pharmaprojects entry by development status.

S7 Fig [pdf]
First event date in Pharmaprojects by development status.

S8 Fig [pdf]
Enrichment of approved drug targets among genes with human genetic evidence using updated pipeline data and genetic associations from Nelson et al. 2015.

S9 Fig [pdf]
Percent of targets with human genetic evidence by pipeline phase using updated pipeline data and genetic associations from Nelson et al. 2015.

S10 Fig [pdf]
Concordance between dates added to GWAS Catalog and appearance in Nelson et al. 2015.

S11 Fig [pdf]
Enrichment of approved drug targets among genes with human genetic evidence using updated pipeline and genetic association data.

S12 Fig [pdf]
Percent of targets with human genetic evidence by pipeline phase using updated pipeline and genetic association data.

S13 Fig [pdf]
Decline in effect size of newly reported GWAS associations through time.

S14 Fig [pdf]
Agreement between estimated effect size in earliest GWAS studies and later replications.

S15 Fig [pdf]
Decline in effect size of newly reported GWAS associations through time, using later estimates from larger studies.

S16 Fig [pdf]
Computing information content from MeSH.

S17 Fig [pdf]
Relationship between semantic similarities computed in this study and computed by Nelson et al. 2015.

S18 Fig [pdf]
Main text figure 1b, using similarity cutoff 0.7 instead of 0.73.

S19 Fig [pdf]
Effect of manually assigned similarities on estimated effect of genetic evidence.

S20 Fig [pdf]
Target approval probability and GWAS associations by indication.

S21 Fig [pdf]
Target approval probability and GWAS associations by target class.

S22 Fig [pdf]
Target approval by RVIS score.

S23 Fig [pdf]
Target approval by development time.

S24 Fig [pdf]
Coefficient estimates.

S25 Fig [pdf]
Effect of including predictors on genetic evidence effect estimates from logistic regression.

S26 Fig [pdf]
Effect of manually assigned similarities on genetic evidence effect estimates from logistic regression.

S27 Fig [pdf]
Effect of excluding congenital disease indications on genetic evidence effect estimates from logistic regression.

S1 Table [pdf]
Schematic of two-by-two table used in odds ratio calculation.

S2 Table [pdf]
Association between genetic evidence and clinical progression recreated from Nelson et al. 2015 supplementary tables.

S3 Table [pdf]
Coverage of Pharmaprojects development status information.

S4 Table [pdf]
Concordance of Pharmaprojects development status inferred from different fields.

S5 Table [pdf]
Association between genetic evidence and clinical progression using updated pipeline data and genetic associations from Nelson et al. 2015.

S6 Table [pdf]
Progression 2013-2018 by phase.

S7 Table [pdf]
Approval or progression 2013-2018.

S8 Table [pdf]
Association between genetic evidence and clinical progression in New Pipeline test set.

S9 Table [pdf]
Association between genetic evidence and clinical progression using only new target-indication pairs.

S10 Table [pdf]
Association between genetic evidence and clinical progression using only 2013 inactive target-indication pairs.

S11 Table [pdf]
Progression 2013-2018 excluding pairs with similar 2013 approved mechanism.

S12 Table [pdf]
Progression 2013-2018 excluding pairs without previous approvals for the target.

S13 Table [pdf]
Approval by OMIM genetic evidence using MeSH supplementary concepts.

S14 Table [pdf]
GWAS annotation data sources.

S15 Table [mesh]
Overlap between Nelson et al. 2015 genetic associations and current analysis.

S16 Table [mesh]
Overlap between Nelson et al. 2015 genetic associations and current analysis conditional on common SNP association.

S17 Table [pdf]
Overlap between Nelson et al. 2015 genetic associations and current analysis by eQTL status.

S18 Table [s]
Overlap between Nelson et al. 2015 genetic associations and current analysis by evidence source.

S19 Table [pdf]
Association between genetic evidence and clinical progression using updated pipeline and genetic association data.

S20 Table [pdf]
Sensitivity of association between genetic evidence and clinical progression to eQTL -value cutoff.

S21 Table [pdf]
Association between OMIM genetic evidence and pipeline progression by indication class.

S22 Table [pdf]
Association between GWAS genetic evidence and pipeline progression by case-control odds ratio.

S23 Table [pdf]
Computing semantic similarity from MeSH.

Zdroje

1. Schuhmacher A, Gassmann O, Hinder M. Changing R&D models in research-based pharmaceutical companies. Journal of Translational Medicine. 2016;14(1):105. doi: 10.1186/s12967-016-0838-4 27118048

2. Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nature Reviews Drug Discovery. 2010;9(3):203. doi: 10.1038/nrd3078 20168317

3. Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al. The support of human genetic evidence for approved drug indications. Nature Genetics. 2015;47(8):856. doi: 10.1038/ng.3314 26121088

4. Hurle MR, Nelson MR, Agarwal P, Cardon LR. Trial watch: Impact of genetically supported target selection on R&D productivity; 2016.

5. Plenge RM, Scolnick EM, Altshuler D. Validating therapeutic targets through human genetics. Nature Reviews Drug Discovery. 2013;12(8):581–594. doi: 10.1038/nrd4051 23868113

6. Cohen J, Pertsemlidis A, Kotowski IK, Graham R, Garcia CK, Hobbs HH. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature Genetics. 2005;37(2):161. doi: 10.1038/ng1509 15654334

7. Abifadel M, Varret M, Rabès JP, Allard D, Ouguerram K, Devillers M, et al. Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nature Genetics. 2003;34(2):154. doi: 10.1038/ng1161 12730697

8. Kotowski IK, Pertsemlidis A, Luke A, Cooper RS, Vega GL, Cohen JC, et al. A spectrum of PCSK9 alleles contributes to plasma levels of low-density lipoprotein cholesterol. The American Journal of Human Genetics. 2006;78(3):410–422. doi: 10.1086/500615 16465619

9. Cohen JC, Boerwinkle E, Mosley TH Jr, Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. New England Journal of Medicine. 2006;354(12):1264–1272. doi: 10.1056/NEJMoa054013 16554528

10. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research. 2017;45(D1):D896–D901. doi: 10.1093/nar/gkw1133 27899670

11. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research. 2013;42(D1):D1001–D1006. doi: 10.1093/nar/gkt1229 24316577

12. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research. 2016;45(D1):D896–D901. doi: 10.1093/nar/gkw1133 27899670

13. GTEx Consortium, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348(6235):648–660. doi: 10.1126/science.1262110 25954001

14. Informa’s Pharmaprojects;. https://pharmaintelligence.informa.com/products-and-services/data-and-analysis/pharmaprojects.

15. McKusick-Nathans Institute of Genetic Medicine JHUB. Online Mendelian Inheritance in Man, OMIM®;. https://omim.org/.

16. Cao C, Moult J. GWAS and drug targets. BMC Genomics. 2014;15(4):S5. doi: 10.1186/1471-2164-15-S4-S5 25057111

17. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS genetics. 2013;9(8):e1003709. doi: 10.1371/journal.pgen.1003709 23990802

18. Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical development success rates for investigational drugs. Nature Biotechnology. 2014;32(1):40–51. doi: 10.1038/nbt.2786 24406927

19. Shih HP, Zhang X, Aronov AM. Drug discovery effectiveness from the standpoint of therapeutic mechanisms and indications. Nature Reviews Drug Discovery. 2018;17(1):19. doi: 10.1038/nrd.2017.194 29075002

20. Gallagher MD, Chen-Plotkin AS. The post-GWAS Era: from association to function. The American Journal of Human Genetics. 2018;102(5):717–730. doi: 10.1016/j.ajhg.2018.04.002 29727686

21. Cingolani P, Platts A, Coon M, Nguyen T, Wang L, Land SJ, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80–92. doi: 10.4161/fly.19695 22728672

22. Cook D, Brown D, Alexander R, March R, Morgan P, Satterthwaite G, et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nature Reviews Drug Discovery. 2014;13(6):419. doi: 10.1038/nrd4309 24833294

23. Nguyen PA, Born DA, Deaton AM, Nioi P, Ward LD. Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects. Nature communications. 2019;10(1):1579. doi: 10.1038/s41467-019-09407-3 30952858

24. Yao J, Hurle MR, Nelson MR, Agarwal P. Predicting clinically promising therapeutic hypotheses using tensor factorization. bioRxiv. 2018; p. 272740.

25. Gorzelany JA, de Souza MP. Protein replacement therapies for rare diseases: A breeze for regulatory approval? Science translational medicine. 2013;5(178):178fs10–178fs10. doi: 10.1126/scitranslmed.3005007 23536010

26. Chang W, Cheng J, Allaire J, Xie Y, McPherson J. shiny: Web Application Framework for R; 2018. Available from: https://CRAN.R-project.org/package=shiny.

27. Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393

28. Sheffield NC, Thurman RE, Song L, Safi A, Stamatoyannopoulos JA, Lenhard B, et al. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Research. 2013;23(5):777–788. doi: 10.1101/gr.152140.112 23482648

29. Resnik P, et al. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res(JAIR). 1999;11 : 95–130. doi: 10.1613/jair.514

30. Lin D, et al. An information-theoretic definition of similarity. In: ICML. vol. 98. Citeseer; 1998. p. 296–304.

31. Greene D, Richardson S, Turro E. ontologyX: a suite of R packages for working with ontological data. Bioinformatics. 2017;33(7):1104–1106. doi: 10.1093/bioinformatics/btw763 28062448

32. Aragon TJ. epitools: Epidemiology Tools; 2017. Available from: https://CRAN.R-project.org/package=epitools.

33. Vehtari A, Gabry J, Yao Y, Gelman A. loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models; 2018. Available from: https://CRAN.R-project.org/package=loo.

34. Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research. 2010;11(Dec):3571–3594.

35. Stan Development Team. RStan: the R interface to Stan; 2018. Available from: http://mc-stan.org/.