Precision medicine, sometimes referred to as “personalized medicine,” is an increasingly common term that remains difficult to precisely define, in spite of its popularity. The general notion is that patients should be viewed individually, rather than strictly as members of some larger general population, and that their specific genetic background, environment, and lifestyle choices should be considered throughout drug development to the point of treatment and continuing patient care.1–3 The term is perhaps best encapsulated by US President Barack Obama’s framing of “delivering the right treatments, at the right time, every time to the right person.”4
The term originally rose to prominence after the completion of the Human Genome Project5 as a call to action in the face of wide disparities between treatment outcomes within the same clinical presentation; however, the exact meaning has changed as technology continues to advance. As researchers increased their ability to gather large amounts of genetic data, they gained the ability to study the genome as a whole, giving rise to the field of genomics. This led to the development of other “omics,” such as transcriptomics, proteomics, and metabolomics, providing a more in-depth picture of complex biological mechanisms. Modern advances in “omics,” as well as biomarker development, clinical trial design, and mathematical modelling (including artificial intelligence) have revolutionized our ability to precisely define or “stratify” patient populations, splitting them into groups based on their characteristics. This stratification allows patients to receive custom treatments depending on the exact nature of their condition, improve patient outcomes, and reduce the cost burden of unnecessary therapeutic interventions.
In this article, we focus on the contribution of genomics to the development of modern precision medicine through the use and interpretation of genome-wide association studies (GWAS). Out of the millions of genetic variants present in human populations, GWAS are designed to uncover those associated with specific traits or diseases. Their use, as well as their integration into other types of data, has enabled the development of more precise therapies. This article aims to provide an overview rather than an exhaustive summary of the topic; interested readers may wish to dive deeper into the included references. We aim to improve general understanding of the theory underlying GWAS, some of the challenges surrounding their analysis, and how precision medicine may expand through the use of related technologies moving forward.
The Role of Genomics in Precision Medicine
The rise of genomics, more than any other “omics” technology, has had a significant impact on the understanding and practice of precision medicine. This is likely due in part to the relatively early development of genomics as a discipline, but also speaks to the ability of the genome to influence phenotype.6,7 Early successes in the identification of causative mutations in diseases caused by a single gene (monogenic or Mendelian diseases) such as cystic fibrosis and some types of cardiomyopathies8 gave rise to the hope that the genetic underpinning of disease was within reach. However, in practice, many conditions are complex (non-Mendelian), with contributions from hundreds to thousands of variants and the significant influence of non-genetic factors (the so-called “missing heritability” problem, which we shall discuss in greater depth later).9 Polygenic diseases are much more common, and thousands of genetic variants with very small effects could impact their phenotype.8 Regardless of the type of disease being evaluated, genomics remains a critical component of “translational precision medicine,” being utilized in every stage from early drug discovery to direct patient care.2
Understanding the Role of Mutations
The next logical question to ask surrounds the impact of genetic variation. Some mutations lead to a direct effect on the physiology (phenotype) of an organism, while others are “silent” and have no discernable effects. Although silent mutations do not lead to any changes in the amino acid sequence of a protein, they may change the efficiency of the translation process and protein folding. Genetic variation can also be classified into “coding” and “non-coding” mutations. Coding mutations have a more direct effect through direct alteration of a gene product. In contrast, non-coding mutations have effects on gene expression, transcript stability, and the physical state of the DNA itself (e.g., by altering its accessibility to transcriptional machinery). As we will discuss later, the largest proportion of identified variation in GWAS is in non-coding regions of the genome, which complicates direct interpretation.
Inheritance and Recombination Events
Mendelian inheritance explains how certain traits are passed from parents to their children following a very specific pattern of recessive and dominant traits. However, most traits do not follow Mendel’s laws. For instance, a specific trait may be the result of the interaction of several genes, or traits may be influenced by genes carried on specific chromosomes. The complexities of inheritance can also lead to scenarios where alleles at different chromosome positions (“loci”) that are close together on a chromosome tend to be inherited together. This concept is called linkage disequilibrium (LD), and it refers to the nonrandom association of alleles across loci, such that individual alleles are correlated.
The processes giving rise to LD are numerous but, importantly, once established, the rate at which this linkage decreases is dependent on the recombination rate between the loci.11 Analyses of real populations reveal the presence of large (on the order of 10’s-100’s of kilobases) segments in strong LD. These segments are often referred to as “haplotype blocks,” regions of the genome where genetic variants are inherited together more often than expected by chance, and are delineated by areas of relatively frequent recombination.11 One consequence of LD, which we will discuss further below, is that individual DNA base changes (single nucleotide polymorphisms or SNPs) can act as markers for other SNPs or genetic alterations within the same haplotype block. The association of SNPs within haplotype blocks can be used to reduce the total number of genetic variants that need to be detected in genetic studies, as a single SNP can serve as a marker for all the variants within the same block.
Genome-Wide Association Studies (GWAS)
There have been several sequencing efforts throughout the years that have significantly increased genomics knowledge. For instance, the 1000 Genomes Project revealed over 88 million variants, including SNPs, short insertions/deletions (indels), and structural variants that are part of groups of genes inherited together, also known as haplotypes.12 However, finding new variants and their location in the genome is not enough. The aim of precision medicine is to understand how these genetic variations are associated with specific diseases or mechanisms and use this information to select the best route of treatment for a patient. The use of GWAS has allowed researchers to incorporate genomics knowledge in drug discovery and development, as well as increase their efficiency and accelerate new knowledge.
GWAS are able to evaluate the presence of statistical associations between genetic variants and phenotypes by studying large numbers of individuals. Depending on the genotyping technology used, millions of genetic variations can be analyzed using this method. Those that are statistically significant (usually a p-value lower than 5×10−8 to correct for multiple testing) have the potential to provide insight into novel mechanisms, disease risk, and new biomarkers.13,14 Moreover, their lack of bias and flexibility on how study populations are selected make them highly relevant to the implementation of precision medicine. Drugs with genetically supported targets are more likely to be successful in phase II and III clinical trials, and drugs with GWAS support are at least two times more likely to be approved, especially if the variant detected alters the amino acid sequence of a relevant protein.15,16
In addition, GWAS are valuable in understanding the variability in drug response and adverse drug reactions. For instance, they have unveiled the association between CYP2C9 and the incidence of Stevens-Johnson syndrome/toxic epidermal necrolysis in patients treated with phenytoin and confirmed the effects of vitamin K epoxide reductase (VKORC1) variants on warfarin dose.17,18 GWAS have also unveiled strong associations between the human leucocyte antigens (HLA) complex and drug hypersensitivities. Although genes located in the HLA complex are particularly difficult to genotype due to their high degree of variation, the use of new statistical methods, such as taking into account LD, have improved the quality of GWAS evaluating this region.19
Challenges and Opportunities
Defining a "condition"
One of the outcomes of the continued drive for precision medicine has been the ability to recognize more detailed and nuanced subgroups within traditional patient populations who have the same diagnosis. Increasingly, this subgrouping is driven by molecular data, such as gene mutations and biomarkers, and is usually referred to as “stratification.”1 The identification of optimal subgroups is challenging in that many variables can be chosen but only some may truly be informative of the true subgroup structure; simply adding every variable one has access to can introduce so much complexity and irrelevant information, or “noise,” that the true subgroups are no longer discernable.1
Given that GWAS relies on detecting the association between genome variation and specific traits, this problem takes on an almost circular nature, as the identification of significant genetic variants depends on the correct definition of the population, which in turn depends on the identified genetic variants. Significantly associated variants, which could act as variables in stratification, should only be evident within the specific population in which they are actually a marker for the trait in question. But, without this a priori knowledge of association, choosing the right population may be difficult or even impossible. Often, the first hint that subgroups are present will be in their clinical presentation or treatment response,1 and so this is likely not as large of a hurdle in practice. However, it does highlight the challenge in sufficiently defining a “condition,” both for research and clinical practice.
One of the criticisms of GWAS to date has been the problem of “missing heritability,” or the inability for significantly associated genetic variants to account for all of the variation within a trait.8 In some cases, this is likely due to larger environmental effects, which are not always incorporated in the models used to analyze GWAS data. For example, the incidence of obesity has increased over a time period so short that it cannot possibly be due solely to changes at the genomic level.3 Another possible explanation is that we’re not always testing for the right associations. Put another way, if a tested trait is actually an amalgamation of two or more specific traits with similar phenotypes, then it should not be surprising that few significant associations are found. Similarly, co-morbidities or co-occurring traits can further confound the analysis by adding additional noise.8
As our understanding of disease stratification in the context of precision medicine improves, it is likely that we will uncover genetic associations that were not previously apparent.
Expanding the Utility of GWAS Studies
To date, most GWAS data has been gathered using microarrays, with downstream analysis and SNP imputation. Although informative, this approach is sure to miss some variants. Arrays are designed to capture known SNPs, and as such, their ability to detect variants is limited to those that have been identified from other sources of historical data. To date, these sources have been heavily biased towards European populations.14 Additionally, other kinds of genomic variation such as insertions, deletions, translocations, etc. will not be detected unless the array is designed to detect them.
Another primary limitation of SNP-based GWAS data is a limited detection threshold for ultra-rare variants; for example, a variant with a frequency of 1 in 100,000 individuals would be nearly impossible to detect.14 This is a critical factor when the underlying causative mutation(s) for a condition represent rare variants with comparatively large effect sizes, as these are more likely to represent recent mutations limited in scope to small segments of the population.7 Incentives for the development of therapeutics for rare diseases, also referred to as “orphan drugs,” will help to spur further development in these areas and improve treatment options for individuals living with rare diseases.
As the cost of whole exome sequencing (WES) and whole genome sequencing (WGS) continues to decrease, these methods may become more prevalent in GWAS, provided that the necessary computational resources and expertise are available. Direct sequencing is better suited to the detection of all kinds of genomic variants, including ultra-rare variants. One estimate suggests that sequencing can identify ten times more relevant variants as SNP typing and imputation.8 In addition, all variants can be directly observed, rather than tied through LD to common SNPs included in arrays.
GWAS data can also help researchers overcome one of the biggest limitations of observational studies: confounding factors. The presence and influence of confounding factors obstruct the process required to understand the causal effects between exposure and outcome in a population. For instance, when evaluating whether or not exposure to high levels of low-density lipoprotein cholesterol causes coronary heart disease, factors such as lifestyle and comorbidities can hide the actual effect. Moreover, randomizing people into separate groups based on the type of exposure evaluated is not possible or even feasible in some cases. Alternatively, genetic variants identified through GWAS can be used to infer causality. This process, known as Mendelian randomization, leverages SNP genotypes as proxies to evaluate the causal effect of exposure and complex traits using observational data.10 With more GWAS data being collected over time, it has become easier to uncover factors causing complex diseases, leading to a new ground of biological discoveries.
Ethnic Variation and Clinical Utility of GWAS
Despite the advancements in precision medicine, there are still disparities in the relevance of genetic variants associated with drug responses or specific diseases across different populations. Most of the GWAS data that were initially generated were focused on European populations, creating a knowledge gap for other ethnicities. For instance, despite more recent efforts to increase study population diversity, roughly 93% of the summary statistics in the NHGRI-EBI GWAS catalogue are generated from European-only samples.20 Since the basis of precision medicine is providing the right therapy to each patient, it is necessary to take ethnic variation into account, as undersampling of certain ethnic groups or other populations underpowers approaches aimed at identifying new genetic associations.20
When applied to a global view, inequity extends to the relative wealth of different world regions. Although the cost of sequencing a human genome has reduced from approximately $300 million to less than $1,500,21 it is still relatively expensive as a means of capturing genetic risk of disease as compared to traditional means such as a family history assessment.3 This places a burden on public health systems and further expands inequities in the private space, where individuals of a higher socioeconomic status can more easily leverage sequencing as part of their clinical workup.21
In addition to questions surrounding equitable distribution, some have raised questions around the relative cost-effectiveness of precision medicine as compared to other approaches, especially in the realm of disease prevention. As an example, the rise of diabetes globally, and especially in North American populations, cannot feasibly be linked solely to new genetic alterations (given the relatively short time span). Therefore, the increased incidence is more likely associated with lifestyle changes, which may interact with an individual’s genetic background.3 This is backed by polygenic risk scores (PRS) of individuals for obesity, which, to date, have had limited predictive ability.8 Although certain high risk individuals could benefit from specific interventions, it is arguable that the most cost-effective measures, at least from a public health perspective, would be those affecting behavioural choices.3 However, the issue of cost-effectiveness is complex and involves many factors; as such, precision medicine may still be cost-effective in some scenarios.22
Within the context of precision medicine, GWAS represents one arm within an increasingly large armament of potential tools. Continued development of multi-omics approaches within the translational landscape from initial development to patient care will expand the volume and heterogeneity of data available.2
One major challenge that arises in this new reality is how to integrate and analyze these data. In addition to the aforementioned heterogeneity, some data will represent point observations, such as genetic variants, while others will represent continuous observations with a temporal aspect, such as patient data from wearables.2 Machine learning (ML) approaches have the potential to integrate such data, allowing the use of “multi-modal” data within the precision medicine landscape. This can incorporate various types of data such as the full range of “omics” data, physiological parameters, and clinical information. However, the success of these approaches is highly dependent on the quality and appropriateness of the data, the selection of the right machine learning algorithms, and the correct interpretation of the results.23 By incorporating diverse data, it may be possible to further refine existing patient stratification and to identify new groupings within traditional disease definitions. This will in turn help to refine further research, including GWAS, and power a cyclical process wherein patient-level data can further improve early stage discoveries.1
The use of complementary approaches to GWAS such as phenome-wide association studies (PheWAS) can also strengthen and generate new hypotheses in drug discovery. Unlike GWAS that focus on a specific population to see which genetic variants are associated with it, PheWAS evaluate the presence of a single genetic variant in a population with a variety of traits. Combining GWAS and PheWAS data or performing one after the other offers more clarity to cases where a single genetic change affects multiple traits (pleiotropy) and provides further insights into target discovery. Since PheWAs usually require the use of complex, heterogenous and often incomplete electronic health record (EHR) data, the use of ML can help with data structuring and content standardization.24
Other, more direct, applications of ML in the GWAS space are also becoming increasingly common. For SNP-based associations, identification of the relevant gene remains challenging, especially in cases where the variant lies outside of a coding region. ML models can incorporate additional evidence, such as expression and protein quantitative trait loci (eQTLs and pQTLs) and epigenomic data, to provide better predictions for downstream validation.25 Other methods, such as those based on the construction and analysis of networks that incorporate GWAS data, have also been investigated and appear promising.26 For example, methods that use network analysis to identify clusters of related genetic variants or phenotypes can provide a more holistic view of the genetic architecture of complex traits and diseases, potentially revealing new insights that would not be apparent from traditional single-variant analyses. Such methods should work equally well, if not better, in the analysis of GWAS associations based on sequencing data, and will shed new light on the molecular underpinnings of a variety of traits.
Since the first GWAS was published in 2005, there have been more than 50,000 significant variant-trait associations identified.14 As an unbiased method for genetic variant detection, GWAS holds the promise of identifying new genes and gene regulation mechanisms underlying a variety of conditions, including complex diseases. With the increasing use of sequencing technologies, development of increasingly complex models to analyze GWAS data, and an increasing appreciation of stratified groups within traditionally labelled conditions, the number of associations is likely to continue to increase. Importantly though, our ability to capture data has, to some extent, outstripped our capacity to infer understanding from it. As such, continued efforts towards connecting associations to disease mechanisms and then to downstream treatment options should be a priority moving forward.
4. Obama White House Archive. Remarks by the President on Precision Medicine. whitehouse.gov https://obamawhitehouse.archives.gov/the-press-office/2015/01/30/remarks-president-precision-medicine (2015).