Genetic Association–Guided Analysis of Gene Networks for the Study of Complex Traits

The wealth of data measuring genetic variations and complex phenotypes in large cohorts provides a substantial resource, but it also raises significant challenges. The number of loci with strong statistical support for an association with one or more phenotypes has grown, but we are still not able to account for the genetic basis of many common phenotypes. New methods use gene networks to identify additional genes associated with phenotypes. Here we briefly review recently developed methods and categorize them by their analytic approach. We also highlight a factor that can confound network-based approaches: genes with more measured single-nucleotide polymorphisms (SNPs) tend to also be more connected across many types of networks. Finally, we highlight new techniques that use nominally significant, as opposed to genome-wide significant, associations to guide the analysis of functional relationship networks. In contrast with network-based methods that use documented associations to reallocate weight or to adjust P values, these methods do not require documented associations with the underlying phenotype. We provide an example of such techniques, the network-wide association study (NetWAS), and discuss how such methods complement the analytic toolkit available to modern geneticists. We conclude by discussing remaining challenges in the field and new application areas for network-based methods for analysis of association data.

In a landmark study in 2005, Klein et al1 reported the discovery of a variant in the complement factor H gene that was associated with age-related macular degeneration via a genome-wide association study (GWAS). Since that time, GWAS has been used to identify variants associated with numerous traits from complex diseases2 to preferences for cilantro.3 Recent work has demonstrated that GWAS discoveries provide a fruitful means to reposition drugs4 and that the presence of a GWAS hit in a drug target is positively associated with the chance that …

T he wealth of data measuring genetic variations and complex phenotypes in large cohorts provides a substantial resource, but it also raises significant challenges. The number of loci with strong statistical support for an association with one or more phenotypes has grown, but we are still not able to account for the genetic basis of many common phenotypes. New methods use gene networks to identify additional genes associated with phenotypes. Here we briefly review recently developed methods and categorize them by their analytic approach. We also highlight a factor that can confound networkbased approaches: genes with more measured single-nucleotide polymorphisms (SNPs) tend to also be more connected across many types of networks. Finally, we highlight new techniques that use nominally significant, as opposed to genome-wide significant, associations to guide the analysis of functional relationship networks. In contrast with network-based methods that use documented associations to reallocate weight or to adjust P values, these methods do not require documented associations with the underlying phenotype. We provide an example of such techniques, the network-wide association study (NetWAS), and discuss how such methods complement the analytic toolkit available to modern geneticists. We conclude by discussing remaining challenges in the field and new application areas for network-based methods for analysis of association data.

Challenges for Genome-Wide Association Studies
In a landmark study in 2005, Klein et al 1 reported the discovery of a variant in the complement factor H gene that was associated with age-related macular degeneration via a genome-wide association study (GWAS). Since that time, GWAS has been used to identify variants associated with numerous traits from complex diseases 2 to preferences for cilantro. 3 Recent work has demonstrated that GWAS discoveries provide a fruitful means to reposition drugs 4 and that the presence of a GWAS hit in a drug target is positively associated with the chance that a drug will progress through clinical trials and ultimately be approved. 5 Each GWAS produces a list of phenotype association P values for each variant, or gene for gene-based tests, [6][7][8][9][10] and variants or genes with P values below a certain threshold are considered associated with the phenotype.
Although GWASs have identified numerous genetic variants associated with diverse phenotypes, these variants generally do not explain the majority of phenotypic variance suspected to be genetic. 11 The unexplained genetic component of complex traits has been termed the missing heritability, 12 and there are numerous potential explanations for it. These explanations include gene-gene interactions, 13,14 heterogeneity (allelic 15 and phenotypic 16 ), rare variants of strong effect, 17 parent of origin, 18 transgenerational, 19 and gene-environment effects, 20,21 in which an environmental trigger induces risk in specific individuals with certain genetic variants. These factors can contribute to make a study that is adequately powered for common variants of reasonable effect size underpowered, leading to an unexpected increase in the false-negative rate (ie, the proportion of true associations that are not discovered). The practice of requiring both a highly stringent significance threshold and replication for publication of association, while reducing false positives substantially, can further compromise the ability of GWASs to identify associated variants and genes. 22 This practice leads to many true associations falling into the nominally but not genome-wide significance range. 22 To extract these associations, we need methods capable of using orthogonal sources of information, for example, pathways or biological networks, to effectively reanalyze GWAS results.
Encyclopedia of Genes and Genomes, 27 the Gene Ontology, 28 or other pathway resources. 29,30 Pathway-based methods, [31][32][33] which have been recently reviewed, 34,35 aim to identify pathways associated with the phenotype or disease of interest. 36 A primary strength of this class of methods is the interpretability of the results: most produce a list of pathways associated with a phenotype of interest. Although it may be challenging to identify a drug that will be efficacious directly from a single genetic variant or gene, the pathways identified by these approaches may suggest targeted agents or drug repositioning opportunities. 37,38 A primary limitation of these approaches is that, for most resources, pathways and gene sets are constructed based on curation of published literature. This places limitations on the breadth of curations and diversity of curations because annotations do not represent a random selection of true biological relationships. 39 In addition, annotation of genes to multiple related pathways can lead to a small set of genes driving an association for multiple distinct pathways that are difficult to untangle. 40 Finally, the pathway-based approaches use statistics based on gene sets and, hence, do not consider connections between genes, simply membership within the pathway.
In contrast with pathway-based approaches, networkbased methods incorporate connectivity measures between genes, either from curated resources, interaction databases, or integrative resources that combine multiple data sources and types. The intuition behind the use of network-based approaches is simple: because genes do not act in isolation, we expect to observe associated genes either connected to or participating in common phenotype-specific subnetworks. 41 The means of encoding this expectation into an underlying analytical method varies by method. 42,43 In general, there are 2 types of approaches: (1) stringent evidence methods that identify genes or variants within neighborhoods with genes that possess a documented association to the phenotype; or (2) permissive evidence methods that identify network neighborhoods with an overabundance of associated variants or genes. Although these are the predominant groups that we focus on, some techniques generate scores independent of disease information using network topology, which are combined in a post hoc step with GWAS associations. 44 The stringent and permissive evidence methods both use disease or phenotypespecific associations to guide an analysis of networks to identify genetic factors underlying the phenotype of interest.
Network-based approaches can incorporate diverse types of networks. Example types of biological networks that are available include protein-protein interaction networks, 45,46 miRNA target networks, 47,48 transcription factor networks, 49,50 and shared function networks. 51,52 These networks can be either curated by experts from the literature or inferred based on available data. The benefit of curated networks is that they are expected to provide high fidelity for the edges that exist, whereas the benefit of inferred networks is that they are expected to provide a more complete representation of the underlying biological networks. A recent analysis revealed that individual network types exhibited varying performance for the prediction of gene-disease associations. 53 Networks that captured common directional responses to perturbations exhibited the highest mean performance across phenotypes for single-type networks, but an integrated model provided the highest overall performance. 53

Stringent Evidence Methods
For stringent evidence methods, high-confidence gene-phenotype associations are used as a building block to identify promising network regions. In some cases, these associations are derived from literature or database support. [54][55][56] In other cases, these may be derived from a set of hits from one or more GWASs that passed a stringent statistical threshold, indicating genome-wide significance. 51,53,57 In each case, the objective of these algorithms is to identify network neighborhoods associated with a high density of prior disease annotations. Depending on the approach, an implicit or explicit assumption of guilt-by-association then allows new genes and variants to be prioritized based on their own connectivity patterns.
For example, a recently developed approach 53 uses stringent evidence of associations, collected from the GWAS catalog, 58 to integrate multiple types of networks. The intuition behind these approaches is that no single source of data is likely to contain sufficient information to accurately predict gene-disease relationships. In this scenario, integrating multiple networks with different types of information can improve performance beyond that achievable with single networks. Integrating and appropriately weighing these diverse evidence types presents a challenge. To address this, the authors used machine learning to automatically learn the relative importance of multiple biological networks. They demonstrate that this integration dramatically outperforms individual networks.

Permissive Evidence Methods
The second type of method analyzes association results, including those which do not meet genome-wide significance, from an individual study in the context of existing networks. Because the information used is from a single study, these methods are easier to evaluate because concordance with documented associations is unlikely to be driven by knowledge bias specific to the phenotype. To identify genes associated with evidence for an association with the disease, these methods have used permissive significance thresholds. [59][60][61][62] An example of this type of approach is protein interaction network-based pathway analysis. 59,60 Protein interaction network-based pathway analysis identifies gene modules that have an enrichment of genes with nominally significant P values. Once these modules are identified, the biological significance of modules can be evaluated through gene set enrichment analysis, for example, by identifying Gene Ontology terms overrepresented in each module. The recently developed integrative protein-interaction-network-based pathway analysis extension adds a step that allows signal to diffuse from enriched nodes along network edges before modules are identified. 61

New Methods Integrate Elements of Both Strategies
An example of a hybrid approach that incorporates elements from both types of approaches is the NetWAS. 62 NetWAS uses techniques common to stringent evidence algorithms, but performs an analysis that uses permissive evidence. Specifically, NetWAS has 2 analytic similarities to stringent evidence methods: NetWAS uses tissue-specific networks in reprioritization, which have demonstrated consistently better performance than tissue-naïve networks, 56,62 and NetWAS, like the heterogeneous network method, 53 uses a machine learning strategy to identify network characteristics predictive of disease relevance. NetWAS learns which genes in the network are connected to the associations identified by a study. Consistent with permissive evidence methods, NetWAS performs an analysis based on the patterns of association observed within an individual GWAS instead of a summarization of discovered associations. The final outcome of a NetWAS analysis is the generation of a network-based ranking of all genes to disease based on connectivity in the selected network. This hybrid approach can be used to reanalyze a GWAS in the context of a network specific to a tissue of interest to produce a prioritized list of candidates with both tissue and phenotype specificity.

Potential Confounding From Varying SNP Abundances
Network-based approaches provide an important new avenue to analyze genetic association data to reveal the genetic basis of common diseases. It is important for users and developers of these methods to be aware that there are factors that correspond between GWAS data and networks, which can act as confounding factors for some network-based approaches.
Pioneering approaches collapsed variant-level P values to gene-level P values by using only the minimum P value for variants within each gene. 51,59 This process leads to genes with more measured variants receiving lower P values, potentially for no reason other than the gene's size. 63 This confounding effect is compounded because genes with more measured variants also tend to be more connected in networks that are commonly used by these methods (Figure 1; code released into the public domain at https://github.com/dhimmel/snplentiful 65 ). Fortunately, the development of gene-based 6-10 methods that consider the nonuniform mapping of measured variants to genes has alleviated this challenge. When applying network and pathway-based methods, it is important to use algorithms that either use gene-or pathway-based tests or permutation of case-control status to account for this feature of the data. The example approach that we examine in depth in the next section, NetWAS, uses results from gene-based association tests to address this issue.

An Illustration of Network-Based Methods Using NetWAS As an Example
NetWAS 62 operates by identifying phenotype-associated patterns in biological networks. The uncovered patterns are then used to rank each gene for association with the phenotype. The specific steps of NetWAS are to identify genes with a nominal association; to use those genes to guide a machine learning analysis of tissue-specific networks; and to use the results of the network analysis to rank all genes in the network based on the model constructed by machine learning (Figure 2 and discussed in detail below).
In the first step of NetWAS, a gene-based test is applied to convert SNP association P values into scores for each gene. The goal of this step is to generate a positive set of genes that are enriched for true associations. In Greene et al, 62 the versatile gene-based association study 6 method was used, but any gene-based test that effectively controls for the number of variants in genes can be applied. Genes are then selected based on a lenient statistical threshold, for example, P<0.01 in Greene et al, but genes could alternatively be selected based on a permissive false discovery rate threshold. A negative set is constructed consisting of genes that show no evidence of association. The goal of the negative set is to define the universe of genes that were measured but not identified as significant, and so in practice, this could be set to encompass only genes with little evidence for association (eg, P≥0.2). In Greene et al, this was constructed as simply the complement of the positive set (genes having P≥0.01). The negative set allows NetWAS to be readily applied to platforms that are not genome-wide, for example, the Immunochip 66 or Metabochip 67 platforms. Once the positive and negative sets are constructed, they are overlaid on a selected network (Figure 2A).
The 2 sets of genes, nominally associated positives and unassociated negatives, are used to guide an analysis of tissue-specific networks. To perform this analysis, the network is converted into an adjacency matrix, as depicted in Figure 2B. This analysis uses weighted networks, but for illustrative purposes, binary networks (connected, unconnected) are depicted in Figure 2. A machine learning algorithm is then applied to derive weights associated with connectivity to each gene ( Figure 2C). In Greene et al, a support vector machine algorithm was applied, though the framework is amenable to any machine learning-based classifier. The algorithm identifies nodes in the network that tend to be either more or less connected to genes with a nominally significant association than to unassociated genes. From this point forward, the status of significant/nonsignificant in the GWAS is no longer considered. A predictor is constructed by using the weights assigned to each node by the machine learning algorithm ( Figure 2D).
In the final stage of NetWAS, the predictor trained in the second step is applied to all genes ( Figure 2E). The predictor contains a weight for each gene that reflects the extent to which it is connected primarily to nominally significant, as opposed to unassociated, genes. The values of each gene's edges to each other gene are then multiplied by these weights and summed to produce a prediction for that gene (Figure 2, NetWAS Score). This prediction captures the extent to which the gene's network connectivity indicates consistency with the nominally significant set.
Focusing specifically on cardiovascular phenotypes, a NetWAS-predicted gene set outperformed the underlying GWAS on key measures in an analysis of multiple phenotypes related to hypertension. 40 Specifically, for each phenotype, NetWAS ranked genes with a documented role in hypertension more highly than the corresponding GWAS; it ranked genes annotated to hypertension-specific Gene Ontology 28 processes more highly than the GWAS; and it ranked genes targeted by antihypertensive drugs more highly than the GWAS. In addition, the NetWAS top-ranked genes exhibited literature support for involvement in hypertension. Although our discussion here focuses on cardiovascular genetics, an analysis of publicly available data revealed strong performance across GWASs of multiple phenotypes. 62

Conclusions
Although GWAS has not fully revealed the genetic basis of common human disease, it is now becoming clear that these data can be useful in a large-scale data mining framework. Network-based methods provide a powerful means to identify the mechanistic basis of complex phenotypes from GWAS results. To take advantage of this resource, we will need to Schematic diagram of the Network-wide Association Study (NetWAS) procedure. In step A, nominally significant genome-wide association study (GWAS) hits (above blue line) are combined with a tissue-specific network to create a labeled network (blue, nominally significant; red, not nominally significant; white, not measured). In step B, this network is converted to a gene-gene adjacency matrix (dark squares indicate edge from the network). The adjacency matrix is symmetrical. In step C, a weight is calculated for each gene related to the extent to which it interacts with positive genes using a machine learning algorithm, for example, support vector machines. Scores can range from specifically interacting with positives (high positive weight, dark blue plus) to specifically interacting with negatives (high negative weight, dark red plus). In step D, labels are removed and only weights associated with columns are considered. In step E, genes represented as rows are scored based on the weights of the genes that they are connected to. This generates degrees of positive scores (blue) or negative scores (red). These scores are the output of the NetWAS procedure.
continue to develop, evaluate, and apply algorithms that take into account the complexities of the underlying phenotypes in one or more tissues. In addition to analytical methods that leverage the multitissue to phenotype mapping, methods that incorporate related phenotypes also represent an area of considerable potential. Phenome-wide association studies 68 have complemented GWAS by mapping the associations of variants across multiple phenotypes. Results with NetWAS for distinct hypertension-related phenotypes revealed that performance aggregated across multiple phenotypes outperformed single NetWASs. This suggests that phenome-wide application of networkbased methods represent a promising area for new algorithm development and applications. Methods capable of integrating multiple related phenotypes simultaneously to identify a common genetic basis could improve power and help to disentangle the genetic basis of complex diseases.
Although approaches developed to date have delivered promising results that have added value to existing GWAS, we anticipate that this active area of research will continue to advance. We expect that methods that improve our ability to mine these networks, methods that integrate across multiple tissues or cell lineages, and methods that integrate across multiple phenotypes will each contribute to advances in our understanding of the genetic basis of complex cardiovascular phenotypes. We expect that combined approaches that integrate across multiple aspects simultaneously will also provide new opportunities to discover the genetic basis of complex cardiovascular traits.