Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences

These are exciting times, with a plethora of new technologies that are expediting discovery of the genetic underpinnings of human disease. Comprehensive resequencing of the human genome is now feasible and affordable, allowing each person's entire genetic makeup to be revealed. The major focus of

T hese are exciting times, with a plethora of new technologies that are expediting discovery of the genetic underpinnings of human disease. Comprehensive resequencing of the human genome is now feasible and affordable, allowing each person's entire genetic makeup to be revealed. The major focus of attention in genetics studies has been the small portion (1%) of the human genome that comprises the protein-coding sequences in genes (the "exome"), and the majority of causal disease-associated variants identified to date have been located in these regions. 1 A remarkable extent of genetic variation in the protein-coding regions has been found, with at least 20 000 single-nucleotide polymorphisms (SNPs) present even in normal healthy subjects. 2, 3 Half these SNPs are nonsynonymous changes that result in an amino acid substitution that could potentially affect protein function. The greatest challenge now facing investigators is data interpretation and the development of strategies to identify the minority of gene-coding variants that actually cause or confer susceptibility to disease. To address this problem, bioinformatics tools have been developed to predict the likelihood of pathogenicity. A bewildering array of options is available, and users need to be aware of the programs most suited to their needs as well as the strengths and weaknesses of the various methods employed.
Here, we provide an introductory overview of some commonly used pathogenicity prediction programs as well as a set of illustrative cardiac examples. This article is tailored for readers who are not bioinformatics experts and is relevant to cardiovascular researchers undertaking human genetics studies as well as to clinicians performing genetic testing. For comprehensive reviews of available methods, 4-8 detailed technical explanations of the bioinformatics and validation of individual programs, [9][10][11][12][13][14][15][16][17][18][19][20][21] and comparative analyses in large variant data sets, [22][23][24][25][26][27][28] we refer the reader to excellent articles published elsewhere. The important "take-home" message is that although bioinformatics prediction programs are extremely useful, the results cannot necessarily be taken at face value because all programs have inherent limitations, and additional supporting evidence is required to confirm that predicted deleterious variants have a role in disease processes.

Importance of Gene Coding Sequence Variants in Human Disease
The Human Gene Mutation Database (HGMD) 1 currently lists more than 120 000 variants in more than 4400 genes that have been associated with human diseases. Disease-associated variants include nonsense variants (amino acid changes that result in a stop codon), variants that create or abolish splice donor or acceptor sites, and insertions or deletions (indels) that shift the protein reading frame. All these types of variants have a high probability of altering protein function. Interpretation of missense SNPs (that change an amino acid but do not result in a stop codon) is far less straightforward and more difficult to predict because of the range of effects they can impart. Missense SNPs in critical residues can have disastrous consequences on protein function or structure. However, missense SNPs may be benign when the amino acid is substituted for another with similar biochemical properties, if the substitution occurs in an evolutionarily nonconserved position, or when the residue is not in a critical structural or functional domain of the protein. The average white individual has %10 000 missense SNPs in their exome, of which %200 are novel. 3 Experimentally elucidating the consequences of each variant using in vitro studies and animal models is the best way to demonstrate functional effects, but this is impractical on a large scale. Reliable and high-throughput methods for evaluating missense SNPs are clearly required.

Steps in Sequence Analysis
A number of different strategies may be used in genetics studies, and the choice of method depends on the population under investigation and the specific questions being addressed. Studies of Mendelian traits in large family kindreds have traditionally involved linkage analysis to define a chromosomal disease locus, followed by resequencing of candidate genes that are located within the interval. In cohorts of small families in which linkage is unable to be done, resequencing of selected candidate genes is often performed. These approaches have led to the discovery of numerous disease genes for a wide range of cardiac (and extracardiac) disorders and have provided a basis for commercial genetic testing (discussed in a later section). Whole-genome and whole-exome massive parallel sequencing platforms are now rapidly gaining popularity for discovery of new disease genes and for identification of variants in known disease genes in families. In cohorts of unrelated patients, resequencing of single genes and genome-wide association studies with SNP arrays have been used to look for rare and common variants that affect disease risk. Although cost is still a factor in large cohort studies, next-generation sequencing will undoubtedly be used increasingly in this setting.
Irrespective of the sequencing method used, the principles of sequence analysis are essentially the same ( Figure 1). First, the sequencing output needs to be aligned to a human reference assembly to determine whether there are any differences with the "normal" sequence and to determine the location of variations (gene exon, gene intron, intergenic). Second, the potential effects of variants on the encoded protein need to be determined (eg, nonsynonymous or synonymous amino acid substitution, splice variant, indel, etc). Third, a search is made of publicly available databases, such as dbSNP, 1000 Genomes, and the Exome Sequencing Project, and in some cases, a cohort of healthy control DNA samples may be genotyped to determine whether variants are novel or have been previously reported and the prevalence of the variant allele. Some inferences then need to be made about potential functional effects. For cardiovascular diseases, variants in genes that are expressed in the heart or vasculature and that have relevant functions for the trait under study can be prioritized. However, it is important not to disregard the possibility that cardiac expression or function of some genes may not be recognized. Even after these filtering methods are employed, a long list of "suspicious" variants is likely to remain, and prediction tools have a key role in shortlisting these for further analysis. Bioinformatics tools are heuristic, that is, they combine various types of parameters from multiple sources to infer likely pathogenicity when detailed experimental evaluation of individual variants is unavailable.

Prediction Methods Available
In this review, we have looked at 8 of the currently available prediction tools for nonsynonymous variants to highlight aspects of how these types of programs work and their relative performance. The methods used and parameters assessed in these 8 programs are summarized in Table 1, with some useful notes about inputs and outputs in Table 2.
Genome sequences that are highly conserved during evolution are thought to be important for protein function, and disease-associated mutations tend to be abundant at these sites. 4,5 Many programs, including PANTHER (Protein Analysis Through Evolutionary Relationships) 9,10 and SIFT (Sorts Intolerant From Tolerant amino acid substitutions), [11][12][13] rely primarily on the extent of sequence conservation of a specific residue, which is assessed by looking at an alignment of the sequences of this region of the protein across a wide range of different species, that is, multiple sequences alignment (MSA). Many programs take factors in addition to evolutionary conservation into consideration. Align-GVGD 14 GD indicates Grantham deviation; GV, Grantham variation; HGMD, human gene mutation database 1 ; MSA, multiple sequence alignment; SNP, single-nucleotide polymorphism. *PolyPhen2 uses 8 sequence-based and 3 structure-based features, including position-specific independent count score of wild-type allele, differences in this score between the wild-type and variant alleles, number of residues observed at the position in the MSA, residue side-chain volume change, variant position with respect to a protein domain defined by Pfam, variant allele congruency to MSA, sequence identity with closest homologue deviating from wild-type allele, normalized accessible surface area of amino acid residue, crystallographic b-factor, and change in accessible surface area propensity for buried residues.
†SIFT score, Pfam profile score, and transition frequency (likelihood of observing a given SNP in the UniRef80 database and Protein Data Bank). ‡Predicted secondary structure, solvent accessibility, transmembrane helices, coiled-coil structure, stability, B-factor, and intrinsic disorder.
Grantham Deviation (GD) score reflects the biochemical distance between variant and wild-type amino acids at a given residue. Several programs, including PMut, 16 SNPs3D, 17,18 and PolyPhen-2, 19 use varying combinations of sequence-based and protein structure-based features, such as the effect of a variant on protein folding and accessible surface area of the amino acid residue. MutPred 20 is an extension of SIFT that differs most significantly from other programs by its General ("g") score indicates probability that an amino acid substitution is deleterious; MSA, multiple sequence alignment; property ("p") score, statistical likelihood (P value) that structural and functional properties will be altered; P del , deleterious probability; PHD, Profile fed neural network systems from Heidelberg; PSI-BLAST, Position-Specific Iterated Basic Local Alignment Search Tool; subPSEC, substitution position-specific evolutionary conservation score, estimated from the negative logarithm of the probability ratio of wild-type and mutant amino acids at a specific position; WT, wild type. *Except for 7 tumor-related genes in program library.
incorporation of predicted functional sites, including DNAbinding residues, catalytic residues, calmodulin-binding targets, and predicted posttranslational modification (phosphorylation, methylation, ubiquitination, glycosylation) sites. A broad range of additional parameters are also included in SNPs&GO, 21 with evaluation of evolutionary data from PANTHER, the sequence environment of a residue (including 18 residues on either side of the variant residue), and a gene ontology (GO) score that derives information about the biological processes, cellular components, and molecular functions of gene products in different species from the GO database. These prediction tools have been benchmarked on large mutation data sets, and although developed for use in classifying human mutations, some of these programs can be applied to bacteria, plants, and other organisms. 29

Example Variants
To further illustrate some of the features of these programs, we used them to make predictions about 18 missense variants that we selected as examples, including 9 rare variants that have robust genetic or functional evidence to implicate them as disease causing in various cardiomyopathies and arrhythmias, 30-37 and 9 common variants implicated in disease susceptibility (Table 3). [38][39][40][41][42][43][44][45][46] The results of these predictions are shown in Table 4. For the 9 rare variants, the number of variants that were accurately predicted as likely to be deleterious ranged from 2 using PANTHER (22%, although predictions were able to be made for only 4 variants) to 8 (89%) with SIFT, PolyPhen-2, MutPred, and SNPs&GO. The greatest variability was seen with 2 programs, PANTHER and Align-GVGD, and 3 variants, R403Q MYH7, R92Q TNNT2, and D175N TPMI. For the 9 common variants, with a few exceptions, predictions were overwhelmingly neutral. A closer examination of the factors on which the predictions are based helps to explain these results.

Key Role of Amino Acid Conservation in Predicting Pathogenicity
As noted above, sequences that are highly conserved across species are often functionally important, and high prediction success has been achieved for algorithms that predominantly use evolutionary-based information. [9][10][11][12][13] Sequence-based methods do have their limitations, 47 and this is demonstrated by the predictions generated by PANTHER and Align-GVGD. Although PANTHER is generally reliable when predictions are obtained, 26 it failed to generate predictions for 6 of the 18 variants in our example data set. This may occur if the sequence alignment is poor or when a variant is located at a residue that is not present in a majority of species and hence is unable to be modeled in a Human Markov Model. In Align-GVGD, we found wide discordance between sequence conservation (GV) and biochemical change (GD) components for several variants that resulted in a neutral prediction. Sequence conservation appeared to have relatively less weighting than biochemical change because neutral predictions were more likely to be obtained when the GV scores were high and the GD scores were zero (eg, R403Q MYH7, S532P MYH7), rather than the converse situation with low GV and high GD scores (eg, N195K LMNA, Y315S KCNQ1). As a general concept, adding protein structural or functional parameters should provide greater predictive accuracy than consideration of sequence conservation alone, 27 but this only applies when protein structure or function is known and the relevant databases are up to date. Quite commonly, this information is incomplete or lacking, and the predictions have to rely predominantly on the evolutionary conservation component.

The Importance of MSAs in Predictions
The number of species in an MSA and the evolutionary distance between them heavily influence algorithm accuracy. Evolutionary depth in MSAs is recommended because this potentially provides more information about the extent of conservation. If sequences in the MSA are too similar (eg, dog, pig, human), then variants not normally imparting a functional consequence on the protein will tend to be classified as pathogenic. On the other hand, comparing a broader range of species, such as small rodents (rat, mouse), zebra fish, fly, and worm, may strengthen the case for a variant in a highly conserved residue being pathogenic, but may also produce false negatives if there is divergence in the protein sequences and biological functions of more distantly related species. 7 Similarly, there are no clear indications about whether inclusion of different protein isoforms and different members of the same protein family will strengthen or weaken predictions. In 1 comparative study, PolyPhen-2 appeared to be least susceptible to differences in the MSAs, whereas Align-GVGD was highly susceptible and had a propensity to call variants as neutral when large numbers of sequences were utilized. 27 It has been noted that programs do not always perform best with their own program-generated MSA and can have more accurate results with gene-specific MSAs that have been optimized by the user. PANTHER, SNPs3D, MutPred, and SNPs&GO generate MSAs internally and do not allow the option of users creating and submitting their own MSAs. SIFT and PMut internally generate an alignment but also permit user-generated alignments. The Web-server version of Polyphen-2 has its own alignment pipeline, but user-generated alignments can be submitted to the stand-alone software version, which can be downloaded onto a local computer. Align-GVGD has a very limited set of alignments, so users mostly need to supply their own. This enables greater control of user-defined sequences in the alignment and flexibility of adding or removing sequences in the MSA, but entails considerable additional work to obtain and align the relevant protein sequences. There is also the real possibility of skewing the results by variations in the numbers and types of species selected to be included in the MSA.
MSAs can be obtained from the Pfam (protein families) database 48 or manually curated and then aligned using freely available online alignment tools such as the more widely used programs ClustalW2, 49 MAFFT, 50 MUSCLE, 51 PROMALS, 52 and T-Coffee. 53 Alignments produced by the different pro-grams for specific regions can differ, however, and it has been suggested that more than 1 MSA program may be required, particularly for sequences that contain deletions or insertions. A number of scoring systems have been devised to assess the quality of MSAs, with the overall conclusion that, like the protein prediction programs available, a single flawless method is not available. [54][55][56] Location, Location, Location Significant discrepancies between bioinformatics predictions and experimentally validated effects often arise because the functional characteristics of the region in which a variant is located are inadequately taken into account. Amino acid changes that have modest pathogenicity predictions may  Examples of the importance of the protein "neighborhood" are provided by the R403Q MYH7, R92Q TNNT2, and D175N TPMI variants. The Arg403Gln mutation in the gene encoding myosin heavy chain (MYH7) causes hypertrophic cardiomyopathy in humans and in mice. 31 The R403 residue is Table 4. Continued GD indicates Grantham deviation; GV, Grantham variation; HD, HumDiv; HMM, hidden Markov model; HV, HumVar; P del , probability of deleterious effect; sens, sensitivity; spec, specificity; SVM, support vector machine; subPSEC, substitution position-specific evolutionary conservation. *Probability of deleterious outcome is indicated by cell color: high, red; intermediate, orange; low, green. Predictions were obtained using the web browsers Firefox 5.0.1 (or Safari 5.0.5 for Align-GVGD) using all default settings of the programs. For PANTHER, SIFT, PMut, SNPs3D, PolyPhen-2, MutPred, and SNPs&GO, where MSAs are program generated, WT protein sequences were submitted. For Align-GVGD, alignments were user generated. Alignments for Align-GVGD were manually curated using the T-Coffee Advanced tool and the program's specifications for an appropriate MSA. located in the myosin head adjacent to the actin-binding site and is invariant in myosin heavy chains in the heart and other tissues across a range of species from human to amoeba. 31 Although this high degree of sequence conservation and the biophysical effects of loss of an arginine are able to be assessed in the prediction algorithms, none of the programs would have considered the key role of the 403 residue in actin-myosin interaction, calcium sensitivity, and energy utilization. A similar argument can be made for the R92Q TNNT2 variant, which is in the elongated tail domain of cardiac troponin T at the site where the tropomyosin monomers overlap. This variant has been shown to have distinct effects on calcium sensitivity and thin filament sliding speed in vitro and results in a hypertrophic cardiomyopathy phenotype in mice, 32 yet only 4 of the 8 programs used predicted it to be probably (n=3) or possibly (n=1) deleterious. The D175N TPMI variant, located in the troponin T-binding site in tropomyosin, was also only identified by 5 of the 8 programs as probably (n=4) or possibly (n=1) deleterious despite robust genetic and in vivo functional evidence of pathogenicity. 35

Rare Versus Common Variants
Genetic variation is being recognized increasingly to play a role in many cardiovascular disorders. 57,58 At one end of the spectrum, single-gene variants that have a large functional effect have been considered sufficient to cause disease in families with Mendelian patterns of inheritance. These variants are typically rarely present in the general population, and many are "private" mutations seen only in 1 family. Single rare variants have been associated with numerous heritable cardiomyopathies and arrhythmias, including familial hypertrophic cardiomyopathy, familial dilated cardiomyopathy, arrhythmogenic right ventricular cardiomyopathy and long QT syndrome. In contrast, commonly occurring genetic variants have been associated with complex traits such as hypertension, coronary artery disease, diabetes, and atrial fibrillation (the common disease, common variant hypothesis). Common SNPs can be identified by genome-wide association studies in large cohorts of affected and unaffected individuals. These types of variants are potentially important because of their relatively high-population frequencies, although the risks associated with each variant may only be modest. Recently, human genome sequencing studies have heightened interest in the potential role of rare variants in common diseases. 3,59-63 A new paradigm has been proposed in which the cumulative burden of unique personal combinations of rare variants may contribute substantially to the heritable component of complex disease.
These perspectives on the role of genetics need to be kept in mind when considering the performance of gene variant functional predictions. A striking finding in our example variants was the differences between predictions for rare and common variants. Whereas the known functional rare variants were correctly predicted by a majority of programs as deleterious, the common variants were mostly predicted as nondeleterious. There are several factors that might explain this discrepancy. First, it is important to note that common SNPs that show significant associations with disease in genomewide association studies are almost always not the causal variants themselves but are markers for a pathogenic SNP that is coinherited in the same haplotype. For example, A1101V MYH6 was significantly associated with heart rate, and to a lesser extent with PR interval, in a study of more than 20 000 individuals. 38 The uniformly neutral predictions for A1101V MYH6 may in fact be correct if the trait is not directly attributable to this SNP. Patients carrying the M235T AGT SNP have increased plasma angiotensinogen levels and increased risk of hypertension. 39 Although 1 program, SNPs3D, had a pathogenic prediction, the same argument can be made that M235T AGT might only be a marker of a risk allele. In contrast to the A1101V MYH6 and M235T AGT SNPs, several of the variants in genes encoding cardiac ion channels have had direct experimental validation of deleterious effects. For example, K897T KCNH2 changes the biophysical properties of the I Kr current and also creates a new phosphorylation site for Akt protein kinase that inhibits channel activity. 41,42 Despite these findings, only 2 of the 8 programs (SIFT, PMut) predicted pathogenic effects. Even MutPred, which includes posttranslational modification site prediction, did not call this SNP as pathogenic. S38G KCNE1 has loss-of-function effects on I Ks , 43 whereas H558R SCN5A is a potent modifier of I Na, with effects that vary with different genetic backgrounds. 44 SNPs&GO predicted S38G KCNE1 as pathogenic, but all other programs predicted both variants to be neutral. These differences between predictions and experimental data for ion channel variants may be a result of the locations of these variants in gene-specific functional domains that are not taken into consideration by prediction algorithms (as noted above). Alternatively, these findings may indicate that bioinformatics tools are relatively better at predicting pathogenic rare variants that have large functional effects than common variants that have more modest functional effects.

Which Method Is Best?
Most of the prediction programs have been benchmarked by their curators using large variant data sets and have been shown to perform well (Table 1). However, there are relatively few studies that have systematically compared the predictive accuracy of different programs in the same test data set. This can be a difficult exercise because the various types of outputs may not be readily standardized. In addition, because each of the programs obtains sequence and/or structural information from different databases, there may be confounding factors of conflicting or missing information. Also, if a data set for testing a program's accuracy is similar to its training data set, bias occurs , and misleading inferences of a program's superior performance can arise. The creators of PMut even state that its algorithm was trained using alignments in the Pfam Database, so better prediction performance is expected toward Pfam alignments. 16 The results of 5 comparative studies are shown in Table 5. Chan and colleagues 22 evaluated 254 missense variants using SIFT, PolyPhen, Align-GVGD, and the BLOSUM62 matrix. The overall accuracies (algorithm based on the sum of truepositive and true-negative rates) for single programs were not dissimilar, ranging from 73% (Align-GVGD) to 82% (SIFT). It was noted that the programs with higher sensitivity detected more deleterious variants but had lower specificity, whereas programs with lower sensitivity but high specificity better predicted neutral variants and had fewer false positives for deleterious variants. Wei and colleagues 24 looked at 204 variants with 6 programs and concluded that SIFT and PolyPhen were the overall top predictors, followed by nsSNPAnalyzer. Hicks and colleagues 27 found that SIFT, Align-GVGD, PolyPhen-2, and Xvar had similar overall accuracy when optimal MSAs were provided for each program. Align-GVGD had a very low median sensitivity (10%) and high median specificity (>95%), but these results were considered unreliable, given the bias for negative predictions with large MSAs. Because Align-GVGD performed best with a manually curated MSA, it was considered less suitable for use in largescale sequencing analyses. The speed of the program and the  number of variants that can be inputted are other criteria that limit the suitability of most programs for use in nextgeneration sequencing analysis. To meet these needs, Schwarz and colleagues have developed MutationTaster. 25 When compared with PANTHER, PolyPhen, Poly Phen-2, PMut, and SNAP, in a training set of 1000 disease-linked variants and 1000 SNPs, MutationTaster was found to have the highest accuracy (86%) and was substantially faster than the other programs studied. In the most comprehensive analysis to date, Thusberg and colleagues 26 utilized 9 programs to evaluate more than 40 000 variants in several databases, including dbSNP. PhenCode, LSDBs (locus-specific mutation databases), and IDbases (LSDBs for immunodeficiency-causing mutations). These authors concluded that no single method could be rated as best by all parameters but that SNPs&GO and MutPred were overall superior to other programs tested.

Consensus Predictions
Several groups have proposed that using the consensus predictions of a number of programs may be more reliable than using a single program. [22][23][24] For example, in the analysis by Chan and colleagues, 22 the 4 programs tested gave concordant results for only 63% of the variants. However, when this occurred, the overall predictive value increased to 88%. Similarly, Wei and colleagues 24 found that when different combinations of programs were used, the consensus of 5 programs (SNPs3D excluded) gave the best total accuracy (73%). In our example variants, we found that no program predicted all rare variants as pathogenic. Seven of the 9 rare variants had consensus predictions by SIFT and PolyPhen-2 , and all 9 rare variants were identified correctly as deleterious when other combinations of 2 methods were used, for example, SIFT and PolyPhen-2 or MutPred or SNPs&GO. For the 9 common variants, with the exception of G389R ADRB1, the combined predictions of multiple programs did not increase the number of positive predictions.
Although confidence in a result may be increased if concordant results are obtained with a number of programs, some pathogenic variants may be missed. On the other hand, having less stringent criteria, such as requiring any 1 program to be deleterious, will increase the chances that all the true positives will be detected but may also result in more falsepositive results. A further consideration is that output similarities may be consequences of the similarity of inputs for some combinations of programs and do not necessarily equate with greater prediction accuracy.
The comparative studies outlined above have been benchmarked using variants that have been predetermined to be deleterious or benign. The performance of these methods on a genomewide scale in which there are many thousands of variants of unknown function has been less extensively evaluated. Chun and Fay compared SIFT and PolyPhen with their likelihood ratio test (LRT) in an evaluation of 3 human genomes. 23 Surprisingly, 76% of variants were predicted as deleterious by only 1 program , and only 5% of variants were predicted as deleterious by all 3 programs. These authors proposed that it was the small proportion of variants with consensus predictions that was most likely to be functionally significant. This is a very important point that warrants further validation. Although using multiple prediction programs for each variant is desirable, this is time consuming and impractical on a large scale. To address this issue, Liu and colleagues have recently developed dbNSFP (database for nonsynonymous SNPs' functional prediction). 28 This method integrates pathogenicity predictions from SIFT, PolyPhen-2, LRT, and MutationTaster into a single application.

Recommendations
The selection of pathogenicity prediction programs depends very much on the situation and the type of data being interrogated. When there are only a small number of specific variants under consideration, for example, in a family that has undergone linkage analysis and sequencing of candidate genes in a disease interval or with a family in which genetic testing of known disease genes has been performed, a detailed analysis is warranted , and it is highly recommended that a number of prediction programs be used. We have routinely used SIFT, PolyPhen-2, PMut, and SNPs&GO and have recently added MutationTaster to our suite of preferred programs. The selection of programs is probably less critical than looking at consensus predictions (when all programs agree) or majority predictions (when most programs agree). At present, only a subset of programs (including SIFT, PolyPhen-2, and MutationTaster) have batch modes that allow multiple variants to be simultaneously inputted and are suitable for analyzing large next-generation sequencing data sets. In the next few years, it is likely that many more programs will be adapted for this use.

Use of Gene Variant Prediction Programs in Genetic Testing
Genetics studies in families have generally been performed by research groups seeking to decipher molecular mechanisms of disease. As a result of these studies, lists of disease genes have been established for many of the inherited cardiomyopathies and arrhythmias. Commercial genetic testing of subsets of the more common of these disease genes is now available , and expert consensus recommendations for indications for genetic testing have recently been compiled by the Heart Failure Society of America, the Heart Rhythm Society, and the European Heart Rhythm Association. 64,65 Healthcare professionals are now empowered to send off patient DNA samples for genetic testing , and informed interpretation of the results is crucial.
If the results for a family proband DNA sample come back as positive, showing a variant in gene X that is "probably pathogenic," it cannot necessarily be assumed that this is the disease-causing mutation in the family , and a number of questions need to be asked initially along the lines of the flowchart in Figure 1. One needs to know whether the variant is novel, rare , or commonly present in a population whose ethnicity is similar to that of the family being studied. As noted above, disease-causing mutations are nearly always rare and are often novel. The genes on genetic testing panels have all been preselected on the basis of known associations with cardiac disease, but it is useful to know whether the same variants, other variants at the same amino residue, or variants in neighboring residues in these genes have previously been identified with the same disorder or other cardiac disorders. This information can be obtained by searching mutation databases or the published literature. Bioinformatics tools have undoubtedly been used to come to the "probably pathogenic" annotation, and it is useful to know which programs and how many programs were used and the criteria used to define pathogenicity. We now know that every individual carries hundreds of novel potentially pathogenic variants, 3,66 and so additional steps should be taken to make a case for a particular variant being disease causing. Determining whether a variant cosegregates with disease status in a family is a key factor in assessing its likely role in disease. Clinical evaluation of all first-degree relatives of an index case with suspected heritable disease should be performed and blood samples taken for DNA analysis. The presence or absence of a variant in family DNA samples can be readily ascertained by simple tests, such as polymerase chain reaction and sequencing. Factors such as variable expressivity and penetrance and phenotype phenocopies need to be taken into account when assessing variant segregation in a family. Even if a variant does cosegregate with the family phenotype, however, this cannot be regarded as definitive evidence of disease causation. The final interpretation of clinical significance relies on a considered balance of probabilities and is ideally performed in the setting of a multidisciplinary clinic in which pretest and posttest genetic counseling is provided. The role of genetics in clinical practice is likely to increase exponentially in the near future as whole-genome sequencing to document personal genomes becomes more readily available. This type of information will take genetics beyond looking for rare disease-causing variants in families to assessment of a single patient's risk of developing common diseases and responses to drug therapies. 67

Future Directions
This is a rapidly moving field , and the need for faster and more comprehensive prediction tools is growing in parallel with the exponential use of next-generation sequencing. In the short term, submission inputs/outputs for prediction programs need to be streamlined, database resources need to be updated and maintained, quantitative and standardized measures of accuracy and reliability are required, and genespecific functional domain information should be taken into account. In addition to refining methods to assess nonsynonymous variants, there is an ongoing need to look at other types of variants and parameters. VAAST, developed by Yandell and colleagues, 68 has been recently developed specifically to analyze next-generation sequencing data and includes scoring of a broad range of coding and noncoding genetic variants, as well as incorporation of pedigree data. Comprehensive programs such as this will be invaluable for looking at the role of rare variants in both rare and common disorders. A generic limitation of all programs is the focus on single variants, and future refinements of genomic prediction tools would ideally incorporate evaluation of clusters of variants and their interactions. 8,69 The extent to which the cardiac "environment" can affect gene variant effects is also an important question. 70 The development of integrative strategies that can delineate unique individual cardiac substrates for disease is a daunting task but will ultimately be required to successfully implement personalized approaches to medical diagnosis and management.