Integration of Large-Scale Genomic Data Sources With Evolutionary History Reveals Novel Genetic Loci for Congenital Heart Disease

Supplemental Digital Content is available in the text.


Background
Congenital heart disease (CHD) is the most prevalent birth defect in humans, occurring in approximately 8 per 1000 live births, and consisting of malformation of the heart and/or the great vessels 1 . Around 20% of all CHDs can be attributed to chromosomal imbalances such as Down and Turner, and 22q11 deletion syndromes; around 80% occur as sporadic, non-syndromic CHD.
In such cases, CHD behaves overall as a genetically complex trait with moderate heritability.
Previous genome-wide investigations into CHD have found evidence for rare causative copy number variants (CNVs) and single nucleotide variants (SNVs); and associations with common SNVs in GWAS [2][3][4][5][6] . It has been estimated in previous studies that several hundred genes may be involved in polygenic CHD susceptibility; therefore, many remain to be discovered 7 .
CNVs are 1 kilobase (kb) to several megabase (Mb) sized regions of duplication (DUP) and deletion (DEL) in the genome. A 2014 meta-analysis of CNVs in 1694 non-syndromic CHD cases identified 79 chromosomal regions in which 5 or more CHD cases had overlapping imbalances 5 . The estimated prevalence of pathogenic CNVs in non-syndromic CHD patients is 4-14%, whereas in syndromic CHD patients it is 15-20% (the most common being 22q11 deletion syndrome) 3,8,9 . There are multiple mechanisms by which a CNV may lead to disease including the disruption of chromosome structure, alteration of gene expression due to disruption of regulatory elements, and changes of the relative amounts of dosage-sensitive genes 10 .
The dosage-balance model postulates that, for genes that are in stoichiometric relationships (for example forming protein complexes with other genes), any perturbation in their relative ratios will tend to be deleterious 10 . In the early course of vertebrate evolution, around 500 million years ago, two whole-genome duplications (WGD), during which gene stoichiometry throughout the genome was preserved, as all genes were duplicated, took place. Periods of gene loss followed each of these events, resulting in the retention of some WGD paralogs in the genome (termed "ohnologs") and the loss of others. The dosage-balance model would predict that ohnologs should be enriched for dosage-sensitive genes. 11 Ohnologs, of which there are around 7,000 in the human genome, have indeed been shown to exhibit characteristics consistent with dosage-sensitivity: for example, ohnologs are enriched for haploinsufficient genes 11,12 ; and Makino et al. reported, based on CNV data in healthy individuals from the Database of Genomic Variants (DGV), that genomic regions (~2Mb in size) near ohnologs are CNV deserts, indicating that those regions are dosage-balanced 13 .
The formation and fixation of gene duplications within the genome is subject to different evolutionary mechanisms -small scale duplications (SSD) involving relatively few genes, and WGD. A strong relationship between the evolutionary mechanism of duplication and phenotypic consequences, including heritable diseases, has been previously shown 14 15, 16 . Ohnologs have a significant association with certain human genetic diseases; for example 12 out of 16 reported candidate genes within the Down syndrome critical region (21q22.12, 21q22.13 and 21q22.2) are dosage-balanced ohnologs 11 . By contrast, genes arising from SSDs lack enrichment for disease association 17 . In addition, ohnologs are enriched for genes involved in signalling and gene regulation, key cardiovascular developmental processes 11 . These considerations led us to hypothesise that ohnologs may be enriched among CHD causative genes.
We tested this hypothesis in a meta-analysis of CNV data including 4,634 non-syndromic CHD cases, and integrated these data with a whole-exome sequencing (WES) study of 829 cases of Tetralogy of Fallot (TOF), the commonest cyanotic CHD phenotype, which has been previously shown to have a significant aetiological contribution from CNVs 6 . These were compared with control data, which were derived from large-scale genomic resources [18][19][20][21] .

Methods
The appropriate institutional review bodies approved all recruitment of human participants in this study. The study corresponded with the stipulations of the Declaration of Helsinki, and all participants (or their parents, if affected probands were children too young to themselves consent) provided informed consent. Data from consortia were accessed subject to the applicable data-sharing agreements. Summary data, analytic methods, and summary study materials will be made available to other researchers for purposes of reproducing the results or replicating the analyses reported here, on request to the corresponding authors. Full Materials and Methods are available in the Data supplement of the article.

Update of CHD CNV dataset and generation of control CNV dataset
We updated the previous meta-analysis of CNVs in non-syndromic CHD cases 5  A control CNV dataset was generated by acquiring CNV data from individuals not explicitly identified as having a developmental disorder, who were enrolled in the 1000 Genome Project Phase 3, DGV, DECIPHER, and published studies 21,27,28,33,34 . The control CNV dataset resulted in 256,511 DEL CNVs, 84,343 DUP CNVs and 6,403 BOTH CNVs, i.e. either DEL or DUP. gnomAD CNVs 35 were incorporated into the analysis as they became available, and resulted in an additional 51,420 DUP CNVs and 198,611 DEL CNVs.

Comparison of CHD CNV regions with control CNV regions
All CHD DEL and DUP CNV regions (coordinates hg19) were compared against control DEL and DUP CNV regions, respectively. Any CHD CNV regions overlapping control CNV regions were excluded. As a result, we identified DEL and DUP CNV regions only seen in nonsyndromic CHD cases. The genes located in those regions were annotated using the Ensembl database. There were a number of genes that already had an assigned phenotype (OMIM) 36 ; among these, 59 had been previously associated with CHD pathogenesis such as ZIC3, NKX2-6, GATA4, JAG1, GJA1 and TBX5. All genes with OMIM assigned phenotypes were excluded from further analysis.
Novel genes found in the CNV regions only seen in CHD cases were then compared to an in-house list of 12,771 genes with novel or rare SNVs (either absent from ExAC or with frequency of <0.01) from WES data in 829 TOF cases 6 . Genes supported by both CNV and WES data were included for further analysis. In total, 3,082 genes in DEL CNVs, 4,297 genes in DUP CNVs and 3,068 in BOTH CNVs (i.e. genes found in DEL and DUP CNVs) were also found in the TOF WES data with either high (nonsense variants, frameshift, splice variants) or medium (missense, splice variants) impact SNVs. This intersection of CNV and WES data led to an overall reduction of ~60% in the number of candidate genes for CHD ( Figure 2).

Ohnologs are highly enriched in CHD cases whereas small-scale duplications (SSD) and
singleton genes are not.
Ohnologs (N=7,023) were identified using data from Singh et al (2015) 37 , available at http://ohnologs.curie.fr/. SSDs (N=7,014) were extracted from Ensembl gene trees 12 . Any remaining genes that were neither found in the ohnolog dataset nor identified as having a direct paralog were considered for the purpose of this study to be singletons. The frequencies of ohnologs, SSDs and singletons among the candidate CHD genes were compared with their frequency in the human genome. Novel genes supported by the CNV data in CHD cases were found to be enriched for ohnologs (14.65% vs 12.05%, χ 2 test, p<0.0001, OR=1.253, 95% CI: 1.199-1.309,) ( Figure 3A). There were no differences in SSDs ( Figure 3B) and an underrepresentation for singletons ( Figure 3C) compared to the human genome. There was a 2.3-fold increased enrichment of ohnologs in the genes supported by both CNV and WES data in CHD cases (χ 2 test, p< 0.0001, OR=3.751, 95% CI: 3.574-3.937). In this instance, SSDs were also enriched in CHD cases compared to the human genome (χ 2 test, p< 0.0001, OR=1.437, 95% CI: 1.356-1.905). However, ohnologs were 2-times elevated compared to SSD genes (33.94% versus 16.43%). Additionally, we assessed our methodology by applying it to a group of genes with strong a priori evidence for pathogenicity. The crowd-sourced Genomics England "PanelApp" gene list for CHD (available at https://panelapp.genomicsengland.co.uk/panels/212/), which represents a consensus view of causative genes, was highly enriched for ohnologs (76.6% vs 12.05%, χ 2 test, p<0.0001, OR=23.89, 95% CI: 12.33-46.18). We therefore used ohnolog status as an additional candidate gene filter.

Candidate genes supported by both CNV and WES data of CHD cases
In order to further refine our candidate genes, we integrated additional genomic resources including the top 5% ExAC CNV intolerance scores, probability of haploinsuffieciency (pHI) 38 , probability of loss-of-function intolerance (pLI) 19 , and RNAseq expression data from mouse embryonic hearts 39 . Lastly, we incorporated ohnolog status. Genes from BOTH CNVs were analysed twice; once with the metrics used for genes from DEL CNVs and once with the metrics used for genes from DUP CNVs (Figure 4). This led to the identification of 9 candidate genes from DEL and BOTH CNVs: BRWD1, DIP2C, EYA3, GRB10, HNRNPC, RC3H2, SLIT3, TLN1 and UBASH3B. All 9 have the following properties: a) loss-of-function (LoF) variants in the WES data, b) found in DEL or BOTH CNV regions only seen in non-syndromic CHD cases, c) top 5% of ExAC DEL CNV intolerance scores, d) haploinsufficient (pHI≥0.65) and/or unable to tolerate LoF variants (pLI≥0.9), e) in the top 25% of highly expressed genes in mouse heart at E9.5 and/or E14.5, f) ohnolog, g) not present in the list of genes curated from the DDD study, h) not classified as human non-essential genes from the Sudmant study 21 (Table 2).
In addition, we found 45 candidate genes from DUP and BOTH CNVs, which had the following properties: a) high or medium impact SNVs in the WES data, b) found in DUP and BOTH CNV regions only seen in non-syndromic CHD cases, c) top 5% of ExAC DUP CNV intolerance scores, d) in the top 25% of highly expressed genes in mouse heart at E9.5 and/or E14.5, e) ohnolog, f) not present in the list of genes curated from the DDD study, g) not in the list of non-essential human genes from the Sudmant study 21 (Table 2).

Pathway enrichment and gene ontology analysis
We performed pathway enrichment analysis, using the Reactome Pathways Analysis tool 40 , on the final 54 candidate genes supported by both CNV and WES data in non-syndromic CHD cases. This resulted in 11 pathways, where >5 of our candidate genes were involved in those pathways ( Table 3). The top 3 pathways based on entities ratio (entities found/total entities) from Reactome were ''axon guidance'', ''signalling by receptor tyrosine kinases'' and ''cellular responses to external stimuli''. In addition, Ingenuity pathway analysis (IPA) was also used with the only pathway including >5 genes being ''axon guidance signalling''. Gene ontology analysis 41  and SLIT3 are involved ( Figure 5).

SLIT2 and SLIT3 variants in CHD
SLIT2 and SLIT3 were the most strongly supported genes found both by pathway analysis and gene ontology ( Figure 5). Therefore, we further explored the phenotypic associations of these genes within our population.
In the present study, individuals with CNVs including SLIT3 were reported with malformation of the heart and great vessels (n=1), VSD (n=1), atrial septal defect (n=3) and TOF (n=1) whereas individuals with SLIT2 CNVs were reported with malformation of the heart and great vessels (n=1), VSD (n=2) and double outlet right ventricle (n=1). In addition, 20 missense SNVs and 3 splice-site SNVs in SLIT3 were found in 24 out of 829 TOF cases (2.9%, 95% CI: 1.91%-4.35%) and SLIT2 had 12 missense SNVs and 2 splice-site SNVs in 14 out of 829 TOF cases (1.7%, 95% CI: 0.9%-2.9%). Probands were available for 12 SLIT3 variants and 5 SLIT2 variants which were confirmed by Sanger sequencing. Remaining variants were confirmed to have good coverage using Integrative Genomics Viewer (IGV). Samples from both parents were available for 9 probands with SLIT3 variants and were analysed for variant inheritance; 2 of the 9 SLIT3 variants tested were identified as de novo. Samples from both parents were available for 5 probands with SLIT2 variants and were all either maternally or paternally inherited.

Discussion
Here, we performed a large-scale genome-wide meta-analysis study of non-syndromic CHD cases and identified 54 novel candidate genes for CHD. In addition to the large size of our dataset, we incorporated a novel analysis strategy incorporating the evolutionary origin of gene duplications. Ohnologs tend not to be observed in CNVs in vertebrate genomes 13  Pathway analysis and gene ontology analysis concordantly identified the SLIT2 and SLIT3 genes, which have recently received increasing attention in heart development 42 . In vertebrates, the Slit family comprises of 3 known members (SLIT1-3), which are highly conserved secreted proteins that bind to Roundabout (ROBO) receptors. SLIT2 and SLIT3 are expressed during mouse embryonic development and interact with ROBO1 and ROBO2 43,44 .
They both encode proteins that consist of 4 LRR domains (leucine-rich repeats) also called D1-D4, 8 EGF repeats (epidermal growth factor) and 1 Laminin-G-like domain 43 . Slit3 is expressed early in murine cardiogenesis in the cardiac crescent at E7.5 and linear heart tube at E8.5, later expression being restricted to the myocardium of the atria and OFT but not in the cardiac cushions or valves 44,45 . Slit2 is strongly expressed in the pharyngeal region at E8.5-E9.5 and later in the ventricular trabecular myocardium, epicardium, aortic semilunar valves and the mesenchyme surrounding the caval veins 44,46 . Slit3 -/mutant mice exhibit ventricular septal defect (VSD), thick atrioventricular valves and hypoplastic posterior aortic semilunar leaflet with Slit2 -/mutant mice exhibiting bicuspid aortic valves and immature semilunar valves 44 . Robo1 -/mutant mice also exhibit VSD with down-regulation of NOTCH signalling, suggesting a potential mechanism for the underlying defects 44 . In another study, Slit3 -/mice also exhibit congenital diaphragmatic hernia 47 . Congenital heart defects ranging from bicuspid aortic valves to septal and outflow tract defects are therefore observed in variety of animal models in which genes involved in the SLIT/ROBO pathway have been inactivated.
We identified SLIT2 and SLIT3 heterozygous SNVs in 2.9% and 1.7% non-syndromic TOF patients, respectively. All SNVs were novel or rare (either absent from ExAC or with frequency of <0.01) and predicted with in silico tools to be pathogenic. The majority of the SNVs in both genes were missense, although a few splice site SNVs were also found. Their functional relevance will be of interest in future studies. Both genes were also present in CNVs in CHD patients with varying phenotypes including septal defects and malformation of the great arteries. This is the first study to find an association of SLIT2 and SLIT3 with predisposition to human CHD, although, of note, a recent study identified ROBO1 LOF SNVs in cases with TOF and septal defects 48 .

Limitations
This study has certain limitations. The databases and publications included in this analysis incorporated different CNV detection platforms and analysis algorithms 49 . Irrespective of the method used in the studies identifying pathogenic CNVs, we only included studies that used the same genotyping method between cases and controls and confirmed their CNV detection by an additional methodology like qPCR, which to a degree addresses this limitation. Another potential limitation is the fact that during our filtering strategy we might have missed some important genes crucial for cardiac development. Though we accept that all important genes will not have been captured, we detected 54 strong candidate genes supported by multiple lines of evidence as having a causative role in non-syndromic CHD. Further research in much larger numbers of comprehensively genetically characterised CHD cases is warranted to establish the magnitude of the contribution of these genes, and to discover novel loci.
In conclusion, we show that ohnologs are over-represented in CHD cases and that incorporation of the evolutionary origins of genes is useful in refining candidate genes emerging from large-scale genetic evaluations of CHD. We also observe that CNVs and SNVs in SLIT2 and SLIT3 are associated with CHD involving TOF, septal defects and outflow tract defects, supporting the importance of the SLIT-ROBO signalling pathway in heart development.
https://decipher.sanger.ac.uk/ and via email from decipher@sanger.ac.uk. Funding for the DECIPHER project was provided by the Wellcome Trust 20 .   Table 2. Candidate genes supported by both CNV and WES data of CHD cases. 54 protein-coding candidate genes supported by CNV and WES data in non-syndromic CHD cases. All genes in the list are strict ohnologs. Data presented includes the Ensembl ID and the nature of the chromosomal imbalance for which the gene is either deleted (DEL), duplicated (DUP) or deleted/duplicated (BOTH).   ohnologs B) small scale duplications (SSD) and C) singletons. Statistical significance was tested using two-tailed Chi-square test with Yates's correction, p<0.05 was considered statistically significant.

Figure 4:
Filtering process using large-scale genomic data resources. Both graphs are in logarithmic scale and represent the consecutive filtering of the genes using the different metrics for A) deleted (DEL) and both CNV genes B) duplicated (DUP) and both CNV genes. There is approximately 70% reduction in the number of candidate genes when we apply the evolutionary duplication metric -ohnolog. Also, none of our candidates were present in the list of homozygous deleted genes (non-essential) from the Sudmant study as well as not present in the list of genes curated from the DDD study. genes were supported by multiple lines of evidence.