Validation of Polygenic Risk Scores for Coronary Heart Disease in a Middle Eastern Cohort Using Whole Genome Sequencing

Background: Enthusiasm for using polygenic risk scores (PRSs) in clinical practice is tempered by concerns about their portability to diverse ancestry groups, thus motivating genome-wide association studies in non-European ancestry cohorts. Methods: We conducted a genome-wide association study for coronary heart disease in a Middle Eastern cohort using whole genome sequencing and assessed the performance of 6 PRSs developed with methods including LDpred (PGS000296), metaGRS (PGS000018), Pruning and Thresholding (PGS000337), and an EnsemblePRS we developed. Additionally, we evaluated the burden of rare variants in lipid genes in cases and controls. Whole genome sequencing at 30× coverage was performed in 1067 coronary heart disease cases (mean age=59 years; 70.3% males) and 6170 controls (mean age=40 years; 43.5% males). Results: The majority of PRSs performed well; odds ratio (OR) per 1 SD increase (OR1sd) was highest for PGS000337 (OR1sd=1.81, 95% CI [1.66–1.98], P=3.07×10−41). EnsemblePRS performed better than individual PRSs (OR1sd=1.8, 95% CI [1.66–1.96], P=5.89×10−44). The OR for the 10th decile versus the remaining deciles was >3.2 for PGS000337, PGS000296, PGS000018, and reached 4.58 for EnsemblePRS. Of 400 known genome-wide significant loci, 33 replicated at P<10−4. However, the 9p21 locus did not replicate. Six suggestive (P<10−5) new loci/genes with plausible biological function were identified (eg, CORO7, RBM47, PDE4D). The burden of rare functional variants in LDLR, APOB, PCSK9, and ANGPTL4 was greater in cases than controls. Conclusions: Overall, we demonstrate that PRSs derived from European ancestry genome-wide association studies performed well in a Middle Eastern cohort, suggesting these could be used in the clinical setting while ancestry-specific PRSs are developed.

the first 20 PCs. A Wald test was performed on the most significant SNVs to calculate odds ratios and confidence intervals. Other models with various combinations of covariates were also performed. The standalone version of LocusZoom 47 was used to generate regional plots that show the linkage disequilibrium between SNVs (extracted from our data) and association results.
For the most significant SNVs, expression quantitative trait locus (eQTL) analysis was performed using GTEx V8.0 data in several relevant tissues (see below Supplemental Methods for more details). Evidence of association between SNVs and expression of neighboring genes within 1 MB was considered.

Haplotype analysis for the 9p21 locus
Fifteen SNVs at the 9p21 locus, reported to be associated with CHD in European ancestry cohorts, were considered in this analysis; these are located on chromosome 9 between 22,040,765 and 22,125,913 and include: rs1333037, rs1333036, rs1333039, rs1333040, rs10757272, rs4977574, rs2891168, rs1333042, rs1333043, rs1333045, rs1333046, rs1333047, rs1333048, rs1333049, and rs1333050. Haplotypes were obtained using a sliding window of seven SNVs, using PLINK v1.07. 48 The association test for each haplotype (presence or absence) was performed using GMMAT's Wald test, including the same covariates and accounting for relatedness.

Genomic loci associated with CHD
A list of previously identified CHD loci from all studied ancestry groups was extracted from the GWAS catalog and then supplemented with results from recent relevant publications. 10,[49][50][51] The final set contained 936 unique SNVs (Supplemental Table VI

Association analysis for rare variants
We limited rare variant analysis to lipoprotein metabolism genes: APOB, LDLR, and PCSK9 in the LDLcholesterol pathway, APOA5, APOC2, LPL, LIPA, APOC3, APOE, and ANGPTL4 in the triglyceride pathway, and lipoprotein(a) LPA. Bi-and multi-allelic variants within these genes were annotated using the Ensembl Variant Effect Predictor (VEP; https://grch37.ensembl.org/Tools/VEP) to predict the impact of variants ('high' and 'moderate'). Additionally, we assessed the prevalence of pathogenic/likely pathogenic (P/LP) variants in these genes (ClinVar;https://www.ncbi.nlm.nih.gov/clinvar/;May 5, 2021) in cases and controls. We selected P/LP, 'high', and 'moderate' variants with MAF<0.05, GATK filter value of 'PASS', 'ExcessHet'<54, 'InbreedingCoeff'> -0.2, and missing genotype rate <0.05. MAF was compared between cases and controls at a variant-level, and at a gene-level by aggregating variants within genes. The burden of rare variants was calculated as the sum of the MAF of variants within each gene.
The case to control ratio (CCR) of the rare variant burden was calculated as CCR = ∑MAF ,CHD ∑MAF ,Controls where represents the variants. SKAT and burden test were performed using the SMMAT R package. 46

Supplemental Table Legends
Supplemental Table I: PRS results for various models with different combinations of covariates   Supplemental Table II: The details about the nine GWAS summary statistics used for metaanalysis in the replication stage Supplemental Table III: Meta-analysis of the most significant SNVs in public GWAS summary statistics described in Supplementary Table II where meta-analysis P < 10 -3  Table III: Meta-analysis of the most significant SNVs in public GWAS summary statistics described in Supplemental Table II where meta-analysis P < 10 -3 Genes previously discovered in GWAS

Supplemental
Rep: "Y" if SNV was replicated with meta-analysis P < 10 -3 . eQTL: SNV that was associated with the corresponding gene expression (P <0.05) in GTEx without FDR adjustement; eQTL P : eQTL P between the SNV and the gene expression; GTEx tissue: tissue where evidence of eQTL was observed; CHD risk factors: a subset of relevant CHD risk factors that were associated with the corresponding gene as shown in GWAS catalog; Biological function relevant to CHD: biological function of the corresponding gene that may be linked to CHD. Detailed gene functions can be found in Supplementary Material. For genes that were previously reported to be associated with CHD, CHD risk factors and biological functions were not reported. List of SNVs were split in two parts: the potentially novel genes and genes previously discovered in GWAS of CHD. The SNVs were ordered by chromosome and position.    (https://gatk.broadinstitute.org/hc/en-us). Raw sequencing reads were aligned to human reference build 37 (hs37d5) using BWA-MEM (v0.7.12-r1039). Aligned reads were further processed using the GATK bestpractices pipeline (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variantdiscovery-SNPs-Indels-) to obtain the genotype calls in Variant Call Format (VCF). The bioinformatics pipeline used to generate VCF files is described elsewhere 39,40 .

Data quality control
The VCF files were split at multi-allelic sites and left-normalized using bcftools version 1.9. Low-quality genotypes were then converted to missing (thus keeping high quality genotypes only) using three criteria: allelic balance, coverage, and the genotype quality (GQ) measure from GATK. For the allelic balance, genotypes were set to missing for heterozygous calls if the ratio of reads harboring the reference allele to the total reads was outside the interval [0.25 -0.75] 52 . They were set to missing for homozygous reference calls if the ratio was < 0.75, and for homozygous alternative alleles if the ratio was > 0.25. They were also set as missing when the coverage was < 5 or > 1.5 times the mean coverage (across individuals), or if GQ was < 20. The following filtering steps were performed on the obtained data and removed: -individuals with a heterozygotes to non-reference homozygotes ratio >3: the heterozygosity ratio as described and explored in Samuels et al. 53 and described as follows: "The heterozygosity ratio is the number of heterozygous sites in an individual divided by the number of non-reference homozygous sites and is strongly affected by the degree of genetic admixture of the population and varies across human populations. Unlike quantifications of runs of homozygosity (ROH), the heterozygosity ratio is not sensitive to the density of genotyping performed." In our cohort, the Q-AFR group has particularly high values. As can be seen in Table 1 of the cited paper, African populations typically have the highest values and may exceed 2. Members of the Admixed group typically fall between Q-SAS and either the Persian or Gulf subgroups. Overall, the Admixed group has a higher heterozygosity ratio than the populations the admixture is drawn from, as expected.

Figure:
Individual heterozygosity rates (heterozygotes to non-reference homozygotes) stratified by ancestry variants that did not have a GATK filter value of 'PASS' variants with an 'ExcessHet' value >54 variants with an 'InbreedingCoeff' < -0.2 variants with a missing genotype rate >0.05 individuals with missing genotype rates >0.05

Figure:
Individual missing rate stratified by ancestry variants that deviate from Hardy-Weinberg Equilibrium with a P<10 -8 in cases and controls combined, P<10 -7 in CHD patients, and P<10 -5 in controls, for SNVs where only observed heterozygotes were larger than expected individuals with ambiguous genetic sex using X-chromosome heterozygosity rate, F (0.33<F<0.93), estimated with PLINK v1.9 42 , The calculation of F for the X chromosome was performed after splitting off the pseudo-autosomal region (PLINK --split-x) and then running LD-based pruning. The bounds were determined by taking a rough bound of 0.8 between the sexes (so < 0.8 considered to be female and > 0.8 considered male) because the male class shows less variation and 0.8 is a commonly used boundary for them, e.g. it is the default boundary used by PLINK for the male class. Then the final bounds were determined by using the mean of each class +/-3 standard deviations When there was a discordance between the reported sex and genetic sex, genetic sex was used.
After these QC steps, 7,023 individuals remained (

Relatedness estimation
Kinship coefficients were calculated to quantify relatedness between subjects using a relatively independent set of 265,250 SNVs obtained by LD-pruning SNVs with MAF >0.05. The pruning was performed using the PLINK '--indep 50 5 1.05' command. PLINK kinship coefficients were used to flag duplicated or highly related individuals during QC steps. For association analysis, the model included the genetic relationship matrix (GRM) constructed using GCTA 43 .

Association analysis for common variants
Association analysis was conducted using a generalized linear mixed model that accounts for relatedness between individuals, age, BMI, sex, and 20 PCs. Association analysis was performed using GMMAT's score test (https://cran.r-project.org/web/packages/GMMAT/index.html). 46 A Wald test was performed on the most significant SNVs to calculate their odds ratios and confidence intervals.

Known genomic loci associated with CHD
A list of previously identified CHD loci from all studied ancestry groups was extracted from the GWAS Catalog. The initial list was extracted by taking "Coronary Heart Disease" and its child traits. Only those with a "mapped trait" of "coronary heart disease" or "coronary artery disease" were retained. This list was

Expression quantitative trait loci (eQTL) analysis
To seek biological validation of the most significant SNVs, we performed eQTL analysis using GTEx We searched for evidence of association between SNVs and expression of neighboring genes within 1 MB. repression of Coronin-7 can prevent LDLR-A18-GFP from being transported away from the Golgi apparatus and that is likely also true for unmodified LDLR. CORO7 has been found to be associated with systolic blood pressure 57 and waist-to-hip ratio 58 , which are risk factors of CHD 59 .

rs7690530 (RBM47, P=2.94x10 -6 ):
RBM47 is a paralog of A1CF and is also a cofactor for mRNA editing of ApoB (APOB) by APOBEC (APOBEC1) 60 to produce the shortened ApoB-48. ApoB, and in particular ApoB-100 or the ratio of ApoB-100 to ApoA have been considered even stronger markers for coronary atherosclerosis than LDL 61 . rs12950395 (HS3ST3B1, P=9.3x10 -6 ) and rs114906338 (HS3ST1, P=4.39x10 -6 ): HS3ST3B1 and HS3ST1 are involved in the synthesis of Heparan-Sulfates. Heparan-Sulfates are a major component of some tissues and are found intra and extra-cellularly as well as at the cell surface and play an essential role in many signaling pathways including TGF-Beta 62 , Angiopoietin 62 , and FGF 63 .
HS3ST1 may result in Heparan-Sulfate that binds Antithrombin III (SERPINC1), or it goes on to be processed to Heparin in Mast cells. Heparin is a potent and widely used anti-coagulant that is often still harvested from biological sources rather than produced synthetically. Two of our top SNVs (rs2480948, P= 2.6x10 -5 and rs513479, P= 2.91x10 -5 ) are near coagulation factors 7 and 10 (F7 and F10). Low dose heparin itself has been effective in reducing cardiac events and deaths, and heparin and dextran-sulfates are effective in lowering LDL and total cholesterol respectively according to Gustafsen et al. 64 . In Smits et al. 65 , they observed an association between a SNV in HS3ST1 and the severity of coronary artery disease and atherosclerotic cardiovascular events.

rs6959554 (TNS3, P=1.56x10 -6 ):
Tensin 3 is often found at focal adhesion sites. It is involved in the dissociation of the integrin-tensinactin complex, cell migration, and potentially bone development. TNS3 is known to interact with p130Cas/BCAR1 66 , a gene known to be associated with coronary artery disease (rs999675, 10 -12 ) 28 . TNS3 may also be involved in the regulation of RAC1 and RhoA 67,68 , both of which are associated with coronary artery disease in the GWAS Catalog. TNS3 has a region homologous to PTEN with actin binding sites, Ia and Ib, and a focal adhesion binding site while the other end has an SH2 binding site, a