Gene expression in African Americans, Puerto Ricans and Mexican Americans reveals ancestry-specific patterns of genetic architecture

We analyzed data from a total of 2,733 participants from the GALA II16 and SAGE17 asthma case–control studies who self-identified as African American (AA; n = 757), Puerto Rican (PR; n = 893), Mexican American (MX; n = 784) or other Latino American (LA; n = 299) (Table 1 and Supplementary Table 1). The median age of the participants varied from 13.2 (PR) to 16.0 years (AA). Genome-wide, or global, genetic ancestry proportions were estimated for all participants (Fig. 1). The median global African ancestry was highest in AA (82.6%), followed by PR (19.7%), and lowest in MX (3.5%).

Table 1 Study participants
Fig. 1: Study overview.
figure 1

This study included TOPMed WGS and whole transcriptome data generated from whole-blood samples of SAGE AA and GALA II Latino individuals (n = 2,733). We compared elements of the genetic architecture gene expression, such as cis-heritability and genetic variance, across participant groups defined based on self-identified race/ethnicity and genetic ancestry. We performed eQTL mapping and identified eQTLs that were specific to AFR or IAM ancestry. Finally, we developed genetic prediction models of whole-blood transcriptomes and performed comparative TWASs using GWAS summary statistics generated from the PAGE study and the UKB. Figure created with

Heritability of gene expression in admixed populations

We compared the heritability (h2) and genetic variance (VG) of whole-blood gene expression attributed to common genetic variation (minor allele frequency (MAF) ≥ 0.01) within the cis-region across self-identified race/ethnicity groups and subpopulations defined based on genetic ancestry (see Methods). Across 17,657 genes, cis-heritability (Fig. 2a) was significantly higher in AA (median h2 = 0.097) compared with PR (h2 = 0.072; PWilcoxon = 2.2 × 10−50) and MX participants (h2 = 0.059; P = 3.3 × 10−134) and in PR compared with MX participants (PWilcoxon = 2.2 × 10−25) (Supplementary Table 2). Genetic variance (Fig. 2b) of whole-blood transcript levels in AA participants (median VG = 0.022) was higher than in PR participants (VG = 0.018; PWilcoxon = 4.0 × 10−19) and in MX participants (VG = 0.013; PWilcoxon = 5.6 × 10−135). The results remained unchanged when the sample size was fixed to n = 600 in all populations (Extended Data Fig. 1).

Fig. 2: Cross-population comparison of cis-heritability (h2) and genetic variance (VG) of whole-blood transcript levels.
figure 2

a,b, Violin plots showing the distribution of h2 (a) and VG (b) for all available genes in each population. The box plots extend from the 25th to 75th percentiles and median values are annotated. The analyses were stratified by self-identified race/ethnicity and compared AA, PR and MX participants. c,d, Split violin plots showing the distribution of h2 (c) and VG (d) in individuals with >50% global genetic ancestry (high; darker shading) versus those with <10% of the same ancestry (low; lighter shading). Analyses were conducted separately for AFR and IAM ancestry. e,f, The local ancestry at the transcription start site of each gene was used to compare h2 (e) and VG (f) in participants with 100% (AFR/AFR or IAM/IAM; darker shading) versus 50% (AFR/EUR or IAM/EUR; lighter shading) local ancestry. Statistical significance in all panels was determined by Wilcoxon test and all P values are two sided.

Next, we compared the distribution of h2 (Fig. 2c) and VG (Fig. 2d) between participants grouped based on proportions of global genetic ancestry (Supplementary Table 3 and Supplementary Fig. 1). Among participants with >50% African ancestry (AFRhigh; n = 721), cis-heritability (h2 = 0.098) and genetic variance (VG = 0.022) were higher than in n = 1,011 participants with <10% global African ancestry (AFRlow; h2 = 0.060 (PWilcoxon = 9.6 × 10−126) and VG = 0.013 (PWilcoxon = 7.6 × 10−106)). Among individuals with >50% Indigenous American (IAM) ancestry (IAMhigh; n = 610), cis-heritability (h2 = 0.059) and genetic variance (VG = 0.012) were lower than in participants with <10% IAM ancestry (IAMlow; h2 = 0.084 (P = 3.1 × 10−103) and VG = 0.020 (PWilcoxon = 3.1 × 10−158)). The results remained consistent when partitioning h2 and VG by coarse MAF bins, with larger differences in h2 and VG among 0.01 ≤ MAF ≤ 0.10 variants (Extended Data Fig. 2).

We also investigated the impact of ancestry at the locus level, defined as the number of alleles (zero, one or two) derived from each ancestral population at the transcription start site. Heritability was higher in individuals with homozygous local African ancestry (AFR/AFR) compared with AFR/European (EUR) ancestry (h2 = 0.096 versus h2 = 0.084, respectively; PWilcoxon = 1.4 × 10−14) and lower in participants with homozygous Indigenous American ancestry (IAM/IAM) compared with IAM/EUR ancestry (h2 = 0.055 versus 0.064, respectively; P = 1.6 × 10−7) (Fig. 2e). Differences in VG by local ancestry were statistically significant for AFR/AFR versus AFR/EUR (PWilcoxon = 2.0 × 10−7) and IAM/IAM versus IAM/EUR (PWilcoxon = 1.6 × 10−8) (Fig. 2f). The results were also consistent for VG comparisons within self-identified race/ethnicity groups (Supplementary Table 4).

We calculated heritability using linkage disequilibrium adjusted kinships (LDAKs). This method assumes that SNP-specific variance is inversely proportional not only to the MAF, but also to linkage disequilibrium tagging18. Estimates obtained using the LDAK-Thin model and genome-wide complex trait analysis (GCTA) were nearly identical across populations based on self-identified race/ethnicity (h2 = 0.094 for AA, 0.071 for PR and 0.059 for MX) and genetic ancestry (h2 = 0.104 for AFRhigh, 0.066 for AFRlow, 0.062 for IAMhigh and 0.093 for IAMlow; Supplementary Table 5).

Lastly, we tabulated the number of heritable genes for which global and/or local ancestry was significantly associated (false discovery rate (FDR) < 0.05) with transcript levels (Supplementary Fig. 2 and Supplementary Table 6). Global AFR ancestry was associated with the expression of 326 (2.4%) and 589 (4.5%) heritable genes in AA and PR, respectively. Associations with local, but not global, AFR ancestry were more common (8.9% in AA and 10.9% in PR). Local IAM ancestry was associated with the expression of 9.8% of genes in MX, compared with 2.8% for global IAM ancestry.

Assessment of ancestry-specific eQTLs

To understand the control of gene expression at a more granular level, we performed cis-eQTL analysis. A total of 19,567 genes with at least one cis-eQTL (eGenes) were found in the pooled sample. The largest number of eGenes was detected in AA (n = 17,336), followed by PR (n = 16,975) and MX participants (n = 15,938) (Supplementary Table 7 and Supplementary Fig. 3). The number of eGenes was similar in AFRhigh (n = 17,123) and AFRlow (n = 17,146) groups. When the sample size was fixed to n = 600 for all populations, the number of eGenes observed in AFRhigh (n = 16,100) was higher than in AFRlow (n = 14,344). The numbers of eGenes detected in the IAMlow (n = 14,866) and IAMhigh groups (n = 14,419) were similar. The number of linkage disequilibrium-independent (r2 < 0.10) cis-eQTLs per gene was significantly higher in the AFRhigh group than the AFRlow group (PWilcoxon = 2.7 × 10−246) and lower in the IAMhigh group compared with the IAMlow group (PWilcoxon = 2.8 × 10−33) (Extended Data Fig. 3).

To characterize ancestry-related differences in the regulation of gene expression, we developed a framework for identifying ancestry-specific eQTLs (anc-eQTLs) (see Methods, Fig. 3 and Supplementary Tables 8 and 9). For heritable protein-coding genes, we first compared the overlap in 95% credible sets of cis-eQTLs identified in participants with >50% global ancestry (AFRhigh and IAMhigh) and those with <10% of the same global ancestry (AFRlow and IAMlow). For genes with nonoverlapping 95% credible sets, we distinguished between population differences in MAF (tier 1) and linkage disequilibrium (tier 2). For genes with overlapping 95% credible sets, eQTLs were further examined for evidence of effect size heterogeneity between ancestry groups (tier 3).

Fig. 3: Framework for the classification of anc-eQTLs.
figure 3

a, eQTL mapping analyses were conducted in participant groups derived based on genetic ancestry. Associations identified in participants with >50% global African ancestry (AFRhigh; n = 721) were compared with eQTLs detected in participants with <10% African ancestry (AFRlow; n = 1,011), and eQTLs identified in participants with >50% global Indigenous American ancestry (IAMhigh; n = 610) were compared with associations detected in participants with <10% Indigenous American ancestry (IAMlow; n = 1,257). b, Decision tree for the classification of anc-eQTLs. Tier 1 is based on MAF differences and represents the most ancestry-specific class. Tier 2 anc-eQTLs were identified using additional fine-mapping with PESCA22. Tier 3 anc-eQTLs were detected in both ancestry groups, but with statistically significant effect size heterogeneity based on Cochran’s Q test. c, Prevalence of anc-eQTLs in each ancestry group. Figure created with

Tier 1 anc-eQTLs were only common (MAF ≥ 0.01) in individuals with >50% AFR or IAM ancestry and were thus considered to be the most ancestry specific. Over 28% (n = 2,695) of genes contained at least one tier 1 AFRhigh anc-eQTL, while 7% (n = 562) of genes contained a tier 1 IAMhigh anc-eQTL (Supplementary Table 9). A representative example of a tier 1 AFRhigh anc-eQTL is rs3211938 (CD36), which has an MAF of 0.077 in the AFRhigh group and an MAF of 0.002 in the AFRlow group (Fig. 4a). This variant has been linked to high-density lipoprotein (HDL) cholesterol levels in several multi-ancestry GWASs that included African Americans19,20,21. Tier 2 anc-eQTLs, with ancestry-specific linkage disequilibrium patterning, had an MAF of 0.01 in both high (>50%) and low (<10%) global ancestry groups and were further fine-mapped using PESCA22. There were 109 genes (1.1%) that contained eQTLs with a posterior probability (PP) of being specific to AFRhigh of >0.80 and 33 genes (0.4%) matching the same criteria for IAMhigh (Supplementary Table 9). For instance, two lead eQTLs in low linkage disequilibrium were detected for TRAPPC6A in AFRhigh (rs12460041) and AFRlow (rs7247764) populations (Fig. 4b). These variants belonged to nonoverlapping credible sets and PESCA estimated that rs12460041 was specific to AFRhigh with a PP of 0.87 (Fig. 4c). Over 50% of heritable protein-coding genes (n = 5,058 for AFR and 5,355 for IAM) had overlapping 95% credible sets of eQTLs between high and low ancestry groups. Among these shared signals, a small proportion of eQTLs exhibited significant effect size heterogeneity (tier 3; 2.0% for AFRhigh and 1.0% for IAMhigh). For example, the KCNK17 eQTLs rs34247110 and rs3734618 were detected in the AFRhigh and AFRlow groups, but with significantly different effect sizes (Cochran’s Q P value = 1.8 × 10−10) in each population (Fig. 4d). Consistent with tier 3 eQTLs being observed in multiple ancestries, rs34247110 has been associated with type 2 diabetes in Japanese and multi-ancestry (European, African American, Hispanic and Asian) populations23,24.

Fig. 4: Examples of anc-eQTLs.
figure 4

a, The tier 1 anc-eQTL (rs3211938) for CD36 is present in individuals with >50% African ancestry (AFRhigh) with an MAF of 0.077 and in individuals with <10% African ancestry (AFRlow) with an MAF of 0.002. CD36 expression by rs3211938 genotype is shown for each individual and summarized using split violin plots and box plots, which extend from the 25th to the 75th percentile. b, Regional plots of FastQTL association results for TRAPPC6A show two independent eQTL signals in AFRhigh and AFRlow participants, with linkage disequilibrium r2 < 0.2 in each population. Variants are colored based on linkage disequilibrium r2 with respect to the index variant, as indicated by the diamond. Hollow circles show variants that were not associated with TRAPPC6A expression at the gene-specific P value threshold. kb, kilobases; TSS, transcription start site. c, Left, fine-mapping using CAVIAR identified nonoverlapping 95% credible sets in AFRhigh and AFRlow groups. The lead AFRhigh eQTL, rs12460041, had a PP of 0 in AFRlow. Right, fine-mapping using PESCA to account for linkage disequilibrium differences between populations confirmed that rs12460041 is a tier 2 anc-eQTL with a PP of >0.80 in AFRhigh. d, Tier 3 anc-eQTLs rs34247110 and rs3734618 were included in the 95% credible set for KCNK17 in AFRhigh and AFRlow groups, but have different eQTL effect sizes in each population, as indicated by the nonoverlapping shaded 95% CIs.

The prevalence of any tier 1, 2 or 3 anc-eQTL was 30% (n = 2,961) for AFR ancestry and 8% (n = 679) for IAM ancestry. Overall, 3,333 genes had anc-eQTLs for either ancestry. The remaining genes (n = 6,648 for AFR and 7,836 for IAM) did not contain eQTLs with ancestry-related differences in MAF, linkage disequilibrium or effect size as outlined above. Increasing the global ancestry cut-off to >70% did not have an appreciable impact on anc-eQTLs in the AFRhigh group (28.1% overall and 27.3% for tier 1), but decreased the number of anc-eQTLs in the IAMhigh group (3.3% overall and 3.3% for tier 1), probably due to a greater reduction in sample size in this group (n = 212 versus n = 610, respectively; Supplementary Table 10). Considering all protein-coding genes without filtering based on heritability (n = 13,535), the prevalence of anc-eQTLs was 22% for the AFRhigh group, 5% for the IAMhigh group and 25% overall. The observation that anc-eQTLs were more common in participants with >50% AFR ancestry aligns with the higher h2 and VG values in this population and a greater number of linkage disequilibrium-independent cis-eQTLs (Extended Data Fig. 3). Among genes with tier 1 and 2 anc-eQTLs, 83% had higher h2 estimates in the AFRhigh group than in the AFRlow group, while this was observed for 57% of genes without any ancestry-specific eQTLs (Extended Data Fig. 4).

We detected 70 unique anc-eQTLs associated with 84 phenotypes from the NHGRI-EBI GWAS catalog25, most of which were tier 3 anc-eQTLs (59%) that mapped to blood cell traits, lipids and plasma protein levels (Supplementary Table 11). Colocalization with GWAS results from the multi-ancestry Population Architecture Using Genomics and Epidemiology (PAGE) study21 identified 78 eQTL–trait pairs with strong evidence of a shared genetic signal (PP4 > 0.80), 16 of which were anc-eQTLs (Supplementary Table 12). One illustrative example is rs7200153, an AFRhigh tier 1 anc-eQTL for the haptoglobin (HP) gene, which colocalized with total cholesterol (PP4 = 0.997; Extended Data Fig. 5). The 95% credible set included rs7200153 (PPSNP = 0.519) and rs5471 (PPSNP = 0.481; linkage disequilibrium r2 = 0.75), with rs5471 probably being the true causal variant given its proximity to the HP promoter, stronger effect of HP expression and decreased transcriptional activity of rs5471-C in West African populations26,27,28, which is supported by previous literature20,29,30.

We also performed trans-eQTL analyses that largely recapitulated our cis-eQTL results, with a larger number of trans-eGenes and independent (linkage disequilibrium r2 < 0.10) trans-eQTLs in populations with greater levels of AFR ancestry and lower levels of IAM ancestry (see Methods and Supplementary Table 13).

Performance of TWAS models trained in admixed populations

Following the PrediXcan approach7, we developed gene expression imputation models from the pooled GALA II and SAGE population (n = 2,733) for 11,830 heritable genes with a mean cross-validation of 0.157 (Supplementary Table 13 and Supplementary Fig. 4). We also generated population-specific models for African Americans (10,090 genes; cross-validation r2 = 0.180), Puerto Ricans (9,611 genes; cross-validation r2 = 0.163) and Mexican Americans (9,084 genes; cross-validation r2 = 0.167). Adjusting for local ancestry did not improve predictive performance (Supplementary Table 14).

We validated GALA II/SAGE TWAS models and compared with GTEx v8 in the Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE)31—a study of 598 African American adults (Supplementary Fig. 5). The prediction accuracy was proportional to the degree of alignment in ancestry between the training and testing study samples. Across 5,254 genes with models available in all studies, the median Pearson’s correlation between genetically predicted and observed transcript levels was highest for pooled (r = 0.086) and AA (r = 0.083) models and lowest for GTEx (r = 0.049).

To evaluate the performance of the GALA II/SAGE TWAS models for gene discovery in admixed populations, we applied them to GWAS summary statistics for 28 traits from the PAGE study21 and compared them with TWASs using GTEx v8 (refs. 2,7) and the Multi-Ethnic Study of Atherosclerosis (MESA)3. GTEx v8 whole-blood models are based on 670 participants of predominantly European ancestry (85%)2, while MESA models impute monocyte gene expression3 from African American and Hispanic/Latino individuals (MESAAFHI; n = 585). The number of genes with available TWAS models was 39–82% higher in GALA II/SAGE compared with GTEx (n = 7,249) and MESAAFHI (n = 5,555). Restricting to 3,143 genes shared across all three studies, the cross-validation r2 was significantly higher in GALA II/SAGE compared with GTEx (PWilcoxon = 4.6 × 10−159) and MESAAFHI (PWilcoxon = 1.1 × 10−64) (Fig. 5a). TWAS models generated in GALA II/SAGE AA (n = 757) attained a higher cross-validation r2 than GTEx (PWilcoxon = 2.2 × 10−103), which had a comparable training sample size (Fig. 5b).

Fig. 5: Comparison of transcriptome imputation model performance and TWAS results.
figure 5

TWAS models were developed using data from GALA II and SAGE participants following the PredictDB pipeline and compared with TWAS models generated in GTEx v8 and MESA using the same approach. a,b, Violin plots showing the distribution of internal cross-validation r2 values for genes with TWAS models available in each study. The box plots extend from the 25th to the 75th percentiles of the r2 distributions, with median r2 values indicated by squares. Predictive performance was compared between TWAS models trained in the pooled GALA II/SAGE population and MESA models trained in African American and Hispanic participants (MESAAFHI) (a) and between TWAS models trained in African Americans (b). Statistical significance was determined by Wilcoxon test and the P values for differences in r2 are two sided. c, Summary of TWAS associations with FDR < 0.05 across 28 traits in PAGE. BMI, body mass index; BP, blood pressure; CRP, C-reactive protein; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LDL, low-density lipoprotein; WHR, waist-to-hip ratio. d,e, Correlation between TWAS z scores based on GALA II/SAGE pooled models and z scores using GTEx (d) or MESAAFHI models (e) for the union of genes with FDR < 0.05 detected with either model. Associations between TWAS z scores from each model are visualized by linear regression lines with shaded 95% CIs. Genes highlighted in orange had FDR < 0.05 using GALA II/SAGE models but did not reach nominal significance (TWAS P value > 0.05) using GTEx or MESA models.

TWASs using GALA II/SAGE models across 28 PAGE traits identified a larger number of significant gene–trait pairs (n = 380; FDR < 0.05) than MESAAFHI (n = 303) and GTEx (n = 268), with only 35 significant gene–trait pairs detected in all three analyses (Fig. 5c). GALA II/SAGE models yielded a larger number of associated genes than MESA in 80% of analyses (binomial test; P = 0.012) and a larger number than GTEx in 79% of analyses (binomial test; P = 0.019). Of the 330 genes with an FDR of <0.05 in GALA II/SAGE, 143 (43%) were not present in GTEx and 199 (60%) were not present in MESAAFHI. For genes that were significant in at least one TWAS, z scores in GALA II/SAGE were highly correlated with GTEx (r = 0.74; P = 3.5 × 10−64; Fig. 5c) and MESAAFHI (r = 0.55; P = 8.5 × 10−27; Fig. 5d), suggesting that most genes have concordant effects even if they are not significant in both analyses.

HDL cholesterol exhibited one of the largest differences in TWAS associations (Fig. 6a), with over 60% more significant genes identified using GALA II/SAGE models (n = 29) than GTEx predictions (n = 11). TWAS models for several associated genes, including those with established effects on cholesterol transport and metabolism, such as CETP, were not available in GTEx. The top HDL-associated gene, CD36 (z score = −10.52; PTWAS = 6.9 × 10−26) had tier 1 AFRhigh anc-eQTLs (rs3211938) that were rare in European ancestry populations (MAF = 0.00013). The difference in MAF may explain why CD36 was not detected using GTEx (z score = 0.057; PTWAS = 0.95), even though all 43 variants from the GTEx model were available in PAGE summary statistics.

Fig. 6: Comparison of TWAS results for selected traits.
figure 6

a, A TWAS of HDL was performed by applying GALA II/SAGE pooled models and GTEx v8 models to GWAS summary statistics from the multi-ancestry PAGE study (n = 33,063). b, A TWAS of neutrophil counts was performed by applying GALA II/SAGE models trained in African Americans and GTEx v8 to summary statistics from a GWAS meta-analysis of individuals of African ancestry (n = 13,476) by Chen et al.32. All genes with FDR < 0.05 are labeled, except for chromosome 1 in b due to the large number of statistically significant associations. Significantly associated genes for which expression levels could not be predicted using GTEx v8 elastic net models are highlighted in red.

Although GALA II/SAGE multi-ancestry TWAS models showed robust performance, in some cases population-specific models may be preferred. For instance, benign neutropenia is a well-described phenomenon in persons of African ancestry and is attributed to variation in the 1q23.2 region. Applying GALA II/SAGE AA models to a meta-analysis of 13,476 individuals of African ancestry32 identified ACKR1 (PTWAS = 1.5 × 10−234), the atypical chemokine receptor gene that is the basis of the Duffy blood group system (Fig. 6b). This causal gene was missed by GTEx and MESAAFA. After conditioning on the Duffy-null rs2814778-CC genotype, no statistically significant TWAS associations remained on chromosome 1 (Supplementary Fig. 6). GALA II/SAGE AA models also detected seven genes outside of 1q23.2 that were not previously reported in GWASs of neutrophil counts: CREB5 (PTWAS = 1.5 × 10−14), DARS (PTWAS = 2.9 × 10−8), CD36 (PTWAS = 1.1 × 10−5), PPT2 (PTWAS = 1.3 × 10−5), SSH2 (PTWAS = 4.7 × 10−5), TOMM5 (PTWAS = 2.9 × 10−4) and ARF6 (PTWAS = 3.4 × 10−4).

Next, we performed TWASs of 22 blood-based biomarkers and quantitative traits using summary statistics from the UK Biobank (UKB). Ancestry-matched TWASs of UKB AFR (median GWAS n = 6,190) identified 56 gene–trait associations (FDR < 0.05), whereas GTEx detected only five genes (Extended Data Fig. 6). TWAS z scores for associated genes were modesty correlated (r = 0.37; 95% confidence interval (CI) = −0.01–0.66). TWASs in UKB EUR (median GWAS n = 400,223) also illustrated the advantage of ancestry-matched analyses, but the difference was less dramatic, with a 15% decrease in the number of genes that reached an FDR of <0.05 using GALA II/SAGE AA models and strong correlation between z scores (r = 0.77; 95% CI = 0.76–0.78). Concordance between significant associations across all traits was 28%, ranging from 32.7% for height to 7.6% for hemoglobin.

2023-05-25 15:32:28