An atlas of genetic correlations across human diseases and traits, 2015, Bulik-Sullivan et al.

tralfamadorian97

Established Member
Citation: Bulik-Sullivan, Brendan, et al. "An atlas of genetic correlations across human diseases and traits." Nature genetics 47.11 (2015): 1236-1241.

Authors: Brendan Bulik-Sullivan, Hilary K Finucane, Verneri Anttila, Alexander Gusev, Felix R Day, Po-Ru Loh, ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium , Laramie Duncan, John R B Perry, Nick Patterson, Elise B Robinson, Mark J Daly, Alkes L Price & Benjamin M Neale

Link: https://www.nature.com/articles/ng.3406

Abstract​

Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual-level genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique—cross-trait LD Score regression—for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity, and educational attainment and several diseases. These results highlight the power of genome-wide analyses, as there currently are no significantly associated SNPs for anorexia nervosa and only three for educational attainment.
 
Last edited:
A recently-posted Genome-Wide Association Study of Hypermobile Ehlers-Danlos Syndrome (hEDS) (Discussion) used cross-trait LD score regression to estimate genetic correlations between hEDS and other conditions. The authors found a strong genetic correlation between ME/CFS and hEDS. Genetic correlation was also referenced by @forestglip and @ME/CFS Science Blog in the discussion of the DecodeME results.

I do not have a biostatistics background, so I did not understand this. I read the Bulik-Sullivan et al. paper to improve my understanding by learning about genetic correlation and cross-trait LD score regression.

I’ve made some notes below. If you are like me and are also confused about genetic correlation and LD score regression, I hope my notes provide useful context. On the other hand, if you are familiar with these concepts, feel free to correct any errors I might have made.


What is genetic correlation?
Suppose we have two continuous phenotypes, P1 and P2. Construct a simple additive model in which each phenotype is determined by the sum of all genetic influences G and all environmental influences E.

1759092804663.png


The genetic correlation between phenotypes 1 and 2 is then corr(G1,G2), the Pearson Correlation of their genetic influences.

For discrete phenotypes, the definition of genetic correlation is similar, but involves a link function.


When will two phenotypes be genetically correlated?
The most obvious circumstance in which two phenotypes will have high genetic correlation is when they share an underlying genetic architecture: that is, they are influenced by the same genetic polymorphisms in the same direction. Thus the finding that two diseases have high genetic correlation is often taken to imply that they are driven by related pathological mechanisms. However, see the “caveats” section below for some exceptions to this implication.


How does genetic correlation compare to other approaches for computing genetic commonality between phenotypes?
An intuitive alternative approach to finding the genetic commonality between two phenotypes is to list all the SNPs found to be individually statistically significant predictors of each phenotype, and then compute the intersection of these two lists. The drawback of this approach is that it neglects all signal from SNPs below the the threshold of statistical significance. In contrast, genetic correlation aggregates signal across all SNPs.

In essence, even if few SNPs are individually statistically significant, the aggregate information from all SNPs may produce a statistically significant genetic correlation.


What are some examples of genetic correlation?
The authors of the present paper use their technique to estimate genetic correlations between a variety of traits and diseases. They produce the following genetic correlation matrix:

1759093153834.png


Many of the phenotype groupings in this matrix make at least intuitive sense: for instance, the grouping of Crohn’s disease with ulcerative colitis.


What is cross-trait LD score regression?
Traditional methods for estimating the genetic correlation between phenotype 1 and phenotype 2 require access to individual-level data from both a GWAS of phenotype 1 and a GWAS of phenotype 2. Privacy protection rules often make this requirement prohibitive. Bulik-Sullivan et al. propose cross-trait LD score regression, a method that instead relies only on publicly accessible summary statistics from the two GWAS.

This enables a researcher without access to the underlying individual-level GWAS data to nevertheless compute genetic correlations.


How reliable is the cross-trait LD score regression approach to estimating genetic correlation?
I skimmed through the algebraic derivation of cross-trait LD score regression in the supplementary material to this paper, and noticed that it relies on some strong modeling assumptions that seemed unlikely to hold. However, the authors also validate cross-trait LD score regression empirically on real data, and show a good match between genetic correlation estimated from individual-level data and genetic correlation estimated from summary statistics. Use of summary statistics does increase the standard error of the genetic correlation estimates, but not so much as to make those estimates useless.

This suggests that cross-trait LD score regression is viable technique, and that strictly satisfying its modeling assumptions may not be crucial.


What are some caveats to the interpretation of genetic correlation?
  1. Even though genetic correlation is immune to direct environmental confounding, the finding that phenotypes A and B are genetically correlated does not unambiguously determine their causal relationship. For instance, if G is the shared genetics of two phenotypes, we could have A←G→B or G→A→ B or G→B→ A or A←G→X→B for some X.
  2. When we compute the genetic correlation between phenotype A studied in GWAS 1 with phenotype B studied in GWAS 2, we are approximately asking the question: “How much commonality is there between the SNPs distinguishing phenotype A from GWAS 1 controls and the SNPs distinguishing phenotype B from GWAS 2 controls?” In theory, differences in control-selection-criteria between GWAS 1 and 2 could bias the result. (I believe this bias could be corrected with access to individual-level GWAS data).
  3. In theory, two phenotypes could have high genetic correlation not because of truly shared genetics, but because the genes that influence them are in linkage disequilibrium (i.e. they tend to be inherited together).
  4. @forestglip linked to an interesting discussion of some other un-intuitive causes of high genetic correlation: https://gcbias.org/2016/04/19/what-is-genetic-correlation/
 
Last edited:
A recently-posted Genome-Wide Association Study of Hypermobile Ehlers-Danlos Syndrome (hEDS) (Discussion) used cross-trait LD score regression to estimate genetic correlations between hEDS and other conditions. The authors found a strong genetic correlation between ME/CFS and hEDS. Genetic correlation was also referenced by @forestglip and @ME/CFS Science Blog in the discussion of the DecodeME results.

The discussion of the hEDS paper fizzled out but there do seem to be important unaddressed questions raised by it and it would be good to understand how a technique like cross-trait LD score regression might work to help.

Maybe the problem with hEDS is that it is not one trait but two:
1. bendiness
2. enough fatigue and general pain to take you to see a physician who diagnoses hEDS
(Bendy people without pain and fatigue don't get diagnosed and may not even meet criteria for hEDS.

2 is pretty much bound to overlap with ME/CFS- related pathways.

I am getting the impression that we have two different questions that may be best answered in very different ways.
1. Variants of exactly which gene with which functions contribute causation to ME/CFS?
2. To what extent do different diseases share causal genetic risk elements - which in reality may be segments of DNA containing several genes in tight LD.

We may be able to decide whether ME/CFS has common contributory pathway elements with autism, major depressive disorder, widespread pain or psoriasis without knowing anything about which gene theoretically contributes most*. Note that this does not mean that any core causal pathway elements that define each disease in itself are shared. The core pathway element in migraine is the bit that gives the headache and however many diseases may share contributory genetic factors if they don't have headaches they don't share the core pathway.

*Edit: If we want to know that diseases that share a risk segment of DNA involve that segment facilitating a similar class of process (however defined) then we would expect there to be a rogue gene with function relevant to both.

Arguably, to draw any useful conclusion we want to identify the gene. But the reality of practical medicine often means that we become reasonably confident that we can draw conclusions from data like this without knowing everything we would like - confident enough to justify drug trials that work. When the link between B27 and ank spond was found, for many years we were uncertain whether the effect was from B27 or a gene in linkage disequilibrium. Moreover, even now we have no clear idea why B27 predisposes to disease (and also protects against other disease). When B27 cropped up in uveitis and other forms of arthritis that allowed us to build useful stories about mechanisms without us knowing all we wanted to know.

Basically, the likely scenario would be that if we found that a segment of DNA with light LD conferred risk to four or five illnesses/diseases that had one or more similar features that might not at first seem central to the illness but which was rather clearly defined (true migraine for instance) then it would be reasonable to take as the working model that the risk(s) from the segment worked in the same way. (If the segment contained a number of genes with similar functions it might genuinely be impossible to dissociated the risks and blame one gene anyway.)
 
Last edited:
Maybe the problem with hEDS is that it is not one trait but two:
1. bendiness
2. enough fatigue and general pain to take you to see a physician who diagnoses hEDS
(Bendy people without pain and fatigue don't get diagnosed and may not even meet criteria for hEDS.
That’s an interesting theory, and certainly a very possible explanation for the hEDS GWAS findings. I expect that digging more closely into the definitions of the case and control cohorts used in the hEDS GWAS may elucidate this further.

I am getting the impression that we have two different questions that may be best answered in very different ways.
1. Variants of exactly which gene with which functions contribute causation to ME/CFS?
2. To what extent do different diseases share causal genetic risk elements - which in reality may be segments of DNA containing several genes in tight LD.
I agree. And although (1.) would be more scientifically impactful, using techniques like genetic correlation to study (2.) is probably more feasible in the short term, and could still provide valuable leads to ME/CFS researchers. For this reason, I intend to read more about genetic correlation and related techniques.
 
Back
Top Bottom