Preprint Initial findings from the DecodeME genome-wide association study of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, DecodeMe Collaboration

The protocol says that exclusion criteria are "(ii) any alternative diagnoses including major psychiatric illness (e.g. bipolar disorder or schizophrenia) that can result in chronic fatigue, as explicit in the Canadian Consensus and IOM/NAM criteria[4, 14]." so in my head that would include Hashimotos, Graves, Lupus, MS, Sjögrens etc. So quite a few B-cell autoimmune diseases that have HLA link if I'm not mistaking. If I remember correctly it wasn't possible to exclude such people in the control sample?

Assuming these conditions were excluded, it might not have been possible to screen them fully out of DecodeME either. For one thing, people have the HLA types involved without ever getting autoimmune disease. There could also be participants who both have them and are destined to develop an autoimmune condition, but they were included because they gave their DNA sample before signs of it became apparent.

I don't know whether people with autoimmune disease were excluded or not, though. I have one (psoriatic disease) which gets referred to as autoimmune and auto-inflammatory so interchangeably that I've no idea what either means any more, but I was allowed to give a sample.
 
We used FUMA (v1.8.0) to annotate genetic associations (30). FUMA is an integrative web platform that performs extensive functional annotation for DNA variants in genomic areas identified by lead variants using multiple resources.
I figured out how to upload the summary statistics to FUMA, which also does MAGMA analyses, so I tried to see if I could replicate the brain enrichment.

There are a lot of customizable options, so I was not able to get the same exact results. I also had to convert all the SNPs from GRCh38 to GRCh37 to be able to use the tool. There are various methods to convert coordinates, and some coordinates are difficult or not possible to map. I used UCSC liftOver. Out of 8,902,782 variants, 28,169 (or around 0.3%) could not be mapped, so that may play a part in the difference.

Also MAGMA requires setting values for the distance on either side of genes where SNPs are considered as related to a gene. I don't know what value they used, so I used the default of 0 (only SNPs actually within a gene region are considered).

Here's the FUMA created manhattan plot of the data I uploaded:
1755349054962.png

I got the same 13 significant MAGMA genes (though with slightly different p-values):
1755349139263.png

And here is the MAGMA tissue enrichment:
magma_exp_gtex_v8_ts_avg_log2TPM_FUMA_jobs651031.png

For reference, this is the official enrichment from the study:
1755349384985.png

The tissues are generally the same. All the significant tissues are brain regions, and the first five are still in the same order. The order of the rest are a little different, and a couple brain regions are now not significant, while the pituitary is.

To prove that the brain enrichment isn't dependent on those 13 genes, I deleted them. By that I mean that I deleted all data for the SNPs within 50kb on either side of those 13 genes, so that they couldn't possibly play a part. I uploaded the filtered data and reran the analysis. As expected, there are now no significant MAGMA genes:
geneManhattan_FUMA_jobs651164.png

But the tissue enrichment is still almost identical (a few brain regions swapped positions and the heights just barely changed):
magma_exp_gtex_v8_ts_avg_log2TPM_FUMA_jobs651164.png
 
FUMA also has exactly the MAGMA cell type enrichment I was hoping for.

There are hundreds of different cell-type expression datasets to choose from to do analyses like the tissue enrichment above. I don't really know how to choose from them, or even to examine which cell-types are included before running the analysis, but these are the 5 I tested:
PsychENCODE_Adult
GSE104276_Human_Prefrontal_cortex_all_ages
GSE67835_Human_Cortex
GSE168408_Human_Prefrontal_Cortex_level1_Fetal
576_Xu_Human_2023_Lymph_node_ThoracicLymphNode_level1

Here are the enrichment plots for these. Note that the red bars are for cell types that were significant after bonferroni correction within one dataset. None of the cell-types were significant when correcting across all five datasets.
PsychENCODE_Adult_FUMA_celltype651043.pngGSE104276_Human_Prefrontal_cortex_all_ages_FUMA_celltype651043.pngGSE67835_Human_Cortex_FUMA_celltype651043.pngGSE168408_Human_Prefrontal_Cortex_level1_Fetal_FUMA_celltype651043.png576_Xu_Human_2023_Lymph_node_ThoracicLymphNode_level1_FUMA_celltype651043.png

For the first chart, the most significant cell-type is "Ex8". I don't know exactly what that is, other than I think it's a subtype of excitatory neuron, based on some quick searches. For the second, it's "GABAergic neurons", and for the third it's "neurons".

It occurs to me how easy p-hacking would be with this. Someone could easily test hundreds of different datasets, dredging for significant cell-types, then only report a few. Makes me feel that pre-registration for such studies should be more common, and that the cell-type datasets they plan to test should be in there.

Edit: On that note, I did first accidentally do the enrichment analysis on every single brain-related cell-type dataset, not realizing that you were supposed to pick specific datasets from within that section. I barely even glanced at the results, before redoing it with specific datasets. I don't remember what the results were, but just saying it for full transparency.
 
Last edited:
Oh, last thing for now. Like the DecodeME study, FUMA ran a MAGMA gene-set analysis on various curated gene sets. And like the study, it didn't return any significant gene sets after Bonferroni correction. But it might be interesting to look at the gene sets that had the lowest p-values:

1755352152607.png

I don't know what the first four are, but the next three are familiar. There have been several synapse discussions, such as in relation to the genes prioritized in the Zhang HEAL2 paper. From that paper:
As highlighted in our network analysis, ME/CFS genes participate in biological pathways associated with synaptic function
 
  • GOBP_PEPTIDYL_LYSINE_ACETYLATION → Acetylation of lysine residues in proteins.
  • GOBP_PROTEIN_ACETYLATION → More general acetylation of proteins.
  • GOMF_UBIQUITIN_LIKE_PROTEIN_LIGASE_BINDING → Binding to enzymes that attach ubiquitin or ubiquitin-like proteins to targets.
  • GOBP_PROTEIN_ACYLATION → Covalent addition of acyl groups to proteins (which includes acetylation but also other acylations).
I think we should investigate whether protein degradation is a factor here. From machine learning and network analysis performed in 2018 :


Screenshot 2025-08-16 at 17.50.13.png

Recall also that the paper by Zhang et al (https://www.s4me.info/threads/disse...ing-powered-genome-analysis-2025-zhang.43705/) has the following related information :

Screenshot 2025-08-16 at 17.53.58.png
 
Another thought: Would it be worthwhile to create a list of the hypotheses we think are worth pursuing, both by members here and by others? In a members only post or even a private group. We could then speculate on what experiments could be done to validate/falsify these hypotheses.
This sounds like an excellent idea to my foggy, non-scientific brain...

Eta I've created this thread for anything that emerges...
 
Last edited by a moderator:
This is fascinating @forestglip ! I'm quite a bit behind in my understanding of MAGMA and FUMA compared to you, but I will try to catch up.

Based on what you posted, the evidence seems quite persuasive that the differences found in DecodeME (not just the 8 hits) point to something happening in the brain. And there is less persuasive but still interesting evidence pointing towards neurons and their synapses.

Because MAGMA avoids trying to pinpoint specific genes and just works with correlations/probabilities of all SNPs and their correction to all genes, it might give a more global view of where the problem lies?

EDIT: to me, this seems more persuasive than the pathways of the 40+ potential genes that FUMA/coloc identified because these still point in many possible directions.
 
Last edited:
I'm quite a bit behind in my understanding of MAGMA and FUMA compared to you, but I will try to catch up.
Let me know if you run into issues. There were a few annoying roadbumps in the process.

Because MAGMA avoids trying to pinpoint specific genes and just works with correlations/probabilities of all SNPs and their correction to all genes, it might give a more global view of where the problem lies?
Maybe. I imagine the eight main loci might be related to brain, might not. Maybe 6 are, and 2 are related to the immune system. But MAGMA is more like, if you average out all of the genetic signal, what does it point to.
 
Based on what you posted, the evidence seems quite persuasive that the differences found in DecodeME (not just the 8 hits) point to something happening in the brain. And there is less persuasive but still interesting evidence pointing towards neurons and their synapses.
Yes, combined with Zhang's synapse findings, it seems more and more likely to me that the brain is part of the picture. I can't remember if we have other reason to think about synapses specifically.

Maybe your earlier suggestion of thinking about which of the genes from the top loci have an obvious, strong connection to brain/synapses might be a good starting point.

Edit: I also previously did GSEA based on the rare variant associations with CFS from the UK BioBank:
So I did preranked GSEA using the -log10(SKATO p value) for ranking from the Genebass page for CFS. [...]

But looking at the cellular component report, there seem to be a lot of neuron-related components near the top.
 
Last edited:
FUMA also has exactly the MAGMA cell type enrichment I was hoping for.

Would you be able to answer my earlier question to Prof Ponting:

This non-scientist's understanding would benefit from knowing the variables involved in the gene-set analyses:
Z = B0 + C1.B1 + ... + CnBn + e


... is Z the 13 gene-analysis ones, or is it all 18k?
... is C1 a binary 0/1 for membership of each modeled gene in the gene-set (set of genes expressed in a tissue_

Hugely impressed you have done all that work and can get close to the study results - wrangling the actual data gives a better feel for what was actually done.
 
Would you be able to answer my earlier question to Prof Ponting:

This non-scientist's understanding would benefit from knowing the variables involved in the gene-set analyses:
Z = B0 + C1.B1 + ... + CnBn + e


... is Z the 13 gene-analysis ones, or is it all 18k?
... is C1 a binary 0/1 for membership of each modeled gene in the gene-set (set of genes expressed in a tissue_

Hugely impressed you have done all that work and can get close to the study results - wrangling the actual data gives a better feel for what was actually done.
See here (different letters used but same idea):
To identify tissue specificity of the phenotype, FUMA performs MAGMA gene-property analyses to test relationships between tissue specific gene expression profiles and disease-gene associations. The gene-property analysis is based on the regression model,

Z∼β0+EtβE+AβA+BβB+ϵ

where Z is a gene-based Z-score converted from the gene-based P-value, B is a matrix of several technical confounders included by default. Et is the gene expression value of a testing tissue type c and A is the average expression across tissue types in a data set [...]

We performed a one-sided test (βE>0) which is essentially testing the positive relationship between tissue specificity and genetic association of genes.

The tissue gene-property analysis is a linear regression of all genes. Z is a gene's score from the GWAS and Et is a gene's expression in a tissue. Both of which are continuous, not binary.

For the gene-set analysis (the ubiquitin, synapse gene sets, etc), there's a binary variable on the right side instead - a gene is either in the gene set or not. The z-score on the left is still continuous.
 
See here (different letters used but same idea):


The tissue gene-property analysis is a linear regression of all genes. Z is a gene's score from the GWAS and Et is a gene's expression in a tissue. Both of which are continuous, not binary.

For the gene-set analysis (the ubiquitin, synapse gene sets, etc), there's a binary variable on the right side instead - a gene is either in the gene set or not. The z-score on the left is still continuous.

Fantastic - thanks so much. The paper confused me:
We considered 54 tissue types and identified significant enrichment of these genes’ expression for 13 (p < 0.05/54), all of which were brain regions

it wasn't clear what "these" referred to.
 
This lecture is interesting and relevant to our discussion:
MPG Primer: Linking SNPs with genes in GWAS (2022)

Don't understand everything, but there's some discussion that eQTL data and GWAS hits often do not match very well. Genes that are likely to be causally related to disease often do not have a lot of eQTL data.

This makes sort of sense because eQTL data is mostly about turning the gene expression on and off in different degrees, like a volume knob. But genes that are causally related to disease in GWAS will often be fine-tuned because turning the knob too high or too low becomes pathological. In other words, those with a lot of eQTL data are often those where the expression doesn't have a damaging effect on the organism, so perhaps not the ones we're interested in.

I suspect this mostly applies to diseases/conditions with clear hits and higher effect size, but perhaps it also applies to our quest to find the causal variants in DecodeME. For the hit on chromosome 1, for example, the paper highlights RABGAP1L because it has high coloc probability based on eQTL data in many different of tissues (see Figure 4 in the paper). But as the graph below shows, there are many other potential genes in the region, most of which are closer to the hit.
1755457894317.png

In the lecture, they mention that the closest gene is certainly not always the causal one but it is significantly more likely to be so than further away genes. So perhaps it would be worthwhile to highlight the closest 1-2 genes for each of the hits, as these are more likely to be relevant than others.
 
Last edited:
So perhaps it would be worthwhile to highlight the closest 1-2 genes for each of the hits, as these are more likely to be relevant than others.
The example locus you gave might be one of the harder ones to do this with because there are so many genes around the locus. There's a good chance the causal variant isn't the top hit, so one of the other variants near another gene might be causal.
 
Highlighting gene UNC13C, which seems the closest to the hits on chromosome 15. The gene card reads as follows:
Predicted to enable calmodulin binding activity and syntaxin-1 binding activity. Predicted to be involved in glutamatergic synaptic transmission and regulated exocytosis. Predicted to be located in presynaptic active zone. Predicted to be active in several cellular components, including axon terminus; presynaptic membrane; and synaptic vesicle membrane.
UNC13C Gene - GeneCards | UN13C Protein | UN13C Antibody

EDIT: added the image below

1755591849124.png
 
Last edited:
Another gene that hasn't been discussed yet but that seems the closest to the hit on chromosome 6q is POU3F2
This gene encodes a member of the POU-III class of neural transcription factors. The encoded protein is involved in neuronal differentiation and enhances the activation of corticotropin-releasing hormone regulated genes. Overexpression of this protein is associated with an increase in the proliferation of melanoma cells.
POU3F2 Gene - GeneCards | PO3F2 Protein | PO3F2 Antibody

EDIT: added the image below

1755591888952.png
 
Last edited:
This makes sort of sense because eQTL data is mostly about turning the gene expression on and off in different degrees, like a volume knob. But genes that are causally related to disease in GWAS will often be fine-tuned because turning the knob too high or too low becomes pathological. In other words, those with a lot of eQTL data are often those where the expression doesn't have a damaging effect on the organism, so perhaps not the ones we're interested in.
Great point—also the fact that a mutation could often be relevant for a reason that doesn't affect expression levels at all, but rather how it affects the binding affinity or accessibility of certain domains to ligands, regulatory enzymes and molecules, etc etc etc.

A particular mutation could be extremely relevant but have no eQTL data because the thing it does mechanistically is swap out an amino acid residue that can no longer get phosphorylated/acetylated/what have you and as a result that protein can’t get activated as strongly as it should. But the total amount of that gene’s transcripts or protein might remain relatively unchanged. So eQTLs provide information on one possible way that a SNP could be biologically relevant, but that’s about it.
 
Back
Top Bottom