Preprint Initial findings from the DecodeME genome-wide association study of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, DecodeMe Collaboration

Thanks @Chris Ponting I look forward to seeing what more there is to come and echo @Hoopoe in thanking you and the rest of the team, for all you’ve done so far. The more I’m learning the more possibilities and scope there seems to be, exciting times!

Really useful answers in general but also in helping my understanding of the process. I wasn’t sure if I’d just missed something wrt FUMA/MAGMA (it also seems they briefly supported hg38 but now don’t).

I’ve just finished running magma 1.10 standalone from the command-line on gwas-1 using hg38 (amazing what you can do on a raspberry pi!). Although tbh it’s difficult to know what else I may have missed or got wrong in the process and it uses the same reference data (https://cncr.nl/research/magma/) so may have the same problems. But I’ll try to write up my steps so others here can have a prod at it.
 
Ah yes apologies. I think I'm a bit confused because those low p-values shown in the graph are in purple and for the category: 0.01 ≤ MAF < 0.04, but I thought that the hits found had MAF > 10%?
Oh good point. If I'm understanding correctly, the furthest right, uppermost dot is 20:48914387:T:TA (the lead hit in the whole study), but its MAF is around .34.
 
Hi. Yes, we needed to LiftOver DNA variants from human genome build 38 (hg38) to the previous version (GRCh37 (hg19)) which FUMA uses for, e.g., MAGMA analysis. You're correct that some SNPs are lost in the LiftOver, but only a few. More variants are lost because FUMA's reference panel of variants is incomplete. Yes, the GWAS model we used in REGENIE is an additive linear model. But we are also applying non-additive, non-linear models separately (TarGene; will report when we have results). Yes - as with nearly all GWAS - are lead SNPs are not in protein-coding sequence. And, yes, there is scope for a lot more analyses both by us (DecodeME) and any researchers who receive the consented and pseudoanonymised data via the Data Access Committee. The preprint's title starts with "Initial Results..." and we mean this. Yet more to come.
Hi Chris, sincere thanks for all your hard work and advocacy on our behalf. Exciting to hear more analyses of the data are in the works!

I was wondering if there was any sort of timeframe for the publication of the HLA analysis? And am curious whether the Manhattan plot published with the initial data shows the HLA data points or whether they were removed from the plot whilst the analysis is ongoing?
 
Hi Chris, sincere thanks for all your hard work and advocacy on our behalf. Exciting to hear more analyses of the data are in the works!

I was wondering if there was any sort of timeframe for the publication of the HLA analysis? And am curious whether the Manhattan plot published with the initial data shows the HLA data points or whether they were removed from the plot whilst the analysis is ongoing?
No timeframe for completion of the HLA analysis. The Manhattan plot in the preprint includes all variants including those in the HLA. But note that in the HLA analyses we do not test each DNA variant, rather we test combinations of variants that are commonly coinherited, i.e. "HLA alleles". So even if there isn't a "signal" in the Manhattan plot this does not immediately mean that we won't see association to an HLA allele.
 
Don't think we have discussed this plot yet:
Tried to recreate the plot using the whole dataset:

1755798455731.png

I thought it started to deviate from the red line (which has a slope of 1) quite soon already from an observed -log10 p-value of 3. But the graph is a bit misleading because it is logarithmic so the vast majority of points are in the region to the left of this (there are so many points that you can't see the difference). For example more than 90% of points have an observed -log10 p-value < 4.

So it's mostly the bottom left region of the graph that matters for interpreting inflation, and that seems to close to the red line.
 
Oh good point. If I'm understanding correctly, the furthest right, uppermost dot is 20:48914387:T:TA (the lead hit in the whole study), but its MAF is around .34.
Looks like a labelling issue of the graph. Here's what I got trying to recreate it. The high p-values belong to the SNPs with high MAF values, showing that it was mostly this group that reached significant hits.

@Chris Ponting

1755805502894.png
 
As mentioned above, something @forestglip and I have been working on recently is running MAGMA locally on the DecodeME summary statistics.

I wouldn’t necessarily recommend anyone do this or follow these instructions. There may be things wrong with this method and it seems to end up with only 94% of variants mapped while @forestglip gets over 99% using the LiftOver and FUMA method mentioned earlier.

But it was a useful exercise to learn a bit about the process and in sharing this we hope others can learn or indeed point out errors, which helps us learn!

A recipe for gene enrichment with DecodeMe and MAGMA
You will need the following ingredients
- a computer with a shell (performed on Linux but can also be done on Mac or Windows)​
- an understanding of installing and running tools from command-line​
enough space to download lots of data (15GB or so)​

Prepare your tools
- As well as standard tools like awk and zcat you will need to install sort-bed, bedmap and magma itself​
- Magma binaries and source for yourself are available from https://cncr.nl/research/magma/
- The apt package bedops provides sort-bed and bedmap or for more info https://github.com/bedops/bedops

Prepare your data
- DecodeME summary stats files and the list of variants that passed qc are available here https://osf.io/rgqs3/files/osfstorage
- You will also need gene locations (build38) and reference data (European) from here https://cncr.nl/research/magma/
- Molecular Signals Database files from https://www.gsea-msigdb.org/gsea/msigdb (the complete Human MSigDB v2025.1.Hs release, described as Human Gene Set GMT file set (ZIPped)​
- UCSC genome annotation database for GRCh38/hg38) assembly​

We recommend organising these into tools and data folders for easy management, especially if you’ll be running multiple passes with different databases or options. Most need unzipping although the snp151 database is very large so best left zipped.

Then perform the following steps

Filter the main gwas file to include just those variants which passed quality control
Code:
awk -v OFS='\t' 'FNR==NR {ids[$1]++; next} NR==1 || ($3 in ids)' gwas_qced.var gwas_1.regenie > filtered_regenie_file.txt

Add rsid information to these summary stats
First we create then sort a bed formatted file of chromosomes and positions (zero indexed) from the above
Code:
awk -v OFS='\t' 'NR!=1 { print "chr"$1, $2-1, $2, $0 }' filtered_regenie_file.txt > query.bed
sort-bed query.bed > sorted_query.bed

And filter the UCSC genome annotation database to include only single nucleotide variations and again output this in bed format
Code:
zcat snp151.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp151_snp.bed

Then we use bedmap to matching rsids to the positions in the summary stats bed file
Code:
bedmap --echo --echo-map-id --delim '\t' sorted_query.bed snp151_snp.bed > sorted_query_with_rsids.bed

Now we prepare the data for MAGMA
First create a gene position file from the above using the rsids as an index
Code:
awk -v OFS='\t' '$21 != "" {print $21,$4,$5}' sorted_query_with_rsids.bed > decodeme.magma_input.id.chr.genpos.txt

Then a p value file
Code:
awk -v OFS='\t' '$21 != "" {print $21,10^(-$19)}' sorted_query_with_rsids.bed > decodeme.magma_input.id.p.txt

Now we can use MAGMA to annotate
Code:
snploc=./decodeme.magma_input.id.chr.genpos.txt
ncbi38=./NCBI38.gene.loc
magma --annotate \
      --snp-loc ${snploc} \
      --gene-loc ${ncbi38} \
      --out decodeme

Then perform gene based analysis
Code:
ref=./g1000_eur/g1000_eur
magma \
    --bfile $ref \
    --pval ./decodeme.magma_input.id.p.txt N=275488 \
    --gene-annot decodeme.genes.annot \
    --out decodeme

And finally gene-set analysis
Code:
geneset=msigdb_v2025.1.Hs_GMTs/c5.all.v2025.1.Hs.entrez.gmt
magma \
    --gene-results decodeme.genes.raw \
    --set-annot ${geneset} \
    --out decodeme

After a short wait you will have some delicious data to consume with friends!

Thanks to those who made and documented these tools
https://cncr.nl/research/magma/
https://bedops.readthedocs.io/en/latest/

As well the wise and informative internet for its advice on how to use them
https://cloufield.github.io/GWASTutorial/09_Gene_based_analysis/
https://hanbin973.github.io/bioinformatics/2023/02/20/rsid.html


There’s also a handy bash script attached, which will ask for paths to the relevant locations and run it all for you. This also saves configurations and allows reuse of some intermediate files so makes multiple runs with slightly different settings or data easier. The script was put together with help from Gemini Pro for all the boring scaffolding.
 

Attachments

As mentioned above, something @forestglip and I have been working on recently is running MAGMA locally on the DecodeME summary statistics.

I wouldn’t necessarily recommend anyone do this or follow these instructions. There may be things wrong with this method and it seems to end up with only 94% of variants mapped while @forestglip gets over 99% using the LiftOver and FUMA method mentioned earlier.
Thanks for delving into this!

It looks so complicated. I expected that there would be nice packages in R and Python that wrap this into a handy user interface and provide the correct reference data (for LD, for example) automatically. Quite disappointing that the workflow mostly consists of command-line instructions and extracting the right data and tools from the web.

Yesterday I tried doing the LDSC and stumbled onto one problem after the other (it only runs on Python2, need to convert GRCh38 to rsID, download the correct reference data online, etc.). I ran out of energy but hope to try again later.
 
Last edited:
Thanks for delving into this!

It looks so complicated. I expected that there would be nice packages in R and Python that wrap this into a handy user interface and provide the correct reference data (for LD, for example) automatically. Quite disappointing that the workflow mostly consists of command-line instructions and extracting the right data and tools from the web.

Yesterday I tried doing the LDSC and stumbled into one problem after the other (it only runs on Python2, need to convert GRCh38 to rsID, download the correct reference data online, etc.). I ran out of energy but hope to try again later.
Unfortunately a lot of bioinformatics pipelines are pretty hodgepodge and you need to be a Jack of all trades to get something new up and running. It’s true that all of coding is 90% troubleshooting, but for bioinformatics especially where almost none of the packages/tools were designed by people explicitly trained in software engineering, it can be hell in a handbasket with entire weeks wasted. I started out in research with only a little bit of self-taught R experience and immediately got thrown headfirst into trying to understand what awk was with no help except stack exchange—I’m glad I had already developed the ability to parse through complete gibberish and not get overwhelmed, otherwise it would have discouraged me from science all together.

Nextflow has improved things somewhat—it’s a crowdsourced automated pipeline collection that basically handles all the intermediate steps for you. But when something breaks or if you need to customize anything, it’s a nightmare and a half. And you’re limited to the few highly used pipelines that have been developed and tested by a kind soul.

Huge props to folks on this thread for diving right into it and being so quick on the uptake!
 
It looks so complicated. I expected that there would be nice packages in R and Python that wrap this into a handy user interface and provide the correct reference data (for LD, for example) automatically.
Note that the script hotblack posted includes a lot of extras for convenience, like remembering settings for future runs. These are the only required steps we used:
Bash:
# First store all variants from gwas_qced.var in ids array.
# Then loop over every row in gwas_1.regenie and save to filtered_regenie_file.txt if it's the first row (header) or if the third column (the ID) is in the ids array.
awk 'FNR==NR {ids[$1]++; next} NR==1 || ($3 in ids)' gwas_qced.var gwas_1.regenie > filtered_regenie_file.txt

# For all but first row, print the chromosome, position-1, position, and entire original line to query.bed
# Need to subtract one because .bed files are zero indexed
awk 'NR!=1 { print "chr"$1, $2-1, $2, $0 }' filtered_regenie_file.txt > query.bed

# Sort bed file
sort-bed query.bed > sorted_query.bed

# Filter rsid reference file to only include SNVs.
# Pipe decompressed file to awk
# Set output separator to tab
# If the twelfth column is "single" (SNV), output chr, pos, pos, rsid (bed format)
zcat snp151.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp151_snp.bed

# Add matching rs ids to the positions in the summary stats bed file
bedmap --echo --echo-map-id --delim '\t' sorted_query.bed snp151_snp.bed > sorted_query_with_rsids.bed

# Create SNP position file from summary statistics for MAGMA
# The format is 'rsid    chromosome    position'
awk '$21 != "" {print $21,$4,$5}' sorted_query_with_rsids.bed > decodeme.magma_input.id.chr.genpos.txt

# Create SNP p value file from summary statistics for MAGMA
# The format is 'rsid    p-value'
awk '$21 != "" {print $21,10^(-$19)}' sorted_query_with_rsids.bed > decodeme.magma_input.id.p.txt

# Annotate genes with SNPs
snploc=./decodeme.magma_input.id.chr.genpos.txt
ncbi38=./NCBI38.gene.loc
magma --annotate \
      --snp-loc ${snploc} \
      --gene-loc ${ncbi38} \
      --out decodeme

# Do gene-based analysis to create score for each gene
ref=./g1000_eur/g1000_eur
magma \
    --bfile $ref \
    --pval ./decodeme.magma_input.id.p.txt N=275488 \
    --gene-annot decodeme.genes.annot \
    --out decodeme

# Run gene-set analysis
geneset=msigdb_v2025.1.Hs_GMTs/c5.all.v2025.1.Hs.entrez.gmt
magma \
    --gene-results decodeme.genes.raw \
    --set-annot ${geneset} \
    --out decodeme

FUMA does make MAGMA much easier (though requires converting to GRCh37). I've been looking for other online tools for analyzing GWAS data. The following each have several other automatic analysis tools where you mainly just need to upload the summary stats (though unfortunately you do need the data to have rs ids for at least the second website), though I haven't tried any of the tools yet. The second one looks like it has LDSC.

SysBiol PostGWAS

Complex-Traits Genetics Virtual Lab

The main struggle seems to be either mapping the variants to rs ids or converting from GRCh38 to GRCh37, and it doesn't seem that either is a trivial task to do exactly right. I'm not sure we did the rs id mapping perfectly since we based it only on positions and not letters. Surprisingly hard to find good explanations.

Edit: Also we found later that "single" in the rsid reference file doesn't only mean SNVs. I think it includes all single letter insertions or deletions as well, so not sure how that affects things.
 
Last edited:
The main struggle seems to be either mapping the variants to rs ids or converting from GRCh38 to GRCh37, and it doesn't seem that either is a trivial task to do exactly right. I'm not sure we did the rs id mapping perfectly since we based it only on positions and not letters. Surprisingly hard to find good explanations.
It might not help with the practical problems—hard to know because I don't understand any of it!—but there is a catalogue of GWAS studies at the Biometrics Institute.

Thought I'd link it in case it hadn't already come up.

 
I’ve gained a whole new level of appreciation for what you and other people in bioinformatics do.…
I appreciate it, though I’ll be honest it’s partially a mess of our own making. Not a month goes by where I don’t witness a PhD bioinformaticist doing something utterly baffling like hardcoding 16 separate subsets of one data table instead of creating one variable to split the table by…(and I wish that was an exaggeration not a real example)
 
For those with strength, courage and coding skills:

I've noticed that the Dutch authors of MAGMA have a new method called FLAME. It combines multiple approaches to finding the effector gene within a significant GWAS locus using a machine-learning framework.
Prioritizing effector genes at trait-associated loci using multimodal evidence | Nature Genetics
GitHub - Marijn-Schipper/FLAMES: FLAMES: Accurate gene prioritization in GWAS loci

Recently, integrative methods have been developed that aim to combine many different levels of functional data to predict the effector gene in a GWAS locus1–4. There are two main strategies to do so. The first strategy prioritizes genes using locus-based SNP-to-gene data. Examples of this are chromatin interaction mapping, quantitative trait loci (QTLs) mapping or selecting the closest gene to the lead variant. These annotations can then be combined by linear regression or machine learning to merge several types of SNP-to-gene data into a single prediction1,3. The second strategy assumes that all GWAS signal converges into shared, underlying biological pathways and networks. These methods prioritize genes in a locus based on gene features enriched across the entire GWAS4. However, no current method leverages these two strategies together to make well-calibrated predictions of the effector genes in a locus. We designed a new framework, called FLAMES. This framework integrates SNP-to-gene evidence and convergence-based evidence into a single prediction for each fine-mapped GWAS signal.

1755972195915.png

If I understand correctly, the distance to the gene is still one of the most important pieces of information.
In our benchmarks, we found that the only competitive method to FLAMES is selecting the closest gene to lead SNP if it also has the highest PoPS in the locus. Generally, this produces slightly higher precision (3.1% across all benchmarks) than FLAMES would yield, without the need for running the FLAMES annotation and scoring steps, reducing the computational complexity of gene prioritization. However, prioritizing genes using FLAMES at the recommended threshold will yield approximately 33.2% more correctly identified causal genes.
 
This paper gives some data and background on eQTL not being as useful in identifying relevant genes as many anticipated.
The missing link between genetic association and regulatory function | eLife

Using diseases and genes were the mechanisms are relatively well understood, they found that GWAS hits fall in the regulatory region of those genes, as expected. Only a small fraction (around 8%) of those GWAS loci, however, show colocalization with eQTLs in the relevant tissue. In other words, the non-coding variants are real and likely regulatory, but current eQTL datasets and methods often fail to show which gene they regulate.

DecodeME mainly relied on eQTL colocalization to define Tier 1 genes, so I think it is worth exploring some other methods. I've mainly checked the closest protein-coding gene to the locus but there must be better methods (such as FLAME above).
 
Something I’ve only really started to appreciate going through this process is quite what is being done and quite how clever and meaningful it is… and how those dismissing it are showing their own ignorance and how their ideas about genetics are decades out of date.

I’m still pretty ignorant, the complex statistics and biology are beyond me, but the arguments against what has been found just seem ludicrous if you spend a bit of time trying to understand what DecodeMe has done. And will do.

I hadn’t appreciated quite how layered this is. Even after reading the paper and overviews. But just in what’s been done already we have:

GWAS - looks at SNPs and their individual significance, this is individual letter changes in the genome, a p-value per individual genetic difference found​
Gene based analysis - looks at the significance of genes, combining the p-values from all SNPs, all the genetic differences within and near a gene, to generate a p-value for that gene​
Gene set analysis - looks at the significance of predefined groups of genes, a set, such as those known to be involved in a particular pathway, so generates a p-value for that set​

So already there is a sort of movement from letters to words to sentences as we get closer to a meaning. Or thinking about layers and the treasure map analogy, perhaps a geological or mining way of looking it may fit, searching for precious metals?

First we do an aerial view and see some signs, some formations on the surface which look interesting.​
So then we go to that location and we get down and examine and analyse things and we find certain rock types which indicate where to go next.​
So we start to dig then once underground we find the different strata and seams which guide us in the right direction until…​
Bam! We find what we’re after.​

And those who said ‘oh well you didn’t find all the gold on the surface when you did that aerial reconnaissance’, well, they look silly don’t they.
 
For those with strength, courage and coding skills:

I've noticed that the Dutch authors of MAGMA have a new method called FLAME. It combines multiple approaches to finding the effector gene within a significant GWAS locus using a machine-learning framework.
Looks interesting. I tried to see if I could do anything, but it's too much stuff I don't know how to do, like the part about creating credible set files.



In other news, based on a suggestion by @hotblack, I tried to use the UK BioBank reference panel for FUMA instead of the 1000 Genomes reference as I did before. It looks like that was the main reason my results were somewhat different from the paper's results.

Now the tissue enrichment is almost identical (first chart is DecodeME). In my previous post, the highest -log10 p-value was around 7, and now it's around 8.5 like the study. There's still two pairs of tissues that swapped positions, so it's not exactly the same, but the p-values are all now very close to the study's values.

1755986378991.png 1755986352491.png

Here are the updated top ten gene sets:
1755986709965.png

Links to descriptions for these:

The first mention of synapse (GOCC_SYNAPTIC_MEMBRANE) moved down to rank 31 (out of 17,006 gene sets).

I also reran the cell-type analysis, testing the same brain region datasets as last time. Even more cell types are significant now!

There's something like a three step process, where it shows all the cell-types that showed significant enrichment of the DecodeME genes:
1755988246645.png

Then it removes redundant cell-types from within a dataset if multiple cell-types from one dataset are very similar to each other:
1755988340614.png

Then it looks for redundant cell-types between different datasets. I don't really know how to interpret this, but if anyone wants to have a go, the FUMA tutorial describes this analysis:
1755988622720.png

It looks like it now includes neurons from two new areas of the cortex (and one from before is gone), GABAergic neuron from the cerebellum, neuron from white matter, neuron from cerebral nuclei, and many specific cells (mostly subtypes of excitatory neurons, and one subtype of oligodendrocyte) from the primary motor cortex.

But I think the last image is showing that a lot of these cell-types are very correlated to each other, so many non-interesting neurons might just be showing up because they're so similar to a cell-type of interest, not because they all play a part.

Edit: I probably wouldn't put too much stock in the top gene sets. When I plot all the p-values, it looks to be basically a uniform distribution you'd expect if almost all gene sets were not true effects. While there might be some real enriched gene sets in there, there are probably too many false positives to know which they are. Nothing significant even with the less strict FDR. It makes sense that they didn't report any significant gene sets.
1756001402358.png
 
Last edited:
Back
Top Bottom