Preprint Initial findings from the DecodeME genome-wide association study of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, DecodeMe Collaboration

I’ve been thinking more about this and it would be really useful if we could get hold of that UKB reference data to use with standalone MAGMA and see if it helps align our results. I wonder if it’s worth asking them?
Just a guess, but I think the UK BioBank data might not be open access like 1000 Genomes is. Maybe worth checking though.
 
Hi Woolie, great to see you again - though I may have missed you previously.

That's true - here's one at the top of the tree on Google

At the same time, research studies are always strongly skewed to higher SES, amongst many things. Yet for 15 years, GWAS have been finding biological differences linked to disease that either explain symptoms or chime with existing known mechanisms. The question is, are the differences that could be attributed to ME/CFS strong enough to explain anything like 8 loci and p< 5 x 10^-8?

I don't know if the team looked at things like SES - but it should be easy to do. I don't know if anyone here can look at e.g. known SES loci/genes vs ME/CFS?

(There seems to be no limit to what some people here can do, it is so impressive. I can't help wondering if any other part of the ME/CFS research community is making as much use of the data, though Chris did say in the webinar that there have been 42 downloads of the summary stats.)


DecodeME restricted the initial analysis to those with European ancestry, and within that spent a lot of effort controlling population ancestry, which can create such effects (immune ones are the strongest, esp on HLA genes, as particular common variants have proved protective or risky for historical pandemics). It helps that UK Biobanks is so big, making it much easier to find controls well matched by ancestry. So such an effect is unlikely. I think @ME/CFS Science Blog also posted a graph showing minimal such effects on more common variants, though higher for less common ones, though they said the more common ones were more prominent in the findings.
Thanks for the reply, @Simon M!

I do see that the proof of the pudding will be in the eating, as it were. As an exercise that can generate new research questions, this study is amazing!

I also get your point that we could identify SNPs that could be linked to certain possible confounding variables, and look at those. Nice idea.

I just think we should take care when making the "biological proof" argument based just on these data, because its a bit fragile.
 
Just a guess, but I think the UK BioBank data might not be open access like 1000 Genomes is. Maybe worth checking though.
Yes looks that’s probably the case. The eligibility criteria seem to make that clear and I can see them being able to use the data in a publicly accessible tool like FUMA but not being able to share it. A shame as it limits our ability to remove variables that may be causing differences in results.
 
Some more thoughts on the local MAGMA results above. The difference in the gene based analysis is worth looking at too. It would be useful to understand why (discussing with @forestglip we wondered if it was the difference in SNP conversion methods and assemblies used, or perhaps the reference panels, but we don’t really know) and if this tells us anything.

The following were in the official DecodeME results but not the local MAGMA results
  • DNAH10OS (couldn’t find anywhere in my results, maybe lost in translation?)

And these were in the local MAGMA results (in groups mentioned) but not the official DecodeME ones (apart from in table S3 where mentioned)
  • ATG12 (gwas1_male only)
  • CACNA1E (both gwas1 and gwas1_female) (also in table S3)
  • DCC (gwas1_female only) (also in table S3)
  • PLCL1 (gwas_1 and gwas_2) (also in table S3)
  • ZPBP (gwas1_female only)

While these were in both

Do the similarities mean we can be more certain of those results? Should we discount the differences in the local MAGMA analysis (that half were also in supplemental table S3 from the paper is perhaps reassuring). Or are both results telling us something? What do the subgroup findings tell us?

I’d like to better understand if/what is wrong in my method and accounting for differences seems a good place to start. It may be we can recreate things with FUMA and get the full output to compare results more closely (maybe differences are subtle with some things falling just in or outside the p-value thresholds). If anyone who knows these tools has any thoughts please do chip in. Now it’s all setup it should at least be trivial to change variables and rerun.

Edit: noticed a mistake, HIST1H4H is an alias for H4C8 so it is in both sets!
 
Last edited:
Some more thoughts on the local MAGMA results above. The difference in the gene based analysis is worth looking at too. It would be useful to understand why (discussing with @forestglip we wondered if it was the difference in SNP conversion methods and assemblies used, or perhaps the reference panels, but we don’t really know) and if this tells us anything.
I see a few reasons:

- Different reference panel (1000G vs UK BioBank).
- Liftover to another assembly only loses 0.3% of variants, while our mapping to SNP method lost 6% (though it's unclear how many more are lost later in both methods when MAGMA refers to the LD reference panels)
- I don't think we mapped rs ids correctly. I think probably most are correct, but since different rsids can refer to the same position but with different letters, if we only map by position, then we might get an rsid that is referring to a different letter change at the same position.

I'm looking into a tool called bcftools annotate for being able to add rsids to the data. Some links about it being used for this: 1, 2, 3, 4, 5.

I think the main steps are:
1. Download the huge (28 GB) dbSNP reference file from this link (I think GCF_000001405.40 is a synonym for GCRh38, while the other is for 37) and the index for it (the tbi file).
2. Convert the summary stats file to VCF format.
3. I think use bcftools sort command to sort the summary stats.
4. Zip and index the summary stats.
5. Run bcftools annotate using the syntax from the links above.

I haven't tried it yet, that's just my understanding so far.

I'm not really that worried about replicating MAGMA locally (if you want to do that, I think you'd be better off doing liftover instead of mapping to rsids. I tried the pyliftover library mentioned previously, and it wasn't too hard to use. Though I'm not sure if using a position ID instead of an rsid will work with MAGMA with the 1000G reference panel without some changes. Maybe that's not possible, I'm not sure. Or maybe just using the better rsid mapping technique would be enough to make it match.) I mainly want to add rsids so I can upload to the other post-GWAS analysis tools I mentioned previously.

If you're interested in pyliftover, here's a script I had Claude write. The input file is the same format as the original summary stats file, just filtered for qced variants.
Run this to create the filtered file:
Bash:
awk 'FNR==NR {ids[$1]++; next} NR==1 || ($3 in ids)' gwas_qced.var <(zcat gwas_1.regenie.gz) | gzip > filtered_regenie_file.txt.gz

Then run the following script to liftover:
Python:
import pandas as pd
from pyliftover import LiftOver

def liftover_gwas_sumstats_chunked(input_file, output_file, chunk_size=50000):
    """
    Convert GWAS summary statistics from GRCh38 to GRCh37/hg19 using chunked processing
 
    Args:
        input_file: Path to input summary statistics file
        output_file: Path to output lifted-over file
        chunk_size: Number of rows to process at once (default: 50000)
    """
 
    # Initialize liftover (GRCh38/hg38 to GRCh37/hg19)
    print("Initializing liftover...")
    lo = LiftOver('hg38', 'hg19')
 
    # Process file in chunks
    first_chunk = True
    processed_count = 0
    successful_conversions = 0
    failed_conversions = 0
 
    print(f"Processing in chunks of {chunk_size}...")
 
    # Read and process file in chunks
    chunk_reader = pd.read_csv(input_file, sep=' ', chunksize=chunk_size)
 
    for chunk_num, chunk in enumerate(chunk_reader):
        print(f"Processing chunk {chunk_num + 1} (rows {processed_count + 1} to {processed_count + len(chunk)})...")
 
        # Process this chunk
        processed_chunk = process_chunk(chunk, lo)
 
        # Count successful conversions in this chunk
        chunk_success = len(processed_chunk)
        chunk_failed = len(chunk) - chunk_success
 
        successful_conversions += chunk_success
        failed_conversions += chunk_failed
 
        print(f"  Chunk {chunk_num + 1}: {chunk_success} successful, {chunk_failed} failed")
 
        # Write to output file
        if first_chunk:
            # Write with header for first chunk
            processed_chunk.to_csv(output_file, sep='\t', index=False, mode='w')
            first_chunk = False
        else:
            # Append without header for subsequent chunks
            processed_chunk.to_csv(output_file, sep='\t', index=False, mode='a', header=False)
 
        processed_count += len(chunk)
 
        print(f"  Progress: {processed_count} variants processed")
 
    print(f"\nConversion summary:")
    print(f"Total variants processed: {processed_count}")
    print(f"Successfully converted: {successful_conversions}")
    print(f"Failed conversions: {failed_conversions}")
    print(f"Success rate: {successful_conversions/processed_count*100:.2f}%")
    print(f"Output saved to: {output_file}")
 
    return successful_conversions, failed_conversions

def process_chunk(chunk, liftover_obj):
    """
    Process a single chunk of data
 
    Args:
        chunk: DataFrame chunk to process
        liftover_obj: LiftOver object
 
    Returns:
        DataFrame with successfully converted variants
    """
 
    # Create lists to store results
    new_chromosomes = []
    new_positions = []
    conversion_success = []
 
    # Process each variant in the chunk
    for idx, row in chunk.iterrows():
        chrom = str(row['CHROM'])
        pos = int(row['GENPOS'])
 
        # Add 'chr' prefix if not present
        if not chrom.startswith('chr'):
            chrom = 'chr' + chrom
 
        # Convert coordinate (subtract 1 because pyliftover is 0-based)
        result = liftover_obj.convert_coordinate(chrom, pos - 1)
 
        if result and len(result) > 0:
            # Successful conversion (add 1 back to convert to 1-based)
            new_chrom = result[0][0].replace('chr', '')  # Remove 'chr' prefix
            new_pos = result[0][1] + 1
            new_chromosomes.append(new_chrom)
            new_positions.append(new_pos)
            conversion_success.append(True)
        else:
            # Failed conversion
            new_chromosomes.append(None)
            new_positions.append(None)
            conversion_success.append(False)
 
    # Add new coordinate columns
    chunk['CHROM_hg19'] = new_chromosomes
    chunk['GENPOS_hg19'] = new_positions
    chunk['liftover_success'] = conversion_success
 
    # Filter to only successfully converted variants
    chunk_success = chunk[chunk['liftover_success'] == True].copy()
 
    if len(chunk_success) > 0:
        # Update the CHROM and GENPOS columns to the new coordinates
        chunk_success['CHROM'] = chunk_success['CHROM_hg19']
        chunk_success['GENPOS'] = chunk_success['GENPOS_hg19'].astype(int)
 
        # Update the ID column to reflect new coordinates
        chunk_success['ID'] = (chunk_success['CHROM'].astype(str) + ':' +
                              chunk_success['GENPOS'].astype(str) + ':' +
                              chunk_success['ALLELE0'].astype(str) + ':' +
                              chunk_success['ALLELE1'].astype(str))
 
        # Drop helper columns
        chunk_success = chunk_success.drop(['CHROM_hg19', 'GENPOS_hg19', 'liftover_success'], axis=1)
 
    return chunk_success

if __name__ == "__main__":
    input_file = "filtered_regenie_file.txt.gz"
    output_file = "decodeme_grch37.tsv"
 
    success, failed = liftover_gwas_sumstats_chunked(input_file, output_file)
 
Last edited:
It might not help with the practical problems—hard to know because I don't understand any of it!—but there is a catalogue of GWAS studies at the Biometrics Institute.

Thought I'd link it in case it hadn't already come up.
There's also this gene atlas which provides lots of data of previous gwas including manhatten and qqplots for comparison. Unfortunately, it looks like they stopped updating it since 2019.
 
It also makes it possible to look if genes have been found in previous GWAS. For example it helped me find that OLFM4 was a hit in this GWAS on insomnia:
Apparently CA10 was associated with morningness, in the same study.

Suspect that this is something that would be useful: to systematically search if the DecodeME genes show up in other GWAS.
 
Also noted that the Postuma received a large grant (BRAINSCAPES) of around 20 million euros to work out the biological mechanisms behind genetic results of brain disorders.

Might be useful for DecodeME could make a connection to this group to see if ME/CFS could be included in their work.
The aim of BRAINSCAPES is to map in detail the biological mechanisms underlying multiple brain disorders ('brainscaping').
Recent genetic discovery studies have provided unprecedented insight into the genes involved in brain disorders. The next step is to use this knowledge for gaining mechanistic disease insight. In BRAINSCAPES we will develop novel analytic and experimental tools to study the functional consequences of risk genes on the function of specific cells, their circuits and functional output. We aim to provide insight into the molecular and cellular basis of complex brain disorders that can be used to design novel treatments.

EDIT: added the screenshot and lecture below:
1756151184613.png
 
Apparently CA10 was associated with morningness, in the same study.

Suspect that this is something that would be useful: to systematically search if the DecodeME genes show up in other GWAS.
It doesn’t look like there’s a good method of doing that in the Atlas, maybe just a google scholar search would get a start at coverage so something like this OLFM4 "genome wide" OR gwas or here’s a search for CA10 “genome wide” OR gwas?

And I remember @forestglip was interested in this a while back and working on https://sickgenes.xyz/
 
Back
Top Bottom