Oh thanks for this! I was just listening to this one from the same instituteThis lecture is interesting and relevant to our discussion:
MPG Primer: Linking SNPs with genes in GWAS (2022)
I also read that the intercept of an LD score regression can be interpreted in a similar way (an indicator of inflation). But I don't see this measure reported in the DecodeME preprint: anyone saw it somewhere?A lambda of 1.066 looks fine.
Why do you say that? The top hits on chr20 were around that p-value.Because the observed p-values go all the way up to 10^-11, I assume this shows the full data before filtering based on quality control took place?
Hi. Yes, we needed to LiftOver DNA variants from human genome build 38 (hg38) to the previous version (GRCh37 (hg19)) which FUMA uses for, e.g., MAGMA analysis. You're correct that some SNPs are lost in the LiftOver, but only a few. More variants are lost because FUMA's reference panel of variants is incomplete. Yes, the GWAS model we used in REGENIE is an additive linear model. But we are also applying non-additive, non-linear models separately (TarGene; will report when we have results). Yes - as with nearly all GWAS - are lead SNPs are not in protein-coding sequence. And, yes, there is scope for a lot more analyses both by us (DecodeME) and any researchers who receive the consented and pseudoanonymised data via the Data Access Committee. The preprint's title starts with "Initial Results..." and we mean this. Yet more to come.Like @forestglip and @ME/CFS Science Blog I’ve spent some time digging into some tools including magma/fuma and annovar. And learnt a bit about the myriad hurdles and technical issues bioinformatics students and researchers must live with. After a week much of my understanding is likely way off but I have a few questions, maybe they’re best for @Chris Ponting and the authors but perhaps others with more experience can chime in?
Were the GWAS summary statistics (hg38) or other/raw data used for MAGMA/FUMA analysis? What reference panel was used? These tools and the default panels seem to depend upon different human genome versions/reference assemblies than hg38, was liftover used or re-calling? I’m unsure about some other settings like the window size too, which could effect things quite a lot.
Knowing more about the methods, data and settings here would help reproducibility and understanding I think. And I wonder about what risks/loss there are with conversions between formats?
This study used an additive model I think, would it be worth exploring results from recessive/dominant models too?
I have learned new words like exonic, intronic and intergenic and as far as I can tell most findings (88%) from GWAS studies are in these interonic and intergenic regions. Although these don’t directly change the protein, it seems they can modify things a lot by effecting expression through things like translation.
There seems lots of scope for more analysis here, understanding what is in LD with the identified SNPs and using more tools like annotation to help get an idea of potential mechanisms. Or did the eQTL and risk loci cover this? I’ve not started looking at that yet and don’t understand the pros/cons of different methods.
Is this sort of wider analysis planned? Is the hope others may take it up? Would it be possible to make more of the data available to help others do this?
A bit of a mishmash of questions and thoughts….
Ah yes apologies. I think I'm a bit confused because those low p-values shown in the graph are in purple and for the category: 0.01 ≤ MAF < 0.04, but I thought that the hits found had MAF > 10%?Why do you say that? The top hits on chr20 were around that p-value.
Oh good point. If I'm understanding correctly, the furthest right, uppermost dot is 20:48914387:T:TA (the lead hit in the whole study), but its MAF is around .34.Ah yes apologies. I think I'm a bit confused because those low p-values shown in the graph are in purple and for the category: 0.01 ≤ MAF < 0.04, but I thought that the hits found had MAF > 10%?
Hi Chris, sincere thanks for all your hard work and advocacy on our behalf. Exciting to hear more analyses of the data are in the works!Hi. Yes, we needed to LiftOver DNA variants from human genome build 38 (hg38) to the previous version (GRCh37 (hg19)) which FUMA uses for, e.g., MAGMA analysis. You're correct that some SNPs are lost in the LiftOver, but only a few. More variants are lost because FUMA's reference panel of variants is incomplete. Yes, the GWAS model we used in REGENIE is an additive linear model. But we are also applying non-additive, non-linear models separately (TarGene; will report when we have results). Yes - as with nearly all GWAS - are lead SNPs are not in protein-coding sequence. And, yes, there is scope for a lot more analyses both by us (DecodeME) and any researchers who receive the consented and pseudoanonymised data via the Data Access Committee. The preprint's title starts with "Initial Results..." and we mean this. Yet more to come.
No timeframe for completion of the HLA analysis. The Manhattan plot in the preprint includes all variants including those in the HLA. But note that in the HLA analyses we do not test each DNA variant, rather we test combinations of variants that are commonly coinherited, i.e. "HLA alleles". So even if there isn't a "signal" in the Manhattan plot this does not immediately mean that we won't see association to an HLA allele.Hi Chris, sincere thanks for all your hard work and advocacy on our behalf. Exciting to hear more analyses of the data are in the works!
I was wondering if there was any sort of timeframe for the publication of the HLA analysis? And am curious whether the Manhattan plot published with the initial data shows the HLA data points or whether they were removed from the plot whilst the analysis is ongoing?
Tried to recreate the plot using the whole dataset:Don't think we have discussed this plot yet:
Looks like a labelling issue of the graph. Here's what I got trying to recreate it. The high p-values belong to the SNPs with high MAF values, showing that it was mostly this group that reached significant hits.Oh good point. If I'm understanding correctly, the furthest right, uppermost dot is 20:48914387:T:TA (the lead hit in the whole study), but its MAF is around .34.
awk -v OFS='\t' 'FNR==NR {ids[$1]++; next} NR==1 || ($3 in ids)' gwas_qced.var gwas_1.regenie > filtered_regenie_file.txt
awk -v OFS='\t' 'NR!=1 { print "chr"$1, $2-1, $2, $0 }' filtered_regenie_file.txt > query.bed
sort-bed query.bed > sorted_query.bed
zcat snp151.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp151_snp.bed
bedmap --echo --echo-map-id --delim '\t' sorted_query.bed snp151_snp.bed > sorted_query_with_rsids.bed
awk -v OFS='\t' '$21 != "" {print $21,$4,$5}' sorted_query_with_rsids.bed > decodeme.magma_input.id.chr.genpos.txt
awk -v OFS='\t' '$21 != "" {print $21,10^(-$19)}' sorted_query_with_rsids.bed > decodeme.magma_input.id.p.txt
snploc=./decodeme.magma_input.id.chr.genpos.txt
ncbi38=./NCBI38.gene.loc
magma --annotate \
--snp-loc ${snploc} \
--gene-loc ${ncbi38} \
--out decodeme
ref=./g1000_eur/g1000_eur
magma \
--bfile $ref \
--pval ./decodeme.magma_input.id.p.txt N=275488 \
--gene-annot decodeme.genes.annot \
--out decodeme
geneset=msigdb_v2025.1.Hs_GMTs/c5.all.v2025.1.Hs.entrez.gmt
magma \
--gene-results decodeme.genes.raw \
--set-annot ${geneset} \
--out decodeme
Thanks for delving into this!As mentioned above, something @forestglip and I have been working on recently is running MAGMA locally on the DecodeME summary statistics.
I wouldn’t necessarily recommend anyone do this or follow these instructions. There may be things wrong with this method and it seems to end up with only 94% of variants mapped while @forestglip gets over 99% using the LiftOver and FUMA method mentioned earlier.
Unfortunately a lot of bioinformatics pipelines are pretty hodgepodge and you need to be a Jack of all trades to get something new up and running. It’s true that all of coding is 90% troubleshooting, but for bioinformatics especially where almost none of the packages/tools were designed by people explicitly trained in software engineering, it can be hell in a handbasket with entire weeks wasted. I started out in research with only a little bit of self-taught R experience and immediately got thrown headfirst into trying to understand what awk was with no help except stack exchange—I’m glad I had already developed the ability to parse through complete gibberish and not get overwhelmed, otherwise it would have discouraged me from science all together.Thanks for delving into this!
It looks so complicated. I expected that there would be nice packages in R and Python that wrap this into a handy user interface and provide the correct reference data (for LD, for example) automatically. Quite disappointing that the workflow mostly consists of command-line instructions and extracting the right data and tools from the web.
Yesterday I tried doing the LDSC and stumbled into one problem after the other (it only runs on Python2, need to convert GRCh38 to rsID, download the correct reference data online, etc.). I ran out of energy but hope to try again later.
Note that the script hotblack posted includes a lot of extras for convenience, like remembering settings for future runs. These are the only required steps we used:It looks so complicated. I expected that there would be nice packages in R and Python that wrap this into a handy user interface and provide the correct reference data (for LD, for example) automatically.
# First store all variants from gwas_qced.var in ids array.
# Then loop over every row in gwas_1.regenie and save to filtered_regenie_file.txt if it's the first row (header) or if the third column (the ID) is in the ids array.
awk 'FNR==NR {ids[$1]++; next} NR==1 || ($3 in ids)' gwas_qced.var gwas_1.regenie > filtered_regenie_file.txt
# For all but first row, print the chromosome, position-1, position, and entire original line to query.bed
# Need to subtract one because .bed files are zero indexed
awk 'NR!=1 { print "chr"$1, $2-1, $2, $0 }' filtered_regenie_file.txt > query.bed
# Sort bed file
sort-bed query.bed > sorted_query.bed
# Filter rsid reference file to only include SNVs.
# Pipe decompressed file to awk
# Set output separator to tab
# If the twelfth column is "single" (SNV), output chr, pos, pos, rsid (bed format)
zcat snp151.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp151_snp.bed
# Add matching rs ids to the positions in the summary stats bed file
bedmap --echo --echo-map-id --delim '\t' sorted_query.bed snp151_snp.bed > sorted_query_with_rsids.bed
# Create SNP position file from summary statistics for MAGMA
# The format is 'rsid chromosome position'
awk '$21 != "" {print $21,$4,$5}' sorted_query_with_rsids.bed > decodeme.magma_input.id.chr.genpos.txt
# Create SNP p value file from summary statistics for MAGMA
# The format is 'rsid p-value'
awk '$21 != "" {print $21,10^(-$19)}' sorted_query_with_rsids.bed > decodeme.magma_input.id.p.txt
# Annotate genes with SNPs
snploc=./decodeme.magma_input.id.chr.genpos.txt
ncbi38=./NCBI38.gene.loc
magma --annotate \
--snp-loc ${snploc} \
--gene-loc ${ncbi38} \
--out decodeme
# Do gene-based analysis to create score for each gene
ref=./g1000_eur/g1000_eur
magma \
--bfile $ref \
--pval ./decodeme.magma_input.id.p.txt N=275488 \
--gene-annot decodeme.genes.annot \
--out decodeme
# Run gene-set analysis
geneset=msigdb_v2025.1.Hs_GMTs/c5.all.v2025.1.Hs.entrez.gmt
magma \
--gene-results decodeme.genes.raw \
--set-annot ${geneset} \
--out decodeme