Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Jonathan Edwards · Apr 18, 2025

Sasha said:
If so, does this Zhang study allow us to do that, or do we need more replication?

Depends on the buns and lemonade a bit.

Hutan · Apr 18, 2025

Jonathan Edwards said:
This Zhang study may be the first to hit the public domain where we can truly say there must be a biological causation because there is an identifiable genetic component.

I'm a bit surprised that you are so positive.

Their model built on associations of rare gene variants from two sources seemed to perform fairly well when tested with another very small source of ME/CFS genetic data (a testing cohort of 36 cases and 21 controls). But, the 115 genetic variations that they identified as differentiating did not replicate when tested with the UK Biobank data. See my post above.

Of course, there could be problems with the accuracy of ME/CFS diagnoses in the UK Biobank. But, equally, there may be issues with the selection of the fairly small number of samples used for the Zhang analysis. Maybe a slew of irrelevant gene variants swamped the signal from some relevant ones? The presentation of the data in the UK Biobank comparison didn't get down to the granularity of individual variants.

It's definitely interesting, but I don't think it is definitive. Very happy to be convinced otherwise, of course.

Jonathan Edwards · Apr 18, 2025

Hutan said:
It's definitely interesting, but I don't think it is definitive. Very happy to be convinced otherwise, of course.

OK, I am not in a position to judge. @jnmaciuch seems to think the associations must be real. More opinions welcome.

jnmaciuch · Apr 18, 2025

Jonathan Edwards said:
OK, I am not in a position to judge. @jnmaciuch seems to think the associations must be real. More opinions welcome.

Like @Hutan I don’t think it’s definitive.

But given the rarity of their variants and the small sample size, the independent cohort validation actually carries a lot of weight for me.

If it was all (or even partly) a fluke, you’d expect that test cohort AUC to be barely above 0.5. The validation shows that despite looking at very rare variants in a small group of people, the same pattern of rare variants was replicable in another small group.

But I’d wait to see the overlap with DecodeME before getting too excited. My concern is not that the results are unreliable, but rather that they’re only a small part of the story.

[Edit: I honestly don’t know what to make of the lack of replicability in UK BioBank, but like [edit: others] have already alluded to, the strength of replicability in an association such as this relies on whether the endpoint you’re associating with is actually similar between the two studies. It’s more likely for things to get drowned out in BioBank without stringent diagnostic criteria].

Jonathan Edwards · Apr 18, 2025

jnmaciuch said:
But given the rarity of their variants and the small sample size, the independent cohort validation actually carries a lot of weight for me.

But if I get Hutan's argument right, by the same token there should have been replication in a cohort of maybe 2000 in the UK Biobank?

Reliability is the issue here. There are bound to be caveats. The UK, Biobank might have been a bad sample, as might the others for various reasons. But if there is a discrepancy that doesn't seem to make statistical sense that is a worry. Or is it that they didn't actually test for replication for ME/CFS? As Hutan implies that would be a bit odd.

Hutan · Apr 18, 2025

jnmaciuch said:
Like @Hutan I don’t think it’s definitive.

But given the rarity of their variants and the small sample size, the independent cohort validation actually carries a lot of weight for me.

If it was all (or even partly) a fluke, you’d expect that test cohort AUC to be barely above 0.5. The validation shows that despite looking at very rare variants in a small group of people, the same pattern of rare variants was replicable in another small group.

But I’d wait to see the overlap with DecodeME before getting too excited. My concern is not that the results are unreliable, but rather that they’re only a small part of the story.

[Edit: I honestly don’t know what to make of the lack of replicability in UK BioBank, but like you’ve already alluded to, the strength of replicability in an association such as this relies on whether the endpoint you’re associating with is actually similar between the two studies. It’s more likely for things to get drowned out in BioBank without stringent diagnostic criteria].

I agree completely with this.

I got excited when I read about independent cohort validation. But was disappointed with the UK Biobank failure to replicate - and I was surprised that the authors did not even mention the failure to replicate in the Results. It seemed to be swept under the carpet. The second strongest association with the Zhang 115 gene variants according to the chart was a set of people labelled 'Covid-19 controls'! I don't know if that set includes or excludes people reporting Long Covid.

I haven't even finished reading the results yet, let alone the Discussion, so perhaps they do talk about it later. But, this illustrates my point that there is probably too much covered in this paper. We need more detail to properly evaluate the findings. The UK Biobank comparison is really a paper in itself, a replication, it should not just be a paragraph and a generalised chart.

jnmaciuch · Apr 18, 2025

Jonathan Edwards said:
But if I get Hutan's argument right, by the same token there should have been replication in a cohort of maybe 2000 in the UK Biobank?

Reliability is the issue here. There are bound to be caveats. The UK, Biobank might have been a bad sample, as might the others for various reasons. But if there is a discrepancy that doesn't seem to make statistical sense that is a worry. Or is it that they didn't actually test for replication for ME/CFS? As Hutan implies that would be a bit odd.

I may be misinterpreting the methods (they're a bit vague on details), but I suspect it would be because they are actually comparing p-values between their own cohort and BioBank. Depending on how loosely BioBank defined ME/CFS, it would have a profound difference on how strong the association is with any given gene.

What (I think) they're doing in the UK BioBank analysis is simply checking whether the highly associated genes in their data set are similarly highly associated (above background) [edit: in published studies for] other conditions. So if the UK BioBank GWAS are providing genes that are more broadly associated with general fatigue due to insufficiently stringent diagnostic criteria, the top genes in Zhang et al. may or may not come up among the top genes in BioBank.

However, the validation with the Cornell cohort in this paper is not a comparison of p-values after-the-fact, it's actually looking at the participant-gene-level data for the new group and seeing if the same combination of genes is similarly predictive of outcome.

[Edit: @Hutan, to your point, I think Zhang et al. would be substantially limited by whatever labels were already applied by UK BioBank studies, including that vague 'Covid-19 controls' label. If it was that vague, though, I would've just left it out of the paper.]

Hutan · Apr 18, 2025

jnmaciuch said:
[Edit: @Hutan, to your point, I think Zhang et al. would be substantially limited by whatever labels were already applied by UK BioBank, including that vague 'Covid-19 controls' label]

I don't have a problem with labels.

The Figure 5 caption gives this explanation of some of the covid related labels

COVID19 A2, B2, and C2 indicates severe covid vs. population, hospitalized covid vs. population, and covid vs. population42, respectively; long COVID19_1, long COVID19_2, long COVID19_3, and long COVID19_4 indicate strict case vs. broad control, broad case vs. broad control, strict case vs. strict control, and broad case vs. strict control38, respectively.

I think they could have explained better what the 'Covid19:_C2_v2_England_controls' means though. C2 just means the group of people who got Covid regardless of whether they got it severely, were hospitalised or had a mild infection; perhaps controls means people who had not got Covid, by some particular date? Or perhaps it means people who got Covid but who didn't get it severely? Either way, it's not clear if people with a genetic tendency to get Long Covid were included in the group or excluded.

Notably, we found that ME/CFS was genetically correlated with various complex diseases and traits in rare variants (P < 0,05, one-sided Wilcoxon rank-sum test; Fig. 5A), including depression, irritable bowel syndrome (IBS), and COVID-19 susceptibility (C2).

That's a very opaque and confusing sentence. I don't think they found that 'ME/CFS' was genetically correlated with anything there - they were testing whether their set of rare genetic variants was genetically correlated with anything. And they seem to be suggesting that their set of 115 variants was found to be correlated with the genetics of people with susceptibility to Covid-19. But, .... their chart seems to indicate that the significant correlation was with a control group.

It's messy.

forestglip · Apr 18, 2025

Hutan said:
But was disappointed with the UK Biobank failure to replicate

I think this paper is using their fancy non-linear HEAL2 algorithm to predict disease risk in the first two cohorts, while they had to rely on the traditional statistical tests for the Biobank, so I don't think it should be too concerning that it didn't replicate.

Apart from that, I'd consider depression being the top disease hit a semi-replication. There's a good chance many people in the ME/CFS cohort of this study and the depression cohort of the UK Biobank have a similar condition. For 15 years I was diagnosed with depression before getting an ME/CFS diagnosis, so that's how I would have signed up, and I'm guessing there are many similar cases of people who think they have depression but actually have ME/CFS. And even if the cohorts are perfectly diagnosed, depression is probably one of the conditions I'd rank in the top three if asked for conditions similar to ME/CFS, so seeing it be the top hit from the same genes as ME/CFS is interesting.

jnmaciuch · Apr 18, 2025

Hutan said:
And they seem to be suggesting that their set of 115 variants was found to be correlated with the genetics of people with susceptibility to Covid-19.

I think they're specifically saying that their highly associated genes have a greater-than-expected-by-chance overlap with the set of genes that another GWA study had already found to be associated with XYZ condition. As in, I don't think they actually did any direct analysis of the UK BioBank data, which is an important distinction.

Results said:
First, leveraging rare variant association studies37 for 4,529 diseases and traits in the UK Biobank (UKBB), we assessed the distribution of SKAT-O P-values of our ME/CFS genes per disease or trait (Methods), defining a genetic correlation if this distribution was significantly shifted from the background.

Results said:
Next, we explored the common-variant-based genetic correlations based on the genome-wide association studies (GWASs) on 61 complex diseases and traits using a similar procedure (Methods).

Methods said:
Genetic correlation analysis
For rare variant association study and GWAS data, we compared the P-values of ME/CFS genes with those of all background genes using a one-sided Wilcoxon rank-sum test. The Bonferroni procedure was adopted for P-value adjusting when available; otherwise the raw P-values were reported. For Mendelian disorder gene sets, we used a one-sided Fisher’s exact test to evaluate the enrichment of ME/CFS genes within each disease gene set, followed by the Bonferroni correction

I saw "I think" here because I am truly guessing. I'm having a hard time figuring out exactly what they meant.

I agree it's extremely messy, though I think part of that is because they could only use the same terms that were defined in other papers. They should have done a much better job clarifying whatever they referenced, though.

[Edit: cross posted the same thought with @forestglip]

Hutan · Apr 18, 2025

jnmaciuch said:
I think they're specifically saying that their highly associated genes have a greater-than-expected-by-chance overlap with the set of genes that another GWA study had already found to be associated with XYZ condition. As in, I don't think they actually did any direct analysis of the UK BioBank data, which is an important distinction.

As far as I can tell, there were separate studies. One compared the prevalence of their identified rare variants against identified rare variant data recorded for UK Biobank groups as per my post#44 above and Figure 5.

First, leveraging rare variant association studies37 for 4,529 diseases and traits in the UK Biobank (UKBB), we assessed the distribution of SKAT-O P-values of our ME/CFS genes per disease or trait (Methods), defining a genetic correlation if this distribution was significantly shifted from the background. Notably, we found that ME/CFS was genetically correlated with various complex diseases and traits in rare variants (P < 0,05, one-sided Wilcoxon rank-sum test; Fig. 5A), including depression, irritable bowel syndrome (IBS), and COVID-19 susceptibility (C2). Similar results were obtained based on the burden tests37 (Fig. 5B). Our results provide a rare-variant-based genetic linkage between ME/CFS and depression.

I guess its possible that there was not rare variant association data for ME/CFS in the UK Biobank although I would be surprised, when there appears to be rare variant association data for having had one body part x-rayed. I'm assuming each UK Biobank participant has had their genetics investigated with rare variants noted as well as being given disease and trait labels. And so the UK Biobank database can pull out the significant rare variants for all of the disease and trait labels there are. If there was no UK Biobank ME/CFS rare variant data, it would have been helpful if they had noted that.

There's a reference there that might help us work out what they did, and I still haven't got to the Methods section.

Another separate study looked at GWAS studies of diseases and traits. They only found something about sleep duration. They looked at GWAS of covid phenotypes, and that is when one of the Long covid phenotypes was found to be associated. See my post #45 above and Figure 5.

Next, we explored the common-variant-based genetic correlations based on the genome-wide association studies (GWASs) on 61 complex diseases and traits using a similar procedure (Methods). Interestingly, ME/CFS exhibited the strongest genetic correlation with sleep duration (adjusted P < 0.05, one-sided Wilcoxon rank-sum test followed by Bonferroni correction; Fig. 5C). When linked to COVID-19 phenotypes, we observed a significant common-variant-based genetic correlation between ME/CFS and long COVID-1938 (strict case definition; P < 0.05, one-sided Wilcoxon rank-sum test; Fig. 5D; Methods). This result is consistent with the symptom similarities between long COVID-19 and ME/CFS39,40, but provides a genetic perspective.

jnmaciuch · Apr 18, 2025

Hutan said:
As far as I can tell, there were separate studies.

I think we're on the same page, some signals are just getting lost in transmission!

Hutan · Apr 19, 2025

Hutan said:
There's a reference there that might help us work out what they did.

(I've now skimmed the Methods section of the Zhang paper but it is almost as if they haven't got around to finishing that section. There is very little there about the later studies in the paper including the UK Biobank comparisons.)

Ref #37 Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes, 2022

There's a great looking interface to the UK Biobank database that presumably makes assessing genetic relationships easier. From what I can see, genetic variants have been identified for all of the disease and trait labels that had more than 200 cases in the biobank, nearly 5000 labels. AI tells me that more than 1800 people have a CFS label in the biobank. So, I think Zhang et al should have been able to assess the relationship between their identified 115 variants and those of the people with the CFS label.

From the Supplementary material of Ref 37:

Screen Shot 2025-04-19 at 10.45.10 am.png

I think a good question to ask Zhang ey al is did they explore the relationship between their set of variants and the genetic information of people labelled with CFS in the UK Biobank? And, if not, why not.

Jonathan Edwards · Apr 19, 2025

I am struggling to get my head around this, while trying to write something else, but I wonder if the apparent failure to replicate with the Leeds UK Biobank reflects different methodologies. Do Leeds have a cohort that has had whole genome sequencing and documentation for ME/CFS status. There was the preliminary GWAS study of people who reported having ME/CFS but is that usable? I don't know, but I wonder if there is a reason why they do not say there is a negative result per se.

Simon M · Apr 19, 2025

Jonathan Edwards said:
But if I get Hutan's argument right, by the same token there should have been replication in a cohort of maybe 2000 in the UK Biobank?

I think there is a question mark about UK biobank diagnostic reliability. People were asked either if they had ever had a diagnosis of chronic fatigue syndrome, or Myalgia and celery -itis/ Chronic fatigue syndrome. It’s too easy for people with a diagnosis for chronic fatigue – which is pretty common, to answer to these questions. And Louis Nacul did work in Canada showing this is what happened in a large BC cohort Identified in a general population with abroad question: a more detailed follow-up questionnaire established that many positive answers didn’t have ME.

DecodeME Is different because not only was that a detailed follow-up questionnaire, but most people were recruited from the ME Community, rather than from the general public. I think it’s unlikely that substantial numbers of people with chronic fatigue are in the ME community. But again, we should soon be able to compare results from UK biobank with DecodeME.

But the lack of replicability Versus UK biobank doesn’t concern at this stage

Jonathan Edwards · Apr 19, 2025

Simon M said:
I think there is a question mark about UK biobank diagnostic reliability.

I agree but I am confused because this was a cohort that has been trawled for SNPs as per usual GWAS rather than completely sequenced was it not? And if the Zhang findings depend on rare genes from whole sequencing should we expect a replication to even be possible?

I would like to think that it is almost certain that Zhang et al. have picked up some genetic signals, even if their approach giving 115 genes Amy be much more difficult to interpret than what is likely to come from DecodeME by tracking commoner SNP variants across big numbers.

Hutan was concerned that Zhang failed to find they should have found if their data were reliable. I am still unclear whether this is so. Even if the UK Biobank 'ME/CFS' cohort was dilute I would expect with 2000 cases for there to be some degree of agreement.

I must say that I find the way the Zhang paper is written much less transparent than the way Edinburgh do things.

hotblack · Apr 19, 2025

Will the underlying dataset from DecodeME have rare variants and other data used here? Or is that not coming until SequenceME?

Presumably if it is present replication would be fairly straightforward with access to the model (a shame it hasn’t been made available)?

Or even a wider attempt to take the DecodeME data and rerun the recipe from this paper to train or just finetune the model and validate it on a larger and more consistent dataset?

Andy · Apr 19, 2025

hotblack said:
Will the underlying dataset from DecodeME have rare variants and other data used here? Or is that not coming until SequenceME?

No, DecodeME won't have data on rare variants. Genome wide analysis studies such as DecodeME only look at common variants in specific locations on the genome. Whole genome analysis studies, such as this one and the proposed SequenceME, 'reads' the whole genome of each sample, and this is then used to look for rare variants.

hotblack · Apr 19, 2025

forestglip said:
I think this paper is using their fancy non-linear HEAL2 algorithm to predict disease risk in the first two cohorts, while they had to rely on the traditional statistical tests for the Biobank, so I don't think it should be too concerning that it didn't replicate.

That’s my understanding. It’s not a replication but looking for genetic correlations between ME/CFS and other diseases. They’re saying “okay here are our identified 115 genes, do these pop up for other diseases too”.

However if the genes are relevant and the people in the (Leeds) UK Biobank actually have ME/CFS you’d assume the genes would also show up in some way no?

Hutan said:
I think a good question to ask Zhang ey al is did they explore the relationship between their set of variants and the genetic information of people labelled with CFS in the UK Biobank? And, if not, why not.

Good questions. Anyone feel comfortable asking them?

hotblack · Apr 19, 2025

Andy said:
No, DecodeME won't have data on rare variants.

Thanks Andy.

Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)