Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Discussion in 'ME/CFS research' started by SNT Gatchaman, Apr 17, 2025.

  1. Jonathan Edwards

    Jonathan Edwards Senior Member (Voting Rights)

    Messages:
    17,346
    Location:
    London, UK
    The Zhang study indicates that ME/CFS is partly caused by genetic susceptibility - which fits with there being a familial clustering but provides direct evidence for such causation.

    Several genes have come up and the precise genes to come up may depend on technicalities but they point to several specific types of process. Some of those are to do with the regulation of the way DNA is read for making proteins (transcription) and some to do with metabolism and cyclic AMP, which has a very general role in cell signalling. All of these are interesting but don't necessary point us to any very specific to a cell or organ.

    Genes have also come up that point to two specific areas. One is a T cell dependent immune response. Again that makes sense with viral triggers. Exactly what type of T cell or pathway is a bit difficult to know from the few relevant genes here because of technical complexities, especially for HLA-C. But this is pretty strong evidence that there is an immune process involved - again as we suspected but here is the evidence. Fluge's group had produced some data on this that I think did not replicate but DecodeME has looked at a very large number and will be analysing HLA so hopefully the exact story will become clear. The other area is nerve synapses. Yet again, this is no surprise, but we had less to go on previously for specifically invoking nerve cell events. Now we have a strong indication that susceptibility can come from genetic factors controlling nerve synapses.

    Something that a number of people have been wondering about is whether these two factors - T cell responses and synapses fit into a single story or whether possibly they belong to two separate stories that in some but not all people with ME/CFS interact to varying degree.

    The upshot is that a genetic study has provided causal evidence for the sort of factors that we have been thinking about for a while. The paper that Simon Mcgrath, Adrian Baldwin, Mark Livingstone and Andrew Kewley and I published in 2016 on the 'Biological Challenge of ME/CFS' after discussions here on the forum concludes that it is likely that both immune and neurological signalling systems are involved. But we could not be more specific and we had no evidence. Now we have evidence and further research homing in on the relevant areas ought to reveal the specifics.
     
  2. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,307
    The preprint says all the genetic data is on Synapse.org (have to request access, not sure how easy that is), so maybe someone who knows how can test that.

    Edit: The note about the data was added in version 2 of the preprint.
     
    Last edited: May 17, 2025 at 8:43 PM
    Starlight, Sean, bobbler and 7 others like this.
  3. Yann04

    Yann04 Senior Member (Voting Rights)

    Messages:
    2,238
    Location:
    Romandie (Switzerland)
    Oh sorry I had deleted my post because i realised i misread and thought that might have made my suggestion irrelevant.But if my suggestion is still relevant hopefully someone can test.
     
    MeSci, alktipping, Deanne NZ and 3 others like this.
  4. Braganca

    Braganca Senior Member (Voting Rights)

    Messages:
    405
    Thank you @Jonathan Edwards.. I just edited down to your key points for myself so I can try to remember.

    ==

    1. The Zhang study indicates that ME/CFS is partly caused by genetic susceptibility.

    2. Several genes point to several key processes, such as making proteins, metabolism, and cell signalling. None necessarily point to any specific cell or organ.

    3. There is strong evidence that susceptibility to ME/CFS is driven by genes regulating two key areas — a T-cell dependent immune process, and nerve synapses.

    4. It’s possible these two factors - T cell responses and synapses fit into a single story or to two separate stories that some people with ME/CFS interact with to varying degrees.
     
    MeSci, Sean, Chestnut tree and 11 others like this.
  5. Jonathan Edwards

    Jonathan Edwards Senior Member (Voting Rights)

    Messages:
    17,346
    Location:
    London, UK
    That's about right @Braganca, but I would reword 2 to say:

    2.Several genes point to several basic processes, such as making proteins, metabolism, and cell signalling without necessarily pointing to any specific cell or organ.

    Number 3 then gives the genes that do point to specific cells.
     
    MeSci, hotblack, AliceLily and 8 others like this.
  6. Hutan

    Hutan Moderator Staff Member

    Messages:
    32,485
    Location:
    Aotearoa New Zealand
    Note from the moderation team:
    Regarding DecodeME updates, we have created a thread in Research News that can be watched, in order to receive an alert when a DecodeME preprint is posted:
    2025 DecodeME News Alert thread

    When you watch the thread, you can choose to get an email alert.

    We have deleted some posts about this on this thread, as off-topic.
     
    Last edited: May 17, 2025 at 11:04 PM
    mariovitali, Kiristar, Yann04 and 8 others like this.
  7. Nightsong

    Nightsong Senior Member (Voting Rights)

    Messages:
    1,158
    And a few more:

    Changes in DNA methylation profiles of myalgic encephalomyelitis/chronic fatigue syndrome patients reflect systemic dysfunctions (Helliwell et al., 2020) - differences were identified in the DNA methylation patterns of ME/CFS patients that clearly distinguished them from the healthy controls

    SWATH-MS analysis of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome peripheral blood mononuclear cell proteomes reveals mitochondrial dysfunction (Sweetman et al., 2020) - the proteome analysis presented here, while further establishing the disturbance in regulation of immune and inflammatory biological pathways in ME/CFS, also provides evidence of histone methylation and proteasome activation

    Altered endothelial dysfunction-related miRs in plasma from ME/CFS patients (Blauensteiner et al., 2021) - histone deacetylase 1, a protein responsible for epigenetic regulations, represented the most relevant node within the network

    Increased HDAC in association with decreased plasma cortisol in older adults with chronic fatigue syndrome (Jason et al., 2011) - findings suggest increased histone deacetylase activity

    Exercise responsive genes measured in peripheral blood of women with Chronic Fatigue Syndrome and matched control subjects (Whistler et al., 2005) - exercise-responsive genes differed between CFS patients and controls. These were in genes classified in chromatin and nucleosome assembly, cytoplasmic vesicles, membrane transport, and G protein-coupled receptor ontologies
     
    Yann04, MeSci, Sean and 10 others like this.
  8. Hutan

    Hutan Moderator Staff Member

    Messages:
    32,485
    Location:
    Aotearoa New Zealand
    I'm really impressed that you did all that work @forestglip. That shows considerable intellect and determination. The authors of this paper should have had you on board.

    So, as we suspected, there was a UK Biobank CFS grouping that the authors could have used for their investigation of how well their 115 identified genes of interest were represented in the genetic variation of the UK Biobank CFS group. Presumably they did investigate that, and so the lack of a report that there was not anything close to a significant level of matching is hopefully something that will be fixed in their published version.

    In fact, given that we have concluded that it's really the pathways that the rare variants are pointing to that are the useful thing (given the tiny cohorts and the, well, rarity of rare variants), rather than the individual genes, the analysis of comparing identified pathways between the UK biobank CFS group and the Zhang analysis is the one that is important and should have been reported on in this paper. See Forestglip's post here for those analyses.

    There are interesting things in Forestglip's report of the ranking of the 115 genes in the UK Biobank CFS group.
     
    Last edited: May 18, 2025 at 5:41 AM
    voner, SNT Gatchaman, Yann04 and 11 others like this.
  9. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    722
    Location:
    USA
    Agreed!
     
    Yann04, Daisymay, dratalanta and 7 others like this.
  10. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,307
    Thanks! This kind of activity is one of my favorite things to do. I wish I could spend hours per day learning about this kind of statistics related work, but I'm energy limited to around 10 minutes per day on average, with occasional overreaches like this when I get excited about an idea.

    Yes, I do have the the question of: if testing the genes this way in Genebass isn't at all predictive of ME/CFS in another much larger sample (even if it's not a well-defined sample, it's still the closest phenotype to ME/CFS in that database), why would there be an assumption that it would provide any clues about other diseases being genetically related to ME/CFS? But I'll give them the benefit of the doubt that maybe to be able to get a signal from these genes in this way from the BioBank, it requires a much larger sample than the ~2000 cases of CFS. But I also feel like this might be too simplistic of a test if it ignores sample size and is biased towards conditions with a lot of cases.

    Also, I'm not sure why they wouldn't do multiple test correction on this. They did around 4500 statistical tests. The lowest p value was around .009, which doesn't seem unlikely to have come up by chance with that number of tests. They say the following, but I'm not really sure what it means. Why was the Bonferroni procedure not "available"?

    Yeah, it's a really exciting tool. My memory is hazy, but I think I originally wanted to see if synapse related genes were enriched using the whole list of ranked genes from the Zhang model. I thought I was going to have to go searching for a list of synapse related genes somewhere and then do like a test to see if they're ranked higher than would be expected by chance. Then I found out about GSEA which has thousands of these gene lists for different components and pathways and diseases, and does the calculations for you, so that was cool. Then when I found out the Genebass data was publicly available, it made more sense to use the rankings from there instead of attention scores from the model.

    There are lots more interesting gene set collections to see what else might be enriched in the BioBank cases, but I've been restraining myself from just testing anything and everything and increasing multiple testing error. But I assume the much more experienced study investigators could do something more targeted and thought-through, maybe focusing on gene sets related to synapses and proteosomes, which their model suggested were important.
     
    MeSci, Sean, Kitty and 7 others like this.
  11. Hutan

    Hutan Moderator Staff Member

    Messages:
    32,485
    Location:
    Aotearoa New Zealand
    Last edited: May 18, 2025 at 7:58 AM
    MeSci, Kitty, Trish and 1 other person like this.
  12. chillier

    chillier Senior Member (Voting Rights)

    Messages:
    252
    Enjoying the commentary on this paper a lot. My feeling is - like with precisionLife - that since the method is a bit of a black box it would be nice to see it validated on a disease where the more of the biology is known but where a GWAS/ rare variant analysis on its own doesn't provide the power to see it. As @jnmaciuch and others have pointed out, the potential circularity in the analysis where stringdb is used as an input to their model, but they also later do fisher tests of gene modules - makes me a bit uneasy. string is a database of experimentally validated physical protein interactions, but can also include text mined associations such as two genes being mentioned in the same paper. I don't know which they used here.

    The AUC being kind of low I don't think is a problem because I don't think using this as a prediction tool (a biomarker) is very important. It's the inference of important genes that matters. The genes that come up are super interesting, though I wouldn't feel totally comfortable fully trusting them on the basis of this study completely on its own.

    I was looking back through the 14 precision life me/cfs genes, and despite them having various supposed molecular functions (metabolism, viral immunity etc), 10 out of 14 of them are predominantly expressed in neurons or their glia in the brain according the single cell data from the human protein atlas. I don't know exactly how this data was generated and analyzed of course but I thought this was pretty striking nonetheless. Here's an example for USP6NL https://www.proteinatlas.org/ENSG00000148429-USP6NL/single+cell which appears to be predominantly expressed in microglia and oligodendrocyte precursor cells from 'brain' tissue.

    upload_2025-5-18_13-55-56.png

    As an aside, there's a couple of genes that seem to have high expression in spermatids (again from human protein atlas single cell with whatever problems that may or may not have). S100PBP and AKAP1 from precisionLife have very high spermatid expression specificity. ADCY10 from zhang et al as well for instance. Is there something that neuron function and spermatozoa have in common?
     
    Hutan, SNT Gatchaman, Kitty and 8 others like this.
  13. Utsikt

    Utsikt Senior Member (Voting Rights)

    Messages:
    2,954
    Location:
    Norway
    Are those involved in the production of sperm? Isn’t it a bit curious if something involved in the production of sperm pops up in a disease with a female bias?
     
    Kitty, hotblack, Deanne NZ and 3 others like this.
  14. chillier

    chillier Senior Member (Voting Rights)

    Messages:
    252
    It would be weird, and maybe is some artefact of the way they've combined the data. It's surely vanishingly unlikely sperm have anything to do with mecfs, but maybe similar sets of proteins could be used to something both neurons and sperm both need to do? Like organize the cytoskeleton to help form their cell shapes.
     
  15. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,307
    I was wondering why they wouldn't do this, and instead used some kind of simulated dataset, but then I thought maybe it's because the STRING database basically has all the "answers" for well-characterized diseases. So maybe if the model is looking at a well-studied disease, and sees high expression of 1 or 2 genes known to be involved in that disease, the STRING database tells it every other gene known to be involved in the disease, which get incorporated into the model. And accuracy on the test set would overestimate the accuracy to be expected when looking at novel diseases. Just speculation, I don't know a lot about STRING or how their model works.
     
    mariovitali, Kitty, chillier and 3 others like this.
  16. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    1,108
    Location:
    UK
    Like you, I find the discussion fascinating, especially after 30 years with little happening in research.

    But maybe this is the time to raise some methodological concerns. I haven't been able to read the whole paper or all this thread, and I'm likely to be wrong about a lot of this. There are three areas of questions and concerns:

    1. Matching of case and control cohorts.
    2. Apparent failure of Heal2 to discriminate any better than Heal between cases and controls.
    3. The high (to me, possibly in ignorance) rate of loss-of-function genes in cohorts that have all related individuals removed.

    The biggest issue is that Heal2 doesn't appear to improve the ability to separate cases from controls.

    In the dataset that simulates non-linear gene data (exactly what Heal2 is designed to detect) by adding artificial gene differences,and Heal2 does well on AUC. With real ME/CFS data, it does not. I've used AUROC, but the results of AUPRC are similar.

    Simulated data:
    --------ME/CFS---Controls---diff
    AUROC 0.891, 0.598 +0.393

    real ME/CFS data:
    --------ME/CFS---Controls---diff
    AUROC 0.677, 0.668 +0.009

    So, the non-linear model appears (to me) to do no better than the original linear model. If so, can we rely on inferences from non-linear features (and other features unique to Heal2) to determine which genes are important? (I'm unsure what proportion of results and findings rely on the Heal2-only model.)

    Overall, the AUC on real data for both models seem pretty good for genetic findings in a disease with heritability of around 0.12, especially as all related individuals were removed from the cohorts in QC.
     
    voner, Sean, mariovitali and 8 others like this.
  17. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,307
    On the independent ME/CFS dataset, though, there is more of a difference (first quote is the simulated independent dataset, +0.358 improvement over HEAL, second quote is the ME/CFS independent dataset +0.124):
     
    hotblack, Deanne NZ and Kitty like this.
  18. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    1,108
    Location:
    UK
    Could you just spell that out, please.

    My first quote is from the para, which begins with a discussion of simulations and concludes with the data on and "independent dataset" you quote. The next part begins, "To evaluate the performance of Heal2 on real ME/CFS data..." Are you saying the independent data and real ME/CFS are the same, though the results are different? I am confused.
     
    hotblack, Deanne NZ and Kitty like this.
  19. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,307
    The first paragraph you're talking about is the discovery cohort, on which the model was trained. It's made up of the Stanford and CureME cohorts. After training the model, it's tested on the independent dataset, the second paragraph, which is made up of different people, the Cornell cohort. The "to evaluate" sentence refers to both steps.

    Edit: Another way to think of it is that both HEAL and HEAL2 were approximately equally able to classify cases after being trained on those same cases. But when it came time to look at brand new data, HEAL lost some accuracy, while HEAL2 was still just about as accurate.
     
    Last edited: May 18, 2025 at 3:57 PM
  20. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    722
    Location:
    USA
    I agree with @forestglip ’s thoughts. HEAL2 seems to be better at picking out relevant pathways, whereas HEAL picks out relevant one-offs.

    In general, across -omics analyses, there’s a crisis of replication at the individual analyte level. There’s simply two much intra-group variation, confounders, etc. to ever have the same analytes in the same weights recapitulate in an independent cohort when considered on their own. What does seem to be a lot more stable is pathway-level analysis. E.g. individual androgenic steroid metabolites may not recapitulate in a different cohort, but if you look at the class of androgenic steroid metabolites as a whole, you’ll get a robust validation.

    What that means is that bioinformatics is increasingly becoming reliant on how analytes are grouped together in public gene/analyte like GO, KEGG, Hallmark, etc. Some of them are very well curated (like Hallmark), some of them less so (like GO:BP). As I’ve mentioned in the context of STING, no matter the quality of curation, you’re still heavily reliant on what other people have chosen to study before.

    If your disease happens to be driven by one alternative not-well-characterized pathway that involves some NOTCH ligands, a couple cytokines, and ECM proteins, you’re going to get weak hits in all three but none of them may pass multiple testing correction. So it requires a good scientist to go into the gene rankings themselves, cross reference with literature that wasn’t considered by gene set curators, and formulate hypotheses. Sometimes, researchers will create a custom gene-set a priori to detect this kind of thing, but doing that tends to eschew the “unbiased discovery” justification for doing -omics experiments in the first place.

    All this is a long tangent to say that pathway/network level analysis is nearly always stronger, and works rather well considering there are many many well-characterized pathways, but one should always keep in mind this inherent bias towards replicating what has already been well studied.

    Which is why my biggest concern with this study is not weakness of predictive capability or potential skewing of cohort selection, but simply what is being overlooked and what ends up overemphasized. But that’s really just an unavoidable problem, and doesn’t discredit the findings that did show up strongly here. Especially if we only care about the data inferentially, rather than using it for patient classification (which is where these concerns would be much more important)
     
    Last edited: May 18, 2025 at 5:10 PM

Share This Page