Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Discussion in 'ME/CFS research' started by SNT Gatchaman, Apr 17, 2025.

  1. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,245
    Thread for a paper on NLGN1 and risk of suicide:

    Genome-wide association study meta-analysis of suicide death and suicidal behavior (2022, Nature Molecular Psychiatry)
     
    Hutan, Deanne NZ, Kitty and 3 others like this.
  2. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    I reached out to them to get the full list of their top genes, since that was in a supplemental table that wasn’t included with the preprint. I didn’t hear back. They might respond to someone else perhaps
     
    ME/CFS Skeptic, Hutan, Lilas and 6 others like this.
  3. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,245
    I emailed about the same thing! No response.

    The paper references a Supplementary Table 2 that should list the genes, but that's not on MedRxiv, so I thought maybe they forgot to upload it.
     
    Hutan, Lilas, MeSci and 6 others like this.
  4. Sasha

    Sasha Senior Member (Voting Rights)

    Messages:
    5,564
    Location:
    UK
    It's possible to post a comment on the preprint on MedRxiv. How about asking there about it?
     
    Hutan, MeSci, Deanne NZ and 3 others like this.
  5. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,245
    Good idea. I added a comment asking if they could upload the supplementary tables or provide the list of genes. I think MedRxiv has to approve the comment before it becomes visible though.
     
    Hutan, MeSci, Deanne NZ and 6 others like this.
  6. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    That’s why I’d really like to see the full list of top genes—I’m hoping the list will include some indication of their relative contribution to the model. Their independent cohort was extremely small but they managed to get the same AUC on a rare variant analysis. To me, that indicates that there’s a small number of genes that are really driving the predictive power of the model, and those are the same ones coming up in the independent cohort. [Edit: it says in their methods that they performed input feature prioritization, so they should have some list of weights for each feature even if they didn't do feature selection.]

    Specifically in their analysis, I would have liked to see some sort of feature selection process where they try to pare down their model to the fewest number of genes with the most predictive power—that would go a long way towards addressing the possibility of overfitting on a small cohort (though if it was an issue, what you’d usually see is that the AUC on the validation cohort tanks to 0.5).

    It’s possible they already tried that and weren’t able to recapitulate their score with a smaller feature set, which would indicate that it is truly a diffuse and broad biological process at play. I don’t have strong doubts about the robustness of the statistical associations, though I think the story needs to be narrowed down significantly to be a useful hint towards biological mechanism.
     
    Last edited: May 2, 2025
    Hutan, Lilas, SNT Gatchaman and 6 others like this.
  7. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,245
    It's unclear if all the participants had ME/CFS pre-COVID. If some of them developed ME/CFS as a part of long COVID, aren't all the COVID associations to be expected?

    The cohorts:
    The referenced 2023 study doesn't say whether the participants were pre-COVID ME/CFS.
     
    Hutan, Lilas, Deanne NZ and 4 others like this.
  8. Kitty

    Kitty Senior Member (Voting Rights)

    Messages:
    8,083
    Location:
    UK
    I've been wondering how people could proceed if they find multiple associations that look good, but none have strong predictive power. I guess re-running the model on a different cohort would help, but might it also be an issue with the approach itself? It might be an excellent model, just not the best one for this problem.

    I don't know anything about genomics or computational modelling, so I've struggled with the paper. I can see the results look compelling and exciting, but it's hard to get a grasp of how people would go about using them to design experiments.
     
  9. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    From what I can tell, their method (HEAL2) is quite the improvement over standard regression analyses. If anything, I think the problem would be small sample size limiting the rare variants that are even represented in their cohort.

    The model that HEAL2 gave them consists of all the genes with detected rare variants plus the weight that they contribute to the prediction--meaning that what's stored in the model is essentially a complicated way of saying:
    "If you have a rare deleterious mutation in gene X1, you are Y1 times more likely to be labeled "ME/CFS" instead of "control" in the data set, and if you have a mutation gene X2, you are Y2 times more likely, and if you have a mutation in gene X3...."
    And that basically all gets summed up across all the detected genes, with the vast majority of genes ending up as "equal likelihood of being ME/CFS or control", in which case they're ignored by the model.

    They could potentially rerun their HEAL2 method on a much larger cohort to increase the total amount of rare variants detected, though for WGS that would be very costly. I think the best bet is just to see if any of the identified genes overlap with what DecodeME finds--though DecodeME might miss a lot of these if they aren't part of the genome regions that are being screened.

    If it is a situation where there's just a whole slate of weakly associated genes, then the next step is trying to see what process, if anything, connects as many of the puzzle pieces as possible. That will probably be difficult for any genes that are 12 steps removed from the actual pathological mechanism of ME, whatever it is. But the hope is that a good puzzle solver would be able to draw a thread between at least a couple of the strongest associations and go from there.

    That's why I'm particularly interested to see the feature weights from their model--if they ended up with 115 significant genes but only one or two dozen of them are really contributing to the model's predictive power, that's a much better jumping off point for narrowing things down. If it really is 100+ different genes with equally weak predictive power and no apparent connections between them, then it is probably just back to the drawing board at that point.
     
    janice, ukxmrv, Turtle and 6 others like this.
  10. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    And for what it's worth, generally responding to some old comments on this thread:
    between the cross validation in the training cohort and the completely independent test cohort validation, I don't think model overfitting is a huge concern here.

    Though I'd still like to see some feature selection. Testing the predictive power of a small number of features on yet another independent cohort would really make me confident.
     
    ukxmrv, Lilas, hotblack and 2 others like this.
  11. Utsikt

    Utsikt Senior Member (Voting Rights)

    Messages:
    2,852
    Location:
    Norway
    How costly? Is it something someone like Google could be convinced to join for some PR benefits?
    Can you ELI5 overfitting? I don’t think most people are aware or what it is, and I’m not too confident in my own understanding.
     
    hotblack, Deanne NZ and Kitty like this.
  12. Kitty

    Kitty Senior Member (Voting Rights)

    Messages:
    8,083
    Location:
    UK
    Thanks for the explanation, @jnmaciuch, it's really helpful.
     
  13. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    The number quoted to me several years ago was around $1000 each. It's probably available for cheaper now, though perhaps at the expense of accuracy and quality. And ideally you'd get at least 1000 participants, so upper estimate is $1M US dollars, potentially half that if decent cheaper options have been developed. I'm definitely not up-to-date on that.

    Maybe it's something that one rich philantropist with a personal connection to ME/CFS would be willing to do. From knowing a couple folks who have worked at big tech companies, I'm doubtful that they can be convinced to step up.
     
  14. Kitty

    Kitty Senior Member (Voting Rights)

    Messages:
    8,083
    Location:
    UK
    It's something that's already being planned (SequenceME), though they haven't raised all the funding yet. But it wasn't enormous in research terms, maybe around £800,000? I've got to go out now so no time to check, but it's on a thread here.

    Anyway if it goes ahead, this model presumably could be run on the data.
     
    hotblack, Deanne NZ and jnmaciuch like this.
  15. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    Sure! [Edit: the one-sentence version:] The basic meaning of overfitting is when you have a "big data" model meant to predict a phenomenon, but it ends up being driven by latent factors that are unique to your data set.

    [Edit: longer explanation below]
    For example, let's say you did a big metabolomics study to predict something like susceptibility to seasonal infections. Your cohort happened to include a bunch of young adults who go out partying frequently, a bunch of elderly folks recruited from a local nursing home, and some schoolteachers who saw a recruitment flier.

    If you allow your model to take many many data points into account, it's going to end up generating a model that is highly predictive because its really predicting all the little latent factors that might lead to a high infection rate for all the specific situations represented in your cohort. In this example, it might pick up on a bunch of metabolites related to recent alcohol consumption, AND various markers of aging, AND things that correspond with a high-stress job where you end up eating lots of fruit snacks.

    You end up with a trained model (i.e. "for every unit increase in metabolite X there's a Y increase in infections", which then gets summed over all the included metabolites to produce one final "predicted infection frequency" score for each participant). But if you try to use that same model to generate "infection scores" for a new cohort, it'll be much less accurate, because the model was overfit to factors that were only relevant in the first cohort.

    You can address this by a couple methods:
    1) Cross validation: you split your train cohort into a bunch of random subsets and rerun the model several times, keeping only the features that were highly predictive across random subsets.
    2) Feature selection: limit the amount of features the model can take into account. The less features it can include, the more it's going to focus on features that were predictive across all your participants, rather than accounting for every single minutiae of every subset.

    And after the fact, you can retroactively assess overfitting by running the trained model on the test cohort and seeing how well it predicts your outcome on data that wasn't involved in training.
     
    Last edited: May 2, 2025
    janice, ukxmrv, Hutan and 5 others like this.
  16. Jonathan Edwards

    Jonathan Edwards Senior Member (Voting Rights)

    Messages:
    17,231
    Location:
    London, UK
    What I find difficult to follow is how you can get a meaningful estimate of weightings for rare gene alleles from a small sample. I would not expect more than one case to have any particular rare allele, so how do you weight that?
     
    Kiristar, janice, ukxmrv and 4 others like this.
  17. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    671
    Location:
    USA
    I'm pretty sure their model made predictions at the gene level. They focused on rare variants [edit: since they're more likely to cause loss-of-function], but then assigned weightings based on having any rare deleterious variant for that gene.
     
    Last edited: May 2, 2025
  18. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,245
    Previously I had emailed one author about the list of genes. Today I emailed the other corresponding author, and they responded to say they'll have to look into whether they're allowed to share it. If they send me the list and give me permission to post, I'll do so.
     
    Binkie4, janice, Ariel and 14 others like this.
  19. wigglethemouse

    wigglethemouse Senior Member (Voting Rights)

    Messages:
    1,175
    Grabbing some quotes from the paper to describe the "cytotoxic" CD4 cell finding

    From Abstract
    Paper text for Supplementary Fig. 6 - it seems to be saying they took the 115 genes from the HEAL2 analysis and examined their expression in the scRNA-seq data from the Hanson/Grimson Cornell dataset.
    Text continues to talk about what they found and shown in Fig 4G,H,I and in Supplementary table 3. It's the only time the paper uses the word strikingly.
    Here are the figures 4G-I
    upload_2025-5-3_13-32-18.png
     
    hotblack, MeSci, Hutan and 5 others like this.
  20. wigglethemouse

    wigglethemouse Senior Member (Voting Rights)

    Messages:
    1,175
    I revisited my earlier post in this thread about Mark Davis highlighting an "autoimmune" CD4 cell subset in ME/CFS that was elevated in some patients. Here is the slide
    View attachment 25850
    It is titled KIR+ CD8 cells and autoimmune CD4 cells. After listening to that talk I wonder if Mark Davis autoimmune CD4 = "cytotoxic CD4" of this paper and KIR+CD8 = cytotoxic CD8. i.e. Did he mean both the CD4 and CD8 cells in the slide were "cytotoxic". Mark Davis presentation showed both being elevated in a cyclical manner in autoimmune disease.

    The researcher that did the work left academia a few years ago so it is not easy to ask Stanford for clarification on how the "autoimmune CD4" cells were identified.
     
    Last edited: May 3, 2025
    ukxmrv, hotblack, MeSci and 4 others like this.

Share This Page