Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Thread for a paper on NLGN1 and risk of suicide:

Genome-wide association study meta-analysis of suicide death and suicidal behavior (2022, Nature Molecular Psychiatry)
Neurexin 1 [NRXN1, pre-synaptic protein that the post-synaptic neuroligins bind to] variants were previously implicated as risk factors for SD [suicide death] [67, 68]. The top associated variant rs73182688 in NLGN1 in this study is nominally associated with BMI (p = 0.0006), depression (p = 0.004 in FinnGen R5), and personality disorder (p = 0.004 in FinnGen R5) (Supplementary Table S21). Other variants (SNVs and CNVs) in NLGN1 and/or other family members NLGN3 and NLGN4 were previously associated with PTSD, autism, obsessive-compulsive disorder, and depression [69,70,71,72,73,74,75].

The rs6779753 variant in NLGN1 associated with PTSD as well as with the intermediate phenotypes of higher startle response and greater hemodynamic responses (assessed using functional MRI) of the amygdala and orbitofrontal cortex to fearful face stimuli was not in LD with the variant identified herein. In our study, rs6779753 was only suggestively associated with SD (p = 0.06).

NLGN1
was also implicated in a preclinical model of depression [76]. In addition, presynaptic neurexins and cytoplasm partners such as SHANK also have been implicated in autism, schizophrenia, and mental retardation [73, 77,78,79,80,81,82,83].

Overall, there is substantial genetic evidence on the NRXN-NLGN pathway in suicide and other psychiatric conditions.
 
I reached out to them to get the full list of their top genes, since that was in a supplemental table that wasn’t included with the preprint. I didn’t hear back. They might respond to someone else perhaps
I emailed about the same thing! No response.

The paper references a Supplementary Table 2 that should list the genes, but that's not on MedRxiv, so I thought maybe they forgot to upload it.
 
The chat on this paper has died down. I remain puzzled. If they have have found evidence of genetic causation in ME/CFS this is a major milestone. If everyone is unsure whether the data are statistically robust then it would be nice to have a clear idea of why.

There seems to be a problem that 'too many' gene sites have come up, making it unclear which to focus on. I wish I understood how this could happen. My intuition is that if we really think a gene is relevant then it will tell us something useful about mechanism. In the diseases I am familiar with the genetic links all make sense very easily - MHC Class I and II, PTPN22, common cytokine receptors, etc. It is such a pity that the presentation of this paper is so opaque.
That’s why I’d really like to see the full list of top genes—I’m hoping the list will include some indication of their relative contribution to the model. Their independent cohort was extremely small but they managed to get the same AUC on a rare variant analysis. To me, that indicates that there’s a small number of genes that are really driving the predictive power of the model, and those are the same ones coming up in the independent cohort. [Edit: it says in their methods that they performed input feature prioritization, so they should have some list of weights for each feature even if they didn't do feature selection.]

Specifically in their analysis, I would have liked to see some sort of feature selection process where they try to pare down their model to the fewest number of genes with the most predictive power—that would go a long way towards addressing the possibility of overfitting on a small cohort (though if it was an issue, what you’d usually see is that the AUC on the validation cohort tanks to 0.5).

It’s possible they already tried that and weren’t able to recapitulate their score with a smaller feature set, which would indicate that it is truly a diffuse and broad biological process at play. I don’t have strong doubts about the robustness of the statistical associations, though I think the story needs to be narrowed down significantly to be a useful hint towards biological mechanism.
 
Last edited:
Notably, we found that ME/CFS was genetically correlated with various complex diseases and traits in rare variants (P < 0,05, one-sided Wilcoxon rank-sum test; Fig. 5A), including depression, irritable bowel syndrome (IBS), and COVID-19 susceptibility (C2).
When linked to COVID-19 phenotypes, we observed a significant common-variant-based genetic correlation between ME/CFS and long COVID-19 (strict case definition; P < 0.05, one-sided Wilcoxon rank-sum test; Fig. 5D; Methods). This result is consistent with the symptom similarities between long COVID-19 and ME/CFS, but provides a genetic perspective.
It's unclear if all the participants had ME/CFS pre-COVID. If some of them developed ME/CFS as a part of long COVID, aren't all the COVID associations to be expected?

The cohorts:
Cornell ME/CFS cohort
The Cornell cohort was composed of 66 individuals with ME/CFS and 47 healthy controls who consented to provide blood for WGS. Almost all were participants in the cohort described by Moore et al.75, though a few provided blood but did not participate in exercise testing.
The referenced 2023 study doesn't say whether the participants were pre-COVID ME/CFS.
UK CureME ME/CFS cohort
The UK CureME cohort74 is a comprehensive resource designed to advance research into ME/CFS and is part of the UK ME/CFS Biobank initiative (https://cureme.lshtm.ac.uk/). Participants undergo thorough clinical assessments to confirm diagnoses and provide detailed phenotypic data. Biological samples, primarily blood, are collected at multiple time points to facilitate longitudinal analyses. The cohort adheres to rigorous protocols to ensure high-quality, anonymized data and samples. From the UK CureME cohort, we received blood samples for 190 individuals with ME/CFS and 30 healthy controls who provided consent for WGS (RFL Biobank Ethical Reference number: NC2021.24). The cohort adheres to rigorous protocols to ensure high-quality, anonymized data and samples, which are stored at the UCL/RFH Biobank in London, UK.
Stanford ME/CFS cohort
The Stanford ME/CFS cohort comprises 364 participants (208 cases and 156 controls), including members from 22 families totaling 74 participants, with 9 identical twins discordant for ME/CFS. Participants received ME/CFS diagnoses from specialized clinicians in the Bay Area and underwent comprehensive clinical assessments to confirm diagnoses based on ICC and IOM criteria6971. Detailed phenotypic data were collected for all participants, ensuring a rich dataset for analysis. Blood samples were collected over several years at multiple Stanford locations and, for very severe and extremely severe cases, at patients’ homes. Samples were collected using 10 ml K2 EDTA vacutainers, processed immediately, snap-frozen in liquid nitrogen, and stored at −80°C. The research was approved by the Institutional Review Board at Stanford University (40146), adhering to rigorous protocols to ensure high-quality, anonymized data and samples.

We further increased the number of negative controls by incorporating samples from the iPOP (N = 110) and hPOP (N = 268) cohorts which are described elsewhere72,73. This yielded the final Stanford ME/CFS cohort including 208 patients and 534 controls.
 
I would have liked to see some sort of feature selection process where they try to pare down their model to the fewest number of genes with the most predictive power

I've been wondering how people could proceed if they find multiple associations that look good, but none have strong predictive power. I guess re-running the model on a different cohort would help, but might it also be an issue with the approach itself? It might be an excellent model, just not the best one for this problem.

I don't know anything about genomics or computational modelling, so I've struggled with the paper. I can see the results look compelling and exciting, but it's hard to get a grasp of how people would go about using them to design experiments.
 
I've been wondering how people could proceed if they find multiple associations that look good, but none have strong predictive power. I guess re-running the model on a different cohort would help, but might it also be an issue with the approach itself? It might be an excellent model, just not the best one for this problem.
From what I can tell, their method (HEAL2) is quite the improvement over standard regression analyses. If anything, I think the problem would be small sample size limiting the rare variants that are even represented in their cohort.

The model that HEAL2 gave them consists of all the genes with detected rare variants plus the weight that they contribute to the prediction--meaning that what's stored in the model is essentially a complicated way of saying:
"If you have a rare deleterious mutation in gene X1, you are Y1 times more likely to be labeled "ME/CFS" instead of "control" in the data set, and if you have a mutation gene X2, you are Y2 times more likely, and if you have a mutation in gene X3...."
And that basically all gets summed up across all the detected genes, with the vast majority of genes ending up as "equal likelihood of being ME/CFS or control", in which case they're ignored by the model.

They could potentially rerun their HEAL2 method on a much larger cohort to increase the total amount of rare variants detected, though for WGS that would be very costly. I think the best bet is just to see if any of the identified genes overlap with what DecodeME finds--though DecodeME might miss a lot of these if they aren't part of the genome regions that are being screened.

If it is a situation where there's just a whole slate of weakly associated genes, then the next step is trying to see what process, if anything, connects as many of the puzzle pieces as possible. That will probably be difficult for any genes that are 12 steps removed from the actual pathological mechanism of ME, whatever it is. But the hope is that a good puzzle solver would be able to draw a thread between at least a couple of the strongest associations and go from there.

That's why I'm particularly interested to see the feature weights from their model--if they ended up with 115 significant genes but only one or two dozen of them are really contributing to the model's predictive power, that's a much better jumping off point for narrowing things down. If it really is 100+ different genes with equally weak predictive power and no apparent connections between them, then it is probably just back to the drawing board at that point.
 
And for what it's worth, generally responding to some old comments on this thread:
between the cross validation in the training cohort and the completely independent test cohort validation, I don't think model overfitting is a huge concern here.

Though I'd still like to see some feature selection. Testing the predictive power of a small number of features on yet another independent cohort would really make me confident.
 
They could potentially rerun their HEAL2 method on a much larger cohort to increase the total amount of rare variants detected, though for WGS that would be very costly.
How costly? Is it something someone like Google could be convinced to join for some PR benefits?
I don't think model overfitting is a huge concern here.
Can you ELI5 overfitting? I don’t think most people are aware or what it is, and I’m not too confident in my own understanding.
 
How costly? Is it something someone like Google could be convinced to join for some PR benefits?
The number quoted to me several years ago was around $1000 each. It's probably available for cheaper now, though perhaps at the expense of accuracy and quality. And ideally you'd get at least 1000 participants, so upper estimate is $1M US dollars, potentially half that if decent cheaper options have been developed. I'm definitely not up-to-date on that.

Maybe it's something that one rich philantropist with a personal connection to ME/CFS would be willing to do. From knowing a couple folks who have worked at big tech companies, I'm doubtful that they can be convinced to step up.
 
Can you ELI5 overfitting? I don’t think most people are aware or what it is, and I’m not too confident in my own understanding.
Sure! [Edit: the one-sentence version:] The basic meaning of overfitting is when you have a "big data" model meant to predict a phenomenon, but it ends up being driven by latent factors that are unique to your data set.

[Edit: longer explanation below]
For example, let's say you did a big metabolomics study to predict something like susceptibility to seasonal infections. Your cohort happened to include a bunch of young adults who go out partying frequently, a bunch of elderly folks recruited from a local nursing home, and some schoolteachers who saw a recruitment flier.

If you allow your model to take many many data points into account, it's going to end up generating a model that is highly predictive because its really predicting all the little latent factors that might lead to a high infection rate for all the specific situations represented in your cohort. In this example, it might pick up on a bunch of metabolites related to recent alcohol consumption, AND various markers of aging, AND things that correspond with a high-stress job where you end up eating lots of fruit snacks.

You end up with a trained model (i.e. "for every unit increase in metabolite X there's a Y increase in infections", which then gets summed over all the included metabolites to produce one final "predicted infection frequency" score for each participant). But if you try to use that same model to generate "infection scores" for a new cohort, it'll be much less accurate, because the model was overfit to factors that were only relevant in the first cohort.

You can address this by a couple methods:
1) Cross validation: you split your train cohort into a bunch of random subsets and rerun the model several times, keeping only the features that were highly predictive across random subsets.
2) Feature selection: limit the amount of features the model can take into account. The less features it can include, the more it's going to focus on features that were predictive across all your participants, rather than accounting for every single minutiae of every subset.

And after the fact, you can retroactively assess overfitting by running the trained model on the test cohort and seeing how well it predicts your outcome on data that wasn't involved in training.
 
Last edited:
What I find difficult to follow is how you can get a meaningful estimate of weightings for rare gene alleles from a small sample. I would not expect more than one case to have any particular rare allele, so how do you weight that?
I'm pretty sure their model made predictions at the gene level. They focused on rare variants [edit: since they're more likely to cause loss-of-function], but then assigned weightings based on having any rare deleterious variant for that gene.
 
Last edited:
Previously I had emailed one author about the list of genes. Today I emailed the other corresponding author, and they responded to say they'll have to look into whether they're allowed to share it. If they send me the list and give me permission to post, I'll do so.
 
Grabbing some quotes from the paper to describe the "cytotoxic" CD4 cell finding

From Abstract
Patient-derived multi-omics data implicate reduced expression of ME/CFS risk genes within ME/CFS patients, including in the plasma proteome, and the transcriptomes of B and T cells, especially cytotoxic CD4 T cells, supporting their disease relevance.

Paper text for Supplementary Fig. 6 - it seems to be saying they took the 115 genes from the HEAL2 analysis and examined their expression in the scRNA-seq data from the Hanson/Grimson Cornell dataset.
To explore the functional changes of the 115 ME/CFS genes with increased granularity, we performed GSEA in cell subsets generated with high resolution clustering of T lymphoid cells from single-cell RNA sequencing (scRNA-seq) of PBMCs derived from ME/CFS patients and controls33,34 (Supplementary Fig. 6).

Starting with data from the T cell subset pseudobulked by sample and cell cluster, we again compared expression levels in ME/CFS versus controls for the 115 genes and significant modules. Due to the sparsity of scRNA-seq data, we filtered the genes included in the GSEA to the top quartile to avoid comparing low count genes between groups.

Text continues to talk about what they found and shown in Fig 4G,H,I and in Supplementary table 3. It's the only time the paper uses the word strikingly.
Strikingly, the set of 115 ME/CFS genes were significantly downregulated in the cytotoxic CD4 T cell cluster (adjusted P < 0.02, NES = -1.77; 61/115 genes were included in GSEA; Fig. 4G-H), identified as such due to its substantially higher expression of effector genes like CCL5 and GZMA compared to conventional CD4 T cell subsets35 (Fig. 4I). 18 of these genes were found in the core enrichment driving the downregulation in ME/CFS patients (Supplementary Table 3).

Here are the figures 4G-I
upload_2025-5-3_13-32-18.png
 
Talk by Mark Davis on the rare CD4 T cell type is here starting at 4hrs 48mins
https://videocast.nih.gov/watch=42563
He talks about classical autoimmune diseases - Celiac disease and mice with EAE autoimmune disease where they saw bursts of CD4 and gamma delta T cells. The gamma delta T cells expressed IL17 and apprently not much is known about them, and they saw about 1 in 1000 CD4 cells were this particular rare type. Then moves onto Lupus and MS patients, and finally ME/CFS. So he covers how they discovered these cells in autoimmune diseases and then they showed up in some ME/CFS patients.

I revisited my earlier post in this thread about Mark Davis highlighting an "autoimmune" CD4 cell subset in ME/CFS that was elevated in some patients. Here is the slide
View attachment 25850
It is titled KIR+ CD8 cells and autoimmune CD4 cells. After listening to that talk I wonder if Mark Davis autoimmune CD4 = "cytotoxic CD4" of this paper and KIR+CD8 = cytotoxic CD8. i.e. Did he mean both the CD4 and CD8 cells in the slide were "cytotoxic". Mark Davis presentation showed both being elevated in a cyclical manner in autoimmune disease.

The researcher that did the work left academia a few years ago so it is not easy to ask Stanford for clarification on how the "autoimmune CD4" cells were identified.
 
Last edited:
Back
Top Bottom