Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Sure, but isn't that what's interesting? The networks of related genes, even if they include genes not actually very useful on their own in this cohort. Which networks did the model "pull higher up" based on its assessment that these networks are useful for classification.
Oh sorry that's an implicit bit of information that I didn't think to specify--the networks in STRING are not discrete like lists in gene sets. Think of STRING more as one giant nearest neighbor graph, if you're more familiar with that. All the known protein interactions are encoded as edges between protein nodes, and "neighborhood/network" is just a relative term for a cluster of tightly connected nodes among that big graph. Those networks don't tend to be isolated, though, so if you wanted to define actual discrete clusters from that graph, that would be an a posteriori approximation with a graph partition algorithm like Louvain.

My sense is that HEAL2 is cross referencing individual gene associations with ME/CFS with this massive graph, and then the attention mechanism takes into account the individual association of the gene with ME/CFS as well as the associations of its neighbors, which after several iterations also includes information about the neighbor's neighbors and the neighbor's neighbor's neighbors and so on.

Because it's not discrete categorizations, there would be no easy label that could be transferred in HEAL2's output. It would be more of a "vague hand wave at [this] part of the graph" situation.

Maybe the divide is that you're talking about the issue of using GSEA to figure out pathways enriched in this cohort of people in the study, while I'm interested in what pathways were enriched in the finished model that were able to allow it to replicate on an independent dataset.

But I think I understand your initial point better now--GSEA might be able to recapitulate those networks in a discrete way if there does happen to be overlap between how gene sets are defined and protein-protein interactions in STRING.

It's just a bit of a shot in the dark since the gene sets and STRING are defined in very different ways. STRING is looking at experimentally validated protein-protein interactions, whereas gene sets are basically saying "Okay so we looked at several good-quality experiments in the literature that knocked out the TGF-b receptor and this long list of genes all repeatedly came up as differential, so we're calling that list 'Response to TGF-b'".

But I'd still be interested to see what comes up if you do that analysis.
 
On that note, I was wondering if that's what the p/q values in the list of genes are. Are they just showing the independent difference between cases and controls for each individual gene? If that's the case, those could actually be used for looking at enrichment in these specific people (though it might be too small of a sample to return much).
I actually couldn't figure that out from the text. My guess is that it might be based on the actual attention score with some kolmogorov-smirnov-like test compared to random permutations, but I definitely don't know the specifics. I think the low amount of mutations for each gene in this dataset would preclude any method that isn't based off of attention somehow.

[Edit: Ah I see it now, it was in another section]
 
The paper focuses on the genes which may be involved. But the model uses knowledge of protein interactions too. I wonder if it’s possible to get details of which protein interactions were deemed to be important. Can the network be examined to pull out these? I think the paper talks about modules or perhaps parts of the network which lit up, but can we/they be more specific?

edit: reading recent posts I think I get that the proteins are the nodes and interactions the edges in the GNN. So working backwards to find protein interactions may be possible but individual proteins probably not useful?
 
Last edited:
Because it's not discrete categorizations, there would be no easy label that could be transferred in HEAL2's output. It would be more of a "vague hand wave at [this] part of the graph" situation.
Ok yes thank you that makes sense.

Those networks don't tend to be isolated, though, so if you wanted to define actual discrete clusters from that graph, that would be an a posteriori approximation with a graph partition algorithm like Louvain.
Yes, that'd be ideal, and actually seems like what they did for their module enrichment of the top 115 genes. I don't understand a lot of the terms like Louvain, but it seems like they made discrete modules from STRING, then used Enrichr on the best matches. I doubt I'd be able to figure out how to do that though.

Anyway, makes sense that it might be a shot in the dark with random other gene sets, but we'll see what happens.
 
Yes, that'd be ideal, and actually seems like what they did for their module enrichment of the top 115 genes. I don't understand a lot of the terms like Louvain, but it seems like they made discrete modules from STRING, then used Enrichr on the best matches. I doubt I'd be able to figure out how to do that though.
Louvain is basically what you as a human would do if you printed out a graph and then drew circles around nodes to group them together discretely. The algorithm just applies a certain logic for the best way to "cut" the graph.

It looks like what they did is pull the graph information from STRING for those top 115 genes and then "cut" using Louvain to generate those gene modules. I'd expect that it worked well precisely because those top genes are pretty interconnected to begin with--if you tried to do that for the whole list of genes, you might get some funky results since Louvain is exhaustive and will force groupings that fit every node. Even just for their 115 genes, it looks like at least one module ended up as a "miscellaneous dump bucket."

For just characterizing the whole list, I'd say GSEA is probably your best bet [edit: with the caveats already discussed]!
 
Last edited:
It looks like what they did is pull the graph information from STRING for those top 115 genes and then "cut" using Louvain to generate those gene modules.
I got the impression they made 1261 modules using the entire STRING network, then tested each one against the 115 genes and got 4 significant modules.
We downloaded the human protein-protein interactions (PPIs) from STRING30 v12.0, comprising 19,622 proteins and 6,857,702 interactions. High-confidence PPIs (combined score >700) were extracted for downstream analysis, including 16,185 proteins and 236,000 interactions. To mitigate bias from hub proteins81, we applied the random walk with restart (RWR) algorithm with a restart probability of 0.5. This produced a smoothed network after retaining the top 5% predicted edges (N = 6,243,766). Next, we utilized the Louvain method82 to decompose the network into different modules. Following algorithm convergence, we obtained 1,261 modules with an average size of 13 nodes.

The enrichment of genes of interest within each module was tested using one-sided Fisher’s exact test. Modules with adjusted P < 0.05 based on Benjamini-Hochberg (BH) correction were considered significant.
 
I got the impression they made 1261 modules using the entire STRING network, then tested each one against the 115 genes and got 4 significant modules.
You're exactly right, it seems I was conflating details from different sections. Using the bigger graph to begin with would alleviate some of the "junk module" concern since you ideally wouldn't be left with too many miscellaneous stragglers. Technically you could use those modules as custom GSEA gene sets, you'd just need to convert the protein names into the corresponding genes. I would offer to do the louvain clustering and gene name mapping myself but it would be a bit time intensive and I'm already procrastinating a bit on other work projects (shame on me).
 
You're exactly right, it seems I was conflating details from different sections. Using the bigger graph to begin with would alleviate some of the "junk module" concern since you ideally wouldn't be left with too many miscellaneous stragglers. Technically you could use those modules as custom GSEA gene sets, you'd just need to convert the protein names into the corresponding genes. I would offer to do the louvain clustering and gene name mapping myself but it would be a bit time intensive and I'm already procrastinating a bit on other work projects (shame on me).
Oh absolutely spend time on the important things. Maybe eventually I'll try to figure out how exactly they did it.
 
Oh absolutely spend time on the important things. Maybe eventually I'll try to figure out how exactly they did it.
If you know R I could probably give you a rough outline on how to do it.

[Edit: also, sadly, much of the work that I have to do for school/research is much less productive and useful than the ME/CFS rabbitholes I'd prefer to spend my time on...such is life]
 
The paper focuses on the genes which may be involved. But the model uses knowledge of protein interactions too. I wonder if it’s possible to get details of which protein interactions were deemed to be important.
I think we can only go by the scores in the final model, and it'd be tough to get any specific individual protein interactions from that.
 
What do you all make of the assertion "Our results provide a rare-variant-based genetic linkage between ME/CFS and depression."?
I suspect that is heavily influenced by the fact that most sick people will score higher on a depression scale because the scales are terrible, and that many with ME/CFS will also get a depression diagnosis, either by mistake or as a secondary result of severe illness with little to no help, or even active harm being caused against them.

In short, «depression» is such an unreliable marker that it is probably positively correlated with any physical illness.
 
What do you all make of the assertion "Our results provide a rare-variant-based genetic linkage between ME/CFS and depression."?
I'm not sure it's the strongest evidence, but if, as @Utsikt says, there is overlap in diagnosis, and it's not considered a totally unrelated condition, I think that supports the idea that it got number 1 most related because it actually does have genes in common with the ME/CFS group.
 
I wonder if, in time, that we might discover that ME/CFS and depression share some mechanisms. By that, I don't mean that they are the same thing, but that the shared or similar symptoms may be caused by the same pathways, and perhaps that infection can be a cause of depression itself (assuming the label survives in the future).
 
I wonder if, in time, that we might discover that ME/CFS and depression share some mechanisms.

I agree.
I think it is worth remembering that whereas we speculate that there might be more than one process under the roof of 'ME/CFS' there is no doubt whatsoever that 'depression' includes several completely different processes. To the extent that some of them make it hard to get to sleep and others make you wake up early. Some come with delusions, others don't. Depression is a popular term for what psychiatrists tend to call 'depressive illness' when trying to be precise (which they rarely succeed at but that isn't always their fault). And a depressive illness is pretty much anything with an inhibitory effect on mind. Moreover, it can include bipolar disorder with inhibited and hyperactive phases.

ME/CFS has an inhibitory effect on the mind so I think it very plausible that it will turn out to have some common mechanisms with some depressive illness types. And they are likely to be the 'biological' types I think.

All in all I don't think there should be any worry that the genetics will mean that ME/CFS physicians need to take lessons from psychiatrists, or even immunologists for that matter. I rather suspect that ME/CFS researchers will have some lessons for the psychiatrists - and maybe even for the immunologists. When I listen to Chris Ponting talking I hear a lot more sense than I read on X posts from self-appointed Long Covid immunoglitterati.
 
I suspect that is heavily influenced by the fact that most sick people will score higher on a depression scale because the scales are terrible, and that many with ME/CFS will also get a depression diagnosis, either by mistake or as a secondary result of severe illness with little to no help, or even active harm being caused against them.

In short, «depression» is such an unreliable marker that it is probably positively correlated with any physical illness.
Indeed I agree, and got misdiagnosed and labelled sometimes without their knowing - most people don’t really know what depression is and isn’t and back years ago before PEM was better described…no chance - before years on the more complex rollercoaster of me/cfs makes someone realise it’s not.

I don’t know what we can do to clean up old records or have a consistent policy where in hindsight things weren’t right
 
Last edited:
Ok, I've run GSEA on the Zhang genes ranked by attention scores with the hallmark and canonical pathways collections:

Hallmark:
hallmark.png

Canonical Pathways:
c2.png

I used collapsePathways to reduce the number of pathways, and it removes about half of them. Interestingly the first two, which seem very similar, are both kept. I saw other very similar pairs remove one, so it does seem to be working, just maybe a high threshold for exclusion.

Below are the top 10 for canonical pathways, with the leading edge genes for each and the attention score for each gene.

WP_NOTCH_SIGNALING_WP268
NOTCH1 - 0.345
PSEN2 - 0.262
PSEN1 - 0.261
DTX2 - 0.257
DVL2 - 0.247
NOTCH3 - 0.246
NOTCH4 - 0.238
ADAM17 - 0.235
HDAC1 - 0.233
DLL4 - 0.228
NOTCH2 - 0.228
RBPJL - 0.224
RFNG - 0.212
HES1 - 0.2
APH1B - 0.185
MFNG - 0.184
DLL3 - 0.182
LFNG - 0.179
DTX3 - 0.178
RBPJ - 0.175
DTX3L - 0.17
DTX4 - 0.166
JAG1 - 0.162
DLL1 - 0.162
DTX1 - 0.158
CREBBP - 0.152
APH1A - 0.149
DVL3 - 0.145
KCNJ5 - 0.144
HES5 - 0.133
DVL1 - 0.125
NCSTN - 0.123
MAML1 - 0.114
NUMBL - 0.113
HDAC2 - 0.101

KEGG_NOTCH_SIGNALING_PATHWAY
NOTCH1 - 0.345
PSEN2 - 0.262
PSEN1 - 0.261
DTX2 - 0.257
DVL2 - 0.247
NOTCH3 - 0.246
NOTCH4 - 0.238
ADAM17 - 0.235
HDAC1 - 0.233
DLL4 - 0.228
NOTCH2 - 0.228
RBPJL - 0.224
RFNG - 0.212
HES1 - 0.2
EP300 - 0.192
PSENEN - 0.188
MFNG - 0.184
DLL3 - 0.182
LFNG - 0.179
DTX3 - 0.178
RBPJ - 0.175
DTX3L - 0.17
DTX4 - 0.166
JAG1 - 0.162
DLL1 - 0.162
DTX1 - 0.158
CREBBP - 0.152
APH1A - 0.149
DVL3 - 0.145
SNW1 - 0.139
HES5 - 0.133
DVL1 - 0.125
NCSTN - 0.123
MAML1 - 0.114
NUMBL - 0.113
HDAC2 - 0.101
MAML2 - 0.1

WP_DISRUPTION_OF_POSTSYNAPTIC_SIGNALING_BY_CNV
NLGN2 - 0.403
SHANK1 - 0.379
DLG2 - 0.346
SYNGAP1 - 0.343
NRXN1 - 0.31
NLGN1 - 0.304
MAPK3 - 0.278
DLGAP1 - 0.277
NRXN2 - 0.276
CAMK2A - 0.24
HOMER1 - 0.229
GRM1 - 0.219
GRIN2B - 0.164
RYR2 - 0.143
GRIN2C - 0.134
CAMK2B - 0.13
GRIN2D - 0.127
MAPK1 - 0.124
GRIN2A - 0.111
NRXN3 - 0.106

KEGG_MEDICUS_REFERENCE_REGULATION_OF_GF_RTK_RAS_ERK_SIGNALING_PTP
MAPK3 - 0.278
DUSP8 - 0.215
PTPN7 - 0.193
DUSP5 - 0.188
DUSP4 - 0.185
DUSP3 - 0.185
PTPRR - 0.168
DUSP16 - 0.168
DUSP10 - 0.144
PTPN5 - 0.142
DUSP2 - 0.14
DUSP1 - 0.14
DUSP7 - 0.133
MAPK9 - 0.124
MAPK1 - 0.124
MAPK8 - 0.118

PID_RET_PATHWAY
GRB2 - 0.552
PTPN11 - 0.492
RHOA - 0.449
BCAR1 - 0.331
PIK3CA - 0.306
SRC - 0.296
MAPK3 - 0.278
IRS1 - 0.272
HRAS - 0.253
SOS1 - 0.243
SHANK3 - 0.239
GRB10 - 0.227
PIK3R1 - 0.223
CRK - 0.221
RAP1A - 0.217
PRKACA - 0.176
PRKCA - 0.175
RET - 0.171
SHC1 - 0.163
GRB7 - 0.161
PTK2 - 0.15
IRS2 - 0.149
MAPK1 - 0.124
MAPK8 - 0.118
PXN - 0.097
FRS2 - 0.093
RASA1 - 0.092
DOK6 - 0.09

PID_IGF1_PATHWAY
GRB2 - 0.552
PTPN11 - 0.492
PRKCZ - 0.412
BCAR1 - 0.331
PIK3CA - 0.306
YWHAE - 0.278
IRS1 - 0.272
HRAS - 0.253
CRKL - 0.247
SOS1 - 0.243
GRB10 - 0.227
PIK3R1 - 0.223
CRK - 0.221
PTPN1 - 0.209
AKT1 - 0.181
SHC1 - 0.163
PTK2 - 0.15
IRS2 - 0.149
PXN - 0.097
PRKD1 - 0.093
BAD - 0.091

KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
INS - 0.397
PIK3CA - 0.306
MAPK3 - 0.278
IRS1 - 0.272
PIK3R1 - 0.223
PRKCA - 0.175
KRAS - 0.175
PRKCG - 0.161
PIK3CB - 0.15
IRS2 - 0.149
ATP1A2 - 0.145
IGF1 - 0.136
ATP1A1 - 0.134
PRKCB - 0.13
PIK3R5 - 0.129
PIK3CD - 0.125
MAPK1 - 0.124
PIK3R3 - 0.123
PIK3R2 - 0.123
ATP1A4 - 0.122
FXYD2 - 0.12
NR3C2 - 0.117
HSD11B2 - 0.105
ATP1B3 - 0.103
ATP1B1 - 0.103
ATP1A3 - 0.09
ATP1B2 - 0.083
SFN - 0.067
INSR - 0.066

WP_NOTCH_SIGNALING_WP61
NOTCH1 - 0.345
CUL1 - 0.316
SRC - 0.296
PSEN2 - 0.262
PSEN1 - 0.261
NOTCH3 - 0.246
NOTCH4 - 0.238
ADAM17 - 0.235
HDAC1 - 0.233
DLL4 - 0.228
NOTCH2 - 0.228
PIK3R1 - 0.223
HEY2 - 0.213
HES1 - 0.2
EP300 - 0.192
PSENEN - 0.188
HEY1 - 0.185
APH1B - 0.185
DLL3 - 0.182
AKT1 - 0.181
RBPJ - 0.175
JAK2 - 0.166
JAG1 - 0.162
DLL1 - 0.162
DTX1 - 0.158
APH1A - 0.149
STAT3 - 0.144
SNW1 - 0.139
CCND1 - 0.136
HES5 - 0.133
RING1 - 0.125
NCSTN - 0.123
PIK3R2 - 0.123
MAML1 - 0.114
NUMBL - 0.113
HDAC2 - 0.101
MAML2 - 0.1
MYC - 0.097
SPEN - 0.095
CDKN1A - 0.086
NUMB - 0.072
MTOR - 0.066
NCOR1 - 0.064
JAG2 - 0.063
MAML3 - 0.058
ITCH - 0.057
PTCRA - 0.057

REACTOME_SIGNALING_BY_ALK
PIK3CA - 0.306
SRC - 0.296
IRS1 - 0.272
SIN3A - 0.238
HDAC1 - 0.233
PIK3R1 - 0.223
EP300 - 0.192
SHC1 - 0.163
PIK3CB - 0.15
STAT3 - 0.144
PIK3R2 - 0.123
PTPN6 - 0.121
PLCG1 - 0.112
DNMT1 - 0.108
HDAC2 - 0.101
MYC - 0.097
FRS2 - 0.093
JAK3 - 0.089

PID_TRKR_PATHWAY
GRB2 - 0.552
PTPN11 - 0.492
RHOA - 0.449
PRKCZ - 0.412
PIK3CA - 0.306
NRAS - 0.302
MAPK3 - 0.278
HRAS - 0.253
PRKCI - 0.25
CRKL - 0.247
SOS1 - 0.243
PIK3R1 - 0.223
CRK - 0.221
RAP1A - 0.217
KRAS - 0.175
SHC1 - 0.163
STAT3 - 0.144
CCND1 - 0.136
SH2B1 - 0.127
MAPK1 - 0.124
NGF - 0.115
PLCG1 - 0.112
NTRK1 - 0.107
SHC3 - 0.103
ABL1 - 0.1
RASGRF1 - 0.1
FRS2 - 0.093
BDNF - 0.093
RASA1 - 0.092
ELMO1 - 0.08
RAPGEF1 - 0.079
SQSTM1 - 0.078
MAP2K1 - 0.075
GAB1 - 0.072
SHC2 - 0.068
NTRK2 - 0.061
GAB2 - 0.06
TIAM1 - 0.059
 
Back
Top