jnmaciuch
Senior Member (Voting Rights)
Oh sorry that's an implicit bit of information that I didn't think to specify--the networks in STRING are not discrete like lists in gene sets. Think of STRING more as one giant nearest neighbor graph, if you're more familiar with that. All the known protein interactions are encoded as edges between protein nodes, and "neighborhood/network" is just a relative term for a cluster of tightly connected nodes among that big graph. Those networks don't tend to be isolated, though, so if you wanted to define actual discrete clusters from that graph, that would be an a posteriori approximation with a graph partition algorithm like Louvain.Sure, but isn't that what's interesting? The networks of related genes, even if they include genes not actually very useful on their own in this cohort. Which networks did the model "pull higher up" based on its assessment that these networks are useful for classification.
My sense is that HEAL2 is cross referencing individual gene associations with ME/CFS with this massive graph, and then the attention mechanism takes into account the individual association of the gene with ME/CFS as well as the associations of its neighbors, which after several iterations also includes information about the neighbor's neighbors and the neighbor's neighbor's neighbors and so on.
Because it's not discrete categorizations, there would be no easy label that could be transferred in HEAL2's output. It would be more of a "vague hand wave at [this] part of the graph" situation.
Maybe the divide is that you're talking about the issue of using GSEA to figure out pathways enriched in this cohort of people in the study, while I'm interested in what pathways were enriched in the finished model that were able to allow it to replicate on an independent dataset.
But I think I understand your initial point better now--GSEA might be able to recapitulate those networks in a discrete way if there does happen to be overlap between how gene sets are defined and protein-protein interactions in STRING.
It's just a bit of a shot in the dark since the gene sets and STRING are defined in very different ways. STRING is looking at experimentally validated protein-protein interactions, whereas gene sets are basically saying "Okay so we looked at several good-quality experiments in the literature that knocked out the TGF-b receptor and this long list of genes all repeatedly came up as differential, so we're calling that list 'Response to TGF-b'".
But I'd still be interested to see what comes up if you do that analysis.