Circulating cell-free RNA signatures for the characterization and diagnosis of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, Gardella+

I'm no machine learning expert and also no statistician and don't have access to this exact paper, but we're seeing this approach a lot in ME/CFS research and in other papers, which I had access to, there were often large concerns about this approach (not just related to the biology behind things, but the basic merits of it). In other papers it typically had the flavour of designing "the best lockpicking lock on the basis of optimizing it for one lock" and reporting "we can classify ME/CFS in a test set with high accuracy".

The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.

My concerns:
Overfitting to the test set (primarily because the set of models to chose from has become large and easily accessible to everyone).

The standard fix:
Split into training set, validation set and test set. Use the training set to train the model. Use the validation set to tune the model, select features and chose a model. Then once everything is finalised run the data over the training set. (There's also more fancy approaches, something like nested cross-validation).

This seems fairly basic to me, so I'm suprised to see it happen and wonder if I misinterpreted something. Happy to hear anybodies thoughts, especially on what was specfically done here.
 
I think platelet-derived cfRNA was decreased in ME/CFS.

EDIT: so that would mean that the platelets are less likely to rupture in ME/CFS?
Ah sorry--yes I suppose? I have no idea what biological difference would result in less rupturing in ME/CFS though. So I think it is more likely to be a confounder in sample processing.
 
I'm no machine learning expert and also no statistician and don't have access to this exact paper, but we're seeing this approach a lot in ME/CFS research and in other papers, which I had access to, there were often large concerns about this approach (not just related to the biology behind things, but the basic merits of it). In other papers it typically had the flavour of designing "the best lockpicking lock on the basis of optimizing it for one lock" and reporting "we can classify ME/CFS in a test set with high accuracy".

The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.

My concerns:
Overfitting to the test set (primarily because the set of models to chose from has become large and easily accessible to everyone).

The standard fix:
Split into training set, validation set and test set. Use the training set to train the model. Use the validation set to tune the model, select features and chose a model. Then once everything is finalised run the data over the training set. (There's also more fancy approaches, something like nested cross-validation).

This seems fairly basic to me, so I'm suprised to see it happen and wonder if I misinterpreted something. Happy to hear anybodies thoughts, especially on what was specfically done here.
This is what their methods state:
The sample metadata and count matrices were first split into a training set (70%) and a test set (30%), which were evenly partitioned across phenotype, sex,
and test site. We repeated the split 100 times, for 100 different iterations of the train and test groups. Within each of the 100 training sets, feature selection was
performed using differential abundance analysis which selected genes with a base mean of greater than 50 and a P-value less than 0.001. These genes were
used to train 15 machine learning algorithms using fivefold cross-validation and grid search hyperparameter tuning. For each iteration of each model, accuracy, sensitivity, and AUC-ROC were used to measure test performance.

And the relevant results text:
Unsupervised clustering based on the differentially abundant features demonstrated separation between cases and controls ( Fig. 2B). To build machine learning classifiers, we used Monte Carlo cross-validation by repeatedly partitioning the dataset into training (70%) and test (30%) sets, while balancing for phenotype, sex, and test site. This process was repeated 100 times, generating 100 unique train–test splits (Materials and Methods). From each training set, we selected features using differential abundance criteria (P -value < 0.001 and base mean > 50) and trained 15 machine learning models (Fig. 2C). This approach yielded approximately 1,500 models. We evaluated performance based on test set accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC), assessing both the average metrics and the best-performing seed for each model ( Fig. 2D).
As expected, tree-based algorithms such as ExtraTrees, Random Forest (RF), and C5 exhibited strong training performance, reflecting their capacity to fit complex patterns. However, these models tended to overfit the training data as evidenced by poor performance on the test set. Models with robust performance demonstrated high accuracy and AUC-ROC values for both the training and test sets. Across all models, each sample was included in the test set approximately 450 times (~30% of all iterations).
We observed variability in individual classification rates, with some samples classified correctly >90% of the time, while others classified correctly as low as 11% of the time ( Fig. 2E). This result suggests that certain samples possessed unique features which drove consistent correct or incorrect classification.

So yes, it does seem like the first scenario you describe.
 
I think your concerns are definitely warranted @EndME . I've gotten to learn under several machine learning experts over the past few years and the thing they drill into me is that the test set cannot be used for picking the best model. If you do that, it's basically just using it as another cross validation rather than an actual test set. Any time I had to choose between models, it was always done based on cross validation in the training cohort--the test set doesn't even get touched until we know which model we're moving forward with for downstream analysis. [Edit: and, frankly, that's why my current PI prefers unsupervised models over anything else if there's a chance they'll turn up a relevant signal]

I also am generally quite skeptical about using big data models for ME/CFS diagnosis. The primary and most useful purpose of big data fishing expeditions is to pick out patterns with potential biological relevance that we didn't know to look for. That's why all my projects are collaborations with clinicians and specialized biologists--I find the patterns, I theorize about what it could mean, they tell me whether that makes sense or whether there's something I missed, and then it's up to them to validate the phenomenon directly.

I truly don't think we're ever going to get a good diagnostic marker this way. Sure, it can discriminate between ME/CFS and healthy controls, but it's probably going to get very hairy once you start throwing in any other diagnosis encountered in a clinical setting. If you're searching for a biomarker, it needs to be a biomarker of the illness, not a set of hundreds of genes that are different in ME/CFS compared to a selected group of controls driven by an unknown biological process.
 
If anyone has paywall access, I'd love to see which specific markers they determined to be "plasmacytoid dendritic, monocyte, and T cell–derived"
First time poster here! I do have paywall access through an institution but I'm not familiar with most of the terminology so apologies if this isn't the correct response but this is what I can quote from the full paper.
Using this approach, we identified six cell types that differed significantly between cases and controls (P-adj < 0.05, Fig. 3C ). In order of significance, these included plasmacytoid dendritic cells (pDCs), monocytes ( Fig. 3B ; cluster 26), naïve CD8+ T cells, T cells ( Fig. 3B ; cluster 16), mucosal-associated invariant T
(MAIT) cells, and platelets ( Fig. 3D ). To understand deconvolution differences by sex, we first analyzed only female subjects (n = 126) and identified six cell types with significant differential abundance ( Fig. 3E ). While statistical comparisons between female and male ME/CFS cases remained challenging due to sample
size, we identified some cell types with patterns that suggested potential sex-based differences in immune responses. For instance, we found effector/memory (E/M) CD8+ T cells ( Fig. 3B ; cluster 5) to be significantly elevated only in female cases versus controls and not significantly different when we included both sexes.
Additionally, we saw classical monocytes as elevated in both female cases and controls compared to their male counterparts ( Fig. 3E ).


Among T cell subtypes, we found significant differences in naive CD8+ T cells, MAIT cells, and other T cell clusters. Naive CD8+T cells are essential for adaptive immunity, and their increased cfRNA contribution in cases, although unexpected as these are an unactivated and relatively stable cell type, may suggest altered Tcell homeostasis or impaired differentiation into effector and memory subsets. MAIT cells, which comprise a T cell subset involved in mucosal immunity and microbial defense, have been increasingly recognized for their role in chronic inflammatory diseases ( 11 ). We also observed an increased contribution of E/M CD8+ T cells in females with ME/CFS, suggesting potential sex-specific immune alterations. E/M CD8+ T cells play a crucial role in long-term immune surveillance and rapid responses to antigen re-exposure, and their elevation may indicate impaired resolution of inflammation and/or chronic immune activation in female cases.
 

Attachments

  • Screenshot 2025-08-12 212343.png
    Screenshot 2025-08-12 212343.png
    137.1 KB · Views: 6
Back
Top Bottom