ME/CFS Science Blog
Senior Member (Voting Rights)
I think platelet-derived cfRNA was decreased in ME/CFS.makes platelets more likely to rupture in ME/CFS,
EDIT: so that would mean that the platelets are less likely to rupture in ME/CFS?
I think platelet-derived cfRNA was decreased in ME/CFS.makes platelets more likely to rupture in ME/CFS,
Ah sorry--yes I suppose? I have no idea what biological difference would result in less rupturing in ME/CFS though. So I think it is more likely to be a confounder in sample processing.I think platelet-derived cfRNA was decreased in ME/CFS.
EDIT: so that would mean that the platelets are less likely to rupture in ME/CFS?
Yes look like you're right. It seems that they highlight the GLMNET lasso model because it did well on the test set (while selection of the model should have happened before that).The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.
This is what their methods state:I'm no machine learning expert and also no statistician and don't have access to this exact paper, but we're seeing this approach a lot in ME/CFS research and in other papers, which I had access to, there were often large concerns about this approach (not just related to the biology behind things, but the basic merits of it). In other papers it typically had the flavour of designing "the best lockpicking lock on the basis of optimizing it for one lock" and reporting "we can classify ME/CFS in a test set with high accuracy".
The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.
My concerns:
Overfitting to the test set (primarily because the set of models to chose from has become large and easily accessible to everyone).
The standard fix:
Split into training set, validation set and test set. Use the training set to train the model. Use the validation set to tune the model, select features and chose a model. Then once everything is finalised run the data over the training set. (There's also more fancy approaches, something like nested cross-validation).
This seems fairly basic to me, so I'm suprised to see it happen and wonder if I misinterpreted something. Happy to hear anybodies thoughts, especially on what was specfically done here.
The sample metadata and count matrices were first split into a training set (70%) and a test set (30%), which were evenly partitioned across phenotype, sex,
and test site. We repeated the split 100 times, for 100 different iterations of the train and test groups. Within each of the 100 training sets, feature selection was
performed using differential abundance analysis which selected genes with a base mean of greater than 50 and a P-value less than 0.001. These genes were
used to train 15 machine learning algorithms using fivefold cross-validation and grid search hyperparameter tuning. For each iteration of each model, accuracy, sensitivity, and AUC-ROC were used to measure test performance.
Unsupervised clustering based on the differentially abundant features demonstrated separation between cases and controls ( Fig. 2B). To build machine learning classifiers, we used Monte Carlo cross-validation by repeatedly partitioning the dataset into training (70%) and test (30%) sets, while balancing for phenotype, sex, and test site. This process was repeated 100 times, generating 100 unique train–test splits (Materials and Methods). From each training set, we selected features using differential abundance criteria (P -value < 0.001 and base mean > 50) and trained 15 machine learning models (Fig. 2C). This approach yielded approximately 1,500 models. We evaluated performance based on test set accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC), assessing both the average metrics and the best-performing seed for each model ( Fig. 2D).
As expected, tree-based algorithms such as ExtraTrees, Random Forest (RF), and C5 exhibited strong training performance, reflecting their capacity to fit complex patterns. However, these models tended to overfit the training data as evidenced by poor performance on the test set. Models with robust performance demonstrated high accuracy and AUC-ROC values for both the training and test sets. Across all models, each sample was included in the test set approximately 450 times (~30% of all iterations).
We observed variability in individual classification rates, with some samples classified correctly >90% of the time, while others classified correctly as low as 11% of the time ( Fig. 2E). This result suggests that certain samples possessed unique features which drove consistent correct or incorrect classification.
First time poster here! I do have paywall access through an institution but I'm not familiar with most of the terminology so apologies if this isn't the correct response but this is what I can quote from the full paper.If anyone has paywall access, I'd love to see which specific markers they determined to be "plasmacytoid dendritic, monocyte, and T cell–derived"
Using this approach, we identified six cell types that differed significantly between cases and controls (P-adj < 0.05, Fig. 3C ). In order of significance, these included plasmacytoid dendritic cells (pDCs), monocytes ( Fig. 3B ; cluster 26), naïve CD8+ T cells, T cells ( Fig. 3B ; cluster 16), mucosal-associated invariant T
(MAIT) cells, and platelets ( Fig. 3D ). To understand deconvolution differences by sex, we first analyzed only female subjects (n = 126) and identified six cell types with significant differential abundance ( Fig. 3E ). While statistical comparisons between female and male ME/CFS cases remained challenging due to sample
size, we identified some cell types with patterns that suggested potential sex-based differences in immune responses. For instance, we found effector/memory (E/M) CD8+ T cells ( Fig. 3B ; cluster 5) to be significantly elevated only in female cases versus controls and not significantly different when we included both sexes.
Additionally, we saw classical monocytes as elevated in both female cases and controls compared to their male counterparts ( Fig. 3E ).
Among T cell subtypes, we found significant differences in naive CD8+ T cells, MAIT cells, and other T cell clusters. Naive CD8+T cells are essential for adaptive immunity, and their increased cfRNA contribution in cases, although unexpected as these are an unactivated and relatively stable cell type, may suggest altered Tcell homeostasis or impaired differentiation into effector and memory subsets. MAIT cells, which comprise a T cell subset involved in mucosal immunity and microbial defense, have been increasingly recognized for their role in chronic inflammatory diseases ( 11 ). We also observed an increased contribution of E/M CD8+ T cells in females with ME/CFS, suggesting potential sex-specific immune alterations. E/M CD8+ T cells play a crucial role in long-term immune surveillance and rapid responses to antigen re-exposure, and their elevation may indicate impaired resolution of inflammation and/or chronic immune activation in female cases.