Recursive ensemble feature selection provides a robust mRNA expression signature for ME/CFS, 2021, Metselaar et al

Sly Saint

Senior Member (Voting Rights)
Recursive ensemble feature selection provides a robust mRNA expression signature for myalgic encephalomyelitis/chronic fatigue syndrome
Abstract

Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a chronic disorder characterized by disabling fatigue. Several studies have sought to identify diagnostic biomarkers, with varying results. Here, we innovate this process by combining both mRNA expression and DNA methylation data. We performed recursive ensemble feature selection (REFS) on publicly available mRNA expression data in peripheral blood mononuclear cells (PBMCs) of 93 ME/CFS patients and 25 healthy controls, and found a signature of 23 genes capable of distinguishing cases and controls.

REFS highly outperformed other methods, with an AUC of 0.92. We validated the results on a different platform (AUC of 0.95) and in DNA methylation data obtained from four public studies on ME/CFS (99 patients and 50 controls), identifying 48 gene-associated CpGs that predicted disease status as well (AUC of 0.97). Finally, ten of the 23 genes could be interpreted in the context of the derailed immune system of ME/CFS.
https://www.nature.com/articles/s41598-021-83660-9
 
This used the CAMDA data set which used the Reeves et al. (2005) very odd operationalisation of the Fukuda et al. criteria, the so-called empiric criteria. These gave a prevalence of 2.54% for the population. A lot probably have primary psychiatric problems rather than ME/CFS. I consider them probably worse than the Oxford criteria.
 
I'd guess that the selection of criteria came from the author in the dept of psychology Amsterdam as his contribution.
The whole CAMDA dataset used the Reeves et al. (2005) criteria. I'm not sure whether it was possible to select patients in any other way using this dataset. Every paper looks like it used the criteria.

It was frustrating. The data was analysed in a special conference at the time with lots of numerate people analysing it, but I couldn't have any confidence in any of the results.
 
Last edited:
The whole CAMDA data set used the Reeves et al. (2005) criteria. I'm not sure whether it was possible to select patients in any other way using this dataset. Every paper looks like it used the criteria.

It was frustrating. The data was analysed in a special conference at the time with lots of numerate people analysing it, but I couldn't have any confidence in any of the results.

Yes, I misspoke when calling the selection a selection of criteria. My memory can't even king in mind what I've read past the moment I read things. I wonder if there were any other choice of data-sets available to them or was this the only choice? I confess I don't have any idea of what these things even are.
 
The whole CAMDA data set used the Reeves et al. (2005) criteria. I'm not sure whether it was possible to select patients in any other way using this dataset. Every paper looks like it used the criteria.

The good news is that Aletta Kraneveld was involved in the whole Dutch biomedical research agenda process. She went to the Invest in ME 2019 conference. I think it's important for us Dutch patients to remind them of better criteria to use. They did however mention this in the study.

Our results return a promising gene signature for ME/CFS that needs to be validated in a well characterized clinical cohort to study its use as a diagnostic tool
 
FWIW:

In this 2011 article by Leonard Jason, et al., the relative ability of the empiric and CCC criteria to identify patients from controls did not seem wildly different (79% vs 87%), but it seems to beg the question of how they determined who was a "true" patient in the first place.

Data mining: comparing the empiric CFS to the Canadian ME/CFS case definition
https://pubmed.ncbi.nlm.nih.gov/21823124/

Abstract
This article contrasts two case definitions for myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS). We compared the empiric CFS case definition (Reeves et al., 2005) and the Canadian ME/CFS clinical case definition (Carruthers et al., 2003) with a sample of individuals with CFS versus those without. Data mining with decision trees was used to identify the best items to identify patients with CFS. Data mining is a statistical technique that was used to help determine which of the survey questions were most effective for accurately classifying cases. The empiric criteria identified about 79% of patients with CFS and the Canadian criteria identified 87% of patients. Items identified by the Canadian criteria had more construct validity. The implications of these findings are discussed.
 
how they determined who was a "true" patient in the first place
Diagnosing CFS
At the end of Stage 2, a team of physicians was responsible for making final diagnoses. Two physicians independently rated each file according to the current U.S. definition of CFS, ICF, Exclusionary for CFS due to medically/psychiatrically explained chronic fatigue (Fukuda et al., 1994), or Control (participants with no exclusionary illness and less than 6 months of fatigue). Reviewing physicians had access to all information gathered on each participant during each of the phases of the study. Physicians were not blind to Wave 1 status because they needed to be fully appraised of the medical history. The review panel was also provided with all results from the physical exam. If a disagreement occurred regarding whether a participant should receive a diagnosis of CFS, ICF, Exclusionary due to medically/psychiatrically explained chronic fatigue, or Controls during the physician review process, the participant’s file was rated by a third physician reviewer, and the diagnosis was determined by majority rule. We used refinements of the Fukuda et al. criteria as recommended by an International Research Group and the CDC (Reeves et al., 2003).

Full text available here, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228898/
 
This protein expression stuff is exciting I think, and it looks like we have a number of teams interested in it. @DMissa

I am tempted to leap into googling each gene where expression has been found here to be down regulated, to find out what it does. I've been looking at the Human Protein Atlas just now - the link I have given there is to CORO6, one of the genes found to have down-regulated expression.
It feels a little bit like reading a newspaper astrology item though, where the information for a protein is vague enough that it's easy to find something that seems to fit with what we think is happening.

One interesting thing from the Human Protein Atlas is that it underlines the variability of expression in different tissues. See for example the screenshot of a bit of the list of tissues where CORO6 is expressed. It's not expressed much in blood cells at all, and even there, it's more highly expressed in eosinophils and basophils (which weren't looked at in this study) than in monocytes (which were).

Screen Shot 2021-02-26 at 10.21.50 PM.png

So, if you thought the finding of down regulation of e.g. CORO6 in patients compared to controls was worth following up, it might be useful to look at different tissues, maybe muscle, or maybe the eosinophils. And of course, the protein might be doing different, and important, things in the different tissues, even at low concentrations . It feels a bit like a needle in a haystack sort of problem - but surely the needles are there in the haystack. People surely can't feel as bad as we do without there being some proteins expressed differently. And the sifting of the straw has begun.
 
Last edited:
The good news is that Aletta Kraneveld was involved in the whole Dutch biomedical research agenda process. She went to the Invest in ME 2019 conference. I think it's important for us Dutch patients to remind them of better criteria to use. They did however mention this in the study.

Just a little bit more patience, the research agenda will be public within a few months.
 
Way above my head. However, this study highlights the possibilities inherent in publicly available datasets that other teams can dive into, in this case with machine learning algorithms.

In this study they only used a few and quite small datasets. Yet for machine learning I would have thought the more (good) data the merrier. I wonder what the limitation was, lack of funding* to do more or no additional datasets publicly available?

* They appear to have been quite creative to get whatever funding they had:
The funding for the study was provided by the division of Pharmacology, Department of Pharmaceutical Sciences, Faculty of Science, Utrecht University and Tytgat Institute for Liver and Intestinal Research, AGEM, Amsterdam UMC.
 
Way above my head. However, this study highlights the possibilities inherent in publicly available datasets that other teams can dive into, in this case with machine learning algorithms.

In this study they only used a few and quite small datasets. Yet for machine learning I would have thought the more (good) data the merrier. I wonder what the limitation was, lack of funding* to do more or no additional datasets publicly available?

* They appear to have been quite creative to get whatever funding they had:

Way above my head too.

I'm not sure if this post by @Jonathan Edwards is in any way related to this publication [they both suggest immune dysregulation I guess]:
https://www.s4me.info/threads/me-cf...maintains-status-quo.12949/page-2#post-228586

The other thing is that I don't think Jonathan's a big fan of studies which use peripheral blood mononuclear cells (PBMCs) - from memory his view is that PBMCs are not doing a lot.

Interesting to see what Jonathan's take on this is.
 
This protein expression stuff is exciting I think, and it looks like we have a number of teams interested in it. @DMissa

The analysis here looks very nice!

With regard to gene products the study here is looking at mRNA levels. In my opinion it's hard to infer downstream or functional consequences with mRNA levels without protein data alongside it (the authors are responsible about this in my opinion and focus on the diagnostic parts which I think is good. They don't overclaim). I found this out firsthand, eg in the paper we just published: if we only had transcriptome (mRNA) data it would've looked like the electron transport chain was downregulated in expression, when the reality is that it's downregulated at the transcript level but still upregulated at the protein level (likely due to specific dysregulation of involved signalling pathways).

The best performing diagnostic parts here look nice. Looks like the mRNA measurements might generally achieve better specificity than sensitivity (the mRNA ROC curves ramp upwards from the Y axis). I've actually seen this trend when exploring transcriptome data, so that's very interesting. Sensitive tests lend themselves well to an initial screening step since you are less likely to miss any true positive patients. Specific tests lend themselves well to a subsequent confirmatory test after initial screening since you can better filter out false-positives.
 
The whole CAMDA dataset used the Reeves et al. (2005) criteria. I'm not sure whether it was possible to select patients in any other way using this dataset. Every paper looks like it used the criteria.
The CAMDA dataset can be downloaded here: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/CFS/

Do you happen to have a link or reference that explains that the CAMDA dataset was based on the CDC study that used the empirical criteria?
 
The whole CAMDA dataset used the Reeves et al. (2005) criteria. I'm not sure whether it was possible to select patients in any other way using this dataset. Every paper looks like it used the criteria.

The CAMDA dataset can be downloaded here: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/CFS/

Do you happen to have a link or reference that explains that the CAMDA dataset was based on the CDC study that used the empirical criteria?

I'm afraid I'm not sure how easy I could do that at this stage.
It relates to this cohort:
Chronic Fatigue Syndrome – A clinically empirical approach to its definition and study
https://bmcmedicine.biomedcentral.com/articles/10.1186/1741-7015-3-19
This population-based case control study enrolled 227 adults identified from the population of Wichita with: (1) CFS (n = 58); (2) non-fatigued controls matched to CFS on sex, race, age and body mass index (n = 55); (3) persons with medically unexplained fatigue not CFS, which we term ISF (n = 59); (4) CFS accompanied by melancholic depression (n = 27); and (5) ISF plus melancholic depression (n = 28). Participants were admitted to a hospital for two days and underwent medical history and physical examination, the Diagnostic Interview Schedule, and laboratory testing to identify medical and psychiatric conditions exclusionary for CFS.
At the time I used to look at the CAMDA papers that came out and they all used the empiric criteria as far as I could see so I eventually stop reading them.

Lots of people missed it at the time because it was described as an operationalisation of the Reeves et al (2003) criteria, basically the Fukuda et al. (1994) criteria. However, the Reeves et al. (2005) criteria are a very weird operationalisation of the criteria.
 
Last edited:
Back
Top Bottom