Cochrane Review: 'Exercise therapy for chronic fatigue syndrome', Larun et al. - New version October 2019 and new date December 2024

So they argued that a change of 4 was clinically significant
As their source they refer to:
Which reads:
Because the Chalder fatigue scale6 is relatively new, there is no published definition of equivalence. The researchers in this trial include several of those involved in developing and testing the instrument. Our consensus view was that a difference of less than four, using a Likert scale, is not important
Ok that's it. I don't trust a word of what Larun et al. or Cochrane say anymore ....

I really thought they would have checked that the right figures were used, not only those that favour the results for exercise therapy.


EDIT: The last sentence has been edited for clarification.
 
Last edited:
I really thought they would have checked that the right figures were used, not only those that favour the results for exercise therapy.
To clarify what I mean: I did a quick Pubmed search using the terms: (Chalder Fatigue Scale) AND (minimally important difference OR clinically significant). It gives only 10 results including the relevant paper by Sabes-Figuera et al. and the paper on Lupus which the authors was used in the updated Cochrane review (Goligher et al. 2008). This gives me the impression that the authors should have been able to find the Sabes-Figuera study, even with a limited search.

The PACE trial (White et al. 2011) says:
A clinically useful difference between the means of the primary outcomes was defined as 0·5 of the SD of these measures at baseline, equating to 2 points for Chalder fatigue questionnaire and 8 points for short form-36.
So Larun et al. should probably have said that a clinically significant difference is estimated at 2 - 4 points on the Chalder Fatigue Scale for patients with CFS, based on previous studies.
 
Last edited:
The PACE trial (White et al. 2011) says:
A clinically useful difference between the means of the primary outcomes was defined as 0·5 of the SD of these measures at baseline, equating to 2 points for Chalder fatigue questionnaire and 8 points for short form-36.
So Larun et al. should probably have said that a clinically significant difference for is estimated at 2 - 4 points on the Chalder Fatigue Scale, for patients with CFS, based on previous studies.

This letter was published in the Lancet on what was done in the PACE Trial:

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(11)60689-2/fulltext
Correspondence| Volume 377, ISSUE 9780, P1831, May 28, 2011
The PACE trial in chronic fatigue syndrome

The PACE trial in chronic fatigue syndrome
Published:May 28, 2011DOI:https://doi.org/10.1016/S0140-6736(11)60689-2

In their randomised trial of treatments for patients with chonic fatigue syndrome, Peter White and colleagues (March 5, p 823)1 define a clinically useful difference between the means of the primary outcomes as “0·5 of the SD of these measures at baseline, equating to 2 points for Chalder fatigue questionnaire and 8 points for short form-36”. They cite achieving a mean clinically useful difference in the graded exercise therapy or cognitive behaviour therapy groups, compared with specialist medical care alone, as evidence that these interventions are “moderately effective treatments”.

The source for this definition of clinically useful difference states that such a method has a “fundamental limitation”: “estimates of variability will differ from study to study…if one chooses the between-patient standard deviation, one has to confront its dependence on the heterogeneity of the population under study”.2

In White and colleagues' study, we do not have heterogeneous samples on the Chalder fatigue questionnaire and short-form 36 physical function subscale, since both are used as entry criteria.1 Patients had to have scores of 65 or less on short-form 36 to be eligible for the study.1

However, most, in practice, would probably need to have scores of 30 or more to be able to participate in this clinic-based study. Indeed, only four of 43 participants in a previous trial of graded exercise therapy scored less than 30. 3,4
Guyatt and colleagues 2 suggest that “an alternative is to choose the standard deviation for a sample of the general population”, which White and colleagues have given as 24.1 An SD of 24 gives a clinically useful difference of 12; both graded exercise therapy and cognitive behaviour therapy fail to reach this threshold. Whether they “moderately improve outcomes”, as claimed, 1 is therefore questionable.

I am chair of a myalgic encephalomyelitis support and advice group—an unpaid voluntary position.

References
  1. 1.
    • White PD
    • Goldsmith KA
    • Johnson AL
    • et al.
    • on behalf of the PACE trial management group
    Comparison of adaptive pacing therapy, cognitive behaviour therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome (PACE): a randomised trial.
    Lancet. 2011; 377: 823-836
    View in Article
  2. 2.
    • Guyatt GH
    • Osaba D
    • Wu AW
    • et al.
    Methods to explain the clinical significance of health status measures.
    Mayo Clinic Proc. 2002; 77: 371-383
    View in Article
  3. 3.
    • Fulcher KY
    Physiological and psychological responses of patients with chronic fatigue syndrome to regular physical activity. Loughborough University of Technology, Loughborough1997
    http://hdl.handle.net/2134/6777
    (accessed March 4, 2011).
    View in Article
  4. 4.
    • Fulcher KY
    • White PD
    Randomised controlled trial of graded exercise in patients with the chronic fatigue syndrome.
    BMJ. 1997; 314: 1647-1652
    View in Article
Article Info
 
Last edited:
So it is pretty routine to describe an effect of 0.6 as moderate.
A SMD of 0.64 seems pretty large compared to a 3.4 point difference on a 33 point scale.

Just a thought: Is it possible that the SMD was inflated because the standard deviation in the studies was low? Some trials used the 11-point version of the Chalder Fatigue Scale, which has ceiling effects. So perhaps most participants in these trials had near maximum scores with very little variation. So even small change would look big with so little background noise.
 
Last edited:
This letter was published in the Lancet on what was done and PACE Trial:
To clarify for others:

Larun et al. set the minimal important difference (MID) for SF-36 physical function at 7 points, citing two studies, one on patients rheumatoid arthritis and one on patients with heart disease:
  • Ward MM, Guthrie LC, Alba MI. Clinically important changes in short form 36 health survey scales for use in rheumatoid arthritis clinical trials: the impact of low responsiveness. Arthritis Care & Research 2014;66:1783-9.
  • Wyrwich KW, Metz SM, Kroenke K, Tierney WM, Babu AN, Wolinsky FD. Triangulating patient and clinician perspectives on clinically important diNerences in health-related quality of life among patients with heart disease. Health Services Research 2007;42:2257-74
The results of the Cochrane review showed that post-treatment, mean physical functioning scores in the exercise group were 13.10 points higher. So they argue that this is a clinically significant effect.

The threshold of 7 points as MID however, seems quite low. The letter by Jane Giakoumakis argued for a MID of 12 points, based on half the standard deviation of the SF-36 measured in the general population. I've found this study on patients with Idiopathic Pulmonary Fibrosis which estimated the MID for SF-36 physical function at 13.9 points.

Since SF-36 physical function is a frequently used measure I suspect there will be more of these estimates for MID. It would be useful to get a broad overview to see what range there is and whether Larun et al. choice of 7 was adequate or not.
 
So they argued that a change of 4 was clinically significant, which is higher than the 3,4. points of effect size reported by Larun et al.
A SMD of 0.64 seems pretty large compared to a 3.4 point difference on a 33 point scale.

Just a thought: Is it possible that the SMD was inflated because the standard deviation in the studies was low? Some trials used the 11-point version of the Chalder Fatigue Scale, which has ceiling effects. So perhaps most participants in these trials had near maximum scores with very little variation.
The minimum score on shoulder is effectively 11, making it a 23 point scale. 3.4 point difference strikes me as credible for 0.5 SD, given that 0.5 SD isn't actually a very big difference (see the graph in my previous post).

You're right that a ceiling effect would constrain the baseline standard deviation used to calculate SMD. But there is a counter argument that people on the ceiling could "improve" without showing a fall in their score, since they are already off the scale.

Given also that, as you point out, other estimates of a minimally important difference are in the range 2-4, I am not sure the threshold used for MID is an obvious flaw in the review. Certainly not compared with all the other problems.

(As@Dolphin points out in that letter to the Lancet, issues of an artificial baseline standard deviation are probably more important for the SF 36 physical function scale, but, surprisingly, exercise doesn't seem to have improved physical function according to this review)
 
But there is a counter argument that people on the ceiling could "improve" without showing a fall in their score, since they are already off the scale.
Do you mean that some patients had such severe fatigue that they still would have the maximum score even after an improvement in fatigue?

That could be but I suspect than in the trials changes on the CFQ were mostly determined by response bias, placebo effects and other non-clinical effects (for example: not willing to admit that 12 weeks of therapy was pointless) rather than actual changes in health. So people fill in the questionnaire a little 'better.'

In that case bias causes a small but consistent change on the questionnaire which looks moderate when expressed as SMD because patients pretty much all have the same score (near the maximum) causing little background noise.

Just a thought. I should really look at the data more closely to see if it makes sense...
 
Do you mean that some patients had such severe fatigue that they still would have the maximum score even after an improvement in fatigue?

That could be but I suspect than in the trials changes on the CFQ were mostly determined by response bias, placebo effects and other non-clinical effects (for example: not willing to admit that 12 weeks of therapy was pointless) rather than actual changes in health. So people fill in the questionnaire a little 'better.'

In that case bias causes a small but consistent change on the questionnaire which looks moderate when expressed as SMD because patients pretty much all have the same score (near the maximum) causing little background noise.

Just a thought. I should really look at the data more closely to see if it makes sense...
We have both the FINE and PACE Trial individual scores (though not by item), if anyone ever wants to look a bit more at individual patient data.

ETA: Also I found Fulcher’s PhD somewhere, probably on ETHOS which also has similar data.
 
Weird. I do not consider any change in the CFQ to be clinically significant. None at all. At best it's a secondary measure and a very poor one over a secondary dimension of this disease. It is not a measure of anything other than the researchers' own misunderstanding of the problem and promotion of their personal beliefs above reality. It's as meaningless an argument as which precise skullcap sizes or brow distance are evidence of genius or criminality.

Objective measures or bust. If we're "healthy", that means a normal life with no limitations whatsoever, enough with this "recovery is hard to define" crapfest. Anything else is just arguing over the precise fabric of the shoes of the angels dancing on a hairpin and its adherent properties on the metal pinhead. What a waste of everything, meaningless conversations over made-up nonsense.

Not to rain on the discussion here, sadly because this nonsense is imposed on us like a meteor crashing down we do have to discuss it. But holy crap is this all dumb and disastrous. People (not you here, the idiots trying to make CFQ a relevant thing) shouting over which preferred arbitrary measurement on an imaginary scale means something about a subset that isn't even about meaningful by itself. Fools asking the wrong questions and wasting millions of lives arguing which imaginary answer is more meaningful than other imaginary answers or the precise cutoff at which this imaginary answer means something else or that.

At least when people were shouting and fighting each other in the middle ages over which fictitious spirit or demon was responsible for something they had ignorance to blame for.
 
The letter by Jane Giakoumakis argued for a MID of 12 points, based on half the standard deviation of the SF-36 measured in the general population.
I get the point about using the general population rather than the presumably narrower patient group. But we've seen that the distribution of the SF-36 is definitely not a normal distribution. Is it possible to derive something meaningful from the standard deviation of a non- normal distribution?
 
Thanks for all the discussion on this.

I remember when I first read about the concept of a 'clinically significant difference'/MID/etc, and it seemed like a potentially useful way of assessing patient views on the value of changes in questionnaire scores in nonblinded trials once they had been told to about the various problems with bias. Then I read how these concepts were actually defined in papers, and it seemed like it was often just another way of making it seem that researchers were doing more valuable work than they truly were.

I find it difficult to believe that many patients would consider a 3.4 change to be important in trials of the sort assessed in this review. But who knows? It seems no-one bothered to ask us.

edit: And if patients are asked to indicate the MID on a questionnaire with a huge range of items of different importance, then that doesn't lead to a MID score, it just shows that the exact items identified indicate an MID. If different items lead to the same score that does not mean they are also viewed as an important difference.
 
Last edited:
I find it difficult to believe that many patients would consider a 3.4 change to be important in trials of the sort assessed in this review. But who knows? It seems no-one bothered to ask us.

I had never heard of this MID before getting into ME studies - and I had spent years working on trials and acting as expert witness in the law courts on trials. As far as I can see it has no relevance to clinical importance. I used a slide of the 3.4 change in PACE for my NICE presentation with the full 33 points of Y axis. It just looks pathetic. None of this pseudo statistics has any bearing on reality.
 
Again, I stop by with a trivial point. I'm sure this has been pointed out already, but still this limitation of the review is not sufficiently addressed in the abstract, and not at all in the conclusion.

Selection criteria

We included randomised controlled trials (RCTs) about adults with a primary diagnosis of CFS, from all diagnostic criteria, who were able to participate in exercise therapy.

To not specify the features of patients not meeting these selection criteria for the sample (being not able to particpate in exercise therapy) seems to me a grave neglect.

(Hope this is understandable & also why I think this matters, not able to explain at the moment.)

Edited to add:
About the exercise participants had to be able to participate in:
"Seven studies used aerobic exercise therapies such as walking, swimming, cycling or dancing, provided at mixed levels in terms of intensity of the aerobic exercise from very low to quite rigorous, and one study used anaerobic exercise."
 
Last edited:
The following changes were proposed but rejected:

1) Objective outcomes
Tom Kindlon and Robert Courtney noted that with the exception for health resource use, Larun et al. have not reported on objective outcomes. The randomized trials included in the review had data on outcomes such as exercise testing, a fitness test, the six minute walking test, employment status and disability payments. Objective outcomes tend to be less influenced by bias due to a lack of blinding. The analysis by Vink &Vink-Niese showed that with some exceptions, objective outcomes generally have not significantly improved following exercise therapy. Back in 2015, the authors responded that “the protocol for this review did not include objective measurements." But they did seem to agree that objective measures should be carefully considered in an update. No extra objective outcomes were reported in the amended review.

2) Compliance:
Kindlon also asked about data on compliance: information on whether the trial participants really followed the exercise therapy as prescribed. He wrote: “it would be interesting if you could obtain some unpublished data from activity logs, records from heart-rate monitors, and other records to help build up a picture of what exercise was actually performed and the level of compliance.” Again, the authors seemed to agree that this is an important point that should be considered in an update of the review. No information is provided on compliance in the 2019 amendment.

3) Selective reporting in the PACE trial
Tom Kindlon and Robert Courtney both argued that the PACE trial should not be rated as low risk of bias for selective reporting. They referred to the Cochrane tool for assessing risk of bias (RoB 1), where the low risk of bias was explained as “The study protocol is available and all of the study’s pre-specified (primary and secondary) outcomes that are of interest in the review have been reported in the pre-specified way.” Kindlon and Courtney argued that this was not the case for the PACE trial and that therefore the trial should not be rated as low risk of bias. Their comments were supported by Cochrane editor Nuala Livingstone during an internal audit of Courtney’s complaint to Cochrane. In their 2015 response, Larun et al. acknowledged that changes were made to planned analysis specified in the protocol of the PACE trial but argued that “these changes were drawn up before the analysis commenced and before examining any outcome data.” In the 2019 amendment all risk of bias judgements have remained the same, including the low risk of bias on selective reporting for the PACE trial. The authors justify this as follows: “The protocol and the statistical analysis plan were not formally published prior to recruitment of participants, and some readers, therefore, claim the study should be viewed as being a post hoc study. The study authors oppose this, and have published a minute from a Trial Steering Committee (TSC) meeting stating that any changes made to the analysis since the original protocol was agreed by TSC and signed off before the analysis commenced.”

4) Proposal to analyze the excluded data from Jason et al.
For the outcome of physical function at follow-up, the study by Jason et al. was excluded because of large baseline differences: the exercise group had much lower (39) physical function scores than the relaxation group (54). Kindlon noted that “It would be good if other methods could be investigated (e.g. using baseline levels as covariates) to analyse such data.” The authors responded that this would make the analysis very complicated and that this can be more easily addressed in a review based on individual patient data. The 2019 amendment does not use an alternative method to include the results of Jason et al. on physical function at follow-up.

5) Downgrading fatigue post-treatment to low-quality evidence
From a publicly released email exchange, we know that the previous Cochrane Editor in chief David Tovey strongly objected to the results for fatigue post-treatment to be rated as moderate quality. He wrote: “the conclusion that this is moderate certainty evidence seems indefensible to me.” Tovey argued that it could be further downgraded for inconsistency (because of considerable heterogeneity reflected by a I2 of 80%) or imprecision (because the confidence interval of the effect crosses the line of no longer being clinically significant). The authors – represented by officials of the Norwegian Institute of Public Health (NIPH) - argued that heterogeneity was mostly due to the study by Powell et al.: when it was removed, the heterogeneity became acceptable while the effect size remained moderate. Regarding imprecision, they argued that GRADE only advises downgrading when the confidence interval crosses the line of no effect, not the line of a clinically significant effect. In the email correspondence, the authors did seem to agree that these were both borderline cases and open to interpretation. They, therefore, proposed the following compromise, as explained by Fretheim Atle from the NIPH: “I proposed a compromise: We simply grade the evidence for this outcome as Low-moderate. The authors have accepted to use the term ‘may’ (usually indicating low certainty evidence) when describing the certainty of the evidence, rather than the term ‘probably’ (usually indicating moderate certainty). They have also accepted not to use any categorization of the effect size.” An alternative solution proposed was to use the term “low to moderate quality evidence”. The 2019 amendment, however, uses the words “probably” and “moderate-certainty evidence”.

EDIT: The changes made to the Cochrane review are not an update (which would include a new literature search and new studies) but an amendment. This has now been changed in the text.
 
Last edited:
We have both the FINE and PACE Trial individual scores (though not by item), if anyone ever wants to look a bit more at individual patient data.

ETA: Also I found Fulcher’s PhD somewhere, probably on ETHOS which also has similar data.

Dolphin, can you please upload the fine trial individual scores or do you have a link to these scores for me; thank you
 
The following changes were proposed but rejected:

1) Objective outcomes
Tom Kindlon and Robert Courtney noted that with the exception for health resource use, Larun et al. have not reported objective outcomes. The randomized trials included in the review had data on outcomes such as exercise testing, a fitness test, the six minute walking test, employment status and disability payments. Objective outcomes tend to be less influenced by bias due to a lack of blinding. The analysis by Vink &Vink-Niese showed that with some exceptions, objective outcomes generally have not improved significantly following exercise therapy. Back in 2015, the authors responded that “the protocol for this review did not include objective measurements.” But they seemed to agree that objective measures should be carefully considered in an update. Despite this, no extra objective outcomes were reported in the updated review.

2) Compliance:
Kindlon also asked about data on compliance: information on whether the trial participants really followed the exercise therapy as prescribed. He wrote: “it would be interesting if you could obtain some unpublished data from activity logs, records from heart-rate monitors, and other records to help build up a picture of what exercise was actually performed and the level of compliance.” Again, the authors seem to agree that this is an important point that should be considered in an update of the review. Nonetheless no information is provided on compliance in the 2019 update.

3) Selective reporting in the PACE trial
Tom Kindlon and Robert Courtney both argued that the PACE trial should not be rated as low risk of bias for Selective reporting. They referred to the Cochranes tool for assessing risk of bias (RoB 1), where the low risk of bias was explained as “The study protocol is available and all of the study’s pre-specified (primary and secondary) outcomes that are of interest in the review have been reported in the pre-specified way.” Kindlon and Courtney argued that this was not the case for the PACE trial and that therefore the trial should not be rated as low risk of bias. Their comments were supported by Cochrane editor Nuala Livingstone during an internal audit of Courtney’s complaint to Cochrane. In their 2015 response, Larun et al. acknowledged that changes were made to planned analysis specified in the protocol of the PACE trial but argued that “these changes were drawn up before the analysis commenced and before examining any outcome data.” In the 2019 update all risk of bias judgements have remained the same, including the low risk of bias on selective reporting for the PACE trial. The authors justify this as follows: “The protocol and the statistical analysis plan were not formally published prior to recruitment of participants, and some readers, therefore, claim the study should be viewed as being a post hoc study. The study authors oppose this, and have published a minute from a Trial Steering Committee (TSC) meeting stating that any changes made to the analysis since the original protocol was agreed by TSC and signed off before the analysis commenced.”

4) Proposal to analyze the excluded data from Jason et al.
For the outcome of physical function at follow-up, the study by Jason et al. was excluded because of large baseline differences: the exercise group had much lower (39) physical function scores than the relaxation group (54). Kindlon noted that “It would be good if other methods could be investigated (e.g. using baseline levels as covariates) to analyse such data.” The authors responded that this would make the analysis very complicated and that this can be more easily addressed in a review based on individual patient data. The 2019 update does not use an alternative method to include the results of Jason et al. on physical function at follow-up.

5) Downgrading fatigue post-treatment to low-quality evidence
From a publicly released email exchange we know that the previous Cochrane Editor in chief David Tovey strongly objected to the results for fatigue post-treatment to be rated as moderate quality. He wrote: “the conclusion that this is moderate certainty evidence seems indefensible to me.” Tovey argued that it could be further downgraded for inconsistency (heterogeneity reflected by I2 of 80%) or imprecision (because the confidence interval of the effect crosses the line of no longer being clinically significant). The authors – represented by officials of the Norwegian Institute of Public Health (NIPH) - argued that heterogeneity was mostly due to the study by Powell et al. when it was removed, the heterogeneity became acceptable while the effect size remained moderate. Regarding imprecision they argued that GRADE only advises downgrading when the confidence interval crosses the line of no effect, not the line of a clinically significant effect. In the email correspondence they did seem to agree that these were both borderline cases and open to interpretation. They therefore proposed the following compromise, as explained by Fretheim Atle from the NIPH: “I proposed a compromise: We simply grade the evidence for this outcome as Low-moderate. The authors have accepted to use the term ‘may’ (usually indicating low certainty evidence) when describing the certainty of the evidence, rather than the term ‘probably’ (usually indicating moderate certainty). They have also accepted not to use any categorization of the effect size.” An alternative solution proposed was to use the term “low to moderate quality evidence”. The 2019 update however uses the words “probably” and “moderate-certainty evidence”.
Excellent!

I have not seen any justification for those from Cochrane in the review and commentary. Did I miss them? Every single one of those points is damning. Combined they frankly amount to malpractice, "justified" or not. Simply describing the problem is not a proper justification. Neither is "this would be too hard".

This simply does not amount to serious work. Except they are all competent professionals, which suggests a much worse reason to produce something so ridiculously bad, especially knowing all the scrutiny it would be subjected to and how much is already documented.
 
None of this is going to change without legal action!!

The Lancet wont change, the BMJ wont and Cochrane wont either. I doubt NICE will throw themselves under the bus either especially with Cochrane now claiming there is evidence of efficacy.

This needs to go to the Supreme Court. Simple as that.
 
Back
Top Bottom