Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Subtropical Island · Feb 15, 2018

Daisybell said:
But - even if that is the belief - what the data keeps showing is that even if you change what people think, it doesn’t actually change their behaviour. They aren’t more active, they don’t get back to work, they don’t use fewer resources in terms of benefits etc... so that must be seen as a failure.

I don’t think non-complaining sick people is usually the goal, if the research is sponsored by the DWP! Fine for the health service, but not otherwise.

Non-complaining sick people = higher rates of ‘cure’ that can be claimed.

So, no, not useful for society, resources, govt funding, etc etc (those who fund the research).
But useful for the people doing the ‘research’: we got funded, we did something, it got published as a ‘success’, we’ll get more funding for ‘research into effective treatments’. (My inner cynic wants to say: better outcome for these ‘researchers’ than a cure would be).

The scammed are not just patients but funding bodies.

The sad thing is that when you have been scammed, it’s often very hard to ‘give up’ and admit that all your ‘good work’ has been a fail. Easier to rationalise.
This is human.

What we need is modern scientific method: where every negative result adds to our body of knowledge.
The old quote about the filament for lightbulbs: (something like) not 100 failures but 100 things we now know don’t work well.

ETA What i’m trying to say here is that studies like PACE can have real value. We need to all (especially the people involved in publishing PACE) appreciate the value of a result that confirms the negative of your hypothesis, or yields a null result. If mistakes are made we need to review them and learn from them.
A conclusion that something is not significantly effective is a useful conclusion. A conclusion that there are better ways to run a trial is also a useful conclusion - so long as you make future trials better.

This is what is so brilliant about this reanalysis: we are looking at what what really found, and what was not. The ONLY way to make progress.

Tom Kindlon · Feb 15, 2018

Esther12 said:
Larun response to Kindlon:

"Bimodal versus Likert scoring in Wearden et al. 2010
To enable pooling of as many studies as possible in a mean difference meta-analyses, we used the 33-scale results reported by Wearden. You suggest that the decision to use the 33-point fatigue scores in our analysis may bias the results because there is no statistically significant difference at the 11-point data at 70 weeks. This statement suggests that there is a statistically significant difference when using the 33-point data, but if you look into analysis 1.2 that is not the case. At 70 week we report MD -2.12 (95% CI -4.49 to 0.25) for the FINE trial, i.e. not statistically significant."

Click to expand...

Larun reply to Courtney:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ’Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work.

We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

The Chalder Fatigue Scale was used to measure fatigue. The results from the Wearden 2010 trial show a statistically significant difference in favour of pragmatic rehabilitation at 20 weeks, regardless whether the results were scored bi-modally or on a scale from 0-3. The effect estimate for the 70 week comparison with the scale scored bi-modally was -1.00 (CI-2.10 to +0.11; p =.076) and -2.55 (-4.99 to -0.11; p=.040) for 0123 scoring. The FINE data measured on the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors to be able to use the estimates in our meta-analysis. In our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks. The decision to use the 0123 scoring did does not affect the conclusion of the review.
Regards,

Lillebeth Larun

Click to expand...

So it looks like Larun have moved from the -2.12 (not statistically significant) finding to the -2.55 (statistically significant) finding? Or am I reading it incorrectly? I don't think the latter finding/data is mentioned anywhere else in the review.

Esther12 · Feb 15, 2018

Tom Kindlon said:
So it looks like Larun have moved from the -2.12 (not statistically significant) finding to the -2.55 (statistically significant) finding? Or am I reading it incorrectly? I don't think the latter finding/data is mentioned anywhere else in the review.

I don't think I'd noticed that before.

"-2.55 (-4.99 to -0.11; p=.040) for 0123 scoring"

Those are the figures from the FINE RR: http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0

So Larun is contrasting those "effect estimate"s with her "unadjusted analysis", which is what was used in the Cochrane review?

So it seems that the difference was a result of adjustments made in the FINE analysis? I've forgotten how FINE data was analysed now.

Evergreen · Feb 16, 2018

Valentijn said:
Those who were added under the likert scheme scored bimodally as follows: 8 (1 patient), 7 (18 patients), 6 (26 patients), 5 (20 patients), and 4 (23 patients).

Thanks for that really clear explanation, @Valentijn . Wow. So that's 88 patients whose bimodal scores indicate abnormal fatigue being counted as recovered under the Likert switch. I'm guessing their bimodal scores hadn't reduced by 50% either.

For people like me who need to see these things again and again, from the PACE protocol:
"We will use the 0,0,1,1 item scores [bimodal] to allow a possible score of between 0 and 11. A positive outcome will be a 50% reduction in fatigue score, or a score of 3 or less, this threshold having been previously shown to indicate normal fatigue [27]."

Reference 27 was
Chalder T, Berelowitz G, Hirsch S, Pawlikowska T, Wallace P,
Wessely S, Wright D: Development of a fatigue scale. J Psychosom
Res 1993, 37:147-153.

Evergreen · Feb 16, 2018

Esther12 said:
It was actually in this linked rapid response: http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0

The FINE authors have since gone on to use these results in presentations on their results, although they were never peer-reviewed and their figures are contradicted by the Cochrane review on exercise therapy which (after being challenged by Robert Courtney on their use of results that did not seem to have been reported elsewhere, despite claiming to have only used results from published paper) reported having access to FINE data, but that their analysis showed that Likert results were not significant.

Thanks for that link, @Esther12 . My FINE folder is filling up.

This playing with numbers until they tell the story you want makes me suspicious that the SF36-PF may find itself usurped by a measure that is deemed more sensitive to change, because it often stubbornly suggests little or no change in physical function.

Although this reminds me of a piece I was rereading yesterday from Collin & Crawley's 2017 paper, where they seem to be suggesting that objective measures will show more improvement than subjective:

Anecdotally, services report overall high patient satisfaction, which may appear to be at odds with a minority of patients experiencing substantial improvement. This apparent paradox may be explained in part by the difficulty of measuring long-term outcomes in a complex chronic illness [16, 25], a problem which could perhaps be addressed by using objective rather than subjective measures [26].

Objective measures haven't quite played ball so far, though, have they. (Reference 26 is a study of a small sample of Australians who did a 12 week CBT/GET/graded cognitive activity intervention and reportedly improved on objective measures of cognitive performance but not subjective.)

I find it baffling that there's no awareness of how flimsy "patient satisfaction" is. It's like going on a date and judging its success based on whether the person says "I had a nice time tonight" at the end rather than whether they call to arrange a second date, and show up for the second date, and arrange a third. Very few people end a date by saying "I don't like you. I have no intention of ever having any contact with you ever again." And they certainly wouldn't if the person they went on a date with was in charge of their healthcare and the gateway to their only source of income.

Simon M · Feb 16, 2018

Comparison of protocol primary outcomes with published primary outcomes (did the switching matter?)

I thought I'd bring things together to show the impact of changing primary outcomes, using the results from the new paper. Spoiler alert: the published outcomes always look better than protocol-specified ones.

First, how the authors revised reporting of primary outcomes:
The protocol looked at the proportion of patients who improved on CFQ, SF36, or both, with a clinically-meaningful difference required of a 2-3x higher rate to be effective.

The published version switched to measuring mean (average) score differences between whole groups (e.g. CBT vs control group), rather than how many patients improved, and so no longer required both CFQ and SF36 to improve. It also changed from a bimodal scoring of CFQ (0-11) to a 0-33 scoring system that is more sensitive to small changes.

A "clincially useful difference" was set at 0.5D, which worked out at -2 points for the CFQ and +8 points for the SF36. The smallest possible change on these scales are 1 and 5 respectively, so the new definition of "clinically useful" effectively means "more than the smallest possible positive change".

Overall results

So while the protocol results showed that only half of outcomes were statistically significant (3/6), all of the published results were comfortably statistically signficant. And instead of 2/6 being clincically meaningful for the protocol, most (3/4) were for the published primary outcomes.

More info in the thumbnail:

Curiously, the PACE Lancet paper didn't come out and say that the effect of CBT on CFQ scores reached a clinically useful difference, but did not for SF36 scores. Instead, the start of the discussion section includes this:

Mean differences between groups on primary outcomes almost always exceeded predefined clinically useful differences for CBT and GET when compared with APT and SMC. In all comparisons of the proportions of participants who had either improved or were within normal ranges for these outcomes, CBT and GET did better than did APT or SMC alone

Note the elding of the more positve secondary outcome of proportion improving, and the post hoc outcome of normal range, with the primary outcomes. More details in the box

Results: for the primary outcomes... participants had less fatigue and better physical function after CBT and GET than they did after APT or SMC alone

It's not obvious in the original published figure, so here it is with a line showing that CBT fails to reach the threshold for a clinically useful difference (SMC is the control group, APT the pseudo-pacing one).

Here's the original figure 3

To sum up:

The protocol results are much worse than those published: CBT is not an effective overall treatment for CFQ fatigue and SF36 function. GET is, but the results are weak (at 20% improvement vs 10% of controls).
In any case, the long-term results show that these improvements don't persist. This new paper unpicks the PACE authors' optimistic theory that the non-difference at long-term was because people had CBT/GET after the trial to 'catch up'. (The difference between groups also disappeared amongst those that did not have further treatment.)

The authors of the new paper put this nicely into context:

Wilshire and colleagues said:
It would have been perfectly acceptable first to report the protocol-specified primary outcome analysis, and then to explore the data using methods that are more sensitive to smaller effects – for example, analysis of the individual, continuous outcome measures. However, instead, the researchers chose to omit the former analysis altogether, and report only the latter.

[they were saying this in relation to the PACE authors' stated expectation that CBT/GET would outperform the control group by 5-6x but it's a similar point]

Simon M · Feb 17, 2018

Here’s, a couple of graphs that reveal how far the PACE authors moved the goalposts for the primary outcomes, by comparing the protocol primary outcome of overall improver rates from this new paper with the published improver rates and the published recovery rates.

Both definitions are based largely on self-reported fatigue and function.

Far more patients “improved” with the published improver definition compared with the protocol version.

The published improver rates are now a secondary outcome, but they are based on the how many patients improved by a clinically useful difference for both of the revised primary outcome measures.

Surprisingly, similar rates of patients “recovered“ using the published definition as improved overall with the protocol one.

Note that it won’t be all the same people in both improved and recovered groups. The improvers would include those who had an initial low score e.g. SF36=30 and substantial improvements, while the “recovered“ group will include those who had a high initial score e.g. SF36=60 ( which already matches the recovery definition for function) and relatively minor improvements.

Neither group needed to improve on objective outcomes.

Evergreen · Feb 18, 2018

Simon M said:
Far more patients “improved” with the published improver definition compared with the protocol version.

Simon M said:
Surprisingly, similar rates of patients “recovered“ using the published definition as improved overall with the protocol one.

Simon M said:
So while the protocol results showed that only half of outcomes were statistically significant (3/6), all of the published results were comfortably statistically signficant. And instead of 2/6 being clincically meaningful for the protocol, most (3/4) were for the published primary outcomes.

Thanks so much for putting this information in such an easy-to-grasp format, @Simon M , really impactful. Gobsmacking stuff.

And this is the clincher, really, isn't it:

Simon M said:
Neither group needed to improve on objective outcomes.

Carolyn Wilshire · Feb 19, 2018

Adrian said:
I agree with this. My belief is that the overrode the protocol by getting the stats plan approved (and the stats plan doesn't discuss why changes were made). My guess is that they never got explicit approval for the protocol changes but their claims are based on approvals of the stats plan. We've been blocked from knowing what happened in the committees (a information tribunal decided to keep minutes private) so we will never know. But why else be so sensitive about them.

Whether not not their 'committee' gave them approval is neither here nor there from the perspective of evaluating the research. If you change what you specify in the protocol, you need a good reason, one that's a whole lot better than 'our mates on the committee agreed so it was okay'.

All that matters from the point of view of the science is that they changed various outcomes and analyses they promised to do in the trial protocol, and that these changes are not scientifically justified - who ever did or did not agree with them at the time is entirely irrelevant.

Carolyn Wilshire · Feb 19, 2018

Sasha said:
BTW, there was a web-appendix to the paper - even if there wasn't room in the main paper for everything, it wouldn't have stopped them presenting all of their findings, as far as I can see.

That's a good point, @Sasha.

Carolyn Wilshire · Feb 19, 2018

Cheshire said:
Were the participants of the PACE trial informed about the recovery criteria? If so, were they warned that the outcomes were changed mid-trial?

No, and that is something for people concerned with ethics and patients' rights. From the point of view of the science, which was the only thing under scrutiny in the current paper, this has little bearing.

Adrian · Feb 19, 2018

Carolyn Wilshire said:
on that's a whole lot better than 'our mates on the committee agreed so it was okay'.

This has been their basic defense for the changes. My point is I think they pulled the wool over their mates eyes with the stats plan not specifying reasons just slipping in changes.

Carolyn Wilshire · Feb 19, 2018

Simon M said:
To sum up: using the stats plan correction method, GET had a statistically signifcant effect on the overall improvement rate and was at the bottom end of a "clinically important difference". CBT had no stat sig effect overall.

CBT had a sig, clinically important effect on CFQ alone; GET had a sig effect on SF36 score but it wasn't clinically important.

You could add that even the GET result means you have to treat 10 patients to get one overall improver, and other analysis in the paper shows that these improvements vs no treatment don't last.
=====

Assuming I've got my numbers right, that's the key point. It gets a bit more complicated looking at the protocol.

The protocol specifies 6 contrasts but no method for statistical correction. They would have to have choose something but would have had a choice. Bonferroni is the strictest option and the most obvious choice, but there are other methods, which might make the GET overall result significant. I'd appreciate @Carolyn Wilshire opinion on this both for accuracy and plausibility. (See quote box for boring exploration of this).

Yes, in fact we present this analysis in the paper, and we obtained the same results as you did. We actually corrected for 5 AND for 6 total comparisons (because the protocol specifies 6 and the stat plan specifies 5), and presented results for both. The conclusions were the same wither way, and accorded with yours.

The problem with using the stats plan's Bonferroni approach is that the plan has been criticised for being late and changing the outcome measures so it can't be taken as definitive (hey, maybe they went ultra-strict to make up for watering down the outcomes). Certainly it's reasonable to apply Bonferroni but it might be reasonable to use other approaches too.

One option would be the Bonferroni-Holm method, which applies successively easier thresholds to each p value, starting at the fully-corrected one for the smallest p value and ending up with just p<0.05. But by my calculation, this method would make the GET overall result significant along with the GET SF36 one, but none of the CBT results.

I'm not sure who suggested this, but yes, Bonferroni is pretty conservative and FDR (False discovery rate aka Bonferroni-Holm) is more lenient, and probably slightly preferable. The main results might have just passed FDR threshold.

Bonferroni is the only method of correction described or used in any of the PACE papers, including the stats plan. So I think its a reasonable assumption that that's what they would have used. I agree there's not much in it - the results are borderline, and they get through based on some thresholds not others. The truth is probably that people self-rated a little better on CFQ and/or SF36 physical function scale after GET and CBT. But there is some value in showing that the results are less impressive than they appear after the outcome switch. It showed that researcher outcome selection introduced a source of bias into the study.

Simon M · Feb 19, 2018

Thanks for these replies.

Carolyn Wilshire said:
Yes, in fact we present this analysis in the paper, and we obtained the same results as you did. We actually corrected for 5 AND for 6 total comparisons (because the protocol specifies 6 and the stat plan specifies %), and presented results for both. The conclusions were the same wither way, and accorded with yours.

Maybe my post wasn't clear - I'm quoting the figures from your paper. The NNT was my own, and I thought the protocol wasn't clear as to whether or not the "clinically important difference" measure applied to just the overall improvers, or also applied to improvers on CFQ or SF36 alone.

Your figures give a p value of 0.010 for GET overall, which I'm assuming does pass the Stat plan version (i.e. marginally under 0.010) but not the protocol version (6 contrasts, + Bonferroni as per stats plan).

Carolyn Wilshire said:
I'm not sure who suggested this, but yes, Bonferroni is pretty conservative and FDR (False discovery rate aka Bonferroni-Holm) is more lenient, and probably slightly preferable. The main results might have just passed FDR threshold.

I asked, because I thought it's just the sort of defence that the PACE authors might advance. But I wasn't sure if using FDR, or other such less-conservative measures than Bonferroni, was normal on such a small number of comparisons: I thought you would only consider that for > 10 comparisons, but I have little experience and was hoping you could throw some light on it.

Also, I realise now that the 6 comparisons of the protocol include the APT ones and without those I couldn't apply the Bonferroni-Holm correction properly myself.

Carolyn Wilshire said:
I agree there's not much in it - the results are borderline, and they get through based on some thresholds not others. The truth is probably that people self-rated little better on CFQ and/or physical function after GET and CBT. But there is some value in showing that the results are less impressive than they appear after the outcome switch. It showed that researcher outcome selection introduced a source of bias into the study.

Well, that's the huge value of the paper! The paper talks about "statistically unreliable results and modest effects - I was just trying to get at what we can say in the most concrete terms that lay people might understand. And I thought that using the more generous stat plan version was less open to any defence from the PACE authors.

As you say in the conclusion, these are "modest effects" on self-report measures. Though there appears to be no overall effect of CBT even on self-report, and only a modest overall effect on GET.

By the way, the link to the PACE primary outcomes on the Wolfson site doesn’t work any more (ref 22), at least not today.

Barry · Feb 19, 2018

When I read about the PACE trial and its illusory outcomes, I always come back to the one thing I know, like I really really know ...

I first met my wife in 1977, and we have been together ever since; she went down with ME in 2006. She has always had a very strong drive to do as much as she is capable of, physically and mentally, and she has never ever lost that. We used to do lots of walking in the country, my wife loves gardening, she does quilting and is currently doing a distance-learning course on it, having already done a good many previously. My wife is very self-motivated, and as physically fit as her ME allows her to be (I cannot imagine she is deconditioned to any significant degree, if at all).

So what is the discrepancy between what I know must be true, and what the PACE author's claim to be true? For me it always comes back to the one cardinal issue we all know and agree on - the discrepancy between subjective versus objective outcomes, and the PACE authors' pathological faith in the former and rejection of the latter. This tells me, intuitively, that the difference is very significant. And the PACE data mean the science tells us the same, no matter what the authors might try to hide behind.

I think if my wife had participated in the PACE trial, it is highly likely she would have self-reported much the same as others. Why? Because the person she would have felt she was most letting down, if she didn't demonstrate the will to live up to the goals and expectations she had been convinced to set for herself, would have been - herself. It would not simply have been about letting others down but letting herself down. And I've a feeling the PACE investigators may have exploited that personality trait in the participants.

I know we clarify here that the outcomes under discussion are self-reported, but I'd hate for us to lose sight of how very significant that is.

Carolyn Wilshire · Feb 19, 2018

Simon M said:
My concern is using the 6 contrasts of the protocol and applying the Bonferroni correction from the stats plan - that kind of mix and match approach might have a defence against it - hence my question to @Carolyn Wilshire (@Tom Kindlon).

Protocol with stats plan correction for multiple comparisons says:
"CBT has no overall effect, GET does and reaches the threshold for clinical importance (but the effect doesn't persist long-term) and CBT has an effect on fatigue only - again on the margin of clinical importance."

Also, you need to treat 10 patients to get one overall GET improver, and the overall improver rates are low: 10% for no treatment, 20% for GET (self-report scoring, not real improvement).

The paper reports results based on both models - 6 comparisons versus 5 comparisons. So readers can actually see what difference this made.

But can I just say again the protocol is the protocol is the protocol. Its fine to produce a stats plan that elaborates on the protocol, but that stats plan cannot contradict the protocol in any way. Well, because its not the protocol.

Carolyn Wilshire · Feb 19, 2018

Simon M said:
Your figures give a p value of 0.010 for GET overall, which I'm assuming does pass the Stat plan version (i.e. marginally under 0.010) but not the protocol version (6 contrasts, + Bonferroni as per stats plan).

Yes, Bruce Levin was very emphatic on the point that values that equal the critical value of p count as significant. I looked it up, and this is definitely correct.

Carolyn Wilshire · Feb 19, 2018

Evergreen said:
I find it baffling that there's no awareness of how flimsy "patient satisfaction" is. It's like going on a date and judging its success based on whether the person says "I had a nice time tonight" at the end rather than whether they call to arrange a second date, and show up for the second date, and arrange a third. Very few people end a date by saying "I don't like you. I have no intention of ever having any contact with you ever again." And they certainly wouldn't if the person they went on a date with was in charge of their healthcare and the gateway to their only source of income.

This is brilliant, @Evergreen!

large donner · Feb 19, 2018

Evergreen said:
It's like going on a date and judging its success based on whether the person says "I had a nice time tonight" at the end rather than whether they call to arrange a second date, and show up for the second date, and arrange a third. Very few people end a date by saying "I don't like you. I have no intention of ever having any contact with you ever again." And they certainly wouldn't if the person they went on a date with was in charge of their healthcare and the gateway to their only source of income.

We are not even friends on benefits with them.

Carolyn Wilshire · Feb 19, 2018

Simon M said:
I asked, because I thought it's just the sort of defence that the PACE authors might advance. But I wasn't sure if using FDR, or other such less-conservative measures than Bonferroni, was normal on such a small number of comparisons: I thought you would only consider that for > 10 comparisons, but I have little experience and was hoping you could throw some light on it.

I've only seen FDR used in the context of many tests (although I admit I might not have noticed much what researchers use in areas outside my own). FDR is very common in neuroscience work - where you perform thousands of tests, so correcting appropriately - neither too much nor too little - really matters. But lately there have been calls there that FDR is too lenient and controls poorly for false positives. People are now recommending permutation thresholding as the gold standard there (no-one is recommending Bonferroni in that field, it is considered too conservative when applied over so many tests, and likely to miss genuine effects).

The other consideration here is that this is a clinical trial. So its purpose is different from just advancing knowledge, its purpose is to demonstrative treatment effectiveness. So you would want to choose a correction method that is bulletproof to to the possibility that some of your results might have been false positives. This is just the situation where you might choose Bonferroni. Then no one gets on your case about whether your outcomes are real or not.

Simon M said:
Also, I realise now that the 6 comparisons of the protocol include the APT ones and without those I couldn't apply the Bonferroni-Holm correction properly myself.

I don't have access to any stats software right now, but I've attached a file with the raw data in it for overall improvement rates (protocol-specified definition). It has two data columns, depending upon whether you go for intention-to-treat (counting drop-outs as non-improvers) or available cases (excluding dropouts from the analysis).

Simon M said:
By the way, the link to the PACE primary outcomes on the Wolfson site doesn’t work any more (ref 22), at least not today.

How very interesting!

Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Administrator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Guest

Senior Member (Voting Rights)

Attachments