Esther12
Senior Member (Voting Rights)
The link still works for me. Anyway here it is:View attachment 1744
Thanks Tom. That link now works for me too, when yesterday it had some error message instead of the image. Maybe it had just glitched out?
The link still works for me. Anyway here it is:View attachment 1744
Seems like there were:I think it was 13% would have met one of the criteria at entry (specifically the sf36 or CFQ ones). Clearly no one met the CGI ones as this is a 'how much better do you feel after the trial' or if you didn't tell them 'how much better did the assessor think you felt after the trial'
The Oxford criteria one is just strange the way they introduced the thresholds and they change quite a lot. I can't remember if there were non-oxford who were also able to meet the trial criteria at the end.
I'm bravely/recklessly going to try to help (the first part of this post, more complex stuff further down). It's simplest if we us the correction method specified in the stats plan: 5 contrasts and the Bonferroni method of correction. Here's a handy summary of the results, showing what is statistically significant:Does this mean that we can/can't say "using the prespecified analysis for the trial's primary outcome there was no significant treatment effect for [CBT and/or GET]"?
Or do we once again have to deal with niggling complexities which prevent a nice simple statement (ideally one suited to those of us not used to discussing Bonferroni correction)?
So here are the protocol results:The problem with using the stats plan's Bonferroni approach is that the plan has been criticised for being late and changing the outcome measures so it can't be taken as definitive (hey, maybe they went ultra-strict to make up for watering down the outcomes). Certainly it's reasonable to apply Bonferroni but it might be reasonable to use other approaches too.
One option would be the Bonferroni-Holm method, which applies successively easier thresholds to each p value, starting at the fully-corrected one for the smallest p value and ending up with just p<0.05. But by my calculation, this method would make the GET overall result significant along with the GET SF36 one, but none of the CBT results.
do we once again have to deal with niggling complexities which prevent a nice simple statement
That would seem to support that "improvements" are due to bias built in to the therapies. CBT is about denying your fatigue, so fatigue questionnaire scores improve. GET is about believing you can do more, so physical functioning questionnaire scores improve. The effect on the SF36-PF was even more pronounced with the Lightning Process, which is extremely heavy-handed on pushing patients to believe that they can do more.CBT has a stat sig effect on the rate of fatigue improvers, but not SF36. Conversely GET had a stat sig effect on SF36 improvers, but not on fatigue.
Or a no, if we go with the Stats plan approach.So that would be a 'yes' [to niggling complexiities re Bonferroni] then?
I'm not sure that argument would impress a neutral. Saying it's the "margins of clinical importance" might be a better approach.I also liked Wilshire's point that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.
IF you look at the %age improvements below, that's true of CBT but both CFQ and SF36 improve for GET, the CFQ just falls below the margin for significance. As Carolyn said, they would need to test the improvement of fatigue vs improvement of SF36, which they didn't, and it probably would not be significant.That would seem to support that "improvements" are due to bias built in to the therapies. CBT is about denying your fatigue, so fatigue questionnaire scores improve. GET is about believing you can do more, so physical functioning questionnaire scores improve.
And after other critical papers, like Wiborg, and FINE.
The increase on the SF-36 physical function for the pragmatic rehab arm was 29.84 to 43.27 i.e. about 13, not 3.In FINE, by 70 weeks the pragmatic rehab arm had changed by about 3 points on the SF-36 PF, the supportive listening arm by about 5 and the GP treatment as usual arm changed by about 10 points.
Those scores you quote for FINE are for bimodal scoring, not the 33-point CFQ.In Table 1 of your paper you show that PACE changed their definition of improvement to “At least an 8 point increase in the 100-point SF-36 physical function scale” and “At least a 2 point decrease on the 33-point CFQ”.
[..]
In FINE, by 70 weeks the pragmatic rehab arm was the only one to change by almost 2 points on the Chalder Fatigue Scale, the other two arms changed by less than 1 point.
Just to point out in this case "Wilshire" just refers to a point in this thread, not a point in the paper itself.Esther12 said:I also liked Wilshire's point that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.
I'm not sure that argument would impress a neutral. Saying it's the "margins of clinical importance" might be a better approach.
UPDATE; By my calculation, while be overall effect of GET is between two and three, that for CBT on CFQ is fractionally below 2 (as is the non-significant effect of CBT overall).
You're absolutely right, a typo, I've edited my post to correct this. Thanks!The increase on the SF-36 physical function for the pragmatic rehab arm was 29.84 to 43.27 i.e. about 13, not 3.
Again, you're right, and I've edited my post to point this out. Thanks!Those scores you quote for FINE are for bimodal scoring, not the 33-point CFQ.
I think the issue is that the threshold on the likert scale wasn't directly comparable to the threshold on the bimodal scare. The result was that it did lower the bar a bit for recovery, though I'm not sure how much practical effect that had by itself. I can take a look at my copy of the data set later if there's no definitive answer posted yet (it might have been addressed in one of the publications).And does the FOIA dataset indicate that findings would have been different if the bimodal scoring of CFQ had been retained as a primary efficacy measure?
They made the change because the FINE trial had found null results with the bimodal scoring but significant results in a post-hoc analysis using Likert. They didn't mention that reasoning in PACE, of course. It made no sense not to provide both analyses, since they were already providing the Likert as a secondary analysis anyway. They obviously figured they might get a significant finding by switching to the Likert and then they could hide the bimodal finding that might turn out to provide null results, like it did in PACE. They've never provided a satisfactory answer to why they did this and never, as far as I've seen, acknowledged that they did it specifically because they saw the FINE findings. That, of course, would have required them to mention FINE and point out that it basically had null results. They managed not to mention that anywhere in PACE as well.
oops--I meant to say above, like it did in FINE.
The result was that it did lower the bar a bit for recovery, though I'm not sure how much practical effect that had by itself. I can take a look at my copy of the data set later if there's no definitive answer posted yet (it might have been addressed in one of the publications).
I can’t see any mention of a post-hoc analysis using Likert in the FINE trial paper itself – am I missing something (I do skip over things thanks to brain fog) or was this mentioned somewhere else? Or do we know this because someone has the FINE data and has done the post-hoc analysis? (I did note this line in the FINE trial paper: “Data sharing: We will be happy to make our dataset available to researchers, once we have finished reporting our findings. Please contact the corresponding author.”)
I haven't understood that point of the Cochrane reviewers--why are they saying the Likert findings are not significant when the post-hoc analysis from the FINE team said they were? I haven't looked closely at the numbers, but where is that contradiction coming from?
"Bimodal versus Likert scoring in Wearden et al. 2010
To enable pooling of as many studies as possible in a mean difference meta-analyses, we used the 33-scale results reported by Wearden. You suggest that the decision to use the 33-point fatigue scores in our analysis may bias the results because there is no statistically significant difference at the 11-point data at 70 weeks. This statement suggests that there is a statistically significant difference when using the 33-point data, but if you look into analysis 1.2 that is not the case. At 70 week we report MD -2.12 (95% CI -4.49 to 0.25) for the FINE trial, i.e. not statistically significant."
"Query re use of post-hoc unpublished outcome data: Scoring system for the Chalder fatigue scale, Wearden 2010.
I would like to highlight what appears to be a discrepancy within the Cochrane review [1] with respect to the analysis of data from Wearden 2010 [2,3].
Throughout the Cochrane review (please see details below), the impression is given that only protocol-defined and published data or outcomes were used for the Cochrane analysis of the Wearden 2010 study.
However, this does not appear to be the case and, to the best of my knowledge, instead of using protocol-defined or published data, the Cochrane analyses of fatigue for the Wearden 2010 study, appears to have used an alternative unpublished set of data.
The relevant analyses of fatigue in the Cochrane review are: Analyses: 1.1, 1.2, 2.1 and 2.3. Each of these analyses states that the “0,1,2,3” scoring system was used for the Chalder fatigue questionnaire. This scoring system is known as the Likert scoring system and uses a fatigue scale of 0-33 points.
However, to the best of my knowledge, data or analyses using this scoring system were not proposed in the Wearden 2010 trial protocol [3], and were not included in Wearden 2010 [2], and have not previously been formally (i.e. via peer review) published by Wearden et al. A post-hoc informal analysis using this data has been informally released by Wearden et al. as a BMJ Rapid Response comment [4].
In the Cochrane review, the analyses using the 0,1,2,3 scoring system contradict text within the section “Characteristics Of Studies”, in relation to Wearden 2010: Under “Outcomes”, it is stated that Chalder fatigue was measured using the 0,0,1,1 scoring system using a scale from 0-11 points: “Fatigue (Fatigue Scale, FS; 11 items; each item was scored dichotomously on a 4-point scale (0, 0, 1 or 1)”.
Wearden 2010 pre-specified Chalder fatigue questionnaire scores as a primary outcome at 70 weeks, and as a secondary outcome immediately after treatment at 20 weeks. The scoring, in both cases, used the 0,0,1,1 system, with a scale of 0-11. This scoring system was described both in the trial protocol [3] and the main results paper published in 2010 [2].
The Likert (0,1,2,3) scoring system was neither proposed in the trial protocol, nor formally published, and so the Likert scores should be considered post-hoc. Even if it is argued that the Chalder fatigue questionnaire (irrespective of the scoring system) was pre-defined as a primary outcome measure, data using the Likert scoring system was neither proposed nor published and so the data itself must surely be considered to be post-hoc. The outcome analyses using the Likert data must be considered post-hoc.
Simply changing a scoring system may, at first glance, appear not to be a significant or major adjustment, however, we do not know what difference it made because a sensitivity analysis has not been published.
I cannot find any explanation within the Cochrane review that explains why the Cochrane review has replaced pre-defined published data with an unpublished and post-hoc set of data.
Is it normal practice for a Cochrane meta-analysis to selectively ignore the pre-defined primary outcome data for a trial, and to selectively include and analyse post-hoc data? I wonder if some clarity could be shed on this situation?
I suggest that the post-hoc data are replaced with the original published data. Otherwise, the post-hoc data should be clearly labelled as such and the risk of bias analysis amended accordingly; and an explanation should be included in the review explaining why an apparently adequate pre-defined set of data has been replaced with an apparent novel set of post-hoc data.
Also, I suggest that any discrepancies that I will outline below, should be corrected where necessary; Either the analyses (1.1, 1.2, 2.1 and 2.3) should be amended or the description of the data should be amended so it is not incorrectly labelled as protocol-defined and published data with a “low risk” of bias.
Discrepancies within the text of the Cochrane Analysis.
Please note that all page numbers used below are pertinent to the current version (version 4) of the Cochrane review in PDF format.
1. On page 28 of the Cochrane review [1], in section “Potential biases in the review process”, under the heading “Potential bias in the review process”, in relation to the review in general, it is stated that: "For this updated review, we have not collected unpublished data for our outcomes..." However, as explained above, this is not the case for the Wearden 2010 fatigue data for which unpublished data has been used in the Cochrane analysis.
2. On page 45 of the review, in section “Characteristics Of Studies”, specifically in relation to Wearden 2010 [2,3], it is stated that only protocol-defined outcomes were used: "all relevant outcomes are reported in accordance with the protocol". "Selective reporting (reporting bias)" is rated as "low risk". However, as explained above, this is not the case, because the Wearden 2010 fatigue data (used in the Cochrane analysis) was not proposed in the protocol. If the data is post-hoc, then the “low-risk” category will need to be revised.
3. On page 44 of the review, in section “Characteristics Of Studies”, in relation to Wearden 2010 [2,3], under “Outcomes”, it is stated that Chalder fatigue was measured using the 0,0,1,1 scoring system using a scale from 0-11 points: “Fatigue (Fatigue Scale, FS; 11 items; each item was scored dichotomously on a 4-point scale (0, 0, 1 or 1)”. Wearden 2010 did indeed use the 0,0,1,1 scoring system for the Chalder fatigue scale: This scoring system was proposed in the trial protocol and published with the main outcome data in Wearden 2010. However, as explained above, this scoring system has not been used in the Cochrane analysis.
4. If figures 2 and 3 also contain discrepancies, after any amendments to the review, then they should be amended accordingly.
There may be other related discrepancies and inaccuracies in the text that I haven’t noticed.
I thank the Cochrane team in advance for giving this submission careful consideration, and for making amendments to the analysis, and providing explanations, where appropriate. I hope you will agree that clarity, transparency and accuracy in relation to the analysis is paramount.
References:
1. Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database Syst Rev. 2016; CD003200.
2. Wearden AJ, Dowrick C, Chew-Graham C, et al. Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial. BMJ. 2010; 340:c1777.
3. Wearden AJ, Riste L, Dowrick C, et al. Fatigue Intervention by Nurses Evaluation – The FINE Trial. A randomised controlled trial of nurse led self-help treatment for patients in primary care with chronic fatigue syndrome: study protocol. BMC Med. 2006; 4:9.
4. Wearden AJ, Dowrick C, Chew-Graham C, et al. Fatigue scale. BMJ Rapid Response. 2010. http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0 (accessed April 16, 2016).
Dear Robert Courtney
Thank you for your detailed comments on the Cochrane review ’Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work.
We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.
The Chalder Fatigue Scale was used to measure fatigue. The results from the Wearden 2010 trial show a statistically significant difference in favour of pragmatic rehabilitation at 20 weeks, regardless whether the results were scored bi-modally or on a scale from 0-3. The effect estimate for the 70 week comparison with the scale scored bi-modally was -1.00 (CI-2.10 to +0.11; p =.076) and -2.55 (-4.99 to -0.11; p=.040) for 0123 scoring. The FINE data measured on the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors to be able to use the estimates in our meta-analysis. In our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks. The decision to use the 0123 scoring did does not affect the conclusion of the review.
Regards,
Lillebeth Larun