5) Fatigue post-treatment should be rated as low instead of moderate quality evidence
The other two factors GRADE uses to downgrading quality of evidence are inconsistency and imprecision. I would like to look at these more closely because they are at the heart of David Tovey’s argument that the outcome for fatigue post-treatment should be rated low instead of moderate quality. In an email of 27 may, Tovey wrote: “I can see three possible reasons for a downgrade: lack of blinding/subjective outcomes, imprecision, and inconsistency, so the conclusion that this is moderate certainty evidence seems indefensible to me, and as we know, I am not alone in this.”
Inconsistency refers to an unexplained heterogeneity of results. The GRADE handbook writes: “Criteria to determine whether to downgrade for inconsistency can be applied when results are from more than one study and include:
- Wide variance of point estimates across studies (note: direction of effect is not a criterion for inconsistency)
- Minimal or no overlap of confidence intervals (CI), which suggests variation is more than what one would expect by chance alone
- Statistical criteria, including tests of heterogeneity which test the null hypothesis that all studies have the same underlying magnitude of effect, have a low p value (p<0.05) indicating to reject the null hypothesis.”
It also refers to the I2 statistic, which is a measure for heterogeneity. I2 of 30-60% is seen as moderate, 50-90% substantial while 75-100% is seen as ‘considerable heterogeneity’.
The graph below gives the info about the analysis of fatigue post-treatment.

As one can see, I2 is 80% which indicates considerable heterogeneity. The p-test clearly rejects the null hypothesis that all studies have the same underlying magnitude of effect. There is a wide variance of point estimates going from an SMD of -0.27 in the FINE Trial to -1,52 in the trial by Powel et al. 2001. That trial shows some overlap with the small studies of Fulcher and Moss-Morris, but not when compared to the other large studies.
So I think we can say there is large heterogeneity in this review for the outcome of fatigue post-treatment. The authors acknowledged this. In the results section they write: “The analysis suffered from considerable heterogeneity (I2 = 80%, P< 0.0001) that we explored insensitivity analysis.”
For downgrading the quality of evidence, heterogeneity has to be unexplained. GRADE advises to do sensitivity analyses to check if it can be explained by differences in populations, interventions, outcomes or study method. This was not the case in the Cochrane review where heterogeneity was mostly caused by the trial by Powell et al. 2001. Exclusion of this trial leads to an acceptable heterogeneity (I2 = 26%, P = 0.24).
In their justification for not downgrading for inconsistency, Larun et al. write:
we chose not to downgrade because all studies gave the same direction and because the observed heterogeneity (80%) was mainly caused by a single outlier. The estimate remains consistent with a non-zero effect size (SMD −0.44; 95% CI -0.63 to -0.24) also when the outlier is excluded.
GRADE however explicitly says that
“differences in direction, in and of themselves, do not constitute a criterion for variability in effect if the magnitude of the differences in point estimates is small.” It doesn’t matter whether the effect sizes are on the same side of the border crossing a positive or negative effect for an intervention. That border is arbitrary in statistical terms. What matters is the size of the difference. Similarly, the GRADE handbook doesn’t write that heterogeneity is not an issue if it is due to only one outlier. What matters is whether that outlier significantly alters the results. In a review with only 8 RCT’s most of which are really small ones, one big study can have a large impact on the results. And that seems to be the case here: If the trial by Powell et al. is removed the effect size for fatigue post-treatments is reduced by a third, from an SMD of -066 to an SMD of -0.44.
In the email correspondence the authors argue that an SMD of -0.44 is still a moderate effect, so this doesn’t change much. But the terms ‘small’, ‘moderate’ and ‘large’ for effect sizes are arbitrary and there are different rules for applying them (Jacob Cohen, the person who introduced these said an SMD of 0,5 should be considered moderate). These are just names to help interpret statistical data. They do not play an important role in the GRADE handbook. More important is that excluding the outlier reduces the effect size by a third. So I think there’s an argument for reducing the quality of evidence for fatigue post-treatment for inconsistency.
.................................................................................................................................................................................................................................
Imprecision is the fifth reason for downgrading the quality of evidence in GRADE. Results are imprecise when studies include relatively few patients and few events and thus have a wide confidence interval (CI) around the estimate of the effect. So one can have a moderate SMD that indicates that the intervention reduces fatigue, but if the confidence interval is wide and includes values that suggest the intervention doesn’t work at all, the quality of evidence can be downgraded for imprecision.
David Tovey argued that this is the case for fatigue-post treatment in the review on GET. The SMD was -0.66 indicating a moderate effect, but the confidence interval (CI −1.01 to −0.31) includes values where there is no longer a clinically significant effect. The authors, however, argued that confidence interval does not cross the line of no effect and that this is what matters. Atle Fretheim explained:
“Our reading of the GRADE-handbook tells us that in a case like ours, if the 95% CI crosses the line of no effect, downgrading is warranted. You opine that downgrading is warranted if the 95% CI crosses the line for minimal clinically important difference.”
It’s true that the GRADE handbook advises not to downgrade when the 95% CI excludes no effect, but it gives that advice for dichotomous outcomes. These usually express the risk that something (usually something bad) will happen such as a stroke or infection. The patient-reported questionnaires used in the GET review are not dichotomous but continuous outcomes. For continuous outcomes, GRADE writes:
“Whether you will rate down for imprecision is dependent on the choice of the difference (Δ) you wish to detect and the resulting sample size required. Again, the merit of the GRADE approach is not that it ensures agreement between reasonable individuals, but that the judgements being made are explicit.”
In short: I think the GET review meets that sample size requirement, so there is an argument for not downgrading for imprecision as Larun et al. did. Tovey's argument, however, makes some sense as well. For guideline panels GRADE advises that “the decision to rate down the quality of evidence for imprecision is dependent on the threshold that represents the basis for a management decision.” Given that authors themselves have defined the minimal important difference at 2.3 points on the Chalder Fatigue Scale, and the lower bound of the confidence interval corresponds to a difference of 1,6 points, this would suggest downgrading the quality of evidence for a guideline panel such as NICE.
For authors of a systematic review, however GRADE advises simply focussing on sample size calculation, because it’s not the job of reviewers to define a clinically useful threshold and to determine the economic costs or the tradeoff between desirable and undesirable consequences. But I think one could argue that in cases where the effect size is so close to the clinically significant difference there is no need for such complex considerations, and it is reasonable to downgrade for imprecision. After all, what is the point of saying there is moderate-quality evidence that exercise therapy reduces fatigue if the size of that reduction is quite likely not clinically significant?
..........................................................................................................................................................................................................................
So for both Inconsistency and Imprecision, there are reasons to downgrade (for the first more than for the second in my opinion) but it’s not a clear case, more a matter of judgement. The authors acknowledged this in the email correspondence by proposing the term low-moderate quality of evidence as a consensus. The GRADE handbook, however, writes that if there is a borderline case to downgrade for two factors, the authors should downgrade for at least one of them and explain the situation in a footnote. It writes:
“If, for instance, reviewers find themselves in a closecall situation with respect to two quality issues (risk of bias and, say, precision), we suggest rating down for at least one of the two.”
Actually this is what Larun et al. did for physical function post-treatment. Both inconsistency and imprecision were borderline cases, so they rated down with one level. They explain it as follows:
Imprecision/inconsistency (certainty downgraded by -1): the confidence interval ranges from a large positive to a small benefit. There is variation in the effect size across available studies, but the heterogeneity is in part caused by a single outlier. When excluding the outlier, the pooled estimate is reduced to (mean difference −7.27, 95% CI −13.51 to −1.23).
The same thing could be said about fatigue post-treatment only here the interval ranges is smaller, the problem is more that it includes values of no clinically significant effect.
I think there’s a case that the quality of evidence for fatigue post-treatment is low, not moderate. Even if we exclude the main problem of using subjective outcomes in unblinded trials, there are so many issues here – ceiling effects on the Chalder Fatigue Scale, heterogeneity caused by the outlier of Powell et al., an effect size barely crossing the threshold of clinically significance, that I think it’s wrong to state that “exercise therapy probably has a positive effect on fatigue.”