Review Interventions for the management of long covid post-covid condition: living systematic review, 2024, Zeraatkar, Flottorp, Garner, Busse+

Thanks, Dave. Your letter ends:
In a related matter, the peer reviews for the review have not yet been posted. Policy at The BMJ is to post them within five days. As far as I know, no reason for this lapse has been offered. Does The BMJ plan to post the peer reviews? If so, when? And if not, why not? This lapse undermines the claim that “The BMJ has fully open peer review.”

My Rapid Response, submitted in December, highlighting the absence of any peer review documents was never published, so I’m pleased you are chasing this up.
I’ve submitted a brief Rapid Response highlighting the absence of any peer review documents and asking for that to be rectified.
 
Another reason why the request that I should submit a rapid review in order to trigger responses--from the authors and from the journal--doesn't wash.

My only concern, in re-reading the passage at the end of my letter, is that I used the word "lapse" twice so close together without realizing it. Bummer! Bad style choice!!! When I read it now, it makes me cringe!!
 
There is a new reply by first author Dena Zeraatkar to previous rapid responses but unfortunately it does not address most of the points raised.
I'm mostly interested in this issue of imprecision which I wrote a blog post on. The authors reply to this point by writing:

When the point estimate exceeded the MID, we rated certainty of an important effect; otherwise, we rated certainty of a trivial or no effect. According to this approach, reviewers may downgrade for imprecision if confidence intervals cross either the null or the MID—we chose the null.
Perhaps others can check but this seems like a contradiction to me. If you're rating the certainty of an important effect then this means not just any effect that is larger than 0, but an effect that is clinically significant and thus larger than the MID. So if you want to rate the certainty of that, you need to check of the confidence intervals in relation to the MID, not the null.

To take an example: one of the effects was 0.04 (95% CI: 0 to 0.08) points with the MID being 0.04. This confidence interval (if we assume it's true and has no bias) says that we're pretty confident that the effect is larger than 0 and that the intervention group does better. So if one was rating an effect, regardless if it's important or not, one should not rate down for imprecision. If one wants to rate the certainty of an important effect, however, then one would have to rate down because the confidence interval indicates it's quite likely that the true effect is lower than the MID.
 
From the response:
Did lack of reporting of actigraphy results in the CBT trial introduce selective reporting bias (7)?

This concern reflects bias due to missing evidence (10). We did not consider this a concern because actigraphy was not an eligible outcome in our review, which focuses on patient-reported outcomes.
I’m sorry, what? There is no logic to this statement.
 
An extension of that logic:

Question: Half the patients died in your clinical trial. Why did you not report this?

Answer: We did not consider this a concern because death was not an eligible outcome in our review, which focuses on patient-reported outcomes,
 
Perhaps others can check but this seems like a contradiction to me. If you're rating the certainty of an important effect then this means not just any effect that is larger than 0, but an effect that is clinically significant and thus larger than the MID. So if you want to rate the certainty of that, you need to check of the confidence intervals in relation to the MID, not the null.
I think what you're saying here makes sense. I haven't looked into this thoroughly, but they cite "Core GRADE 2: choosing the target of certainty rating and assessing imprecision" (2025, BMJ), which states "The location of the point estimate of effect in relation to the chosen threshold determines the target. For instance, using the MID thresholds, a point estimate greater than the MID suggests an important effect and less than the MID, an unimportant or little to no effect. Users then rate down for imprecision if the 95% confidence interval crosses the MID for benefit or harm."

ie. so if the chosen threshold is the MID, and you're rating certainty in whether that effect is over that threshold, then you downgrade if the confidence interval crosses the MID.

That guidance does indeed say that the null can be used as a threshold, but that is only for determining if there is a treatment effect, not whether it is an important effect. It doesn't make sense, from what I see in the guidance, to use the MID as a threshold, and then use the null as the threshold for imprecision.

However, I think I may understand what they are trying to say. What they're saying seems to correspond to a somewhat unclear section in that guidance titled "Assessing whether a true underlying treatment effect exists (using null as threshold)", which is where the null is still the threshold chosen, but the MID is used to some extent to determine whether you're rating certainty in an "unimportant" effect, ie. one that is very close to the null, and therefore probably of negligible importance.

But, based on my understanding of this, if you use the null as the threshold, and the point estimate isn't close enough to the null to say you're rating an "unimportant" effect, you are still then just rating your certainty in a "true underlying treatment effect", which is not the same as rating certainty in an important effect. To do that, you would need to use the MID as your threshold (and so rate down for imprecision if the CIs cross the MID).
 
Doesn't seem like a contradiction so much as Calvinball:
Calvinball is a fictional game from the Calvin and Hobbes comic strip where the rules are made up and changed constantly by the players as they play.
The standard in this type of research is that they almost never get more than the tiniest blip above pure chance, so they simply shift the line as they need it. When, through bias alone, they manage to squeeze enough blood out of a stone to get above the lowest statistical significance, that's what they emphasize. When they manage to cheat enough to make it seem like it even crosses MID, they emphasize that.

For the rest they either go for "hey, it may be better than nothing, if you really want it and believe in it, it's in your hands", or "we have to try again, we will try again".

Of course they don't want to rate the certainty of effect, there is no effect. They only want to make it appear as if there is, because it works, so they do whatever works for them in any given context. That's why why went from decades of "this is not just our opinion, many trials have proved this" to "we don't need trials, we know better from years of experience delivering the same thing trials have failed to prove": because almost no one in the profession cares whether it's true or not. They aren't just allowed to lie and cheat, they are expected to, otherwise it would be embarrassing. They'd rather get it wrong 100% of the time than be embarrassed about ever getting it wrong, is the sad standard.
 
I think what you're saying here makes sense. I haven't looked into this thoroughly, but they cite "Core GRADE 2: choosing the target of certainty rating and assessing imprecision" (2025, BMJ), which states "The location of the point estimate of effect in relation to the chosen threshold determines the target. For instance, using the MID thresholds, a point estimate greater than the MID suggests an important effect and less than the MID, an unimportant or little to no effect. Users then rate down for imprecision if the 95% confidence interval crosses the MID for benefit or harm."

ie. so if the chosen threshold is the MID, and you're rating certainty in whether that effect is over that threshold, then you downgrade if the confidence interval crosses the MID.
Thanks for checking. I think Figure 5b. forms an illustrative example.

The text says:
For (b) in figure 5, decisions about rating down certainty will differ depending on the threshold. When using the null, as the CI does not cross the threshold, Core GRADE users will not rate down their certainty for imprecision. When using the MID, as the CI crosses the threshold, they will rate down for imprecision.

1751123503393.png
Zeraatkar says: "we rated certainty of an important effect". That means that they should rate down for imprecision of the confidence intervals cross the MID.

Zeraatkar, however, says that they didn't so because: "reviewers may downgrade for imprecision if confidence intervals cross either the null or the MID—we chose the null." This is incorrect. If they look at whether confidence intervals cross the null then they are rating the precision of a non-null effect, rather than an important effect. You cannot choose the null if you have already chosen the MID as the threshold of interest.

So there is a contradiction in their statement.
 
Paper said:
To enable imprecision to be judged, we considered whether effect estimates met or exceeded the minimal important difference (MID)—the smallest difference in an outcome that patients find important.77 When the point estimate met or exceeded the MID, we rated the certainty of there being an important effect. Conversely, when the point estimate was between the MID and the null, we rated the certainty of there being no important effect.
I'm trying to understand this.

So if, say, the MID is 1, and a study A has a result of 1.1, CI: 0.9-1.3 (same shape as b in fig. 5 above), then because the point estimate exceeds the MID, they downgrade for imprecision because the CI goes below the MID.

But if a study B result is 0.8, CI: 0.5-1.1, then because the point estimate is below the MID, they do not downgrade because the CI does not go below null.

Am I understanding correctly? Aren't they then comparing different studies using different criteria per study? Study A gets a low score because we're not certain it's showing an "important" effect. Study B gets a high score because we're pretty certain it's showing at least a positive nonzero effect.

Why for the first two outcomes in the Vortioxetine study did they downgrade for imprecision if it didn't cross null?
Screenshot from 2025-06-28 11-49-31.png

Edit: In the plain language summary of the outcomes above, it says "Probably little or no important effect on depression."

But in the paper, it doesn't say "important" effect, it just says "effect".
moderate certainty evidence suggests that vortioxetine probably has little or no effect on depressive symptoms and quality of life.
 
Last edited:
This also doesn't seem correct.
Why not downgrade for imprecision when treatment effects were informed by fewer than 800 participants (1, 2)?

GRADE suggests 800 participants as a “rule of thumb” when the optimal information size (OIS)—the sample size a trial would need to detect a modest effect—is uncertain (3, 4). In practice, the OIS varies by outcome, the observed variability, and the magnitude of effect deemed important. For example, one trial investigating cognitive behavioral therapy (CBT) randomized 114 patients; detecting a 9-point difference on the Checklist Individual Strength-Fatigue (CIS-Fatigue) with 80% power would result in an OIS of 65 participants (5), assuming an SD of 13; the trial’s sample size surpassed this number (5, 6). Further, Core GRADE guidance now advises reviewers only consider OIS in cases of implausibly large effects (4), which we did not identify.
The sample size in this CBT trial (114, or 57 per group) is so far of from what GRADE recommends as sufficient for good precision (800 or 400 per group) as a rule of thumb for continuous outcomes. So something has gone wrong here and I it's how they used the CIS fatigue scale.

Because of ceiling effects the standard deviation (SD) for this scale can change a lot, from only 3-4 points for ME/CFS patients at baseline to 12-13 points in patients with other conditions or after treatment effects. So if you want to calculate an effect size by dividing the mean difference by the SD, you can make choose to make it look big or small depending on which SD you choose.

In their supplementary material, Zeraatkar used an MID of 3 points for this scale, arguing that this represents half a standard deviation (which would then be 6). They use this low SD to argue that the effect estimate big enough and fully above the MID (95% CI: -13.18 to -5.42) so they don't have to rate down for imprecision. When asked if they should not rate down for having a too low sample size (OIS) for the effect found, they switch to another SD of 13. They used this to argue that the effect found wasn't actually large enough to be considered implausible (9.3 points for an SD of 13).

If they had used a consistent SD estimate (big or small) they would have rated down for imprecision in both scenario's. If they used the large SD of 13, then the MID would be approximately 6.5 which is higher than the lower bound of the confidence interval. If they used an SD of 6, then treatment effect would be massive (cohen d of 1.55) and so they would rate down for the OIS criterium.
 
When the point estimate met or exceeded the MID, we rated the certainty of there being an important effect. Conversely, when the point estimate was between the MID and the null, we rated the certainty of there being no important effect.
@forestglip
I take this to mean:

If the point > MID, we rate the certainty of the evidence for rejecting the null hypothesis.

If the MID > the point > null, we rate the certainty of the evidence for not rejecting the null hypothesis.

But that does not make sense.
 
So if, say, the MID is 1, and a study A has a result of 1.1, CI: 0.9-1.3 (same shape as b in fig. 5 above), then because the point estimate exceeds the MID, they downgrade for imprecision because the CI goes below the MID.

But if a study B result is 0.8, CI: 0.5-1.1, then because the point estimate is below the MID, they do not downgrade because the CI does not go below null.

Am I understanding correctly? Aren't they then comparing different studies using different criteria per study? Study A gets a low score because we're not certain it's showing an "important" effect. Study B gets a high score because we're pretty certain it's showing at least a positive nonzero effect.
Here's my understanding of the GRADE approach:

In study B the point estimate (0.8) is lower than the MID (1) so GRADE recommends rating the evidence of there being NO effect. In that case you cannot chose the null and have to compare to the MID. The confidence interval (0.5-1.1) includes values that are higher than the MID, so it's still possible that there is an important effect. Therefore one should downgrade the certainty that there is no effect.

In study A the point estimate (1.1) is higher than the MID (1) so GRADE recommends rating the evidence of there being an effect. In that case reviewers can choose their threshold of interest: they can rate the certainty of an important effect (higher than the MID) or a non-null effect (higher than 0). If they choose the null, then they should not rate down because the CI (0.9-1.3) is fully above it. But if they choose to rate an important effect, they should downgrade because some values of the CI are below the MID.
 
Here's my understanding of the GRADE approach:

In study B the point estimate (0.8) is lower than the MID (1) so GRADE recommends rating the evidence of there being NO effect. In that case you cannot chose the null and have to compare to the MID. The confidence interval (0.5-1.1) includes values that are higher than the MID, so it's still possible that there is an important effect. Therefore one should downgrade the certainty that there is no effect.

In study A the point estimate (1.1) is higher than the MID (1) so GRADE recommends rating the evidence of there being an effect. In that case reviewers can choose their threshold of interest: they can rate the certainty of an important effect (higher than the MID) or a non-null effect (higher than 0). If they choose the null, then they should not rate down because the CI (0.9-1.3) is fully above it. But if they choose to rate an important effect, they should downgrade because some values of the CI are below the MID.
I see, thanks.
In study B the point estimate (0.8) is lower than the MID (1) so GRADE recommends rating the evidence of there being NO effect.
Do you mean no important effect? We see that there's likely a nonzero effect since the CI doesn't cross null. And that's how they worded the plain language summary for vortioxetine.

My brain is swimming trying to fully grasp it, but it seems weird to compare evidence for no important effect for some interventions with evidence for any effect at all in other interventions.

With MID of 1:
Study C with 1.1, CI: 0.1-2.1 - Good evidence of an effect
Study D with 0.95, CI: 0.85-1.05 - Moderate certainty evidence of no important effect.

Maybe it's the brain fog, but it seems overly confusing to throw these findings together in a paper. If we're interested in any effect at all for some reason, Study D with its small CI seems more convincing [edit: for there being any effect at all] than Study C.
 
Last edited:
Rating (im)precision is different from null-hypothesis testing though so wouldn't phrase it like that.
Good point.

I found this, that says that if it’s «important effect», it’s MID, if it’s just «effect» it’s in relation to the null.
Box 1
Possible threshold(s) of interest and target of certainty rating in minimally, partially, or fully contextualized approach

  • Using a minimally contextualized approach (typically in systematic reviews), authors consider only one outcome at a time. Authors rate their certainty in relation to the null—rating their certainty that an effect is truly present—or in relation to a minimally important difference (MID)—rating their certainty that an important effect is truly present.

If @forestglip ‘s observation earlier is correct, they are applying two different criteria. One focuses on the null, and the other on MID. I seems like they are supposed to pick one.
 
Back
Top Bottom