GETSET letters in The Lancet

Regression to the mean is an interesting possibility.

It's a certainty if they subgroup the data like that! Those with low scores will tend to improve; those with high scores will tend to get worse (before any treatment effects). They don't seem to have selected on PF this time, but that will have happened to a certain extent anyway, by nature of the selection criteria for CFS.

They also said:
The original protocol had only one primary outcome measure, the SF-36 PF. However, when some eligible participants were found to have high SF-36 PF scores at randomisation (because of their illness affecting cognitive or social functions but not physical function), we decided to also include fatigue, using the CFQ, as a co-primary outcome. This decision was made mid-way through trial recruitment...

So it looks like they did have some kind of cut-off originally (at least in principle), to allow themselves room for improvement. In some ways, it's good they weren't selecting patients on the measure they were going to use at outcome. But they would have been using selection measures that were highly correlated with it.

I'm not 100% sure what you're saying: they are saying that because the higher group didn't do as well, that's likely because they were already near the maximum (ceiling effects). There could be situations where this arises.

However, based on the figures here we have enough information to know that the patients in the higher functioning group only approved by a maximum of 2.7 points and may actually have decreased. A final score of around 55 is not near a maximum score unless one is accepting that graded exercise therapy can only bring people up to a very low level which the authors have never conceded.

Unlike us, the authors have the figures. It would be pretty clear if they gave the figures that their interpretation is not valid. But by not giving the figures people could be taken in by it.

But didn't they concede that by saying that there might be a ceiling effect? I think that's where I'm getting confused, because otherwise I don't understand what they haven't admitted to. To anyone who knows the PF scale, the improvement they reported is truly tiny.

I certainly agree that it would be useful to see the data. A nice set of histograms of the baseline and outcome measures in the 2 groups would make things much clearer.
 
But didn't they concede that by saying that there might be a ceiling effect? I think that's where I'm getting confused, because otherwise I don't understand what they haven't admitted to. To anyone who knows the PF scale, the improvement they reported is truly tiny.
They are not conceding anything by saying there is a ceiling effect in my view. 100 can be a ceiling. There could be a situation where people were at an average of 90 initially and you can't expect them to get that much better so there is a ceiling effect there. (Say a weight loss program where some people are close to a healthy weight, while others had more to lose). So that is what they seem to be suggesting: the group that started off high couldn't improve much as their scores were already pretty good.

However, when you actually look at the specifics, the subgroup with the initial higher scores either improved a tiny amount or actually decreased. And the final score averages around 54.9. We only know that from information about the other group, it is not information that is explicitly given in the paper. Their argument about a ceiling effect in this specific case could only be true if there is a very low ceiling. They haven't said that explicitly anywhere and I doubt they would concede it. Some people unfamiliar with the scale might think the ceiling they reached was normal functioning or something close to it.

Peter White thinks people can recover with GET and should be turned down for disability payments unless they have tried it. He hasn't said anything to suggest people could only get scores of 55-60 or so.
 
Last edited:
They are not conceding anything by saying there is a ceiling effect in my view. 100 can be a ceiling.

Ah - OK. There are two different types of ceiling effect. One could be in the scale itself - as you say, starting at 90 when 100 is the max. But there's also a treatment ceiling effect, whereby a treatment is acknowledged as only able to produce, say, a 10-point improvement.
I read it as a treatment ceiling, not a scale ceiling.

But it does seem that they've confused the two as well if they think that a score of 55 constitutes some kind of treatment ceiling.

ETA: But then that doesn't make sense if they say it is only relevant in those who had high scores at baseline. As you say, we need to know what those 'high' scores were.

ETASM: And you can't set a 'ceiling' at the mean.
 
Last edited:
Ah - OK. There are two different types of ceiling effect. One could be in the scale itself - as you say, starting at 90 when 100 is the max. But there's also a treatment ceiling effect, whereby a treatment is acknowledged as only able to produce, say, a 10-point improvement.
I read it as a treatment ceiling, not a scale ceiling.

But it does seem that they've confused the two as well if they think that a score of 55 constitutes some kind of treatment ceiling.

ETA: But then that doesn't make sense if they say it is only relevant in those who had high scores at baseline. As you say, we need to know what those 'high' scores were.

ETASM: And you can't set a 'ceiling' at the mean.
If one crunches the numbers, the group with the lower initial scores (A) increased by an average of 16.9 points or more.
While the group with the higher initial scores (B) increased by a maximum of 2.7 and could even have decreased.

So a treatment ceiling (i.e. that they could only improve so much within the timeframe) doesn't justify the poor results for group B.

I asked for the scores so they would be put on record and people wouldn't need to trust my analysis. But I believe we have enough information to say that the ceiling effect doesn't justify the poor results for group B.

ETA: Only a small proportion of group B could have had high e.g. 85+ scores on completion given the average of around 54.9.
 
Last edited:
We have already acknowledged the small size of the effect on physical functioning (0·20), but our finding that the effect size was greater in those with the worst baseline physical functioning suggests this might represent a ceiling effect.
It is disappointing that there are again making this point without sharing information about the other subgroup i.e. with the higher baseline scores.
It's like an unfinished sentence - "but our finding that the effect size was greater in those with the worst baseline physical functioning suggests this might represent a ceiling effect ..." - that they deliberately avoid completing. Crucial to their results is what that ceiling effect might be.
 
What has always concerned me, though, is that The Lancet seem to have no mechanisms to ensure that authors have actually answered the questions asked of them. I saw this countless times when I used to edit Correspondence, and it frustrated the hell out of me. It's the reason why I don't think Correspondence is a particularly good method for "correcting the record". Once a paper has been published, that's sort of it really, unless you can demonstrate fraud. Everything else is just column inches. Sorry for being so cynical, but it's why I didn't leap to my keyboard when Tom called for responses.
Another reason is documentation and showing disagreement. The authors' responses to criticism are nearly always worthless in this field.
 
Claiming there could be a ceiling effect and not providing a histogram/visualisation or statistical test of the effect, leads to the claim lacking credibility. You'd lose marks doing this on an undergraduate project, it's simply not acceptable from professionals.
 
If one crunches the numbers, the group with the lower initial scores (A) increased by an average of 16.9 points or more.
While the group with the higher initial scores (B) increased by a maximum of 2.7 and could even have decreased.

So a treatment ceiling (i.e. that they could only improve so much within the timeframe) doesn't justify the poor results for group B.

I asked for the scores so they would be put on record and people wouldn't need to trust my analysis. But I believe we have enough information to say that the ceiling effect doesn't justify the poor results for group B.

ETA: Only a small proportion of group B could have had high e.g. 85+ scores on completion given the average of around 54.9.

I think it looks more like regression to the mean. If they looked at the control group in the same way, they should see exactly the same, although without the 'treatment' effect the increases/decreases on each side would be more balanced (those below the mean would increase by about the same amount that those above the mean would decrease).

If only a few people were close to the top of the PF scale at baseline, I don't expect it would really make much difference, particularly as the overall mean values are so low.
 
They are not conceding anything by saying there is a ceiling effect in my view.
Not so long as they don't state what ceiling effect they mean.

I suppose that if some participants had little physical function disability at baseline (as @Lucibee notes in post #41), so were close to good physical function anyway, then they would have started off close to a natural ceiling. Does it make sense to include such people into an assessment of PF improvement anyway? (genuine question). If all trial arms the same then maybe? If they meant this as a ceiling effect, then the authors might tacitly concede their trial design was questionable - and not wish to do so?

But as Tom says (if I understand right), if it's more likely the "ceiling effect" was simply that the treatment had run out of what little steam it had, surely they would be very chary of conceding that. Indeed that would seem like calling something a ceiling effect when in fact it's more of an outcome?
 
ETA: Only a small proportion of group B could have had high e.g. 85+ scores on completion given the average of around 54.9.
Note that this group had to have a baseline score of ≥45, which limits the proportion of high scores.

Here is what they said about possible ceiling effects in the main text
Our finding that GES was more useful in those with worse physical functioning is reassuring and has been reported previously, but further exploration is necessary because it might be related to a ceiling effect in those with good physical functioning at baseline. This ceiling effect might also explain the relatively smaller difference in the effect size for physical function, which would reduce the overall difference between study groups.
 
If anyone felt inclined, it would be interesting to get the baseline scores for the SF-36 physical function questionnaire using a freedom of information request.

These could be used to question their claim
Our finding that GES was more useful in those with worse physical functioning is reassuring and has been reported previously, but further exploration is necessary because it might be related to a ceiling effect in those with good physical functioning at baseline. This ceiling effect might also explain the relatively smaller difference in the effect size for physical function, which would reduce the overall difference between study groups.

The patients were divided up into two groups:
severity of disability according to the Short-Form 36 physical function subscale (SF-36 PF, ≤40 and ≥45).17

We know that the average final score for the lower group was 56.9. We also know that the average total final score was 55.7.
If we got the baseline scores, we could calculate the change scores for both groups. It looks like the higher initial group may even have decreased on average, it certainly didn't improve much given the jump for the other group from an average score of 40 or less.

The baseline scores of the higher initial group would probably be mainly 45-65, which would show they could easily have improved.
 
Back
Top Bottom