I would like to comment on the use of the Chalder Fatigue Scale (referred to below as the Chalder Fatigue Questionnaire or CFQ) in the Cochrane review on
Exercise therapy for chronic fatigue syndrome.1
The review states that, fatigue is “measured at end of treatment (12‐26 weeks)” and “after 52‐70 weeks” with “3 different versions of the Chalder Fatigue Scale (0‐11; 0‐33, or 0‐42 points)” and that “
low score means less fatigue”.
This final statement that “low score means less fatigue” necessitates that the fatigue scales used are absolute measures of fatigue, and that there are no inherent changes in the way the scales are used between baseline and outcome.
However, the assumption that
low score at outcome means less fatigue is not safe in the case of the CFQ because there may be an inherent flaw in the understanding of the comparison timepoints when the CFQ is used as a diagnostic tool at baseline compared with its use as an endpoint/outcome measure. In addition, different measures of fatigue (CFQ vs FSS) may not be equivalent at outcome, even if they were well correlated at baseline.
There are other points, which have been made by others,2,3 that refer to issues with equivocating between different versions of the CFQ (bimodal vs Likert), which I am not going to address here. That is not to say that those points are not also important.
Background
The CFQ was designed primarily as a diagnostic tool,4 and as such, seems to be useful in the diagnosis of those with a wide range of fatiguing conditions.
Although it has been validated by the tool authors as a diagnostic tool in some populations,5 it has not been adequately (or independently) assessed as an outcome measure or a repeated measures tool.
There are issues with the construction and content of the tool that I would like to comment on, with reference to its use as an outcome measure in particular.
Scaling
First, I will focus on the scaling of the tool. The tool asks the patient to respond to questions about “feeling tired, weak or lacking in energy in the last month” by ticking one of the following answers: “less than usual”, “no more than usual”, “more than usual”, or “much more than usual”.
The scaling of the tool lets it down as an outcome measure, because it is skewed towards “getting worse” rather than being balanced between improvement and worsening of symptoms: there are two degrees of worsening, and only one of improvement. This makes it difficult to record incremental improvements, which then affects how the tool can be used to compare results between individuals.
This creates a problem for both researcher and patient when trying to record their perceived progress.
I have been told of instances in which patients have been inadvertently coached to complete the tool in a particular way to suggest improvement where there has been no improvement. I don’t believe this is necessarily done deliberately or fraudulently, just that it may seem to be the right thing to do to get around the failings of the instrument.
For example, at outcome, a patient may be told to compare themselves with the start of the
trial or
treatment, rather than the start of their
illness. If there has been
no change in their condition, the patient will record a score of 11 (“no more than usual” on all items), which on the face of it seems entirely reasonable, but will result in an
improvement being logged if their original (baseline) score was higher than this - which it will be if a score of 18 is required for inclusion into a trial for instance.
There then only needs to be a slight imbalance between groups for this to have a substantial effect on any differences reported in a trial setting.
Ceiling effect
The ceiling effect may produce a similar result, in that the recording of a maximum score at baseline may affect how the tool is completed on subsequent occasions.
If a patient scores the maximum (score of 33), or close to it, at baseline, and their condition then worsens, there is a temptation to reset (because of the scaling limitations of the tool) and use the start of the trial as the comparison point for subsequent form completions. If their fatigue has then got slightly worse (“more than usual”, but not “much more than usual” on most items), the recorded lower score will give the false impression of improvement when there has in fact been a slight
worsening of their condition.
Inconsistency and ambiguity in timepoints
The questionnaire text mentions at least three separate timepoints in describing how the patient should complete it. They are asked “about any problems…
in the last month”, and the questions relate to what is “
usual”. In addition, there is a conditional clause referring to “
a long while” ago, and “
when you were last well”. The use of a conditional clause in particular (“
If you have been feeling tired for a long while, then compare yourself to how you felt when you were last well”) means that it is crucial to know what comparison point is being used by the patient each time they complete the form.
It is a shame that the tool’s devisors did not foresee this as an issue, because it would have been easy to make a record of this comparison point on the form itself.
There is also a subjective judgement to be made about how “for a long while” should be interpreted. The use of multiple time periods in the questionnaire (“last month”, “for a long while”, “usual”) increases the ambiguity, and makes it very hard to assume that every patient will have interpreted it in the same way.6 The researchers have also assumed that there is an equivalence between what was experienced a long time ago and what is “usual”, which again may be problematic. What is
usual for one’s condition over the past month (or in the last month) may not be
usual for the period before the illness arose. Then having to extrapolate that across multiple timepoints over which the CFQ is used in the following months of a trial will add further layers of complexity.
There may also be the possibility that the patient makes a comparison with how they felt at the beginning of the last month, particularly if their condition has fluctuated during that period.
For these reasons, I do not believe that it is safe to assume that every patient will be comparing themselves to when they were last well, or even that “when you were last well” is equivalent between patients, on every occasion that they complete the form. For example, does “when you were last well” actually mean “before the illness arose”, or could it also be interpreted as “during my last period of remission” or “before my last episode of PEM (post-exertional malaise)”?
Change in fatigue versus absolute fatigue
Ultimately, the CFQ measures
change in fatigue each time it is used. It does not provide an absolute measure of fatigue, which makes comparison between timepoints problematic.
By comparison, the Fatigue Severity Scale (FSS), which uses a visual-analogue scale of fatigue symptoms over the course of a week, provides a more absolute score, and patients are not required to make comparisons between one timepoint and another when completing the questionnaire. Asking patients to consider their experience over just the past week is conceptually easier than having to remember an average over the course of a month.6 The FSS has other issues with regard to the content of the question items that may bias responses when exposed to certain interventions that I won’t discuss further here.
I would hypothesise that if both scales (CFQ and FSS) were used together in a non-intervention trial in naïve patients with stable fatiguing conditions over a period of some months, the scales would broadly correlate at baseline, but would be found to divert some months later when used as an outcome measure. I would predict that although both scales will record high scores at baseline, at outcome, only the FSS will maintain those scores, and the CFQ would start to tend towards scores of 11 as patients record what is “most usual”, unless they have been specifically told to do otherwise.
If such a trial were to make this finding, then it would confirm that the CFQ is not safe to use as an outcome measure or repeated measures tool without some form of modification to address these issues.
References
1. Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database of Systematic Reviews 2019, Issue 10: CD00320. https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub8/
2. Comments on the review made by Tom Kindlon and Robert Courtney. https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub8/read-comments
3. Wilshire C, Kindlon T, Matthees A, McGrath S. Can patients with chronic fatigue syndrome really recover after graded exercise or cognitive behavioural therapy? A critical commentary and preliminary analysis of the PACE trial. Fatigue: Biomedicine, Health & Behaviour 2017; 5: 43-56 https://www.tandfonline.com/doi/full/10.1080/21641846.2017.1259724
4. Chalder T, Berelowitz G, Hirsch S, Pawlikowska T, Wallace P, Wessely S. Development of a fatigue scale. Journal of Psychosomatic Research 1993; 37, 147–153 [PubMed]
5. Cella M, Chalder T. Measuring fatigue in clinical and community settings. Journal of Psychosomatic Research 2010; 69, 17–22 [PubMed]
Note – Cella & Chalder only looked at discriminative validity in a GP population and only as a diagnostic tool. Morriss et al. (1998) looked at construct validity, deciding that the 11-point questionnaire was the better tool. Morriss RK, Wearden AJ, Mullis R. Exploring the validity of the Chalder Fatigue scale in chronic fatigue syndrome. Journal of Psychosomatic Research 1998; 45: 411-417 [pubmed]
6. Streiner DL, Norman GR. Ambiguity. Chapter 5: Selecting the items. In: Health measurement scales: a practical guide to their development and use (3rd edn). Oxford: OUP, 2003: p62.
Note – See also the subsequent chapter (6) on Biases in responding – particularly with regard to asking respondents to recall how they felt a while ago.