Defining a High-Quality Myalgic Encephalomyelitis/Chronic Fatigue Syndrome cohort in UK Biobank, 2025, Samms & Ponting

Where is missing data missing from, and why?

Summary: All UKB participants have baseline assessment data (with self-reported serious illnesses), and almost all have hospital records (with diagnosis codes). Nearly half have GP records (with diagnosis codes), and a third completed the Pain questionnaire, which asks about diagnoses of specific illnesses.​

The paper often says that the lack of concordance between cohorts is often due to missing data, but it isn't clear how this happens.

Much of this is simply because the four cohorts are defined by four UKB data fields, and a small majority of people have no data in those fields:
upload_2025-5-23_8-40-37.png
  • Every UKB participant was asked if they had a serious illness or disability diagnosed by a doctor at the initial assessment, and asked what that was, so there is no missing data here. (c1)
  • Every participant gave consent to share medical records. UKB has hospital data for 89%. (C3)
  • However, UKB only has primary care data for 46% due to problems accessing data. (C4)
  • And only 33% completed the Pain Questionnaire. (C2)
So, most people have no data for the pain question about ME/CFS or primary care data for diagnosis codes. However, those in the pain questionnaire cohort (C2) have less missing data because, while only 46% have data for GP records, everyone has baseline assessment data, and 89% have hospital records.

Data not missing at random

This missing hospital recoreds appear to be missing at random, at least as far as CFS status. However, pain questionnaire data is not missing at random: 41% of those who reported CFS at baseline assessment have PQ data, compared with 33% for all the UKB. People with health problems are more likely to complete a voluntary questionnaire about health issues than those without.
 

Attachments

  • upload_2025-5-23_11-58-46.png
    upload_2025-5-23_11-58-46.png
    29.2 KB · Views: 20
Last edited:
Cohort quality assessed by cohort overlap
Those in C3 (hospital G93.3 code) come out best, C2 (PQ ME/CFS) worst

The simplest way the study assesses cohort quality is by how much is a diagnosis in one cohort supported by diagnosis in another data field (i.e. in another cohort). I've graphed this here using the data from Table 3.
upload_2025-5-26_9-18-51.png

C3 (G93.3 hospital records) is the best ( with the lowest proportion of any cohort in that cohort only, and the highest rate is in three cohorts), while C2/PQ is the worst.

This does not allow for data missingness, as discussed above:
  • Almost everyone in PQ (C2) could be in at least 3 cohorts (11% with no hospital records can be in 2 or 3). Yet most are only in the PQ cohort.
  • Everyone in the G93.3 (C3) cohort could be in 2 cohorts, but it is much harder to be in 3 (as most have no PQ/C3 data and half have no GP (C4) data. Yet it has 27% in 3 cohorts (and only 28% in C3 alone).
This means that C3 is almost certainly better than the data here suggests, and C2 is worse.

ADDED
It would be interesting to see the proportion of a cohort that matched wtih 2 or 3 other cohorts adjusted for missing data (as has been done in the text for matches between two specific cohorts).

Note: I am ignoring cases common to all four cohorts as there are only 95 of them.
 
Last edited:
The PEM questions
this study said:
The final 3 questions are related to post-exertional malaise, with possible responses: ‘yes’ (considered ‘affected’), ‘no’, ‘do not know’ or ‘prefer not to answer’ (all considered ‘unaffected’). These questions were only asked to participants who answered ‘yes’ to question (5) above: ‘Do you have persistent or recurrent tiredness, weariness or fatigue that has lasted for at least 6 months?’ including to 1,876 (69%) of those in Cohort 2.

These 3 questions were:
(13) “Do you get tired after minimal physical or mental exertion?” (“Tired after minimal physical or mental exertion”);
(14) “Does this tiredness, weariness or fatigue go away when you rest?” (“Fatigue goes away when resting”); and,
“Is this tiredness, weariness or fatigue happening only because you have been exercising and/or working too much?” (“Fatigue only because of exercising or working too much”).

Those do not seem to me to be likely to pick out PEM.
Being tired after exertion is very nonspecific and if someone is feeling generally tired they might easily say yes. And tiredness tends not to go away with rest. Exhaustion or exercise fatigue does, but not tiredness I think.
The last question I find hard to understand - how is the person to know what caused the tiredness? I would guess that someone with PEM could very well say no because it isn't the exercising that caused it so much as the ME/CFS being there that caused trivial exercise to be followed by tiredness.

Second question
"Does this tiredness, weariness or fatigue go away when you rest" ('yes' considered affected)

There are diagnostic criteria that require people to say 'I have fatigue that is unalleviated by rest' to be labelled with ME/CFS. I've complained about that, because resting often does help. But, the opposite is problematic too.
There are diagnostic criteria that require people to say 'I wake up feeling unrefreshed' to be labelled with ME/CFS.

The question is particularly problematic as a test of the existence of PEM because the screening question asking about fatigue lasting 6 months primes people to think about the illness as a whole, not the granularity of a PEM episode.

Third question
"Is this tiredness, weariness or fatigue happening only because you have been exercising and/or working too much?" ('yes' considered affected)
I genuinely expected to find, when I went to the paper to check, that for someone to be put into an ME/CFS cohort, the required answer to this third question would be NO.

Surely, having answered YES to 'Do you have persistent or recurrent tiredness, weariness or fatigue that has lasted for at least 6 months, most people with ME/CFS would answer 'No, the fatigue I feel is not because I have been exercising or working too much'?

A mother without ME/CFS who continues to work outside the home and has two children under 5 might say 'yes, I've been tired for 6 months and yes, it's because I've been working too much'. A person with ME/CFS probably has cut back their work hours, and stopped going to the gym or playing in their social sports team. A person with severe ME/CFS is lying in bed most of the day with meals being brought to them. I think most people with ME/CFS would reject the idea that the reason they are fatigued is because they are doing too much.

I don't think people answering that question would think it was a question about PEM. That's for a lot of reasons: PEM is not really about fatigue, it's about feeling ill, whereas someone with ME/CFS can feel fatigued quite a bit of the time. As for question 2, the preceding question asking about having fatigue for at least 6 months would prime respondents to be thinking the question is not about day to day PEM, but more about the overall cause of the fatigue (as Jonathan noted).

There is a Crawley paper analysing data from a large longitudinal cohort study of young people. I'm pretty sure that young people whose parent reported that their child's chronic tiredness was due to playing too much sport were excluded from a CFS diagnosis.

***
I don't doubt that trying to find useful questions in the UK Biobank data to answer the question 'does this person have PEM?' was difficult. But, I'd be much more inclined to think the people who answered 'no' to questions 2 and 3 have ME/CFS, rather than do not have ME/CFS.
 
I'd be much more inclined to think the people who answered 'no' to questions 2 and 3 have ME/CFS, rather than do not have ME/CFS.
The paper Says that answering these three questions “yes, no, no” is consistent with PEM– Aren’t you agreeing?

I agree that those questions aren’t very helpful in identifying PEM, But are relevant to ME/CFS, And could be useful in assessing the status of those in other cohorts who also answered the PQ
 
The paper Says that answering these three questions “yes, no, no” is consistent with PEM– Aren’t you agreeing?
In that case, yes.

But the paper (also?) says:
The final 3 questions are related to post-exertional malaise, with possible responses: ‘yes’ (considered ‘affected’), ‘no’, ‘do not know’ or ‘prefer not to answer’ (all considered ‘unaffected’). These questions were only asked to participants who answered ‘yes’ to question (5) above: ‘Do you have persistent or recurrent tiredness, weariness or fatigue that has lasted for at least 6 months?’ including to 1,876 (69%) of those in Cohort 2.

These 3 questions were:
(13) “Do you get tired after minimal physical or mental exertion?” (“Tired after minimal physical or mental exertion”);
(14) “Does this tiredness, weariness or fatigue go away when you rest?” (“Fatigue goes away when resting”); and,
“Is this tiredness, weariness or fatigue happening only because you have been exercising and/or working too much?” (“Fatigue only because of exercising or working too much”).
 
Some comments on the validity of each cohort, adding to what the authors say and what has been posted here to date.

General point

The cohort prevalence rates (all diagnosed cases) are mostly pretty high, runningging from 0.31% for G93.3 hospital admission recorded cases to 1.63% for those responding to the pain questionnaire. The best estimate we have for the diagnosis rate is about 0.2% from the Samme/Ponting G93.3 HES study (primarily outpatient cases, which are not covered in UKB data). UKB is biased to healthy individuals, so we might expect a lower actual rate, and all four cohorts are above this.

I will return to this post with a summary when I’m done with each cohort (don't hold your breath).
 
C2: Self-report of ME/CFS from the pain questionnaire, 2,720 people.

I'm starting with this cohort as it's the most extreme and also has data that can be useful for qualifying matched cases from other cohorts.

The best thing about this cohort it asks people if they have ever been told by a doctor that they have "chronic fatigue syndrome/myalgic encephalitis”. Which is better than asking about chronic fatigue syndrome alone as it is so easily understood as chronic fatigue. Also, there are a lot of relevant supplementary questions that can be used to qualify the cohort.

Other than that, it doesn't look good.

The prevalence of 1.63% is too high to be credible. Only 18% with hospital data have a G93.3 code, rising to a still-low 34% for those with an ME/CFS code in the 46% with primary care records. Most have no further evidence of having ME/CFS in any of the other cohorts.

31% of cases don't even have chronic fatigue (fatigue for at least the last six months) and so can't have ME/CFS. And about 30% of those that do have CF provide other answers that don’t fit with ME/CFS, such as not tiring after minimal exertion. Overall, only 1,073 cases (39% of the cohort of 2,720) provide ME-credible answers.

It goes on: most cases (69%, 1,870) did not report they had a diagnosis at the baseline interview. Given the average age at baseline was 56 years, it's pretty unlikely so many have picked up a diagnosis in the decade since (the PQ was mostly done in 2019).

We know from other surveys (e.g. the British Columbia Generations Project, 1.1%, and CDC analysis of the National Health Interview Survey, 1.3% ) that asking a general population if they have CFS or ME generates suspiciously high prevalence figures. In the case of the British Columbia survey, Louis Nacul showed that 66% of those who said they had a diagnosis did not meet diagnostic criteria for ME/CFS. As someone here pointed out, many people will probably assume that Cchronic fagiue is the fancy medical name for chronic fatigue.

A possible explanation for the high rate of self-reported ME/CFS in the PQ cohort is that asking people struggling with generally poor health health if they had chronic fatigue syndrome will generate a lot of false positives. Chronic fatigue is common across illnesses, not least in chronic pain. This could a long way to explain in the implausibly high prevalence figure of 1.63%.

The paper's recommends restricting this cohort for use in research either to the 61% who also have Chronic fatigueor those who also have other answers consistent with ME/CFS: 39%/1,073, which still gives a prevalence rate of 0.64%.

Summary: the data in the paper - not least the resulting prevalence figure of 1.63% - indicates that most C2 cases do not have ME/CFS, but a restricted selection has potential.
 
Maybe «considered affected» refers to affected by what the question asks about, and not affected by PEM?
Yes, probably. It's just pretty odd, with that construction.
The final 3 questions are related to post-exertional malaise, with possible responses: ‘yes’ (considered ‘affected’)

I note that those questions supposedly determining if someone has PEM all only ask about fatigue. I haven't read this paper in detail yet, but I'm assuming that someone other than the authors developed the questions and they are just trying to make use of them.

A lot of the problems in trying to determine whether people have ME/CFS are related to past (and present) assumptions that the illness is just chronic fatigue.
 
Back
Top