Use of EEfRT in the NIH study: Deep phenotyping of PI-ME/CFS, 2024, Walitt et al

OK @Evergreen one more version (let me know if the rose works OK)

because I'd had the warm-up rounds in grey and lost it while playing with the background I've reinstated that being greyed for the first 4 rounds, but of course it is pros and cons and it could be that adds in contrast the removes any benefit that greying it off would provide :)

edited to update table with paler pink

walitt big table hard choices paler pink.png
 

Attachments

  • walitt big table hard choices with muted and warm-up grey.png
    walitt big table hard choices with muted and warm-up grey.png
    162.5 KB · Views: 105
Last edited:

This thread has also gotten awfully long and most of us are probably stuggling to keep up with even half of the messages. As such we should probably ask ourselves: Where are we standing right now? What do we now know with certainty, what exactly don't know yet and what should we still have a closer look at?

I feel satisfied that I understand the test fails on 3 levels:

1. It is conceptually inappropriate to use in mecfs, a physical disease where effort preference isn't a legitmate scientific question, and in which the test hasn't been validated
2. The high failure rate of ME/CFS patients on hard tasks renders the measure invalid as a measure of preference. This argument is the strongest one because it flies even if you accept EEfRT as a good and legitimate test: It fails on its own terms.
3. the exclusion of healthy volunteer F's data is necessary for making the primary endpoint significant. It can't be rationally justified but it can be justified based on precedent.

To do list from here might be to see what was pre-specified in terms of exclusion. Did they consider the possibility mecfs patients would fail the hard task at such high rates.
Then to communicate the problems. Ask for erratum? Retraction? Fight rearguard action in the press?
 
Below I will be presenting my first collection of plots. For the largest part this is what the brilliant plots @Murph, @bobbler and @Karen Kirke made have already shown. All of these plots and calculations plots were done in R, so the colours, formatting and presentation can easily be changed and these plots, code and calculations (mean, SD etc) are suitable for being published if the appropriate changes are made.

I will be presenting every plot once for all the participants and once excluding HV F (his data is pretty irrelevant for what will be presented here but I've included it anyways), the same applies to the calculations. All of these plots are centered around the completion rates of hard rounds, easy rounds and their completion rates have been ignored for now.

The first plot is similar to what @Karen Kirke first presented, but I chose a slightly different graphical presentation and also included mean values (red is for ME/CFS, blue is for HV and dotted lines are mean values for successful completion whilst the continuous line are the mean number of hard trials), that I believe already tell us a lot (@Simon M first brought up the idea of including these I believe, I have sticked to ranking them alphabetically for now but will still look at whether ranking them by things such as SF36 could reveal something interesting). The ratio of the different colours directly shows the percentages of hard rounds successfully completed, instead of this being a separate piece of information.
View attachment 21308

The same applies when HV F is excluded, see below.
View attachment 21309


These plots clearly indicate the massive difference between the percentage of pwME and HV that complete hard tasks. Note also have far away pwME stray from the mean, there is a lot of deviation. This had all been discussed at length already.

The second round of plots are now for the analogous plots from above, however only for first 35 rounds of trials (excluding the practice rounds). These plots are more "interesting" than the plots above as they ensure that all participants have played the same number of rounds and also limit in-game fatiguing effects. Focusing on a fixed amount of rounds has been done in several different EEfRT studies, the choice of 35 is somewhat arbitrary and varies from paper to paper (large enough to be significant and not too large for in-game fatigue to occur).

View attachment 21310

The same applies when HV F is excluded, see below.
View attachment 21311

What we see here is just as striking as what we had already seen before.

Now come the plots of the trial rounds:

View attachment 21312

The same applies when HV F is excluded, see below.

View attachment 21313

This plot clearly shows us the fundamental problem in the trial. In the trial rounds fatigue caused purely by the game and motivation/stragety will play less of a role. Yet pwME cannot complete the tasks. We cannot know whether the results of the study would have been different if trials rounds would have served as calibration rounds, however failing to use them as such means we cannot appropriately interpret the results of the EEfRT. Note here that pwME actually choose the hard task more often then the controls. This is the only time this is the case. Unfortuntely, due to the small sample size it's hard to conclude anything from that apart from the fact that not appropriately calibrating the EEfRT has led to inconclusive results. The median of HV playing hard rounds during the trial rounds appears to be proportional to that in the first 35 rounds.


Finally I have plotted the hard rounds as the game progresses divided into two halves, the first half of total rounds played per player and the second half (i.e. player HV A plays in total 48 rounds so the first half are the first 24 rounds and the second half the next 24 rounds and then we look at the hard rounds in those in those halves separately, if someone plays an uneven number of rounds then this extra round is part of the first half of the game).

View attachment 21314

View attachment 21315

I believe this hasn't been plotted or discussed yet at all. It strongly suggests that the mix of learning effects+saturation effects+fatigue as the rounds progress actually dominate the completion of hard tasks in HV more then they do in pwME. Both are still opting to go for hard rounds just as much as in the beginning of the game, but even HV are completing them less often now. This has been seen in several other studies and is one of the reason a universal cut-off is typically used in combination with using trial rounds as calibration.

In terms of completing hard tasks: After having played the first half rounds HV are now closer to the same point pwME we at, at the beginning of the game. Would their behaviours eventually converge? We would need far more data to know more about that....


The same naturally applies if HV F is excluded (see below).

View attachment 21316 View attachment 21317


In a next collection of plots I'll consider the above data under the aspect of the SF36 data of the participants and also the different reward and probability levels of hard trials.

Interesting and really well laid out to get these points across. It builds/layers each finding in a way I think that most will get what you are demonstrating very clearly. I like the approach.

I am yet to catch up with the next posts, so apologies if I jump the gun on you saying this in those. But before I forget it forever I'm bunging something down on these thoughts. :)

Is the dotted line the median? Because assuming it is so these charts indeed make it clear just how different the mean and median are for ME/CFS. Yet even when on some of the charts they are almost the same for HVs ( I will look back because now I'm intrigued as to whether that 'slips' when looking at the later rounds, because that is an interesting hypothesis to note this nod towards a fatiguing effect for HVs - and their field beginning to separate on this - and where it might end).

Anyway I point out this because a lot of the statistics that the EEfRT relies on for analysis require looking at the distribution and if you are basically comparing a normal distribution where the mean and median are as tight as the HVs are with one where that is certainly not the case (I'm thinking it is tri-modal to be honest on capability-front, as the group in the middle are coping with the effects of the task being undoable in a different way to those playing it as per the HVs but with the disability just 'showing' the non-completion effects that has, but then who knows because it will be overlapping with the additional effects it causes on how people can approach the test itself).

This is where I'd have to re-acquaint myself with all the ins and outs of all possible tests, but surely that is an issue for quite a lot of analyses and we all know the whole parametric vs non-parametric options and so on - how can you do the tests you'd want to when you've such different distributions that you are 'comparing'? And then of course I'm unsure (you've probably seen a hint of it) of using an on-off approach rather than scale (for ME-CFS - 'ness' of different types, which could of course have been interesting to if the data had been calibrated before doing it, because different aspects within those scales could have pointed to different issues) when there is such an obvious split even if you just looked at the ME-CFS participant data.

Sorry I'm rambling on now! o_O
 
Last edited:
Got a reply from Treadway on EEfRT. He seems suitably cautious about the utility of using it in this population.

Hi,

Sorry for the delay. The task has not been specifically validated as a measure of fatigue. My understanding of the paper is that they are trying to understand the clinical features of this illness, which are not well known. In that sense, it seems appropriate to me to use the EEfRT as an exploratory measure to determine whether it is sensitive to PI-ME. In other words, I would look at this study as an initial attempt in the validation of the EEfRT for this type of population. Viewed in that way, I think this is a valid use of the task.

I hope that helps.

Best,

Michael
 
Last edited:
1. My memory of this research program is they brought each patient in sequentially. Each person was at nih for around a week, one after another. (Which is why it took so long). Somebody way upthread was pondering this too but there's no way for subjects to speak to each other irl.
2. Because this research is funded by the US federal government directly it should be possible to get our hands on information. Freedom of Information Act covers NIH and if they won't give info nicely someone can pursue them for it. An interesting request might be All correspondence and minutes of meetings relating to the use of EEfRT and susbequent analysis of said data - see if Nath and Wallitt had any disputes about how far to push it.
https://www.nih.gov/institutes-nih/...public-liaison/freedom-information-act-office. Probably a job for a US citizen to do the actual filing!

This first point is useful insight regarding the way the test ended up not being re-run or calibrated when surely, even if you are so-minded (to confuse disability with other concepts), as soon as you look at how far off a significant chunk of ME-CFS were the whole way through from even completing hard before having done many rounds etc it would have flagged as an issue. Surely.

Unless you are stuffed because you've been doing it one by one and it's all gone seemingly swimmingly for the first x number so you just don't 'see' it building in the way that would have happened if you'd been able to run a few testers on enough representative of the future participants.

I'm interested in the experience @andrewkq notes of running similar tests and whether under circumstances where testing (I'm assuming normally) is more participants taking it in a closer timeframe to each other, there was always an eye out/chance that if strange things cropped up such information would be fed back in at that point to take a look and a pause on the way forward?

Tricky when it might be a test you can't re-run on same participants for various reasons, and the point of your trial is the participants 'in-depth' vs a trial where the tool could perhaps be the only test or one of only a few other tests being run past larger numbers of people.

However I'm aware that it was 8yrs in the making and perhaps it might have been possible for better validation outside of the trial to have been taking place before those with ME were drafted in (noting there seem to be more than one 'trip' to the centre) for said tests?
 
Got a reply from Treadway on EEfRT. He seems suitably cautious about the utility of using it in this population.

Hi,

Sorry for the delay. The task has not been specifically validated as a measure of fatigue. My understanding of the paper is that they are trying to understand the clinical features of this illness, which are not well known. In that sense, it seems appropriate to me to use the EEfRT as an exploratory measure to determine whether it is sensitive to PI-ME. In other words, I would look at this study as an initial attempt in the validation of the EEfRT for this type of population. Viewed in that way, I think this is a valid use of the task.

I hope that helps.

Best,

Michael

I cross-posted below above. But this is interesting. Thanks for posting it
 
3. the exclusion of healthy volunteer F's data is necessary for making the primary endpoint significant. It can't be rationally justified but it can be justified based on precedent.

Just adding on, this isn't a problem with the the use of EEfRT in this study, it is a problem with the test in general. The fact that the strategy that makes the most money is different from the strategy that takes the most effort is a conflict.

If participants don't actually care about the money than this is a mute point. However, if you really did want a larger payout then it is an issue that a lower effort strategy achieves that goal. If a very highly motivated participant can score as having "low effort preference" just because they wanted more money, then there is an issue with the test. Yes, you could just throw out any of the data that appears to follow this strategy. However, it is possible that others attempted to do this just in more subtle or less efficient ways that change the outcome.

It is also silly because this would be an easy thing to fix. Just ensure that the highest payouts occur with the most effort.

As I mentioned earlier, I have a feeling this these points won't be well understood by the researchers though.
 
The EEfRT is also terrible because takes a complex decision making task, with many 100s or 1000s of inputs and simplifies it to a single measure of "effort preference". If we think about the reasons why people might chose one task over another, there are obvious reasons like: the level of fatigue, monetary desires, understanding of the instructions, intelligence, and influence of PEM. However there are many other more subtle factors like: what they ate for breakfast, how comfortable their clothing is, what they thought about the researchers, what time of day the tests were performed, how well they slept, if they needed to use the bathroom, what they would spent the money on, etc. etc.

While on their own these small factors may have little influence, they do all impact our decision making on a day to day basis. The researchers trying to simplify decision making in this way is like saying the US landed on the moon before Russia because they had a stronger effort preference to do so. It just makes no sense to frame the problem in those terms.
 
Last edited:
I feel satisfied that I understand the test fails on 3 levels:

1. It is conceptually inappropriate to use in mecfs, a physical disease where effort preference isn't a legitmate scientific question, and in which the test hasn't been validated
2. The high failure rate of ME/CFS patients on hard tasks renders the measure invalid as a measure of preference. This argument is the strongest one because it flies even if you accept EEfRT as a good and legitimate test: It fails on its own terms.
3. the exclusion of healthy volunteer F's data is necessary for making the primary endpoint significant. It can't be rationally justified but it can be justified based on precedent.

You are all doing a great job on this @Murph.

The multiple levels of failure is worth itemising.

I would just advise against any suggestion that ME/CFS is a 'physical disease' that excludes consideration of effort. diseases involving events known to us as thoughts are just as physical - they must be to have physical effects. It is quite legitimate to study effort in ME/CFS as long as results are interpreted in a plausible way. That is why I suggested the need to distinguish B causes A and C from B causes A, which causes C.

The 'physical disease' argument is the easiest of all for the BPS people to shoot down and win in medical circles. And psychiatric diseases are just as physical and disabling as ME so it brings in a prejudice we can do without.

I don't think exclusion of HVF is justified, even if somebody's respecified rules allow it. Treadway may be of help in that he has identified various issues with the test that need careful handling but there is no guarantee that the test is any use even with that care. My impression from what I have seen of its use is that you simply cannot draw the sort of conclusions psychologists would lie to draw from overlapping scatter plots with weak correlations with modest p values.
 
The 'physical disease' argument is the easiest of all for the BPS people to shoot down and win in medical circles. And psychiatric diseases are just as physical and disabling as ME so it brings in a prejudice we can do without.

Going down this route will also allow them to focus on the stigmatising bit and give a lecture on dualism, ignoring the stronger and most important points about pwME/CFS being unable to complete the hard tasks and therefore the measure is not a measure of preference.
 
Is the dotted line the median?

The solid line is the mean number of hard tasks chosen and the dotted line is the mean number of hard tasks successfully completed. IMO these are the two most important basic values we are looking at in our argument (I have also calculated variance for each group in each plot, but didn't want to overload the plots/the post for now).
 
Last edited:
I will play devils advocate for a moment:

Something that is quite striking and something I believe one still has to discuss, is that in the second half of the game both HV and ME/CFS chose to play hard more often than in the first half even though they percentually both complete it even less often than in the first half (I will additionally also cut the games in half after I have specified a fixed duration of games played to ensure that everyone gets to play the same number of games, this is typically what other studies have done).

The ratio of successful completion of hard/ choice of hard in the second half of the game for HV gets closer towards the ratio with which patients with ME/CFS start of the game with.

The main argument of Walitt is al in its simplified version is “pwME choose hard less often, this means they prefer to not use effort”. Now we are arguing “that is an incomplete assessment because pwME have lower chances of completing hard tasks and as such some in-game learning effects might lead to choosing hard less often rather than the intrinsic nature of pwME”.

The problem with this argument of ours, and something that Walitt et al can argue, is that comparing the second half of the game data to the first half of the game data shows that striking out more often doesn’t lead to choosing hard less often. I haven’t seen a convincing counterargument of our own against such a counterargument. @Murph @bobbler @andrewkq
 
Last edited:
The problem with this argument of ours, and something that Walitt et al can argue, is that comparing the second half of the game data to the first half of the game data shows that striking out more often doesn’t lead to choosing hard less often.

Though in danger of proposing an unfalsifiable argument, but as well as increasing fatigue as the task goes on, participants are learning about what they can do so perhaps they become less cautious. Alternatively as time runs down participants see the finish line in sight and become less cautious. For me these multiple possibilities just illustrate the problems of interpreting anything thing meaningful from this task.

If you don’t know in advance what the cost of the exertion is going to be it makes sense to focus initially on the easy tasks but as time runs down to then try more hard ones.
 
Though in danger of proposing an unfalsifiable argument, but as well as increasing fatigue as the task goes on, participants are learning about what they can do so perhaps they become less cautious. Alternatively as time runs down participants see the finish line in sight and become less cautious. For me these multiple possibilities just illustrate the problems of interpreting anything thing meaningful from this task.

If you don’t know in advance what the cost of the exertion is going to be it makes sense to focus initially on the easy tasks but as time runs down to then try more hard ones.

I understand your point and certainly agree that the EEfRT is far too unrobust to tell us anything about ME/CFS, but this argument would be an argument based purely on hypotheticals and ideas and not by any data. That is not convincing to me, especially if the other side has data to back up their argument (even if this data is iffy and not robust).
 
my first collection of plots.
These are great – the mean lines (which Simon M wisely suggested for Karen Kirke’s graph) are so helpful for seeing what is going on. And I agree that the practice rounds contain important info.

Can I suggest some tweaks?
· Add a legend for the dotted lines.
· Instead of True vs False, Yes vs No. Or a descriptive Successfully completed vs Not successfully completed or Successfully completed vs Failed
· Change “First 4” to “Practice rounds” to distinguish them from the initial rounds of the real task.
· Make the “False”/not successfully completed bar colour a little darker as it’s hard to see.
 
Pre-test calibration
Right off the bat we have what may be the dealbreaker - Treadway recommended individual calibration of what constitutes a hard task in this schizophrenia study:
I mentioned that back a bit - the review paper I was talking about noted that there were a number of studies that did this calibration. I'll paste the link to the study here - it's one of the ones that bobbler noted.
: Examining the reliability and validity of two versions of the Effort-Expenditure for Rewards Task (EEfRT) | PLOS ONE
I think it's worth noting that some studies in the literature actually tested participant's ability to perform the tapping before the experiment and then adjusted the targets accordingly, so the results supposedly weren't disrupted by differences in physical capability. That the investigators in the Walitt et al study didn't do this when dealing with a patient cohort reporting reduced ability to perform repetitive actions could be criticised. Even that sort of modification is problematic though, because the healthy and patient cohorts will have different responses to prolonged tapping.


post-hoc data selection
The response to why exactly participant HV F was excluded has remained somewhat vague to me, or more precisely I haven't yet received precise information on the defined exclusion criteria apart from "Review of data for performance and task validity is part of the pre-existing way of handling the data; the evaluation of invalid performance and task validity takes place after the data is collected.". So it seems they indeed had prespecified criteria, rather than judging the data post-hoc. He mentions that this participant clearly had no problem completing certain tasks but rather chose not to (which we can all agree with), which to them indicates that this is either an invalid performance or an invalid task administration.
  • It would be nice to know what exactly the prespecified criteria are, I will ask him whether these can be provided, but I also don't think the authors would have to tell us these.
I think there's quite a lot of vagueness in that response from Walitt. I don't think the 'pre-existing way of handling the data' necessarily means that they had pre-specified criteria for valid data. I think it just means that they had always intended to look at the data and throw out what they didn't like. And the following phrase 'the evaluation of invalid performance and task validity takes place after the data is collected' makes me think it was all post-hoc even more. I remain amazed that the investigators didn't foresee participants doing what HVF did, and take some steps to make their experiment better.


Summary of arguments
1. It is conceptually inappropriate to use in mecfs, a physical disease where effort preference isn't a legitmate scientific question, and in which the test hasn't been validated
2. The high failure rate of ME/CFS patients on hard tasks renders the measure invalid as a measure of preference. This argument is the strongest one because it flies even if you accept EEfRT as a good and legitimate test: It fails on its own terms.
3. the exclusion of healthy volunteer F's data is necessary for making the primary endpoint significant. It can't be rationally justified but it can be justified based on precedent.
Excellent. I think we need to add in there the ridiculously small size of the experiment, especially compared to much larger experiments where those investigators were still concerned about noise in the data. Perhaps it's part of argument 3 - in that one person pursuing their own strategy could make the difference between a significant or a non-significant difference. I wonder what would happen if we took one or two of the ME/CFS participants' data out, or replicated one of the ME/CFS participant's data? How much fiddling would we need to do adding in or subtracting a participant to make the result non-significant again?


why participants keep trying to do hard tasks after repeatedly failing to complete
I will play devils advocate for a moment:
Something that is quite striking and something I believe one still has to discuss, is that in the second half of the game both HV and ME/CFS chose to play hard more often than in the first half even though they percentually both complete it even less often than in the first half (I will additionally also cut the games in half after I have specified a fixed duration of games played to ensure that everyone gets to play the same number of games, this is typically what other studies have done).

The ratio of successful completion of hard/ choice of hard in the second half of the game for HV gets closer towards the ratio with which patients with ME/CFS start of the game with.

The main argument of Walitt is al in its simplified version is “pwME choose hard less often, this means they prefer to not use effort”. Now we are arguing “that is an incomplete assessment because pwME have lower chances of completing hard tasks and as such some in-game learning effects might lead to choosing hard less often rather than the intrinsic nature of pwME”.

The problem with this argument of ours, and something that Walitt et al can argue, is that comparing the second half of the game data to the first half of the game data shows that striking out more often doesn’t lead to choosing hard less often. I haven’t seen a convincing counterargument of our own against such a counterargument. @Murph @bobbler @andrewkq
I think I'd argue that participants had already bagged their $1 from the easy tasks. The rules of the game had been explained to the participants and they had time to think about it before they even started. So, most of them would have known that doing more $1 easy tasks was not going to increase their chances of getting a higher payout. Instead, to increase the number of higher value rewards in their pool for later selection, they would have to try to get some of the hard task rewards. There was nothing lost by trying and failing. And some of the participants (ME/CFS- A and ME/CFS - D) did manage to complete a hard task after trying and failing. I think it should be very hard to argue that those people weren't putting effort in.
 
Last edited:
let me know if the rose works OK
I think having a contrast between the practice and real rounds is good. I still need to look at the graph on low brightness to be able for it, to make everything paler/more muted. If the rose were paler and the background cells were paler, it would be easier. Thank you so much for all the experimentation. Do stop when it makes sense to.
 
I haven't been able to follow all this discussion, and don't want to add to the burden of too many posts to read, so I'll try to make this brief.

Are participants told before the task that they are being assessed for their effort preference? Or for anhedonia, or something else? If not, what are they told?

I am imagining if I were a participant with ME/CFS and I were asked to perform this task, I would think it remarkably stupid and not worth expending effort on. The ME/CFS participants were there for biomedical testing, not silly mind games where they are trying to second guess strategies. My preference would be to opt out and conserve my energy for the worthwhile stuff.
 
I think I'd argue that participants had already bagged their $1 from the easy tasks. The rules of the game had been explained to the participants and they had time to think about it before they even started. So, most of them would have known that doing more $1 easy tasks was not going to increase their chances of getting a higher payout. Instead, to increase the number of higher value rewards in their pool for later selection, they would have to try to get some of the hard task rewards. There was nothing lost by trying and failing. And some of the participants did manage to complete a hard task after trying and failing. I think it would be very hard to argue that those people weren't putting effort in.

I'm not fully understanding this point, as I don't see how that is a counterargument to their argument "pwME try hard less often because they prefer to exert less effort". In any case if having already had bagged the $1 twice was our counterargument one would have to look at whether the data backs such an argument, i.e. split the data into "before everyone has bagged $1 twice and after everyone has bagged $1 twice" which I don't think anybody has done yet and I expect (but could be wrong) one would still see that in both splits pwME are choosing hard less often.

Something that wouldn’t be a direct counterargument, but rather an alternative interpretation of the data, but for which one would have to do some analysis to see how well this holds up, is that they roughly argue that "HV and pwME can be separated to some degree by how often they choose to do a hard task". If we show that "percentage of hard tasks completed” offers a better separation and that this separation is not equivalent (or strongly correlated) to their separation one has grounds for coming up with an alternative interpretation of the data that could be in fact better than theirs, which just makes their whole argument look insufficient, but wouldn't falsify it.
 
Back
Top Bottom