Use of EEfRT in the NIH study: Deep phenotyping of PI-ME/CFS, 2024, Walitt et al

I ran these and both tests show that ability to complete the hard trials is correlated with Proportion of Hard-Task Choices, which means effort preference is officially confounded with ability.

That’s awesome. Damn distribution assumptions…

Curious if you included any data from the Trials -4 through -1. I may have missed conversations about the practice trials. So apologies if I’m being redundant.

Depending on how practice was run they may have meaningful information on PHTC. But I’d definitely position them as meaningful to ability to complete tasks

(Below regards that time period but focused more on questions about PHTC. But I also noticed completion rates were very low out of the gate. Again sorry if I missed whole conversation on this already!)

(PwME “win” Five of first six trials (including practice). If this wasn’t directed how does it show HV’s having more Effort Preference? If it was directed why so bias towards fatiguing PwME.)

(in case you need another bullet for your gun…)

This table shows Probability Hard Task is Chosen (PHTC) , OR for first six Trials, including trials immediately before 15 minute timer starts:
upload_2024-3-30_22-59-19-png.21555

edited to add:
Just saw your comment
Let me know if I'm missing something or misapplying the stats here.

It’s been a while, I’m rusty. But, I’ll check out the stats.

I also think it’s worth the wait. They said they don’t accept after 18 months. You have plenty of time.
 
Last edited:
This means that the correlation should be tested with a non-parametric test that does not assume normality, like Spearman's rho or Kendall's tau. I ran these and both tests show that ability to complete the hard trials is correlated with Proportion of Hard-Task Choices, which means effort preference is officially confounded with ability.
self-reported physical dysfunction on SF-36 is also correlated with ability to complete the hard trials, which means the more disabled you are, the harder it is for you to complete the EEfRT hard trials.
What a clearly-written post. Assuming the analyses are appropriate (not something I can judge), I think you're right, this will be the heart of the response. Great call on going back and checking.

Curious if you included any data from the Trials -4 through -1...
Depending on how practice was run they may have meaningful information on PHTC. But I’d definitely position them as meaningful to ability to complete tasks
Agree, I think they're worth a look and a mention. We can't assume that participants were giving their all on each of the trial rounds, but for people who chose 2-3 hards in the trial rounds and failed each time, it seems likely this would feed into their strategy for the real rounds that followed, because they were likely to be trying their hardest at least once. And as I think @ME/CFS Skeptic 's draft showed, there was a big difference between the % hards completed in the trial rounds between HVs and pwME.

It’s been a while, I’m rusty. But, I’ll check out the stats.
I also think it’s worth the wait. They said they don’t accept after 18 months. You have plenty of time.
I agree, take all the time you (plural) need to get the argument right. When PEM derails you, that can be a good time to email the draft to others for comments, so that by the time your brain is back in action, you'll have a ready-made to-do list. And don't worry about word count too much at this point, because this version is just being emailed to the authors (per this journal's policy Use of EEfRT in the NIH study: Deep phenotyping of PI-ME/CFS, 2024, Walitt et al). If, after that, a version is going to be submitted to the journal, then it would need to be altered a little to address any points the authors make in their email response.
 
This description, if it accurately represents what happened, makes a total mockery of any comparison between the patients and the healthy volunteers. They effectively were undertaking very different activities.
If so its significant fir two reasons off top of my head:

1. any inference by authors that ideas of pacing or potential harm had come from anything ‘internal’ to the pwme, or were a concept related to ‘a condition’ vs it was being ‘instructed’ externally by someone in charge (or said as a perceived threat/warning ie coercion)

2. it makes it clear that not only was it foreseen that the design of having multiple tasks layered into short spaces of time would mean putting people into PEM before other ones (thereby influencing what they are actually testing in pwme) but that they were aware they’d created a ‘marathon of sprints’ Uber-task fir pwme but were comparing to healthy controls who were able to recover between tasks (if they didn’t get the warning) . Which surely means different descriptions of said tests were required as it is like asking a HC to do a memory test after eg being awake for 2days or doing a bleep test with heavy metal music vs not. Quite different tests.

edit to add: how does eg Wallit think that is replicable given the unusual tranche if tests in this study that included a too small sample to validate an EEfRT on its own (15 in the pwme sample) and what does it even mean even if we do know the severity for each individual and load of exertion they would be carrying from travel, stay and prior tests?
 
Last edited:
I think we finally have our smoking gun. Our argument has been weakened so far by lack of statistical evidence that ability to complete the hard trials is related to Proportion of Hard-Task Choices (PHTC) aka "effort preference". When I tested for this previously, I used a pearson correlation and a linear regression and they were both non-significant. We had a lot of things we were looking into at the time so I just moved onto the next question without much thought. However, on closer inspection, these were not the correct tests to do because they assume normal distributions. Pearson's correlation assumes normality in both variables being compared, and linear regression assumes normality in the residuals from the regression model. But percentage of hard tasks completed is highly negatively skewed (skewness = -1.37)! Here's a histogram showing the skewness:

View attachment 21560

This means that the correlation should be tested with a non-parametric test that does not assume normality, like Spearman's rho or Kendall's tau. I ran these and both tests show that ability to complete the hard trials is correlated with Proportion of Hard-Task Choices, which means effort preference is officially confounded with ability.

View attachment 21561
View attachment 21562

And just in case that isn't sweet enough, the icing on the cake is that self-reported physical dysfunction on SF-36 is also correlated with ability to complete the hard trials, which means the more disabled you are, the harder it is for you to complete the EEfRT hard trials.

View attachment 21563

Unfortunately SF-36 isn't correlated with PHTC, but you can only ask for so much in a severely underpowered study.

View attachment 21564


Let me know if I'm missing something or misapplying the stats here. It's been awhile since my research methods class. It's going to take me a bit longer to incorporate these findings into the letter @EndME @Jonathan Edwards so that's going to be a bit delayed, but I think this makes our argument much stronger.
Yes! Brilliant news if so.
 
Last edited:
I ran these and both tests show that ability to complete the hard trials is correlated with Proportion of Hard-Task Choices,
Would you share results of Kendall’s tau? (No need for graph.)

Did short refresher today and will try again later this week, depending on rest.

Both of these methods first rank the values (PHTC and completion rate) and compare the ranked values to calculate the tau.

Kendall’s analysis is more resilient to ranking ties. Because there are a lot of participants with 100% complete (resulting in many tied rankings) Kendall’s might be a better fit.

I’ll look again later, as I’m able. But that’s my thinking as of now.
 
One new two-way interaction, the interaction of PI-ME/CFS diagnosis and trial number, was tested as well in order to determine whether rate of fatigue differed by diagnostic group.
So they looked at fatigue according to trial number. But easy trials were 7 seconds long and hard trials were 21 seconds long, and we know that pwME chose fewer hard trials/more easy trials. If fatigue is measured by trial number, then pwME would have been measured earlier (timewise) than healthies. Doing it by trial number makes sense only if the only fatiguing element is considered to be making the choice between easy and hard. If the fatiguing element is rapidly pressing buttons, or a combination of central and muscle fatigue, then fatigue should have been measured by time or number of button presses.

Has someone on this thread looked into this issue already? Could you point me to it if so?

Edited to correct the lengths of easy and hard tasks to 7s and 21 seconds respectively. Which makes it worse, right?
Tagging people who have seen my erroneous task lengths and might be working on it @EndME @Kitty @JoClaire
 
Last edited:
So they looked at fatigue according to trial number. But easy trials were 21 seconds long and hard trials were 30 seconds long, and we know that pwME chose fewer hard trials/more easy trials. If fatigue is measured by trial number, then pwME would have been measured earlier (timewise) than healthies. Doing it by trial number makes sense only if the only fatiguing element is considered to be making the choice between easy and hard. If the fatiguing element is rapidly pressing buttons, or a combination of central and muscle fatigue, then fatigue should have been measured by time or number of button presses.

Has someone on this thread looked into this issue already? Could you point me to it if so?

That is an excellent point! I don't think anybody ran the explicit analysis that looked at how these results would otherwise look (based on what I write below my first guess would be that it wouldn’t reveal anything).

I don’t think measuring by time would be a good measure (because 2 easy tasks don't seem much more fatiguing than 1 hard task), number of button presses seems somewhat more appropriate, but it doesn’t account for the duration of time passed which might be somewhat fatiguing itself and that button presses with the one finger are more fatiguing than with the other. I don’t see how one could easily come up with something appropriate.

However, since our main argument is the lower physical ability of pwME, I anyways don’t know how one would incorporate such an analysis into our argument. Someone with lower physical abilities is likely to have a lower threshold for fatigue so comparing the two groups by saying "so and so many clicks is a measure of a certain fatigue" doesn’t really make for a coherent story in my eyes, especially as it would artificially reduce the fatigue pwME experience simply because they have to go for easy if they can't do hard which results in less clicks per time spent.

I recall that when I did two different split-half analyses (the plots are somewhere earlier in this thread) once looking at the half-times of each individual player by the total of numbers of rounds this player played and once looking at the game split into 2 even halves of 17 games, the results differed drastically and especially the first plots didn’t match with the fatigue curve in the study, whilst the second plots are very akin to measuring fatigue on a round by round basis (Figure 3A) and what was done in the study. That points to something interesting happening when the behaviour is dominated by those who play more, but it’s hard to interpret this data because we already know that fatigue influences and drives the results, but there’s no data to prove that pwME were more fatigued once the game started and that this explains the difference in PHTC (according to first split-half analysis I did, in the second half of the EEfRT HV’s and pwME looked very similar with the success rate on hard tasks for HV’s approaching that of pwME, however using a numbers of clicks metric HV’s would have also been more fatigued at this time point whilst in reality the opposite was more likely to be true).

In conclusion: Well spotted and I agree that the analysis they didn't by trial number is not an adequate measure of fatigue, but I don't know how one would do anything sensible given the above reasons. Do you have some suggestions?
 
Last edited:
Just highlighting an edit to my post above, so that anyone who read my error will also see the correction:
Edited to correct the lengths of easy and hard tasks to 7s and 21 seconds respectively. Which makes it worse, right?

So the corrected post reads:
So they looked at fatigue according to trial number. But easy trials were 7 seconds long and hard trials were 21 seconds long, and we know that pwME chose fewer hard trials/more easy trials. If fatigue is measured by trial number, then pwME would have been measured earlier (timewise) than healthies. Doing it by trial number makes sense only if the only fatiguing element is considered to be making the choice between easy and hard. If the fatiguing element is rapidly pressing buttons, or a combination of central and muscle fatigue, then fatigue should have been measured by time or number of button presses.
 
However, since our main argument is the lower physical ability of pwME,

I have not read published responses in nature. And don’t have a strategy for what would be most impactful.

His argument centers around attributing differences to “error preference” (and “pacing”) and (ahem) “demonstrating” no difference in fatigue.

Modeling hard tasks chosen over time (or number of clicks) could demonstrate fatigue difference. Depending on what we find.

Has someone on this thread looked into this issue already? Could you point me to it if so?

I haven’t seen it mentioned. I was going to plot residuals from @ME/CFS Skeptic model against these metrics. But my capacity hasn’t allowed me to do this yet.

In conclusion: Well spotted and I agree that the analysis they didn't by trial number is not an adequate measure of fatigue, but I don't know how one would do anything sensible given the above reasons. Do you have some suggestions?

In my experience there’s never a perfect answer. To your point, time and clicks both fail in some respects. But both are arguably better than trial number, in the areas of weakness you call out.

Trial number also leaves the highest trial numbers influenced by a small set of participants. So either of these could have a more even distribution of participants represented at tail ends.

If I’m able to plot residuals, I’ll share. I’m not able to model results at this time.
 
If it's true one of the major variables between groups (cognitive and physical fatigue) can't actually be measured, surely it isn't a valid exercise in the first place?

True, but doesn't this pretty much apply to any ME/CFS study ever conducted? Who has ever been able to untangle and measure the differences between fatigue, exertional intolerance and PEM both cognitive and physical?

His argument centers around attributing differences to “error preference” (and “pacing”) and (ahem) “demonstrating” no difference in fatigue.

As far as I've understood it, their argument is not that there is no difference in fatigue between the two cohorts (because they didn't measure this and there is likely anyways a massive difference), their argument is that the differences between the two groups occurring purely due to “in-game fatigue” is not the driving force in the PHTC as trial rounds progress. Figure 3A summarizes this part of their analysis.

And that's pretty much the crucial point already. The thing about Figure 3A is if you shift the graph of HV to the left (by somewhere around 20 rounds) the graphs of HV and pwME look pretty identical, alternatively if you extrapolate missing values and shift the graph of pwME to the right (by somewhere around 20 rounds) the graphs of HV and pwME will overlap almost perfectly.

That may mean that HV start behaving like pwME after 20 rounds because pwME have something equivalent to a 20 round deficit of fatigue to begin with or it may be due to something else.

Because the initial fatigue of the participants is not measured as part of the trial it may be that the results are driven by the initial levels of fatigue or alternatively by the lower level of physical abilities of participants which is probably somewhat similar in this set-up but easier to prove because we have measures for that which according to the brilliant analysis of @andrewkq show exactly that.

In my experience there’s never a perfect answer. To your point, time and clicks both fail in some respects. But both are arguably better than trial number, in the areas of weakness you call out.

I honestly have no idea. It all seems pretty much equally bad to me...I think it's a very good idea for someone to look at this, but I simply don't know what a good way to do this would be.

Trial number also leaves the highest trial numbers influenced by a small set of participants. So either of these could have a more even distribution of participants represented at tail ends.

That is certainly true and a good point and one of the many reasons why one should probably be using a cut-off (somewhere around 34 rounds).

If anyone of you (@JoClaire, @Evergreen or anyone else) ends up looking at this and is able to find a good metric for "in-game fatigue" it would also be highly sensible in my eyes to look at decline of button-press not per trial round but per whatever this metric for "in-game fatigue" is, since the main argument of the authors surrounding pacing and validity of their set-up is
There was no group difference in the probability of completing easy tasks but there was a decline in button-pressing speed over time noted for the PI-ME/CFS participants (Slope = −0.008, SE = 0.002, p = 0.003; Fig. 3b). This pattern suggests the PI-ME/CFS participants were pacing to limit exertion and associated feelings of discomfort16. HVs were more likely to complete hard tasks (OR = 27.23 [6.33, 117.14], p < 0.0001) but there was no difference in the decline in button-press rate over time for either group for hard tasks (Fig. 3b).
 
However, since our main argument is the lower physical ability of pwME,
I'm not sure that is our main argument. I think that's what Walitt appears so eager to disprove.

I think our main argument is we are too sick with too many persistent symptoms to have any meaningful QoL, and to engage in any substantive effort for too long without having to stop because of those symptoms. That does not have to conform with lower physical ability.

Some may literally have lower physical ability, but I think that is not necessarily a phenotype characteristic.

Nausea can be a symptom. If it were one of mine, and some ass said I couldn't possibly be nauseous, I'd tell them to prove it. The most they can do is say they don't believe me.
 
Last edited:
I'm not sure that is our main argument. I think that's what Walitt appears so eager to disprove.

I think our main argument is we are too sick with too many persistent symptoms to have any meaningful QoL, and to engage in any substantive effort for too long without having to stop because of those symptoms. That does not have to conform with lower physical ability.

Some may literally have lower physical ability, but that is not necessarily a phenotype characteristic.

"Our main argument" refers in its largest part to the brilliant analysis done by @andrewkq which, if it is correct, shows exactly that. The physical inability to complete hard tasks predicts the lower choice of hard tasks and this correlates with the physical inability of the participants measured by the physical SF-36 subscale.
 
Understood. And well done. I'm just suggesting this shouldn't have to be our fight.

We have many symptoms. One of my worst is balance. This can be objectively tested for. Fine.

But there are others that cannot.

Walitt is making a case based on one claim that he really is making, while ignoring all the other symptoms that are with us 24/7, many of which cannot be qauntified. It's a selective BS stance that he is shoehorning into a phenotype study.
 
Btw, I am aware he is making this our fight, and am relieved and empowered that we have members far smarter than I that are willing to take him square on and highlight the holes in his specious "theory".

I just think we should be responding to him with "Hey, kid, you're in the wrong class" as well.
 
Last edited:
That’s awesome. Damn distribution assumptions…

Curious if you included any data from the Trials -4 through -1. I may have missed conversations about the practice trials. So apologies if I’m being redundant.

Depending on how practice was run they may have meaningful information on PHTC. But I’d definitely position them as meaningful to ability to complete tasks

(Below regards that time period but focused more on questions about PHTC. But I also noticed completion rates were very low out of the gate. Again sorry if I missed whole conversation on this already!)

(PwME “win” Five of first six trials (including practice). If this wasn’t directed how does it show HV’s having more Effort Preference? If it was directed why so bias towards fatiguing PwME.)

(in case you need another bullet for your gun…)



edited to add:
Just saw your comment


It’s been a while, I’m rusty. But, I’ll check out the stats.

I also think it’s worth the wait. They said they don’t accept after 18 months. You have plenty of time.

I'm sorry I haven't replied to your messages @JoClaire, working on the draft has been taking all of my spoons and then some. I really appreciate all of the contributions you've been making. Let me know if there are any open questions that I missed.

I think it's really concerning that they appear to have not standardized the practice trials. It seems like that could bias participants towards different behavior in the actual task. I didn't include the practice trials in any of my analyses. Based on my sense of norms in psych research, I don't think it would be seen as legitimate to include them in the analyses we include in the letter. There tends to be a dogmatic belief that whatever happens in practice rounds on behavioral tasks isn't relevant for the actual task because participants knew it was practice. I think it's a good thing to keep on the back burner if we ever want to write up a full list of all the issues with their use of the task though.

Would you share results of Kendall’s tau? (No need for graph.)

Did short refresher today and will try again later this week, depending on rest.

Both of these methods first rank the values (PHTC and completion rate) and compare the ranked values to calculate the tau.

Kendall’s analysis is more resilient to ranking ties. Because there are a lot of participants with 100% complete (resulting in many tied rankings) Kendall’s might be a better fit.

I’ll look again later, as I’m able. But that’s my thinking as of now.

Kendall's tau has similar results. I also log transformed the data and ran the Pearson correlations and they were also significant. I'm leaning towards just reporting the Spearman and having the other tests in the back pocket if Walitt et al. try to claim that a different test should be used. This website (based on this article) recommends using Spearman for smaller datasets with weak to moderate correlations, which would fit our situation. Luckily it shouldn't matter too much since the results from all appropriate methods are significant.

PHTC vs Hard Task Completion Rate

Log transformed Pearson: r(29) = -.41, p = .021
Spearman: r(29) = .38, p = .033
Kendall: r(29) = .30, p = .027

Physical Functioning vs Hard Task Completion Rate

Log transformed Pearson: r(29) = -.55, p = .0013
Spearman: r(29) = .52, p = .0028
Kendall: r(29) = .38, p = .0087
 
I’ve already connected with the agent processing my FOIA claim on this. If I get back an abundance of material to sort through, would any here be willing to possibly help me go through? Just asking in advance.

(For reference on FOIA filed, from @Murph)

"2. Because this research is funded by the US federal government directly it should be possible to get our hands on information. Freedom of Information Act covers NIH and if they won't give info nicely someone can pursue them for it. An interesting request might be All correspondence and minutes of meetings relating to the use of EEfRT and susbequent analysis of said data - see if Nath and Wallitt had any disputes about how far to push it.
https://www.nih.gov/institutes-nih/...public-liaison/freedom-information-act-office. Probably a job for a US citizen to do the actual filing!"

Sharing an update on my FOIA claim today from the individual working on it, @Murph:

"Even though we narrowed the request so much, there are still a massive number of search results to go through. I have started going through the results, there’s just a lot, and there will be several additional reviews once I’ve done the initial one before the files are ready for release. It’s going to take a while, I expect several months. Naturally, we will try to finish quickly, but it’s going to be a while. Please feel free to continue to reach out for updates. I’ll be able to provide more accurate timing estimates as the case continues to progress."
 
Was this already shared? Wasn't sure so leaving here (I left in NIH channel, but also cascading in this channel. Mods feel free to move if you feel is best).

'We are excited to announce the upcoming PI-ME/CFS Symposium, which will be held on May 2, 2024 at the NIH Clinical Center in FAES Classrooms 3 & 4.' (Presentation Topics and Panel Discussions embedded)

'Registration now open'

https://mregs.nih.gov/channels/F1P5-F2L3
 
Thanks, Andrew. I've been out for a few weeks myself.

Appreciate you using your energy on the letter and sharing your thoughts.

if we ever want to write up a full list of all the issues

I'm curious if there are plans for this?

PI-ME/CFS Symposium
Also if we will ask our key question at the symposium or reserve it for when we share the letter.
 
Thank you to all using their energy and expertise to unravel this. This has been very helpful

If you do plan on asking a question at the NIH symposium, I wonder if it would be useful to submit the question before hand.

It can be very difficult to get a specific question answered when asked during the meeting - NIH can selectively pick the questions to answer, it's hard to frame it in a few words, etc. Maybe asking them beforehand to make sure they address that in the presentation and also maybe make your request public to help ensure its addressed.
 
Back
Top Bottom