Low-Dose Naltrexone restored TRPM3 ion channel function in Natural Killer cells from long COVID patients, 2025, Martini et al

Here's how I understand it. Assuming the null hypothesis is true, you're equally likely to get any p value between 0 and 1.

https://davidlindelof.com/how-are-p-values-distributed-under-the-null/


It doesn't actually matter what the sample size is, p values are always uniformly distributed under the null hypothesis. So there's a 1 in 10,000 chance of getting a p value of 0.9999 or higher if there's no real difference between the groups. Such a high p value isn't an indicator that the two groups are similar.
Agree.

It's an indicator that the means of the two groups are extraordinarily close considering the high variance in the groups, and such a situtation should just happen due to chance 1 in 10,000 times. That or there could have been an error in the analysis like comparing one group to itself by accident.

To be honest, I don't know (not judging your statement in any way) if that conclusion can be drawn directly from from this test.
Usually the null hypothesis being true isn't talked about in frequentist statistics since that's a bayesian idea and things get "funky".

This covers the same topic:
https://stats.stackexchange.com/que...e-for-or-interpretation-of-very-high-p-values
 
Agree.



To be honest, I don't know (not judging your statement in any way) if that conclusion can be drawn directly from from this test.
Usually the null hypothesis being true isn't talked about in frequentist statistics since that's a bayesian idea and things get "funky".

This covers the same topic:
https://stats.stackexchange.com/que...e-for-or-interpretation-of-very-high-p-values
I don't know much about bayesian statistics, and I don't see any new insights on that page (though I don't totally understand all the probability terminology). But one comment has a similar view:
About usefulness of very large p-values, I've got p-values near 1 in t-tests that failed to meet the normality of means assumption. In general, a p-value=1 should be seen at least as a warning that something might be wrong.
 
I don't know much about bayesian statistics, and I don't see any new insights on that page (though I don't totally understand all the probability terminology). But one comment has a similar view:
For what it's worth (as non-simulated further validation of your earlier analysis), when I do differential gene expression analyses where I'm running ~10K tests, a portion of those will always come up as p>0.99 (and that is with a test that does not assume normality). Spot checking my most recent analysis, it was around 180 out of 13000 comparisons, so roughly 1%.

Which just speaks to your earlier point of equal likelihood of any p-value under the null hypothesis. But my understanding was that for [edit: any one specific] test, it will never tell you anything other than whether you can reject the null hypothesis. The logic of the test is not reciprocal in that way.

I've also gotten a 0.999 p-value when I was just doing a single comparison and it seemed unlikely that any of the assumptions were violated. I think it is sometimes just a luck of the draw thing, [Edit: though >0.9999 being reported twice in the results seems to indicate it's not just luck of the draw unless we're all witnessing a once-in-a-lifetime event. I agree it's most likely an assumptions thing]
 
Last edited:
But my understanding was that for a specific test, it will never tell you anything other than whether you can reject the null hypothesis. The logic of the test is not reciprocal in that way.
That's my understanding as well.

I've also gotten a 0.99 p-value when I was just doing a single comparison and it seemed unlikely that any of the assumptions were violated. I think it is sometimes just a luck of the draw thing, though I agree that assumptions should be checked anyways.
Yeah, definitely not impossible that they're the lucky 1 in 10,000. (Though technically even less of a chance than that since it's ">.9999" which could be any number between that and 1.)

Out of curiosity I searched "p>.9999" and there are plenty of papers, though I suppose with millions of papers that have been written, that's to be expected.
 
That's my understanding as well.


Yeah, definitely not impossible that they're the lucky 1 in 10,000. (Though technically even less of a chance than that since it's ">.9999" which could be any number between that and 1.)

Out of curiosity I searched "p>.9999" and there are plenty of papers, though I suppose with millions of papers that have been written, that's to be expected.
Sorry, I think I added my edit right after you quoted me. I had just realized that they reported >0.9999 twice which makes this extremely unlikely to be a luck of the draw thing
 
Last edited:
Sorry, I think I added my edit after you quoted me. I had just realized that they reported >0.9999 twice which makes this extremely unlikely to be a luck of the draw thing
I was thinking about that. I agree that makes it even less likely, but I guess there is still the possibility that the results from the two tests are extremely correlated to each other, in which case the p values should also be similar. I don't know anything about these tests though. But I'm guessing the correlation would have to be very, very high for this to work out, and it'd probably make sense to look for other explanations.
 
1/10,000 * 1/10,000 = 1/100,000,000

They ran more experiments than just two, so the probability is higher if the p-values are close to 1-(1/10,000).
 
Side note @forestglip since you mentioned you were unfamiliar with Bayesian statistics:

our thought process about the two >0.9999 p values is exactly the intuition behind Bayes’ theorem.

Given that we’re seeing two >0.9999 p-values in a research paper (data), and knowing how likely it is to get p > 0.9999 to begin with (prior probability), is it more likely that what we’re seeing is a result of 1) a random happenstance or 2) an error in the statistical analysis (posterior probability)?

Apologies if I’m explaining something you already know, I just thought it was a neat, very intuitive example so it would be worthwhile pointing out
 
Side note @forestglip since you mentioned you were unfamiliar with Bayesian statistics:

our thought process about the two >0.9999 p values is exactly the intuition behind Bayes’ theorem.

Given that we’re seeing two >0.9999 p-values in a research paper (data), and knowing how likely it is to get p > 0.9999 to begin with (prior probability), is it more likely that what we’re seeing is a result of 1) a random happenstance or 2) an error in the statistical analysis (posterior probability)?

Apologies if I’m explaining something you already know, I just thought it was a neat, very intuitive example so it would be worthwhile pointing out
Thanks, basically all I know is that Bayes' incorporates prior knowledge about how likely something is to occur from before even running the experiment. I hear the word often and it seems interesting. So many things I want to learn about and too little energy and time but it's in the queue!
 
Now published:

Low-Dose naltrexone restored TRPM3 ion channel function in natural killer cells from long COVID patients

Etianne Martini Sasso, Natalie Eaton-Fitch, Peter Smith, Katsuhiko Muraki, Sonya Marshall-Gradisnik

Introduction
Long COVID is a multisystemic condition that includes neurocognitive, immunological, gastrointestinal, and cardiovascular manifestations, independent of the severity or duration of the acute SARS-CoV-2 infection. Dysfunctional Transient Receptor Potential Melastatin 3 (TRPM3) ion channels are associated with the pathophysiology of long COVID due to reduced calcium (Ca2+) influx, negatively impacting cellular processes in diverse systems. Accumulating evidence suggests the potential therapeutic benefits of low-dose naltrexone (LDN) for people suffering from long COVID. Our study aimed to investigate the efficacy of LDN in restoring TRPM3 ion channel function in natural killer (NK) cells from long COVID patients.

Methods
NK cells were isolated from nine individuals with long COVID, nine healthy controls, and nine individuals with long COVID who were administered LDN (3–4.5 mg/day). Electrophysiological experiments were conducted to assess TRPM3 ion channel functions modulated by pregnenolone sulfate (PregS) and ononetin.

Results
The findings from this current research are the first to demonstrate that long COVID patients treated with LDN have restored TRPM3 ion channel function and validate previous reports of TRPM3 ion channel dysfunction in NK cells from individuals with long COVID not on treatment. There was no significant difference in TRPM3 currents between long COVID patients treated with LDN and healthy controls (HC), in either PregS-induced current amplitude (p > 0.9999) or resistance to ononetin (p > 0.9999).

Discussion
Overall, our findings support LDN as a potentially beneficial treatment for long COVID patients by restoring TRPM3 ion channel function and reestablishing adequate Ca2+ influx necessary for homeostatic cellular processes.

Link | PDF (Front. Mol. Biosci.) [Open Access]
 
So I think the p>.9999 (which actually shows up 3 times in the full text) is a result of multiple test adjustment.

Paper said:
Statistical comparisons between groups for noncategorical variables (agonist and antagonist amplitudes) were conducted using the independent nonparametric Kruskal–Wallis test (Dunn’s multiple comparisons). Categorical variables (sensitivity to ononetin) were analyzed using Fisher’s exact test (Bonferroni method).

Bonferroni adjustment can be done by dividing the p value threshold by the number of tests where you would call it significant, but as described here, you can also keep the p=.05 threshold and multiply the calculated p value by the number of tests. If the starting p value multiplied by the number of tests is greater than 1 (e.g. p=.45 and three tests were done), the software would return p=1.0000. The authors might have changed it to p>.9999.

https://www.ibm.com/support/pages/calculation-bonferroni-adjusted-p-values
Statistical textbooks often present Bonferroni adjustment (or correction) in the following terms. First, divide the desired alpha-level by the number of comparisons. Second, use the number so calculated as the p-value for determining significance. So, for example, with alpha set at .05, and three comparisons, the LSD p-value required for significance would be .05/3 = .0167.

SPSS and some other major packages employ a mathematically equivalent adjustment. Here's how it works. Take the observed (uncorrected) p-value and multiply it by the number of comparisons made. What does this mean in the context of the previous example, in which alpha was set at .05 and there were three pairwise comparisons? It's very simple. Suppose the LSD p-value for a pairwise comparison is .016. This is an unadjusted p-value. To obtain the corrected p-value, we simply multiply the uncorrected p-value of .016 by 3, which equals .048. Since this value is less than .05, we would conclude that the difference was significant.

Finally, it's important to understand what happens when the product of the LSD p-value and the number of comparisons exceeds 1. In such cases, the Bonferroni-corrected p-value reported by SPSS will be 1.000. The reason for this is that probabilities cannot exceed 1. With respect to the previous example, this means that if an LSD p-value for one of the contrasts were .500, the Bonferroni-adjusted p-value reported would be 1.000 and not 1.500, which is the product of .5 multiplied by 3

And a page describing the other multiple comparison method they used, Dunn's, which uses the same adjustment:
Multiply the uncorrected P value computed in step 2 by K. If this product is less than 1.0, it is the multiplicity adjusted P value. If the product is greater than 1.0 the multiplicity adjusted P value is reported as > 0.9999
 
I think they may have used non-independent samples which artificially decreased the p-values, though. There were 9 people per group, but they used multiple cells per person, which are expected to be correlated to each other. Technically, you could get a p value as low as you want by using a very high number of cells from each person. [Edit: More accurately: greatly increase the chances of a low p value.]
In the electrophysiological experiments, we included nine participants in each group and analyzed recordings from 61, 65, and 63 independent cells for PregS effects from long COVID, HC, and long COVID receiving LDN groups, respectively. In addition, to assess ononetin effects in the presence of PregS, 52 independent recordings from NK cells in the long COVID group, 53 in NK cells from HC, and 53 recordings from NK cells in the long COVID group receiving LDN.

I don't see any indication in the paper's description of methods that they accounted for correlated samples.

See explanation of the issue of pseudoreplication here: Pseudoreplication in physiology: More means less (Eisner, 2021, Journal of General Physiology)

Edit: This is regarding the p values they reported that are very low:
As reported in earlier studies, we confirmed a reduction in ononetin amplitude (p = 0.0021) and the number of cells sensitive to ononetin (p < 0.0001) when compared to the HC and long COVID group. In contrast, NK cells from the long COVID group receiving LDN had a significant elevation in amplitude (p = 0.0005) and sensitivity (p < 0.0001) to ononetin compared with the long COVID group.
 
Last edited:
While we are talking about statistical intuitions, I find adjustment for multiple comparison to feel very weird.

You take a bunch of p-values and just multiply them by a big number. It's effective at making lots of possibly significant results go away. Lot of bathwater gets thrown out and an unknown number of babies too.

If you called your experiments different experiments and published them in separate papers, nobody would demand adjustment for multiple comparison. if they are done in one batch, it's expected.

This notable plot from Hanson is generated by ~~Bonferroni~~ (edit: not bonferroni, actually another kind of multiple comparison) adjustment. The distribution of the fold change of a lot of the metabolites on the x-axis is similar, but the q-value on the y-axis of the right-side plot is made zero by ~~Bonferroni~~ adjustment. ( I was so flabbergasted by these two plots that I taught myself multiple comparison and downloaded the data and reran the analysis, the plots are accurate. But I do find myself wondering what meaning is left after adjustment).

upload_2025-5-20_9-32-7.png
 
Last edited:
While we are talking about statistical intuitions, I find adjustment for multiple comparison to feel very weird.

You take a bunch of p-values and just multiply them by a big number. It's effective at making lots of possibly significant results go away. Lot of bathwater gets thrown out and an unknown number of babies too.

This notable plot from Hanson is generated by Bonferroni adjustment. The distribution of the fold change of a lot of the metabolites on the x-axis is similar, but the q-value on the y-axis of the right-side plot is made zero by Bonferroni. ( I was so flabbergasted by these two plots that I taught myself multiple comparison and downloaded the data and reran the analysis, the plots are accurate. But I do find myself wondering what meaning is left after adjustment).

View attachment 26271
I'm not sure if there's an intuitive meaning for the actual value of p values after Bonferroni adjustment.

What Bonferroni does is maintains the same rate of false positives. When you do a single test, there's a 5% chance of a real null result having a p-value of less than 0.05. So in 100 studies that are testing something [edit: that is not a real effect], you'd expect around 5 studies to be false positives.

But if you do many tests at once in a study, you're increasing the number of false positives in that one study. For example, if you test 100 things [edit: that in reality have no effect] at once in a study, there's no longer only a 5% chance of reporting a false positive. Now you're almost guaranteed to report [edit: at least one] positive result. Bonferroni adjusts the p value threshold or the p value itself so that there is still only a 5% chance of reporting one or more false positives.

It does make it harder for real positive results to cross the significance threshold, but it's a tradeoff to not report lots of false positive results as well.
 
Last edited:
This notable plot from Hanson is generated by Bonferroni adjustment. The distribution of the fold change of a lot of the metabolites on the x-axis is similar, but the q-value on the y-axis of the right-side plot is made zero by Bonferroni. ( I was so flabbergasted by these two plots that I taught myself multiple comparison and downloaded the data and reran the analysis, the plots are accurate. But I do find myself wondering what meaning is left after adjustment).
I remember also being confused why on earth they would use Bonferroni for an -omics analysis. You almost always want to use Benjamini-hochberg (FDR correction), because if you’re doing Bonferroni for so many analytes you’re basically throwing the baby out with the bath water.

Like @forestglip gets at, you only want to use Bonferroni when it’s really important to not report false positives. If you can tolerate a small proportion of false positives in the interest of not losing your true positives, which is the case in nearly all big data analyses, FDR is the way to go.
 
I remember also being confused why on earth they would use Bonferroni for an -omics analysis. You almost always want to use Benjamini-hochberg (FDR correction), because if you’re doing Bonferroni for so many analytes you’re basically throwing the baby out with the bath water.

Like @forestglip gets at, you only want to use Bonferroni when it’s really important to not report false positives. If you can tolerate a small proportion of false positives in the interest of not losing your true positives, which is the case in nearly all big data analyses, FDR is the way to go.
um, you know what, i think maybe it was benjamini hochberg. yep, they mention that. I mis-remembered.
 
If you called your experiments different experiments and published them in separate papers, nobody would demand adjustment for multiple comparison. if they are done in one batch, it's expected.

For this part, I've thought about this same thing many times, and haven't developed an intuition for why it's done the way it is, and question it for the same reason you say. And apparently so have others:

To adjust, or not to adjust, for multiple comparisons (2025, Journal of Clinical Epidemiology)
The Bonferroni correction in particular, and controlling the FWER more generally, are beloved by that archetypal figure many of us can conjure from our publishing experience – “Reviewer 2” – whom we can readily imagine urging us to conduct a “proper analysis” adjusting for the multiple comparisons in our manuscript [3]. Though such a request comes from a good place, it is worth understanding why and when one might push back on it. Rothman was dismissive [4]. Many others have pointed out the logical difficulties that become apparent when we start asking where this wave of adjustment – this protection against making errors – is to end.

Should we apply a correction to all the results in all of the papers that contributed to the same programme of research? To all of the results in all of the papers that analysed data from the same routine database that we used? To all of the results in all of the papers in that issue of the journal where our paper is published? To all of the results in all of the papers in every issue of that journal ever published? To all of the results in all of the papers that we, personally, have published in our lifetime? None of these things seem to make any sense [1,3]. All of this militates against authors making adjustments to their P values for multiple testing in the more restricted domain of one paragraph of their results section.
 
fig 2a-c would benefit from showing the individual points, particularly 2a
It was too much work for me to understand what it is they're exactly doing, but I got the impression figs 2D-F are the same data as 2A and 2B, just as dots. So LC in fig 2A would be the first column of dots in fig 2D. HC in 2A would be the first column in fig 2E. I might be wrong about this.
 
Back
Top Bottom