800 scientists say it’s time to abandon “statistical significance”

Alvin

Senior Member (Voting Rights)
For too long, many scientists’ careers have been built around the pursuit of a single statistic: p<.05.

In many scientific disciplines, that’s the threshold beyond which study results can be declared “statistically significant,” which is often interpreted to mean that it’s unlikely the results were a fluke, a result of random chance.

Though this isn’t what it actually means in practice. “Statistical significance” is too often misunderstood — and misused. That’s why a trio of scientists writing in Nature this week are calling “for the entire concept of statistical significance to be abandoned.”

https://www.vox.com/latest-news/2019/3/22/18275913/statistical-significance-p-values-explained
I have not read the article but i do agree from previous experience
 
I think the important point this piece is trying to make is that institutions need to reward properly-executed studies regardless of the result - rather than incentivizing researchers to manufacture 'splashy' or otherwise 'desirable' findings.

We've talked here about the need for every experiment to be published, regardless of result. 'Disappointing' negative studies can't just be buried - they must be part of the literature to avoid publication bias and perhaps show what other biases are in play through comparison of methods between studies.

I don't think p-values and 'statistical significance' are that hard to understand and use. My opinion is that if our scientists today don't understand these concepts or can't use them responsibly, I don't think that's a problem with the concepts themselves. If you take out these concepts, people will find some other benchmark to abuse.
 
Last edited:
But have they proposed an alternative?
The Vox-article lists alternatives such as concentrating on effect sizes or confidence intervals or simply lowering the significance for p values to 0.005 instead of 0.05, as suggested by a group of scientists in 2016. I think I'm in favor of the latter. At least in the field of ME/CFS, scientists seem to focus too much on results that are just below the 0.05 threshold but are probably not relevant.

The tool by Kristoffer Magnusson is interesting. If you assume a difference of half a standard deviation and 50 people in the experimental group and the control group, then a p-value between 0.03 and 0.05 would only appear in less than 8% of cases. But it seems to happen in a lot of studies, and that can't be right.
 
If you assume a difference of half a standard deviation and 50 people in the experimental group and the control group, then a p-value between 0.03 and 0.05 would only appear in less than 8% of cases. But it seems to happen in a lot of studies, and that can't be right.
Is this because people "reverse engineer" their findings via p-hacking? Really need a method that is resistant to hacking, or maybe it's more about the researchers.
 
Is this because people "reverse engineer" their findings via p-hacking? Really need a method that is resistant to hacking, or maybe it's more about the researchers.
I would think it also has to do with publication bias plus other ways researchers might (intentionally or unintentionally) bias the study to meet the target threshold.

There are protocols to prevent p-hacking by correcting the necessary threshold. (The simplest - and most conservative - one is to simply divide the threshold by the number of comparisons being made - i.e. if 20 comparisons are being made and .05 is the starting threshold, the new threshold is .0025 for all comparisons). If researches fail to use one of these corrective methods, I would think that makes them pretty incompetent.
 
This is a pretty good article, makes a good fist at presenting the problem simply and accurately.

And I totally agree that the problem is with the culture that rewards p values less than .05.

People so often forget scientists are humans, and have all the biases and unsavoury personal motivations that regular people have. So yes, any system that rewards anything other than research quality will be open to abuse.

The problem is deciding what system to replace this with. One reason the p value system has been embraced so warmly is that it provides a set of objective standards for deciding whether or not you can draw any positive conclusions from your research. The idea of Bayes factors - that assess how likely your result would be given several different hypotheses - is good in principle. But then its left up to the researcher to evaluate the results correctly. You can see right away what will happen in research areas where there are lots of unscrupulous people: with no objective standards to adhere to, a probability anywhere above 0.5 will be talked up as a positive outcome.

(Confidence intervals don't solve the problem because they are generally used and interpreted in a similar way to p values. That is, if your confidence intervals don't cross zero, your result is "significant", otherwise its non-significant. They're useful for other reasons though, for example, future researchers can use them in their metanalyses).
 
There are protocols to prevent p-hacking by correcting the necessary threshold. (The simplest - and most conservative - one is to simply divide the threshold by the number of comparisons being made - i.e. if 20 comparisons are being made and .05 is the starting threshold, the new threshold is .0025 for all comparisons). If researches fail to use one of these corrective methods, I would think that makes them pretty incompetent.
The problem is, sometimes its not possible to count the number of comparisons that were performed, because some are just simply never reported. That's the real problem with p-hacking. They test lots of hypotheses and then simply don't mention the tests they did that didn't yield a significant result.

Sometimes, p hacking works across the a whole body of research. The researcher does a series of consecutive studies that address much the same question, and just buries the results of those that "didn't work".

I suspect the answer might be in dividing all research into "exploratory" and "confirmatory". So any research that is not pre-registered is labelled as "exploratory", and the researchers are not permitted to draw firm conclusions from it. To do that, they must design a replication study - and publish the protocol they will use even before they start. And no cheating, like the PACE researchers did. :whistle:

(Edited for typos)
 
Last edited:
Back
Top Bottom