When the statistical tail wags the scientific dog

Should we “redefine” statistical significance?

Richard D. Morey
4 min readNov 26, 2017

[Part two of this series is here]

Recently, a zillion-author manuscript came down the pipe suggesting that a change to the common practice of using p<.05 as a criterion for statistical significance. The authors include theoretical and applied statisticians and scientists, and the paper has clearly made an impression, both in science press and to the many authors of the various replies.

The authors’ primary argument is Bayesian (and indeed, many of the authors are Bayesian). The core of the “Redefine Statistical Significance” paper (henceforth referred to as “RSS”) is a claim, common in Bayesian circles, that significance tests overstate the evidence against the null hypothesis. In order to calibrate p values to Bayesian evidence, the authors suggest raising the evidential threshold from α=.05 to α=.005. In a series of blog posts, I will examine their arguments and show why they are flawed and even potentially counterproductive.

First, I should say I don’t have much investment in α=.05. It was an arbitrary choice when it was made for convenience decades ago, and it remains arbitrary. I suspect that scientific life would go on just fine if people cast a more skeptical eye toward p values between .005 and .05. This is, indeed, part of what the authors are trying to achieve, and they certainly wouldn’t be the first to suggest that people should not take such p values so seriously.

However, the arguments we use to support a position matter. Even if using α=.005 would, by itself, have no negative effects — and I have no evidence that it will — the authors’ argument should be evaluated on its merits. In a series of three blog posts, I’ll address three topics:

  • A central premise: the concept of a statistical criterion for “discovery” (this post)
  • The core arguments: Bayesian evidence, false discovery “rates”, and correlations (post 2)
  • Responses to RSS: We can’t justify our α, and we shouldn’t abandon significance testing (post 3)

Statistics and “discovery” of effects

The RSS authors’ suggestion is this:

“For fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005…Results that would currently be called significant but do not meet the new threshold should instead be called suggestive…We restrict our recommendation to claims of discovery of new effects.”

I’m going to restrict discussion to experimental results, because the RSS authors restrict their discussion to “effects”.

An α criterion for “discovery” of an effect is the statistical tail wagging the scientific dog

What I find surprising — and indicative of the problems we face across the sciences — is that no one (not yet, at least) has pushed back against the central premise that scientific discovery of a new effect can be established by a statistical criterion. Any discovery of a new effect should rest on multiple experiments under varied, but controlled, conditions. These experiments must establish, through manipulation, the conditions under which the effect is apparent and those under which it is not. The experimenters should establish that they can parametrically manipulate an effect. The resulting scientific knowledge should lead to predictions that other researchers can use when they perform similar experiments. This is how scientific knowledge grows. Discovery of a new effect is a matter for a research programme, not a single experiment. There is no statistical criterion that can establish a “discovery”.

If a researcher claims an effect based on a single experiment, how does the researcher know what the effect is? The discovery of an effect must, in some sense, entail a generalization beyond the experiment. Some generalizations can be statistical, under assumptions of random sampling, for instance. But the crucial generalizations — those that represent real scientific knowledge — will be based on a theoretical understanding of the phenomenon in question that can only be the result of a sustained research programme. The totality of the evidence will not be assessable by a purely statistical criterion, p value or otherwise.

The RSS authors, unfortunately, perpetuate the mistaken notion that an “effect” can be established based on a statistical criterion. This is a large part of what is causing the replication crisis: a belief that a simple criterion, in a small number of experiments, can determine a “real” effect. Scientists, needing to publish more and to over-inflate their findings, fail to establish and understand effects and their limits. But reproducibility is not a simple matter of larger sample sizes, or requiring smaller p values: it is about scientists putting in the necessary work before they make claims, and being humble about the limitations of their methods.

So I agree with the authors that the conditions for “discovering” an effect must be redefined. But redefining statistical significance to α=.005 is at best irrelevant to this and at worst a dangerous perpetuation of the cause of the replication crisis itself. Slower, more systematic science (or, as I prefer to call it, just “science”) — is the only way out of the crisis. To define “discovery” based on p<.05 or p<.005 is letting the statistical tail wag the scientific dog.

If one understands all this, the argument of the RSS authors becomes a minor statistical point. Given the ubiquity of the use of p<.05 to make decisions, it is a minor statistical point that nonetheless has the potential for major impact. In the next post, I will discuss the statistical argument itself.

[Read on: part two]

--

--

Richard D. Morey

Statistical modeling and Bayesian inference, cognitive psychology, and sundry other things