What we lose if we abandon p values

I spent a fair amount of time thinking and writing on both sides of the frequentist/Bayesian divide. In recent years, thanks to reading (and, importantly, re-reading) the persectives of Mayo, Cox, Spanos, Wasserman, I’ve become convinced of the importance of a frequentist perspective on statistical evidence, but I’ve not articulated precisely why. This post represents my first written draft doing so.

This topic is controversial, but for no good reason. Why? I’m convinced that most experienced scientists and statisticians have internalized statistical insights that frequentist statistics attempts to formalize: how you can be fooled by randomness; how what we see…

Why the push for replacing “power” with “precision” is misguided

[Code for all the plots in this post is available in this gist.]

One of the common claims of anti-significance-testing reformers is that power analysis is flawed, and that we should be planning for study “precision” instead. I think this is wrong for several reasons that I will outline here. In summary:

  • “Precision” is not itself a primitive theoretical concept. It is an intuition that is manifest through other more basic concepts, and it is those more basic concepts that we must understand.
  • Precision can be thought of as the ability to avoid confusion between closeby regions of the parameter…

“I am very sorry, Pyrophilus…”

Robert Boyle’s “Unsuccessfulness of Experiments,” two essays about what we would call failures to replicate

The seventeenth century was an exciting time in Europe. A new empirical philosophy — what we call science — was emerging. Oldenburg founded the Philosophical Transactions of the Royal Society, now the oldest scientific journal still being published, to communicate peoples’ observations and ideas. People like Newton, Halley, Hooke, Boyle — names that are now immortal— were performing experiments and developing theories that would change the world.

We learn about their ideas in our science classes, filtered through centuries of additional learning, as fully-formed. We know, however, that science does not work like this: every useful regularity is underpinned by…

How scientists successfully reason about sampling distributions

Rink Hoekstra (@rinkhoekstra) and I are pleased to release a new pre-print, “Use of significance test logic by scientists in a novel reasoning task”. We test the ability of a large, diverse group of researchers to reason from sampling distributions, a core aspect of significance logic. Although it is common to claim that significance testing logic is “backward” and hence difficult to understand, we use a novel experimental task to show that scientists can use significance testing logic to make correct inferences with high probability. …

Ensuring our critiques of power are relevant, clear, and based on good statistics

On Twitter recently, I criticised the common practice (some might say “shorthand”) of claiming that an experiment or study is “underpowered”. This ended up being a popular tweet, which I wasn’t expecting (it is a rather arcane statistical point) but given its centrality to some of the recent issues in experimental design, I thought it was worth revisiting in a blog post.

First, we have to talk about what power is, because — ironically — this is very much misunderstood by everyone, including the statistical reform movement. …

A Christmas-themed statistical game and experiment

(The experiment is over, but you can see a version of it on github)

The Jinglies and the Sparklies face off in a toy-making competition. Who is faster?

It’s nearly Christmas, but Santa has a problem: his elves have stopped working! The elves have have divided into two gangs, the Jinglies and the Sparklies, each believing that their gang can make toys faster. Santa has promised them that the fastest team can take next Christmas season as a holiday; but now, the elves have demanded an authoritative answer to who is faster. Santa has called on your expertise to help settle the question. …

Notes on Tour 1 of “Statistical Inference as Severe Testing”

Anyone who’s had any contact with statistical methods recently knows that there’s a battle being fought over the future of statistical methods. Actually, more than one; the big ones are significance testing vs confidence intervals and Bayes vs frequentism. The so-called “replication crisis” in the various sciences has provided an opportunity for people to advocate various solutions to the issues that plague statistical practice. These issues are real, and stakes high: bad choices could mean another 40 years wandering in the desert of bad methodology, as opposed to cleaning up some of the mess in various fields.

I was happy…

Scientists defend two researchers trying to clean up the scientific literature

[Update: the paper in question was officially retracted while this letter was being circulated; but OSU’s complaint is still outstanding. The good news is that both Markey and Elson’s Universities have declined to act on OSU’s complaint. To date, however, they have not made any public statement in support of Markey and Elson. Vice’s Motherboard has a story about the situation: Two Researchers Challenged a Scientific Study About Violent Video Games — and Took a Hit for Being Right]


If you follow Retraction Watch’s coverage of the psychological literature, you may recall a recent article about coding errors — and…

Part two of a three part series

[Part one of this series is here]

At the heart of the RSS paper are a number of statistical arguments. There are three, and I will address each of them in this (rather long) post. They are 1) two-sided p values around .05 are evidentially weak, 2) using a lower α would decrease the so-called “false discovery rate,” and 3) empirical evidence from large scale replications shows that studies with p<.005 are more likely to replicate.

As I said in the previous post, I don’t have a problem with α=.005 per se. I believe that the arguments for it are…

Should we “redefine” statistical significance?

[Part two of this series is here]

Recently, a zillion-author manuscript came down the pipe suggesting that a change to the common practice of using p<.05 as a criterion for statistical significance. The authors include theoretical and applied statisticians and scientists, and the paper has clearly made an impression, both in science press and to the many authors of the various replies.

The authors’ primary argument is Bayesian (and indeed, many of the authors are Bayesian). The core of the “Redefine Statistical Significance” paper (henceforth referred to as “RSS”) is a claim, common in Bayesian circles, that significance tests overstate…

Richard D. Morey

Statistical modeling and Bayesian inference, cognitive psychology, and sundry other things

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store