New paper: “Why most of psychology is statistically unfalsifiable”
Daniël Lakens and I have been working on a new paper [pdf] that serves as a comment on the Reproducibility Project: Psychology and Patil, Peng, and Leek’s (2016) use of prediction intervals to analyze the results. Our point of view on the RP:P echoes Etz and Vandekerckhove (2016): neither the original studies nor the replications were, on the whole, particularly informative.
We differ from Etz and Vandekerckhove in that we use a straightforward classical statistical analysis of differences between the studies, and we reveal that for most of the study pairs, even very large differences between the two results 1) cannot be detected by the design, and 2) cannot be rejected in light of the data. The reason is, essentially, that the resolution of the findings is simply lacking. Even high-powered replications, by themselves, will not help to assess the robustness of the psychological literature, because the original studies are so imprecise that one cannot call them into question with a new set of results.
In light of this fact, all the discussion of moderators as a possible explanation for failures to replicate is over-interpreting noise. There might be differences between the studies. These differences might be small, and they might be large. For the vast majority of the studies, we just don’t know.
This has dramatic implications for the cumulative nature of science, because the logic of learning from a replication, and asking whether perhaps moderators can account for the difference, is no different from learning from non-replications. Do two studies seem to show a different pattern of results? Is one significant, and the other not? Have you ever written a discussion section that explains such differences? Like with the RP:P, if sample sizes are small, there will often be large differences between studies, even when there is no difference (or a small one). I suspect — and I know others do too — that much of the theorizing that happens in psychological science is interpreting noise.
The ultimate culprits are publication bias combined with common misconceptions about power, which we address in the paper. We also suggest a way of powering future experiments: power your experiment such that you, or someone else, can conduct a similarly-sized experiment and have high power for detecting an interesting difference from your study. We need to stop thinking about studies as if they are one-offs, only to be interpreted once in light of the hypotheses of the original authors. This does not support cumulative science.
Other things to note:
- All code has been released on GitHub
- The manuscript is supported by two interactive applications, one to allow exploration of the RP:P data set with respect to power, and one to help understanding powering for differences.