New pre-print: Use of significance test logic by scientists in a novel reasoning task

How scientists successfully reason about sampling distributions

Richard D. Morey
5 min readAug 3, 2019

Rink Hoekstra (@rinkhoekstra) and I are pleased to release a new pre-print, “Use of significance test logic by scientists in a novel reasoning task”. We test the ability of a large, diverse group of researchers to reason from sampling distributions, a core aspect of significance logic. Although it is common to claim that significance testing logic is “backward” and hence difficult to understand, we use a novel experimental task to show that scientists can use significance testing logic to make correct inferences with high probability. We believe that previous claims that scientists do not understand significance testing logic have been hasty.

Briefly, our argument is that previous vignette-style studies may not have tapped important aspects of participants’ significance testing reasoning. In our task — which resembles a reasoning game more than a survey —participants were asked to solve a statistical puzzle: which of two groups of elves (“Sparklies” or “Jinglies”) made toys faster? Participants could perform experiments until they were satisfied they had an answer (see video).

An example of the experimental “interface”. Fictitious experimental results were displayed in a “evidence by sample size” space, but participants were intentionally kept in the dark about the underlying numerical values and their interpretation.

Our experimental trick was that the task is unlike what you might expect from a statistics task: it contained no numbers at all. The general setup would have been completely unfamiliar to participants. They had to use their abstract statistical reasoning instead of relying on heuristics about p values, confidence intervals, effect sizes, or the like. Participants were free to sample from a null distribution, or not. Crucially, the only way of discovering the null distribution was by actively requesting samples from it. And the null distribution was the only way of making sense of the magnitude of the results.

The critical experimental manipulation was that we randomly assigned participants to one of two null conditions that differed only in the visualization of the sampling distributions, a visually wide or a visually narrow one (though under the hood, they were the same). In statistics, an apparently extreme observed deviation might be qualified by a high standard error; likewise, in our experiment, a visually wide null sampling distribution should cause people to be more sceptical given the same visual “result”. This is exactly what we found (manuscript, Figure 4). Moreover, many participants told us explicitly this is how they performed the task (see our shiny app to read people’s descriptions of their strategies). Importantly, we did not instruct the participants in how to do this. By and large, they simply did it, and came to correct conclusions on the basis of the statistical evidence.

Manuscript, Figure 4. Top: Participants were sensitive to the sampling distribution of the test statistic; their decisions were qualified by the visual narrowness of the statistic under the null hypothesis. Bottom: Participants were much less sensitive to the non-diagnostic mere visual extremeness of the evidence. Decisions are driven by the sampling distribution, not the more salient, but irrelevant, visual impression. See the manuscript for more details.

Another strength of the inference from this experimental task is that by destroying all numerical information and giving only ordinal information about the test statistic, we made the task essentially impossible using a likelihood or Bayesian strategy (see Supplement A, section 3). Our inference that people are using significance testing logic is bolstered not only by participants’ own reports but also by the structure of the task itself.

We believe the results suggest several things:

  • Previous claims that researchers cannot understand significance testing may have been hasty.
  • To the extent that people do understand significance testing, it might be good to take this into account in statistical reforms by building on this knowledge, instead of purging significance testing from education and practice.
  • The statistical education situation may not be as dire as we have thought. This is not to say that we should be complacent — certainly statistical practice is full of issues — but for those of us who teach statistics these results may be a ray of hope.

What we are not arguing:

  • We are not arguing that all or most researchers understand significance testing. Obviously, this would require a random sample of researchers, which would be impossible to obtain. It is better to understand our argument in the context of previous arguments from vignette surveys. Even if there is something special about our sample, the mere existence of a large, diverse sample that appears to understand significance testing logic calls into question previous dire assessments of researchers’ knowledge, which also weren’t based on random samples. Even if our sample is special, that raises the question of why, which is already progress.
  • We are not arguing that just because people appear to understand inference from sampling distributions, that there aren’t problems in practical research situations. The value of our experiment is in isolating statistical understanding from things like research incentives (e.g., fallaciously accepting the null hypothesis because it is rhetorically convenient, rather than because one misunderstands statistics), so obviously we can’t say anything about what happens in actual research.
  • We are not arguing that approaches other than significance might not be superior in practice. Our interest was in testing researchers’ use of basic classical statistical reasoning. One could still argue that even if people can use significance testing, other methods might be better. However, such a claim would require a positive argument, not just “how could anything else be worse?”

We hope that our results lead people to question the received wisdom that researchers have difficulty with significance testing logic. We also hope that our novel experimental method can lead to new ways of assessing students’ knowledge and perhaps new statistical educational tools. If you’re interested in working on such things, please let us know.

Transparency

I am excited to say that this is my most transparent piece of research. We’re releasing a number of companions to the paper to make it more transparent:

  • A website containing links to all resources
  • An online demonstration of the experimental task
  • Two supplements containing methodological details and additional analyses
  • A GitHub repository containing an R package with the data, materials, analyses, and the reproducible manuscripts/supplements. Obtaining everything is as easy as downloading the repository or installing the R package. As far as I know, this represents the largest judgement/decision making experiment ever performed with scientists. The data set is quite rich and amenable to secondary analyses.
  • A shiny app that allows exploration of individuals’ behaviour and responses

We hope these tools are useful! You can contact me (richarddmorey@gmail.com) or Rink Hoekstra (r.hoekstra@rug.nl) with any questions.

--

--

Richard D. Morey

Statistical modeling and Bayesian inference, cognitive psychology, and sundry other things