Wednesday, June 6, 2018

Why I Hate P-Values, Why You Should Too, and What to Do Instead

In my previous blog post on the empirical revolution in economics, when I discussed publication bias, I suggested to ditch p-values. It's a little bit of an extreme position, since we social scientists (and many others) rely almost exclusively on p-values and null hypothesis significance testing (NHST) in order to assess the importance of our results. But after spending a lot of time trying to understand the problem that p-values and NHST are trying to solve, especially in order to teach them in my Econometrics of Program Evaluation class, but also in order to make sense of what I was doing when reporting the results of my own research, I've come to the conclusion that not only p-values and NHST are not useful for solving the problem that we have (namely the problem of assessing sampling noise), but that they are the worse tools that we have to solve this problem, tools that have so many nefarious consequences that I'm now convinced that the best course of action for us social scientists (and others) is to stop using them altogether. 

In order to make my case and present my argument, I am going to need six more posts. In the first one, I present the problem of sampling noise as I understand it. In the second post of the series, I present how p-values solve the problem of sampling noise. In the third post, I present my problems with p-values. In the fourth post, I detail the problems that p-values raise for science, with a focus on publication bias and precision. In the fifth post, I'll present my proposal. My proposal is NOT a perfect solution, and I'm pretty sure there are other approaches that can work even better. In the sixth post, to be written after the complete series is released, I'll compile the remarks I received on my proposal and I'll list the alternative solutions that people have brought to my attention.

If you do not have the patience to read through the six blog posts, here is my case in a nutshell:
  1. p-values are bad because they react discretely to sampling noise, instead of accurately reporting it as a continuous measure. As a consequence, people misinterpret the true precision of their results. When a result is statistically significant, they tend to interpret it as saying that the truth is exactly equal to their estimate. When a result is not statistically significant, researchers tend to interpret it as evidence of no effect. Both of these interpretations are deeply wrong. They end up generating bad behavior, with people focusing exclusively on statistically significant results, leading to publication bias and overestimation of effects.
  2. I propose to report the estimated effect as "x±y/2," with x the main estimate and y the width of the 95 (or 99) confidence interval around it, and to interpret the plausible magnitudes of the effect that our experiment rules out. This gives the same importance to sampling noise as it gives to the main estimate and focuses on a continuous interpretation of sampling noise. Compare for example these two ways of reporting the same barely statistically significant result: 0.1 (p-value=0.049), 0.1±0.099. The first result gives you the impression of a positive result equal to 0.1. The second one gives you the impression of a very imprecise result, that could very well be zero or double the one you've just found. It could even be slightly negative (remember we are dealing in probabilities). What's sure is that the true effect is probably not very negative et probably not much bigger than 0.2. In between, everything is possible.
  3. In order to refine further your analysis, I would recommend to express the magnitudes of possible effects in units of σ, the standard deviation of your outcome (i.e. Cohen's d). What's nice with Cohen's d is that it enables to encode orders of magnitudes of effects: a large effect is of the order of 0.8σ, a medium effect of the order of 0.5σ, a small effect of the order of 0.2σ and a very small effect of the order of 0.02σ. The vocabulary about magnitudes of effects being normalized, you can interpret your results as excluding some magnitudes. For example, with a standard deviation σ of 1, the previous result tells you that there is probably not a large, medium or even small negative effect of your treatment, nor a large or medium positive effect, but that your results are compatible with a very small negative effect, and a very small or small positive effect. Whichever one is true in reality your experiment is not precise enough to tell you. This is not nothing: you have learned something valuable from this experiment. But you have also acknowledged its limitations and lack of precision.

No comments:

Post a Comment