Wednesday, June 6, 2018

My Problems with p-values

This post is the fourth in a series of six posts in which I am arguing against the use of p-values for reporting the results of statistical analysis. You can find a summary of my argument and links to the other posts in the first post of the series. In this post, I present my problems with p-values. 

What are my problems with p-values? Oyy, where to start? Here are four reasons why I hate p-values and NHST:
  1. Science is not about making decisions each time you see a new sample. NHST and p-values are not designed for scientific inquiry but for industrial decisions.
  2. In general, scientists interpret statistically significant effects as equal to the true effect while non-statistically significant results are interpreted as zeroes. Both of these interpretations are deeply WRONG.
  3. Statistically significant results are biased.
  4. Marginally significant results are very imprecise.

NHST and p-values are not designed for scientific inquiry but for industrial decisions

NHST and p-values are not adapted to scientific inquiry but to industrial practice. In order to explain to you why this is the case, let me tell you a story.  (I do not know if this story is 100% true or only apocryphal, but I'm pretty sure I read it in one of Stephen Stigler's books or in Hald's History of Mathematical Statistics. Whether the story is accurately true or not does not really matter though, since it is mostly there to illustrate the context of use of NHST and p-values).

The inventor of the significance test is William S. Gosset, better know under his pen-name of Student (he was so humble that he considered himself a student of the great statisticians of his day, especially Karl Pearson). William Gosset wrote under a pen-name because he was not an academic but was working for a private firm, actually the famous Guiness beer factory. Gosset designed the testing procedure in order to solve a very practical problem that he faced at his job. Everyday, a new batch of grain would come in. Before sending the grain into production, Guiness employees would take a sample of the grain (let's say ten small samples taken in random parts of the batch) in order to assess its quality. They would estimate the quality of the grain in each of the samples for a characteristic important for brewing. Based on the sample, they would have to decide whether to discard the batch or put it under production. The problem is that the sample is a noisy estimate of the quality of the batch. If the batch was bad, but they wrongly decided to put it into production, they would lose money. If the batch was good and they decided to discard it, they would also lose money. You recognize the errors of the second and first type of test statistics. So Gosset had to make a choice, everyday, based on a sample, to discard or accept a batch of grain. He devised a procedure that would minimize the risk of discarding a good batch under a fixed probability of discarding a bad one. The procedure simply used a test statistic: compute the value of the test statistic under the assumption that the batch is good. Compute its p-value. Discard the batch if the p-value is smaller than 0.05.

Gosset's procedure (and test statistics in general) makes a lot of sense in an industrial context. There is repeated sampling and actual decisions made at each sample repetition. Test statistics are perfectly adapted to this problem. Science is a very different problem altogether. There is no repeated sampling. We do not take a decision after each sample repetition. We do not need a procedure to help us make this decision. Fisher was the one who adapted Gosset's idea and translated it to scientific practice. He devised p-values as a means to estimate the strength of the evidence in favor of an assumption. He suggested that we could say that under 5%, the bulk of the evidence could be considered overwhelmingly against the assumption. But he never made this threshold a magical threshold. What made this threshold magical was the procedure of decision attached to statistical testing, that Neyman and Pearson coined after Gosset. But, again, this procedure was adapted to an industrial context of repeated decisions, not to scientific inquiry.

NHST and p-values give a false cutoff sense of confidence

The problem with using test statistics in science is that they focus our attention on the position of our results with respect to a cutoff. Have you ever noticed how much more excited you feel when your results cross the 5% significance threshold? How disappointed you are when they are just below? We also tend to radically alter our reporting of a result when it is statistically significant. For example, if  a coefficient is statistically different from zero, we are happy and we report it as being a positive effect. If the result is not statistically different from zero, we report as being insignificant, and in general we consider it as good as zero. This is something every single one of us has felt.

And this is wrong. It proves a deep misunderstanding of what sampling noise really is and what statistical testing is all about.

Look for example at samples 27 and 28 when N=1000 on the figure above. With sample 27, you have a treatment effect estimated at around 0.18, significantly different from zero at the 5% level. So two stars significance. Great. You tend to interpret this result as a 0.18 positive significant treatment effect, and you are going to remember 0.18. With sample 28, you have a a treatment effect estimated at around 0.17, not significantly different from zero at the 5% level. You tend to interpret this result a non-significant treatment effect, and in general you are going to remember it as a zero. But the two samples contain exactly the same objective information: the confidence interval for the effect is large, ranging from very small (zero or slightly negative) to large.

You cannot change your opinion on a program because some random noise has marginally changed your estimator so that its test statistic falls just above or just below 1.96. Nothing has objectively changed between these two samples. The only reason why we would need to choose a cutoff and change our minds when crossing this threshold is because we want to make a decision. But there is no decision to make. So we should consider the two samples as bringing exactly the same information: either a very small effect (positive or negative) or a very large positive one.

But the 95% confidence interval for sample 28 tells us that the effect might be negative whereas that is not the case for sample 27. Doesn't it count for something? No, and for two reasons. First, a very small effect, positive or negative, is just small and does not have important consequences in the real world. Second, even if it does, your precision does not allow you to conclude anything certain. Probability distributions are continuous here, and the change of probability for the treatment effect being below zero from sample 27 to sample 28 is marginal, extremely small. You can see that if you use the 99% confidence interval. Then sample 27 also contains zero and small negative effects.

Look now at samples 18 and 19 of the same graph. The effects there are small and zero is well within the confidence bands, so a statistical test would just give you an insignificant estimate. In general, you will interpret this as a zero. But this would be wrong. Completely wrong actually, since the true treatment effect is actually 0.18. And the objective information from the confidence interval tells you just that: 0.18 is well within the confidence bands too.

Stick with the objective information. Tests focus your attention on details and marginal changes and cutoff decisions instead of looking at sampling noise objectively. Tests are used as a way to gain false certainty in front of sampling noise. No statistical test can get rid of sampling noise.

Statistically significant treatment effects are biased

One very annoying property of statistically significant results is that they are always biased upwards, especially if sampling noise is large (and thus especially if sample size is small). Look again at the figure above. With N=100, the estimates that are statistically different from zero at the 5% level are 2 to 2.5 bigger than the true effect. With N=1000, not all statistically significant results overestimate the true effect but most do.

That statistically significant results are biased upwards is a mechanical consequence of NHST. In order to shed more light on this, let's compute the p-values for the two-sided t-test that the treatment effect is zero for all our Monte Carlo samples using the CLT-based estimates of sampling noise. The figure below presents the results.

You can see that with N=100, treatment effects are significant at the 5% level only when they are at least bigger than 0.3 (the 5% threshold is the blue line on the graph, the red line is the threshold for 1% significance). Remember that the true effect is 0.18! With N=1000, samples with an estimated effect smaller than 0.1 are never significant at the 5% level. As a consequence, if you take the average of the statistically significant effects overestimate the true effect by a large amount: with N=100, statistically significant effects are on average double than the truth, whereas with N=1000, they are roughly 50% bigger.

Note that there is no such problem for larger sample sizes, where all results are statistically significantly different from zero. Actually, we should expect some estimates to be close to zero, but the probability that it happens is so small that it has no practical effect. People accustomed to p-values are thrown off when using large sample sizes where everything is significant at conventional levels. This is actually a funny consequence of people not understanding sampling noise and test statistics: the fact that everything is significant means that uncertainty about parameter values has decreased and that you can actually look at the magnitudes of the coefficients, not whether they are different from zero.

Marginally significant results are very imprecise

Another related very unfortunate consequence of p-values is that results that are marginally significant at the 5% level have are very imprecise: their signal to noise ratio is equal to 0.5, meaning that there twice as much noise as there is signal. This is a very simple consequence of using NHST. Remember that scientists consider a result to be significant at 5% when the ratio t=|x|/se(x) is superior to 1.96. Now, signal to noise ratio can be defined as s/n=|x|/y, with y the width of the 95% confidence interval. Remember that with a normal distribution, we have se(x)=y/(2*1.96). As a consequence, t~1.96 -> s/n~0.5.

A related consequence of using NHST is that, when choosing sample size using a power analysis based on a similar testing procedure, the signal to noise ratio using default settings for size (5%) and power (80%) is also small. Indeed, one can show that the signal to noise ratio of a power analysis for a one sided test (for a two-sided, replace alpha by alpha/2) is equal to:
with delta the confidence level used to build an estimate of sampling noise. With delta=95%, and a two-sided test, the signal to noise ratio of a usual power analysis is 0.71, so that there is still 1.4 times more noise than signal.

No comments:

Post a Comment