An Economist's Journey

Wednesday, June 6, 2018

How p-values Work in Practice

This post is the third in a series of six posts in which I advocate against the use of p-values to report the results of statistical analysis. You can find a summary of my argument and a link to all the posts in the series in the first post of the series.

In the previous post of the series, I explained what is sampling noise. In this post, I am going to explain how p-values are used to report the statistical significance of results. I'll describe the practice in economics, but I believe most fields use them in a similar way, especially within the social sciences.

Researchers generally check for statistical significance in the following way: they divide their estimated effect x by its standard error se(x) (in general, se(x) is equal to y/(2*1.96), where y is the width of the 95% confidence interval, the estimate of sampling noise I used in my previous post), take the absolute value and compare the resulting ratio to the following thresholds:

If the ratio is above 2.57, the p-value is below 1%, we say the effect is significant at 1% and some statistical software puts three stars in front of the estimate.
If the ratio is above 1.96, the p-value is below 5%, we say the results is significant at 5% and some statistical software put two stars in front of the estimate.
If the ratio is above 1.68, the p-value is below 10%, we say that the effect is significant at 10% and some statistical software puts one star in front of the estimate.

The rationale behind this approach is based on Null Hypothesis Significance Testing (NHST). Before seeing the data, you commit to reject the Null assumption of zero effect if and only if the probability that the true effect comes from a distribution centered at zero is low enough. The procedure that I describe above is called a two-sided procedure in which you do not want to commit ex ante to a privileged direction for the effect of your treatment. It can be either positive or negative and you want to be able to detect either of these effects, and you allocate the same power to both.

The graph below illustrates this procedure with an example taken from my class. The true distribution of the effect across random samples is in black (and nicely approximated by a normal distribution in blue) and is centered around the true value of the effect, in red (0.18). The problem of sampling noise is that you do not know where the true distribution is centered, and you only have access to one point from the black distribution. Thanks to the central limit theorem, you can estimate the width of the black distribution and you can approximate it by a normal, but you still do not know where to center it. Here comes NHST. With NHST, you assume that the true distribution is centered at zero (the green distribution in my graph), and you compute the threshold t for which the probability of falling above t or below -t is equal to say 0.05 when the distribution is centered at zero. This gives you the two green discontinuous lines around 0.1 and -0.1. The idea of NHST is to say that if the estimated parameter falls above t or below -t, it is very unlikely that it comes from the green distribution (over sampling replications, it will only happen 5% of the time if the green distribution is true).

Based on the notation I used in the previous post of the series, with x the effect of the treatment and y the (95%) sampling noise, you can compute the p-value associated with the two-sided test as follows: 2*(1-Phi(|x|/(y/(2*1.96)))), with Phi the cumulative distribution of the standard centered normal. With the example I used in my introductory post, we have x=0.1, y/2=0.09996, so that the standard error is equal to 0.051 and the p-value computed is 0.0498. So the result is said to be statistically significant at 5%.

What is Sampling Noise?

This post is the second in a series of six posts in which I am arguing against the use of p-values for reporting the results of statistical analysis. You can find a summary of my argument and links to the other posts in the first post of the series. In this post, I introduce the problem of sampling noise, the issue that p-values are trying to solve. It is only when I started to understand better what sampling noise really is that I started to hate p-values so much.

What is sampling noise? Let’s take the example of a Randomized Controlled Trial (RCT). In a RCT, we randomly allocate individuals to two groups: one that receives the treatment of interest (drug, job training program, ...) and one that does not. We interpret the difference in average outcomes between the treatment and control groups as the causal effect of the treatment. The common intuition behind a RCT is that treatment and control groups are identical in every respect except as to whether they receive the treatment. But actually, when you run a RCT, the treatment and control groups are identical only when their sizes are infinite. In real life applications, with finite sample sizes, treatment and control groups differ. Some confounding variables are distributed differently in the treatment and control samples and they bias the estimator of the treatment effect. Thanks to randomization, there is no systematic direction to this bias, and it is null on average over sample replications, a property that we call unbiasedness. But in a given sample, the very sample that you might have inherited and that you might be using, the size and direction of the bias are unknown. Knowing that it is zero on average is a poor consolation. You’re not dealing in averages, you’re dealing with the sample that you have.

Here is an illustration from my class. In order to build this illustration, I generated 1000 random allocations to a treatment and control group for four different samples of increasing size taken from the same population (i.e. governed by the same model). For each random allocation, I computed the difference in average outcomes between treatment and controls. The histogram presents the distribution of these estimates. In red is the true effect. (The histogram actually presents the results of drawing a different sample at each replication on top of a different treatment allocation. Both graphs are extremely similar, I just happen to have this one readily available in a nice suitable format. The comparison with the graph obtained with keeping the same sample can be found in Lecture 0 of my course).

With a small sample size (N=100), sampling noise is large and estimates stemming from a given random allocation are extremely imprecise. To the point that almost a 1/4 of the estimates have the wrong sign. In my class, I formally define sampling noise y as the width of the 99% (or 95%) symmetric confidence interval around the true value. You can also use the standard deviation of this distribution as an estimate of sampling noise. Actually, with normal distributions, sampling noise = 5 (or 4 for the 95%CI) times the standard deviation. Here are the estimates of sampling noise for the examples above:

As you can see, sampling noise is large with small sample size and decreases as sample size increases (actually, it decreases with the square root of sample size). With a small sample size, sampling noise is large, precision is low, and a lot of values of the estimated effect might be due to sampling noise. With a small sample, we are not going to be able to rule out a lot of true values of the effect because noise is going to affect our estimates too much. With a very large sample size, noise is trivial and the order of magnitude of the true effect is much more clearly estimated.

So the question then becomes: what can we do when there is sampling noise? At least two things:

Make sampling noise as small as possible (the best approach)
Quantify the size of sampling noise (when you cannot do 1.).

Let me talk about quantifying sampling noise, because this is what p-values are about. Several of the most important tools in statistics enable you to compute an estimate of sampling noise using information from only one sample. Think about how beautiful this is: you can recover an estimate of an unobserved quantity defined over replications of samples from one unique sample! We have several ways to do that:

Chebyshev’s inequality, that gives you an upper bound on sampling noise.
The Central Limit Theorem (CLT), that approximates the sampling noise of an average by a normal distribution. When combined with other tools like Slutsky’s Theorem and the Delta Method, the CLT enables to approximate sampling noise for estimators that are combinations of sample averages.
Resampling methods, that use the current sample as a population and mimic the sampling process from this pseudo population. There is the bootstrap, randomization inference, subsampling...

Here is an example from my class where you can see the true sampling noise (in red) along with its estimate from one sample obtained here using the CLT (in blue).

Pretty impressive, right?

Now that we understand what sampling noise is about, and how to estimate it, we will see in the next installment of the series, how p-values deal with sampling noise.

Why I Hate P-Values, Why You Should Too, and What to Do Instead

In my previous blog post on the empirical revolution in economics, when I discussed publication bias, I suggested to ditch p-values. It's a little bit of an extreme position, since we social scientists (and many others) rely almost exclusively on p-values and null hypothesis significance testing (NHST) in order to assess the importance of our results. But after spending a lot of time trying to understand the problem that p-values and NHST are trying to solve, especially in order to teach them in my Econometrics of Program Evaluation class, but also in order to make sense of what I was doing when reporting the results of my own research, I've come to the conclusion that not only p-values and NHST are not useful for solving the problem that we have (namely the problem of assessing sampling noise), but that they are the worse tools that we have to solve this problem, tools that have so many nefarious consequences that I'm now convinced that the best course of action for us social scientists (and others) is to stop using them altogether.

In order to make my case and present my argument, I am going to need six more posts. In the first one, I present the problem of sampling noise as I understand it. In the second post of the series, I present how p-values solve the problem of sampling noise. In the third post, I present my problems with p-values. In the fourth post, I detail the problems that p-values raise for science, with a focus on publication bias and precision. In the fifth post, I'll present my proposal. My proposal is NOT a perfect solution, and I'm pretty sure there are other approaches that can work even better. In the sixth post, to be written after the complete series is released, I'll compile the remarks I received on my proposal and I'll list the alternative solutions that people have brought to my attention.

If you do not have the patience to read through the six blog posts, here is my case in a nutshell:

p-values are bad because they react discretely to sampling noise, instead of accurately reporting it as a continuous measure. As a consequence, people misinterpret the true precision of their results. When a result is statistically significant, they tend to interpret it as saying that the truth is exactly equal to their estimate. When a result is not statistically significant, researchers tend to interpret it as evidence of no effect. Both of these interpretations are deeply wrong. They end up generating bad behavior, with people focusing exclusively on statistically significant results, leading to publication bias and overestimation of effects.
I propose to report the estimated effect as "x±y/2," with x the main estimate and y the width of the 95 (or 99) confidence interval around it, and to interpret the plausible magnitudes of the effect that our experiment rules out. This gives the same importance to sampling noise as it gives to the main estimate and focuses on a continuous interpretation of sampling noise. Compare for example these two ways of reporting the same barely statistically significant result: 0.1 (p-value=0.049), 0.1±0.099. The first result gives you the impression of a positive result equal to 0.1. The second one gives you the impression of a very imprecise result, that could very well be zero or double the one you've just found. It could even be slightly negative (remember we are dealing in probabilities). What's sure is that the true effect is probably not very negative et probably not much bigger than 0.2. In between, everything is possible.
In order to refine further your analysis, I would recommend to express the magnitudes of possible effects in units of σ, the standard deviation of your outcome (i.e. Cohen's d). What's nice with Cohen's d is that it enables to encode orders of magnitudes of effects: a large effect is of the order of 0.8σ, a medium effect of the order of 0.5σ, a small effect of the order of 0.2σ and a very small effect of the order of 0.02σ. The vocabulary about magnitudes of effects being normalized, you can interpret your results as excluding some magnitudes. For example, with a standard deviation σ of 1, the previous result tells you that there is probably not a large, medium or even small negative effect of your treatment, nor a large or medium positive effect, but that your results are compatible with a very small negative effect, and a very small or small positive effect. Whichever one is true in reality your experiment is not precise enough to tell you. This is not nothing: you have learned something valuable from this experiment. But you have also acknowledged its limitations and lack of precision.

Thursday, March 22, 2018

The Empirical Revolution in Economics: Taking Stock and Looking Ahead

I have been asked to write a piece for the excellent TSEconomist, TSE student magazine. I took the opportunity to put together my ideas on where empirical economics stands and where it is going and what we can do to make things even better.

The last 40 years have witnessed tremendous developments in empirical work in economics. In a recent paper, Josh Angrist and his coauthors show that the proportion of empirical work published in top journals in economics has moved from around 30% in 1980 to over 50% today. This is a very important and welcome trend. In this article, I want to take stock of what I see as the main progress in empirical research and look ahead at the remaining challenges.

In my opinion, the main achievement of the empirical revolution in economics is causal inference. Causal inference, or the ability to rigorously tease out the effect of an intervention, enables us to rigorously test theories and to properly evaluate public policies. The empirical revolution has focused the attention on the credibility of the research design: which features of the data help identify the causal effect of interest?

With the empirical revolution, economics has grown a second empirical leg along its extraordinary first theoretical leg and is now able to move forward as a fully fledged social science, weeding out wrong theories, and as true social engineering, stopping inefficient policies and reinforcing the ones that resist empirical tests.

The achievements of the empirical revolution are outstanding, in my opinion on par with the most celebrated theoretical results in the field. It is obvious to me that in the following years several of the contributors to the empirical revolution in economics will receive the Nobel prize: Orley Ashenfelter, Josh Angrist, David Card, Alan Krueger, Don Rubin, Guido Imbens, Esther Duflo, Michael Greenstone, David Autor. Just to mention a few important achievements, in no way exhaustive:

Microfinance, once held as a development silver bullet, ~~has been proven not to~~ does not seem to be effective at reducing poverty (actually some very recent results suggest that we might have missed the impacts of microcredit because they operate at a larger scale than what previous experiments have been able to look at) while a more integral graduation approach seems to work.
Trade with China has had major impacts on workers across the developed world, even influencing the recent US election, layoffs kill workers, employers have local monopsony power, it is not so clear that the minimum wage and immigration impact employment.
The Permanent Income Hypothesis does not seem to hold, multipliers are superior to one, monetary policy has real effects.
Demand curves are downward sloping but there exists Giffen goods.
Kenyan traders collude, mergers increase prices, and competition decreases them, adverse selection exists in the market for loans.
Pollution kills, even at levels below regulatory norms, regulatory policies work, price policies too. How you allocate emission permits does not matter for the final equilibrium.
Small nudges might have very small or very large effects.
Expected utility does not account for some decisions under risk.

What I find extraordinary is how much empirical results have supported as well as falsified basic assumptions in economic theory such as functioning markets and rational agents. Sometimes agents behave rationally, sometimes they do not. Sometimes markets work, sometimes they do not. Sometimes it matters a lot, sometimes it does not. I think we are going to see more and more theory trying to tease out the context in which deviations might matter or not.

The empirical revolution has also brought about a new type of methodological research. The economists' empirical toolkit now has structured around 5 types of tools for causal inference: lab/controlled experiments; Randomized Controlled Trials (RCTs); natural experiments; observational methods and structural models. Aside the impressive continuing achievements of theoretical econometrics, we see now methodological work investigating the empirical properties of methods of causal inference:

Do lab experiments approximate real life behavior?
Do RCTs bias people's behavior? We know that asking people questions changes their subsequent decisions, and that the subpopulation experimented upon differs from the population of applicants in routine mode.
Do observational methods reproduce the results of RCTs? Recent research by Jasmin Fliegner, Roland Rathelot and myself suggests that it is not the case with the data usually available to economists, whereas recent research at Facebook shows that observational methods seem to perform very well with the very rich data available on social networks.
Can structural models predict the consequences of reforms? Dan McFadden famously showed that a Random Utility Model was able to predict almost perfectly the market share of a new product in the case of the introduction of San Francisco subway. Recent research by Philippe Février, Isis Durrmeyer and Xavier d’Haultfoeuille shows that a workhorse IO model was unable to predict the consequences of the introduction of the French feebate on car market shares . Parag Pathak and Peng Shi also recently showed that a model of school choice does a good job at predicting the consequences of a reform in the allocation algorithm in Boston.

But challenges lay ahead that have to be addressed head on. I am very optimistic since I can see the first responses already taking shape, but I think the swiftest our response to these challenges will be, the most credibility our field will have in the public’s eye and the quickest our progress will be.

The first challenge that I see would be an exclusive focus on causality. Science starts with observation, documenting facts about the world that are in need of an explanation. One of the most influential empirical work in the last decades is Thomas Piketty’s effort, along with coauthors, to document the rise of inequality in countries all around the world. Observing new facts should also be a part of the empirical toolkit in economics.

The second and most important challenge that I see for empirical research in economics is the one of publication bias. Publication bias occurs when researchers and editors only publish statistically significant results. When results are imprecise, publication bias leads to drastic overestimation of the magnitude of results. Publication bias has plagued entire research fields such as cancer research and psychology, fields that now both face a replication crisis. A recent paper by John Ioannidis and coauthors measures the extent of publication bias in empirical economics and finds it to be very large: “nearly 80% of the reported effects in [...] empirical economics [...] are exaggerated; typically, by a factor of two and with one-third inflated by a factor of four or more.” This is a critical problem. For example, estimates of the Value of Statistical Life that are used to evaluate policies are overestimated by a factor of 2, leading to incorrect policy decisions.

The third challenge is that of precision: most results in empirical economics are very imprecise. In order to illustrate this I like to use the concept of signal to noise ratio. A result barely statistically significant at 5% has a signal to noise ratio of 0.5, meaning that there is twice as much noise as there is signal. Such a result is compatible with widely different true effects (from very very small to very very large). But things are actually worse than that. Ioannidis and coauthors estimate that the median power in empirical economics is 18%, which implies a signal to noise ratio of 0.26, meaning that the median result in economics contains four times more noise than it has signal. I attribute this issue to an exclusive focus on statistical significance at the expense of looking at actual sampling noise.

How to address these challenges? In my opinion, we need to see at least three major evolutions in publishing, research and teaching.

Editors have to take steps to encourage descriptive work and to curb publication bias. This requires:

Ditch p-values and statistical significance and focus on sampling noise, measured for example by confidence intervals. Confidence intervals make explicit the uncertainty around the true estimate. Present sampling noise in abstracts in the form of “the estimated impact is x±y.”
Publish null results, they are as interesting and informative as significant results, and maybe more. Favor more precise results.
Write clear guidelines about what is expected in an empirical paper using a given technique.
Require pre-registration of studies, even for non experimental research. Pre-registration prevents specification search.
Encourage the use of blind data analysis. This tool, invented by physicists, enables you to write your code on perturbed data and to run it only once on the true data, preventing specification search.
Publish replications and meta-analysis (rigorous summary of results, including tests of and corrections for publication bias).

Researchers have to join their efforts to obtain much more precise results. This requires:

Take stock of where we stand: organize published results using meta-analysis in order to check which theoretical propositions in economics have been validated or refuted and with which level of precision.
Identify the critical remaining challenges: what are the 10 or 100 most important empirical questions in economics? Follow the example of David Hilbert stating the 23 problems of the century in mathematics.
Focus all of the profession’s efforts on trying to solve these challenges, especially by running very large and very precise critical experiments. Examples that come to mind here are physicists uniting to secure funding for and building the CERN and LIGO/VIRGO experiments required to test critical predictions of the standard model.

Teach economics as an empirical science, by including empirical results on an equal footing with theoretical propositions. This would serve several purposes: identify what is the common core of empirically founded propositions in economics; identify the remaining challenges; help students learn the scientific method and integrate them into the exciting journey of scientific progress.

So many things to do. It is so exciting to see this revolution and to be able to contribute to it!

Corrected on May 23, 2018: addition of links to tests of and corrections for publication bias.