Thursday, September 20, 2018

Why We Cannot Trust the Published Empirical Record in Economics and How to Make Things Better

A strain of recent results is casting doubt on the soundness of the published empirical results in economics. Economics is now undergoing a replication crisis similar to the one psychology and cancer research have undergone in the last ten years. This crisis is so broad that it concerns all of the published empirical results and it is so severe that it might mean that most of them are wrong. The mere fact that we cannot trust any of them and do not know which ones hold and which ones do not casts doubt on all of the empirical results in our field. This is very serious business.

In this blog post, I want to briefly explain what the replication crisis in economics is and what are its most likely causes. I'll then provide some evidence along with personal stories and anecdotes that illustrate the bad behaviors that generate the replication crisis. Then, I'm going to detail the set of solutions that I think we need in order to overcome the crisis. Finally, I will end with a teaser about a project that I am preparing with some colleagues, the Social Science Knowledge Accumulation Initiative, or SKY (yeah, I pronounce SSKAI SKY, can I do that?), that we hope is going to provide the infrastructure necessary in order to implement the required changes.

In a sense, I'm relieved and happy that this crisis occurs, because it is high time that we put the great tools of the empirical revolution to good use by ditching bad practices and incentives and start accumulating knowledge.

What is the replication crisis?


The replication crisis is the fact that we do not know whether the published empirical results could be replicated if one tried to reproduce them exactly in the same way as the original authors did, with the same data collection procedure and the same statistical tools.

The replication crisis come from editors choosing to only publish studies that show statistically significant results. Because statistically significant results can be obtained by chance or by using Questionable Research Practices (QRPs), the published record is populated by studies whose results are due to luck or to data wizardry.

The replication crisis is a very bad problem because it implies that there is no guarantee that the truth will emerge from the published record. When published results are selected in order to find effects that are just big enough to pass the infamous statistical significance thresholds, then the published record is going to be severely biased. Small effects are going to look bigger than they really are, zero effects (or very small effects) might look like they exist and are substantial. And we do not know which effects are concerned and how many of them there is. The replication crisis means that we cannot trust the published empirical record in economics.

The Four Horsemen of the Apocalypse: the Behaviors Behind the Replication Crisis


Before getting to the evidence, let me detail the four behaviors that make the published record unreliable: publication bias, the file drawer problem, p-hacking and HARKing. Let me describe them in turn with the help of an example.

Let's start with publication bias. Imagine that you have a theory saying that all coins are biased. The theory does not say the direction of the bias, only that all coins are biased, some might give more heads than tails and others more tails than heads. In order to test the theory, 100 research teams start throwing coins, one for each team. By sheer luck, sometimes, even unbiased coins will have runs with more heads than tails or more tails than heads. How do you decide whether the run is due to sheer luck or to a biased coin? Well, scientists rely on a procedure called a statistical test. The idea is to assume a balanced coin (we call that the Null Hypothesis) and derive what type of data we should see under that assumption. The way it works is a follows: assuming a fair coin, the distribution of let's say the proportion of heads over runs of N flips is going to be centered around 0.5, with some runs giving a larger probability than 0.5, and some runs giving a lower probability. The distribution is going to be symmetric around 0.5, with tails that get thinner and thinner as we move away from 0.5 on each side because we are less likely to obtain 75% of heads than 55%. Now, the test works like this: each team compares its own proportion of heads to the theoretical distribution of the proportion of heads under the Null for a sample of size N. If the observed proportion of heads is far away in the tails of this distribution, on either side, we are going to say that the coin is biased, because it is unlikely that its run comes from the fair coin Null distribution. But how do we decide precisely? By convention, we decide that the assumption that the coin is unfair is going to be statistically significant if it is such that the proportion of heads over N runs is equal or higher to the 97.5 percentile of the distribution of the fair coin under the Null or equal to or smaller than the 2.5 percentile of the same distribution. Why do we choose these values? Because they have the property that, under the Null, we are only going to conclude incorrectly that a coin is biased while it is actually fair 5% of the time, and 5% is considered a small quantity, or, rather, it was chosen by the inventor of the procedure, Ronald Fisher, and it kind of became fetishized after that. Something like a papal decree (we have a lot of them in science, rather surprisingly). The precise values of the thresholds above or below which we are going to decide that a coin is not fair depend on sample size. They are closer to 0.5 as the number of flips increases, because with a larger number of flips, more and more samples are going to find values closer to 0.5 under the Null Hypothesis that the coin is fair. With larger sample sizes, a smaller deviation is going to be considered more surprising.

Even if all coins are fair, and the theory is wrong, by the sheer property of our testing procedure, 5 out of the 100 competing teams are going to report statistically significant evidence in favor of the biased coin theory: 2.5 that the coin is biased towards heads and 2.5 that the coin is biased towards tails. That would be fine if all the 100 teams reported their results. The distribution of the proportion of heads over the 100 teams would be very similar to the distribution under the null, and would thus invalidate our theory and vindicate the Null.

But, what would happen if only the statistically significant results get published? Well, The published record would only contain evidence in favor of a false theory, invalidating the correct Null hypothesis: the published record would say "All coins are unfair."

Waouw, how is it possible that something like that happens? It seems that my colleagues and friends, who are referees and editors for and readers of scientific journals, hate publishing or reading about Null non-significant results. Why is that? You might say that it looks completely foolish to do so once I've presented you with how the procedure works. Sure, but I came with what I hope is a clear understanding of the problem after a lot of effort, reading multiple books and articles about the replication crisis, simulating these coin tosses for myself, teaching that problem in my class and writing a series of blog posts on the topic. (i) Most scientists do not have a great mastery of statistical analysis (well, you already have to master your own discipline, and now you would need stats on top of that, that's obviously hard). The statistical education of scientists is a series of recipes to apply, and of rules to follow, instead of an introduction to the main issue (that of sampling noise) and to the ways to respond to it. As a consequence of poor statistical training, most scientists interpret Null results as inconclusive, and statistically significant results as conclusive. (ii) Most scientists are not really aware of the aggregate consequences of massive science, with multiple teams pursuing the same goal, and how that interacts with the properties of statistical tests. Most teams have not read John Ioannidis' paper on this (on which my example is loosely based). (iii) Researchers are like every other human being on the planet, they like a good story, they like to be surprised and entertained, they do not like boring. And most correct results are going to be boring: “Coins are fair? Boring! Who would like to know about that? Wait, what, coins are unfair? Here is the front page for you, buddy!” The problem is that the world is probably mostly boring, or, stated otherwise, most of our theories are probably false. We are thus going to find a lot of Null results and we should know about them in order to weed out the wrong theories.

OK, now we understand publication bias. But there’s more. Scientists understand that editors and referees do not like insignificant results. As a consequence, they adopt behaviors that make the replication crisis more severe. These Questionable Research Practices (QRPs) - the file drawer problem, p-hacking and HARKing - increase the likelihood of finding a significant effect even if there is none to levels much higher than 5%, sometimes to levels as high as 60%. As a consequence, the published record is populated by zombie theories even if the truth is actually the good old Null hypothesis. The file drawer problem is scientists simply not even bothering to convert non significant results into papers because they know they’ll never get to publish them. As a consequence, the bulk of evidence in favor of the Null is eradicated before even getting to the editors. The second problem is called p-hacking. P-hacking is when researchers use the degrees of freedom that they have in experimental design in order to increase the likelihood of finding significant results. And there are unfortunately tons of ways to do just that. In my biased coin theory example, p-hacking could happen for example if researchers checked periodically the result of the test after a few coin flips, and then after collecting a few more, and decided to stop collecting data when the test result becomes significant. Researchers might also decide to exclude some flips from the sample because they were not done in correct conditions. This might look very innocent and might not be done in bad faith: you’re just doing corrections to your data when your result is non significant and stop doing most of them when they are. Finally, scientists might try out a host of various tests of the hypothesis and report only the statistically significant results. For example, bias could be defined as the proportion of  heads being different from 0.5, but it could also be that the coin has long runs of successive heads or of successive tails, even if it is unbiased on average. Or it might be that the coin is biased only when tossed in a certain way. If all teams try all these combinations and only report the statistically significant results, the proportion of the 100 teams finding significant results would increase by a lot, sometimes up to 60, instead of the initial 5. HARKing or Hypothesizing After the Results are Known is the procedure or formalizing the theory after seeing the data and reporting it as if the theory came first. For example, over the 100 team investigating coin fłips, the ones finding significantly more heads might end up publishing a paper with evidence in favor of a theory stating that coins are all biased towards heads. The other teams might publish a paper with evidence in favor of the theory that all coins are biased towards tails. Some teams might find that coins produce successive “hot hands” runs of heads or tails, some teams might publish evidence for theories that the way you throw the coin makes a difference on how biased the coins are, etc. Don’t laugh, that has happened repeatedly.


What is the evidence that the published record in economics is tainted by publication bias and Questionable Research Practices?


If my theory of the replication crisis is correct, what should we see? Well, it depends on whether we spend our time hunting for true effects or for Null effects. More precisely, it depends on the true size of the effects that we are hunting for and on the precision of our tests (i.e. the size of our samples). If we spend our time looking for small, very small or zero effects with low precision, i.e. low sample size, we should see the distribution under the null, but truncated below the significance thresholds if there is publication bias. That is, we should see only significant results being published, with the bulk of them stacked closer to the statistical significance threshold and then the mass slowly thinning as we move to higher values of the test statistics. If there is p-hacking or HARKing, we should see a bulging mass just above the significance thresholds. On the contrary, if we mostly are looking at large effects with large sample sizes, the distribution of test statistics should look like a normal only slightly censored on the left, where the missing mass of non-significant results fall. We should have a small bulging mass above the significance threshold if there is p-hacking or HARKing, but that should not be the main mass. The main mass should be very far away from the significance thresholds, providing undisputable evidence of large and/or precisely estimated effects.

So what do we see? The following graph is taken from a very recent working paper by Abel Brodeur, Nikolai Cook and Anthony Heyes. It plots in black the distribution of 13440 test statistics for main effects reported in 25 top economics journals in 2015, by method (I'll come back to that). It also reports the theoretical distribution under the Null that all results are zero (interrupted grey line) and the thresholds above which the effects are considered to be statistically significant (vertical lines).



What we see is highly compatible with researchers hunting mostly for Null effects, of massively missing Null effects and of published effects due either to luck or to QRPs.
  1. There is no evidence of large and/or precisely estimated effects: in that case, test statistics would peak well above the significance thresholds. In the actual data, it's all downhill above the significance thresholds.
  2. We might interpret the plot for IV as evidence of medium-sized or not super precisely estimated effects.  Taken at face value, the peak around the significance thresholds of IV might tell the story of effects whose size places them just around the significance thresholds with the precision that we have in economics. That interpretation is very likely to be wrong though:
    1. It just is weird that our naturally occurring effects just happen to align nicely around the incentives that researchers face to get their results published. But, OK, let's accept for the moment that that's fate.
    2. Editors let some Null results pass, and actually there are too many of them. As a consequence, the distribution is not symmetric around the main effect, and the bulge is too sharp around the significance thresholds, and the mass of results at zero is too big to be compatible with a simple story of us hunting down effects that are exactly around detection thresholds.
    3. The quality of the evidence should not depend on the method. IV seems to provide evidence of effects, but the much more robust methods RDD and RCT show much less signs of a bulge around significance thresholds. It could be that they are used more for hunting Null effects, but it is hard to believe that users of IV have a magic silver bullet that tells them in advance where the large effects are, while RCT and RDD users are more stupid. No, most of the time, these are the same people. The difference is that it's much harder to game an RCT or an RDD.
  3. We see clear evidence of p-hacking or HARKing around significance thresholds, with a missing mass below the thresholds and an excess mass above. It is clearly apparent in RCTs, tiny in RDDs, huge in DID and so huge in IV that it actually erases all the mass until zero! 

OK, so my reading of this evidence is at best extremely worrying, at worst completely damning: yes, most statistically significant published results in economics might very well be false. There is no clear sign of undisputable mass of precisely estimated and/or large effects far above the conventional significance thresholds. There are clear tell-tale signs of publication bias (missing mass at zero) and of p-hacking and HARKing (bulge around significance thresholds, missing mass below). So, we clearly do not know what we know. And we should question everything that has been published so far.

Some personal anecdotes now that are intended to illustrate why I'm deeply convinced that the evidence I've just presented corresponds to the worst case scenario:
  1. I p-hacked once. Yes, I acknowledge that I have cheated too. It is so innocent, so understandable, so easy, and so simple. I had a wonderful student that did an internship with me. He was so amazing, and our idea was so great, we were sure we were going to find something. The student worked so hard, he did an amazing job. And then, the effect was sometimes there, sometimes not there. We ended up selecting the nicely significant effect and published it. I wanted to publish the finding for the student. I also was sure that our significant effect was real. I also needed publications, because this is how my output is measured, and at the time I did not understand very well the problem of publication bias and how severe it was. I chose to send the paper to a low tier journal though, in order not to claim this paper too much on my resume (it is not one of my two main papers). Yes, it was a little bit of a half-assed decision, but nobody's perfect ;)
  2. After that, I swore never to do it again, I felt so bad. I put in place extremely rigorous procedures (RCTs with preregistration, publication of everything that we are doing, even Null results, no selection of the best specifications). And, mind you, out of my 7 current empirical projects, 5 yield non-significant results.
  3. After presenting preliminary non-significant results at the first workshop of my ANR-funded project, a well-intentioned colleague told me that he was worried about how I was going to publish. "All these non significant results, what are you going to do about them?" he said. He said it like it was a curse or a bad omen that I had so many insignificant results. He is an editor. It almost felt like I was wasting public research funds on insignificant results, a problem I could not foresee before starting with the study. But see the problem: if I cannot convert my research in published results, I am going to fail as a researcher and to fail to raise more money to do my future research. But in order to publish, I need significant results, which are beyond my control, unless I start behaving badly and pollute the published empirical record with tainted evidence. This is the incentive structure that all of us are facing and it is deeply wrong.
  4. Turns out my colleague was right: the editor of the first journal we sent a paper with a non significant result wrote back "very well conducted study, but the results are insignificant, so we are not going to publish it." I kid you not.
  5. I asked in a seminar to the presenter whether he obtained his empirical results before or after crafting his theory. He looked at me puzzled. Took a long time to think about his answer and then said: "Do you know anyone who formulates theories before seeing the data?" I said "Yes, people who pre-register everything that they are doing." He replied "Is that a sin not to do it?" I said "It's kind of a sin, it's called HARKing. The main thing that I want to know is whether the evidence in favor of your theory came before or after your theory. It is much more convincing if it came after."After talking more with him, he could delineate which results inspired his theoretical work and which ones he obtained after. But he was unaware that presenting both sets of results on the same level was a QRP called HARKing (and I'm not pointing fingers here, we are all ignorant, we all sinned).
  6. Recently on Twitter, a young researcher boasted that his large productivity in terms of published papers was due to the fact that he stopped investigating non-promising research projects early. If you stop at once when you have non-significant results, then, you contribute to the file drawer problem.
  7. One of the most important current empirical researchers in economics said in an interview “where we start out is often very far away from the papers that people see as the finished product. We work hard to try to write a paper that ex-post seems extremely simple: “Oh, it’s obvious that that’s the set of calculations you should have done.”” This leaves the possibility open that he and his team try out a lot of different combinations before zeroing in on the correct one, a behavior dangerously close to p-hacking or HARKing. I'm not saying that's what they do. I'm saying that no procedures ensure that they do not, and that the empirical track record in the field pushes us to lose trust when this kind of behavior is allowed.
  8. With my students last year, we examined roughly 100 papers in empirical environmental economics. I cannot fathom how many times we commented on a result that was like “the conclusion of this paper is that this policy saved 60000 lives plus or minus 60000.” The bewildered look in my students' eyes! “Then it could be anything?” Yes it could. And even more worrying, how come all of these results are all so close to the significance threshold? My students were all wondering whether they really could trust them.
  9. I cannot express how much people in my field are obsessed with publication. It seems to be the end goal of our lives, the only thing we can talk about. We talk more about publications than we talk about scientific research. Careers are made and destroyed based on publications. The content itself is almost unimportant. Publications! Top journals! As if the content of our knowledge would follow directly from competition for top publications. Hard to avoid pointing that neither Darwin nor Newton, nor Einstein considered publication to be more important than content. And hard to avoid saying that then we are only rewarding luck (the luck of finding the few significant results stemming out of our Null distributions) or worse, questionable research practices. And that we are polluting the empirical record of evidence and ultimately betraying our own field, science and, I'd say, shooting ourselves in the foot. Why? Because when citizens, policy-makers and funders will catch up with that problem (and they will), they will lose all confidence in our results. And they'll be right.

I hope now that you are as worried as I am about the quality of the empirical evidence in economics. Obviously, some people are going to argue that there is not such a problem. That people like me are just frustrated failed researchers that try to stifle the creativity of the most prolific researchers. How do I know that? It all happened before in psychology. And it is happening right now. There are people mocking Brodeur et al for using the very statistical significance tests that they claim are at the source of the problem (I let you decide whether this is an argument or just a funny deflection zinger. OK, it's just the latter). Some people point out that significant results are just results that signal that we need more research in this area, in order to confirm whether they exist or not. I think this is a more elaborate but eventually fully wrong argument. Why? For two reasons. First, it is simply not happening like that. When a significant result gets published, it is considered as the definitive truth, whether or not it is precise, close to the significance threshold or not. And there is NEVER an independent replication of that result. This is not seen as an encouragement for more research. It is just the sign that we have a new result out. Cases of dissent with the published record are extremely rare, worryingly rare. And the first published significant result now becomes an anchor for all successive results, a behavior already identified by Feynman in physics years ago that lead him to propose blind data analysis. Second, there are no incentives for doing so: replications just do not get attention, and have a hard time being published. If they find the same results: boring! If they don't, then why? What did the authors do that lead to a different result? No-one believes that sampling noise alone can be responsible for something as a change in statistical significance, whereas it is actually the rule rather than the exception, especially when results are very close to the significance threshold. Third, this is simply a waste of time, effort and money: all of these significant results stem from the selection of 5 significant results among 100 competing teams. Imagine all the time and effort that is going into that! We have just ditched the work of 95 teams just to reward the 5 lucky ones that obtained the wrong result! This simply is crazy. It is eve crazier when you add QRPs to the picture: following that suggestion, you just end up trying to reproduce the work of the sloppiest researchers. Sorry, I prefer to hunt down rigorously for Null effects all of my life than trying to chase the latest wild geese that some of my colleagues claim to have uncovered.

The existing solutions to the replication crisis


OK, so what should we do now? There are some solutions around that start being implemented. Here are some of the ones that I do not think solve the issue because they do not make the Null results part of the published record:
  1. Most referees of empirical research now require extensive robustness checks, that is trying out to obtain your results under various methods and specifications. This is in a sense a way to try to have you not do too much p-hacking, because a p-hacked result should disappear under this type of scrutiny. Actually, it is true, the most fragile p-hacked results might disappear, but some extreme results will remain, and some true results will disappear also. Or maybe we will see more p-hacking and HARKing around the robustness checks. It is actually an interesting question for Brodeur et al: do papers including robustness checks exhibit less signs of p-hacking. Note also that this approach will not make the record include Null results, just a more selected set of significant ones. So, as a consequence, it does not address the main issue of publication bias.
  2. Some people have suggested to decrease the threshold for statistical significance from 5% to 0.5%. We will still miss the Null results we need.
  3. Some people have suggested the creation of a Journal of Null results. Well, that's great in theory but if Null results have no prestige or impact because people do not read them, or use them, or quote them, then noone will send papers there. And we will miss the Null results.
  4. Some journals have started requiring that datasets be posted online along with the code. It will for sure avoid the most blatant QRPs and correct some coding mistakes but it will not populate the empirical record with the Null results that we need in order to really be sure of what we do know.

There are some solutions that are more effective because they make the track record of Null results accessible:
  1. The AEA recently created a registry of experiments. This is a great first step, because it obliges you to pre-register all your analysis before you do anything. No more QRPs with pre-registration. Pretty cool. Three problems with pre-registration. First, it is voluntary. to make it really work, we need both funders and journals to agree that pre-registration is a prerequisite before submitting a paper or spending money on experiments. Second, it does not solve publication bias. we still are going to see only the significant results get published. It helps a little because now we theoretically know the universe of all studies that have been conducted, if it is being made a pre-requisite, and we can estimate how many of them have been published eventually. If it is only 5%, we should just be super cautious. If it is also mandatory to upload your results on the registry, we could collect all these unpublished Null results and get a truthful answer. But that's with a lot of caveats. For the moment, it does not apply to methods other than RCTs: natural experiments (DID, IV, RDD) and structural models are never pre-registered.
  2. An ever better solution is that of registered reports. Registered reports are research proposals that are submitted to a journal and examined by referees before the experiment is run. Once the project is accepted, the journal commits to publish the final paper whatever the results, significant or not. That's a great solution. Several journals in psychology now accept registered reports, along with one in economics. The problem is that it is not really adapted to some field research where things are decided in great urgency at the last minute (it happens a lot when working with non research institutions).
  3. A great solution is simply to organize large replications of the most important results in the field. Psychologists have created the Psychological Science Research Accelerator (modeled after physicists CERN) and OSF where hundreds of labs cooperate in choosing and conducting large precise studies trying to replicate important results.
  4. A super cool solution is to start accumulating knowledge using meta-analysis. Meta-analysis simply aggregate the results of all the studies on the same topic. Meta-analysis can also be used to detect publication bias. First, you can compare the most precise results to the less precise ones. If the more precise results are smaller, that is sign of publication bias. It does not mean that the most precise results are not biased though. We would need pre-registered replications to say that. Second, you can also compare the distribution of the p-values of statistically significant results (it's called p-curving). It should be decreasing if there is a true effect (a super cool result), it is flat under the Null and it is increasing with QRPs, with a bulge just below significance thresholds. What if we were p-curving the Brodeur et al data? We'll probably just find huge signs of publication bias, because it is clear that there is a bulge, and the upward part of the distribution is compatible with the Null. It would be great if Abel and his coauthors could actually p-curve their data to confirm my suspicion.

What is the track record of these changes? Let's see what happened when doctors moved to registered reports: they moved from 57% of trials reporting a significant effet (i.e. an effective drug) to 8%!



When psychologists tried to replicate huge swaths of published findings, they found that 60% of them did not replicate.




When economists did the same (for a very small number of studies), they found 40% did not replicate, and effects where overestimated by a factor of 2! Below is a graph from a paper by Isaiah Andrews and Max Kasy where they propose methods for detecting and correcting for publication bias and apply their tools to the data from this replication. The blue lines are the original estimates and the black lines the ones corrected for publication bias. You can see that originally borderline statistically significant results (with the confidence interval caressing the 0 line) are actually containing zero and that most of the presented results contain zero in their confidence interval, and are thus compatible with zero or very small effects.


When meta-analyzing economics literatures, John Ionnanidis and his colleagues found huge signs of publication bias, with effects overestimated by a factor of 2 to 4! Below is a graph from their paper showing the distribution of the inflation factor due to publication bias. The mode of the distribution is 300% inflation, so effects that are 3 times too large. Note that this is a cautious estimate of the amount of publication bias that there is in these literatures, since it uses the most precise estimates as an estimate of the truth, while they might very well be biased as well.


My proposal: the Social Science Knowledge Accumulation Initiative (SKY)


The problem with all these changes is that they are voluntary. That is unless funders and journal editors decide to require preregistration for example, we will not see huge progress made on that front. What I'm proposing is to use meta-analysis to organize the existing literature, to evaluate the quality of the empirical published record bit by bit, and to use it to discipline fields, journals, universities and researchers by giving them the incentives to adopt the best research practices. This will be the first job of SKY. We are going to provide a tool that makes doing and reporting the results of meta-analysis easier.

What is nice with meta-analysis is that they summarize what we know and they help detect our likely biases. You can detect publication bias in a literature, obviously, but you can also apply p-curving to a journal, to an editor, to a university, a department, a team and even to an individual researcher. I'm not saying that we should, but the mere fact that we could should push everyone into adopting better practices. Moreover, it seems very hard to game meta-analysis and p-curving, because the only way to game them seems to me to be to expressly cheat.

The application of meta-analysis and p-curving to whole literatures should curb down the incentive for publication bias. Indeed, what will be judged passed on to policy-makers and students are the results of meta-analysis, and not the results of individual studies. Researchers should not compete for doing meta-analysis, they should cooperate. Anyone having published a paper included in the meta-analysis should be included as an author, that is they have incentive to disclose everything that they are doing (even Null results will be valued now), and incentives to control what their fellow researchers are doing, because their results are going to be on the paper as well.

Meta-analysis will also push us towards the normalization of methods, so that the quality of methods is controlled and standardized, in order for studies to be included in a meta-analysis. We also need replicators that redo the statistical analysis of each separate studies. Great candidates are master and PhD students in methods class, under the supervision of a PI.

We finally need a tool to make research more open that is more versatile than pre-registered reports. I propose to make all of our lab notes open. That is we will set up a researcher's log, where he reports everything that he is doing on a project. He will first start with the idea. Then he will report the results of the first analysis, or his pre-analaysis plan, and his iterations. That does not obviate the need for pre-registered reports, but it looks like a pre-registered report only less rigid. If the lab notes are made open, you can receive comments from colleagues at all stages in order to improve your analysis. I know that researchers are afraid of having their ideas stolen, but first this is a red herring (I still have to meet someone to whom it has actually happened, but I have met dozens of researchers who have heard of someone to whom it has happened). What better way not to be beaten to the post than publicizing your idea on a public time-stamped forum? This will also be a very effective way to find co-authors.

Some last few thoughts in the form of an FAQ:
  1. Why not use the existing infrastructure developed in Psychology such as the Accelerator? I see SKY as complementary to the Accelerator as a way to summarize all research, including the one conducted on the Accelerator. The Accelerator is extraordinary for lab research. Some empirical economists do some lab research (and we should probably do much more), but most of us do research using Field Experiments (FE, a.k.a. Randomized Controlled Trials (RCTs)), Natural Experiments (NE) or Structural Models (SM). The Accelerator is not super adapted to both FE and NE because it requires a very strict adherence to a protocol and a huge level of coordination that is simply not attainable in the short run in economics. When running FE, you have to choose features in agreement with the local implementers, most of the details are decided at the last minute. When using NE, we are tributary of the characteristics of the data, of the exact features of the Natural Experiment that we are exploiting and that are sometimes revealed only after digging deep in the data. My hope is that eventually, the coordination brought about by SKY around research questions will enable the gradual standardization of methods and questions that will enable us to run standardized FE and NE. SM are in general intending to predict the consequences of policies or reforms. Their evaluation will entail pre-registration of the predictions before seeing the data, or an access to the code so that anyone can alter the estimation and holdout samples.
  2. How does SKY contribute to the adoption of good research practices? My idea is that SKY will become the go-to site for anyone wanting to use results from economics research (and hopefully in the end eventually any type of research). Every result will be summarized and vetted by entire communities on SKY, so that the best up-to-date evidence will be there. Funders, policy-makers, journalists, students, universities, fellow researchers will come on SKY to know what is the agreed upon consensus on a question, if the evidence is sound, what are the unanswered questions, etc. Fields with a track record of publication bias will risk losing their funding, their positions, their careers, so that they will be incentivized into adopting good research practices. Journals, facing the risk of their track record of publication bias being exposed, should start accepting Null results under the risk of losing readership because of a bad replication index.
  3. How would SKY edit knowledge? My idea is that whole communities should vet the meta-analysis that concerns them. In practice, that means that the plan for the meta-analysis will be pre-registered on SKY, the code and data will be uploaded as well, and an interface for discussion will be in place. I imagine that interface to be close to the forum of contributors of a Wikipedia page, but with GitHub-like facilities, so that one can easily access the changes implied by an alternative analysis and code. The discussion among the contributors will go-on as long as necessary. The editor in charge of the meta-analysis will be responsible for the final product, but it might be subject to changes, depending on the results of the discussion.
  4. Do you intend SKY only for economics research or for all types of social science research? To me, the social or behavioral sciences are a whole: they are the sciences that are trying to explain human behavior. SKY will be open to everyone from psychology, economics, medecine, political science, anthropology, sociology, as long as the format of is one using quantitative empirical methods. It might be great eventually to add a qualitative research component to SKY, and a users portion for policy-makers for example, where they could submit topics of interest for them.
  5. What about methodological research and standardization of empirical practice on SKY? Yep, that will be up there. Any new method will be vetted independently on simulated data, mostly by PhD students as part of their methods class. I'll start by providing my own methods class with its simulations, but then anyone proposing a new method be added to SKY will provide simulations and code so that it can be vetted. We will also provide the most advanced set of guidelines possible in view of the accumulated methodological knowledge.
  6. Once we have SKY, do we really need journals? Actually, I'm not sure. Journals edit content, providing you with the most important up-to-date results according to a team of editors. SKY does better: it gives you the accumulated knowledge of whole research communities. Why would you care for the latest flashy results, whose contribution to the literature will be to push the average meta-analysed effect in a certain direction by a very limited amount? Of course, super novel results will open up new interpretations, or connect dots between different strands of the literatures, or identify regularities (sets of conditions where effects differ) and confirm them experimentally. But the incentive of researchers will be to report these result in the place where everyone is looking: SKY. Why run the risk of having one's impact limited by paywalls?
  7. How do you measure the contributions of scientists and allocate research funds in the absence of the hierarchies accompanying journals? That's the best part: SKY will make this so easy. Researchers with a clean record contributing a lot to a meta-analyzed effect (by conducting a lot of studies or one a large one) will be rewarded. Researchers opening up new fields or finding new connections and/or regularities will be easy to spot. Funding will be made easy as well: just look for the most important unanswered questions on SKY. Actually, researchers should after some time be able to coordinate within a community in order to launch large projects to tackle important questions in large teams, just like physicists do.
  8. Why are you, Sylvain, doing that? While the vision for SKY comes from Chris Chambers' book on the 7 Sins of Psychological Research, I feel this is the most important thing that I can do with my life right now. My scientific contributions will always be parasited and made almost inconsequential in the current system: (i) I cannot identify important questions because I do not know what we know; (ii) I cannot make my contributions known rigorously because Null results get ditched (there are so many more useful things I could be doing with my life right now, and apply my brains, motivation and energy to, rather than publishing biased results and contributing to polluting the published empirical record); (iii) Organizing the published empirical knowledge has a much higher impact on society than conducting another study, because the sample size of literatures is much larger than the one of a given study, so that the precision and impact of the accumulated knowledge will be huge; (iv) I love science, have always done since I was a kid, and I know that it has the power to make the world a better place. I want to contribute to that by nudging science  in the right direction to achieve just that. There are so many useful theories and empirical facts in economics and social science that could be of use to everyone if only they had been properly vetted and organized; (v) I am lucky enough to have an employer that really cares about the public good and that supports me 100%. Lucky me :)
  9. How can I contribute to SKY? Well, drop me an email or a DM on twitter if you want to get involved. We are just starting looking at the solutions and goals, so now is the best time to make a lasting impact on the infrastructure. We will need any single one of your contributions.

With SKY, I want to achieve what Paul Krugman asks economists to do in a recent post: "The important thing is to be aware of what we do know, and why." Couldn't agree more. Let's get to work ;)





Thursday, June 7, 2018

Why p-values are Bad for Science

This post is the fifth in a series of seven posts in which I am arguing against the use of p-values for reporting the results of statistical analysis. You can find a summary of my argument and links to the other posts in the first post of the series. In this post, I present why I think p-values are bad for science.

The problems that I have with p-values and NHST are not minor quibbles. They unfortunately cut to the core of scientific practice. They affect the quality of reported and published scientific evidence, they perturb in major ways the process of accumulation of knowledge and eventually they might undermine the very credibility of our scientific results.

I can see two major detrimental consequences to the use of p-values and NHST:
  1. Publication bias: published results overestimate the true effects, by a large margin, with published results being 1.5 to 2 times larger than the true effects on average.
  2. Imprecision: published results have low precision, with a median signal to noise ratio of 0.26.

Publication bias


Publication bias operates in the following way: if editors decide to publish only statistically significant results, then the record of published results will be an overestimate of the true effect, as we have seen in the previous post of the series. If a true effect is small and positive, published results will overestimate it.

This means that if the true effect is nonexistent, only positive and negative studies showing that it exists and is large will be published. We will get either conflicting results or, if researchers favor one direction of the effect, we might end up with evidence for an effect that is not there.

There are actually different ways publication bias can be generated:
  • Teams of researchers compete for publication based on independent samples from the same population. If 100 teams compete, on average, 5 of them will find significant results even if the true effect is non-existent. Depending on the proportion of true effects that there is to discover, it might imply that most published research findings are false
  • Specification search is another way a team can generate statistically significant results. For example, by choosing to stop collecting new results once the desired significance is reached, choosing to add control variables or changing the functional form. This does not have to be conscious fraud, but simply a result of the multiple degrees of freedom that researchers have, which generate what Andrew Gelman and Eric Loken call "the garden of forking paths." In a pathbreaking paper in psychology, Joseph Simmons, Leif Nelson and Uri Simonsohn showed that leveraging on degrees of freedom in research, it is very easy to obtain any type of result, even that listening to a given song decreases people's age.
  • Conflicts of interest, such as in medical sciences where labs have to show efficiency of drugs, might generate a file drawer problem, where insignificant or negative results do not get published.
Do we have evidence that publication bias exists? Unfortunately yes, massive, uncontroversial evidence. The evidence on publication bias comes from replications attempts and meta-analysis.

Evidence of publication bias from replications


Replications consists in trying to conduct a study similar to a published one, but on a larger sample, in order to increase precision and decrease sampling noise. After the replication crisis erupted in their field, psychologists decided to conduct replication studies. The Open Science Collaboration published in Science in 2015 the results of 100 replication attempts. What they found was nothing short of devastating (my emphasis):
The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size.
Here is a beautiful graph summarizing the terrible results of the study:
The results in red are results that were statistically significant in the original study but that were not statistically significant in the replication.

It seems that classical results in psychology such as priming and ego depletion do not replicate well, at least under some forms. The replication of the ego depletion original study was a so-called multi-lab study where several labs ran a similar protocol and gathered their results. Here is the beautiful graph summarizing the results of the multi-lab replication with the original result on top:

The original study was basically an extreme observation drawn from a distribution centered at zero, or very close to zero, a clear sign of publication bias at play.

What about replication in economics? Well, there are several types of replications that you can run in economics. First, for Randomized Controlled Trials (RCTs), either run in the lab or in the field, you can, as in psychology, run another set of experiments similar to the original. Colin Camerer and some colleagues did just that for 11 experimental results. Their results were published in science in 2016 (emphasis is mine):
We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original.
So, experimental economics suffers from some degree of publication bias as well, although apparently slightly smaller than in psychology. Note however that the number of replications attempted is much smaller in economics, so that things may get worse with more precision.

I am not aware of any replication in economics of the results of field experiments, but I'd be glad to update the post after being pointed to studies that I'm unaware of.

Other types of replication concerns non experimental studies. In that case, replication could mean trying to replicate the same analysis with the same data, in search of a coding error or of debatable modeling choices. What I have in mind would rather be trying to replicate a result by looking for it either with another method or in different data. In my own work, we are conducting a study using DID and another study using a discontinuity design in order to cross check our results. I am not aware of intents to summarize the results of such replications, if they have been conducted. Apart from when it is reported in the same paper, I am not aware of researchers trying to actively replicate the results of quasi-experimental studies with another method. Again, it might be that my knowledge of the literature is wanting.

Evidence of publication bias from meta-analysis

 

Meta-analysis are analysis that regroup the results of all the studies reporting measurements of a similar effect. The graph just above stemming from the multi-lab ego-depletion study is a classical meta-analysis graph. In the bottom, it presents the average effect taken over studies, weighted by the precision of each study. What is nice about this average effect is that it is not affected by publication bias, since the results of all studies are presented. In order to guarantee that there is no selective publication, authors of the multi-lab study preregistered all their experiments and committed to communicate the results of all of them.

But absent a replication study with pre-registration, estimates stemming from the  literature might be affected by publication bias, and the weighted average of the published impacts might overestimate the true impact. How can we detect whether it is the case or not?

There are several ways to detect publication bias using meta-analysis. One approach is to look for bumps in the distribution of p-values or of test statistics around the conventional significance thresholds of 0.05 or 1.96. If we see an excess mass above the significance threshold, that would be a clear sign of missing studies with insignificant results, or of specification search transforming p-values from 0.06 into 0.05. Abel Brodeur, Mathias Lé, Marc Sangnier and Yanos Zylberberg plot the t-statistics for all empirical papers published in top journals in economics between 2005 and 2011:
There is a clear bump in the distribution of test statistics above 1.96 and a clear trough below, indicative that some publication bias is going on.


The most classical approach to detect publication bias using meta-analysis is to draw a funnel plot. A funnel plot is a graph that relates the size of the estimated effect to its precision (e.g. its standard deviation). As publication bias is more likely to happen with imprecise results, a deficit of small imprecise results is indicative of publication bias. The first of the three plots below on the left shows a regular funnel plot, where the distribution of results is symmetric around the most precise effect (ISIS-2). The two other plots are irregular, showing clear holes at the bottom and to the right of the most precise effect, precisely where imprecise small results should be in the absence of publication bias.


 




More rigorous tests can supplement the eyeball examination of the funnel plot. For examples, one can regress the effect size on precision or on sampling noise. A precisely estimated nonzero correlation would signal publication bias. Isaiah Andrews and Maximilian Kasy extend these types of tests to more general settings and apply them to the literature on the impact of the minimum wage on employment, and find, as previous meta-analysis already did, evidence of some publication bias in favor of a negative employment effect of the minimum wage.

Another approach to the detection of publication bias in meta-analysis is to compare the most precise effects to the less precise ones. If there is publication bias, the most precise effects should be closer to the truth and smaller than the less precise effects, which would give an indication of the magnitude of publication bias. In a recent paper, John Ioannidis, T. D. Stanley and Hristos Doucouliagos estimate the size of publication bias for:
159 empirical economics literatures that draw upon 64,076 estimates of economic parameters reported in more than 6,700 empirical studies.
They find that (emphasis is mine):
a simple weighted average of those reported results that are adequately powered (power ≥ 80%) reveals that nearly 80% of the reported effects in these empirical economics literatures are exaggerated; typically, by a factor of two and with one‐third inflated by a factor of four or more.

Still another approach is to draw a p-curve, a plot of the distribution of statistically significant p-values, as proposed by Joseph Simmons, Leif Nelson and Uri Simosohn. The idea of p-curving is that if there is a real effect, the distribution of significant p-values should lean towards small values, because they are much more likely than large values close to the 5% significance threshold. Remember the following plot from my previous post on significance testing:
Most of the p-values should be larger than 0.05 if the true distribution is the black one, since most samples will produce estimates located to the right of the significance threshold. If there is no real effect, on the contrary, the distribution of p-values is going to be flat, by construction. Indeed, the probability of observing a p-value or 5% or less is 5% while the probability of observing a p-value of 4% or less is 4%, hence, the probability of observing a p-value between 5% and 4% is exactly 1% in the absence of any real effect, and it is equal to the probability of observing a p-value between 4% and 3% and so on. When applied to the (in)famous "Power Pose" literature, p-curve is flat:


The flat p-curve suggests that there probably is no real effect of power pose, at least on hormone levels, which has also been confirmed by a failed replication. Results from a more recent p-curve study of Power Pose claims evidence of real effects, but Simmons and Simosohn have raised serious doubts about the study.

There is to my knowledge no application of p-curving to empirical results in economics.

Underpowered studies


Another problem with p-values and Null Hypothesis Significance Testing (NHST) is that they are used to perform a power analysis in order to select the adequate sample size before running a study. The usual practice is to choose sample size so as to have 80% chance of detecting an effect at least equal to a minimum a priori postulated magnitude (a.k.a. the minimum detectable effect) using NHST with a p-value of 5%.

The problem with this approach is that it focuses on p-values and test statistics and not on precision or sampling noise. As a result, the results obtained using classical power analysis are not very precise. One can actually show that the corresponding signal to noise ratio is equal to 0.71, meaning that noise is still 25% bigger than signal.

With power analysis, there is no incentive to collect precise estimates by using large samples. As a consequence, the precision of results published in the behavioral sciences has not increased over time. Here is a plot of the power to detect small effect sizes (Cohen's d=0.2) for 44 reviews of papers published in journals in the social and behavioral sciences between 1960 and 2011 collected by Paul Smaldino and Richard McElreath:


There is not only very low power (mean=0.24) but also no increase over time in power, and thus no increase in precision. Note also that the figure shows that the actual realized power is much smaller than the postulated 80%. This might be because no adequate power analysis was conducted in order to help select the sample size for these studies, or because the authors selected medium or large effects as their minimum detectable effects. Whether we should expect small, medium or large effects in the social sciences depends on the type of treatment. But small effects are already pretty big: they are as large as 20% of the standard deviation of the outcome under study.

John Ioannidis and his coauthors estimate that the median power in empirical economics is 18%, which implies a signal to noise ratio of 0.26, meaning that the median result in economics contains four times more noise than it has signal.

Wednesday, June 6, 2018

My Problems with p-values

This post is the fourth in a series of six posts in which I am arguing against the use of p-values for reporting the results of statistical analysis. You can find a summary of my argument and links to the other posts in the first post of the series. In this post, I present my problems with p-values. 

What are my problems with p-values? Oyy, where to start? Here are four reasons why I hate p-values and NHST:
  1. Science is not about making decisions each time you see a new sample. NHST and p-values are not designed for scientific inquiry but for industrial decisions.
  2. In general, scientists interpret statistically significant effects as equal to the true effect while non-statistically significant results are interpreted as zeroes. Both of these interpretations are deeply WRONG.
  3. Statistically significant results are biased.
  4. Marginally significant results are very imprecise.

NHST and p-values are not designed for scientific inquiry but for industrial decisions


NHST and p-values are not adapted to scientific inquiry but to industrial practice. In order to explain to you why this is the case, let me tell you a story.  (I do not know if this story is 100% true or only apocryphal, but I'm pretty sure I read it in one of Stephen Stigler's books or in Hald's History of Mathematical Statistics. Whether the story is accurately true or not does not really matter though, since it is mostly there to illustrate the context of use of NHST and p-values).

The inventor of the significance test is William S. Gosset, better know under his pen-name of Student (he was so humble that he considered himself a student of the great statisticians of his day, especially Karl Pearson). William Gosset wrote under a pen-name because he was not an academic but was working for a private firm, actually the famous Guiness beer factory. Gosset designed the testing procedure in order to solve a very practical problem that he faced at his job. Everyday, a new batch of grain would come in. Before sending the grain into production, Guiness employees would take a sample of the grain (let's say ten small samples taken in random parts of the batch) in order to assess its quality. They would estimate the quality of the grain in each of the samples for a characteristic important for brewing. Based on the sample, they would have to decide whether to discard the batch or put it under production. The problem is that the sample is a noisy estimate of the quality of the batch. If the batch was bad, but they wrongly decided to put it into production, they would lose money. If the batch was good and they decided to discard it, they would also lose money. You recognize the errors of the second and first type of test statistics. So Gosset had to make a choice, everyday, based on a sample, to discard or accept a batch of grain. He devised a procedure that would minimize the risk of discarding a good batch under a fixed probability of discarding a bad one. The procedure simply used a test statistic: compute the value of the test statistic under the assumption that the batch is good. Compute its p-value. Discard the batch if the p-value is smaller than 0.05.

Gosset's procedure (and test statistics in general) makes a lot of sense in an industrial context. There is repeated sampling and actual decisions made at each sample repetition. Test statistics are perfectly adapted to this problem. Science is a very different problem altogether. There is no repeated sampling. We do not take a decision after each sample repetition. We do not need a procedure to help us make this decision. Fisher was the one who adapted Gosset's idea and translated it to scientific practice. He devised p-values as a means to estimate the strength of the evidence in favor of an assumption. He suggested that we could say that under 5%, the bulk of the evidence could be considered overwhelmingly against the assumption. But he never made this threshold a magical threshold. What made this threshold magical was the procedure of decision attached to statistical testing, that Neyman and Pearson coined after Gosset. But, again, this procedure was adapted to an industrial context of repeated decisions, not to scientific inquiry.

NHST and p-values give a false cutoff sense of confidence


The problem with using test statistics in science is that they focus our attention on the position of our results with respect to a cutoff. Have you ever noticed how much more excited you feel when your results cross the 5% significance threshold? How disappointed you are when they are just below? We also tend to radically alter our reporting of a result when it is statistically significant. For example, if  a coefficient is statistically different from zero, we are happy and we report it as being a positive effect. If the result is not statistically different from zero, we report as being insignificant, and in general we consider it as good as zero. This is something every single one of us has felt.

And this is wrong. It proves a deep misunderstanding of what sampling noise really is and what statistical testing is all about.



Look for example at samples 27 and 28 when N=1000 on the figure above. With sample 27, you have a treatment effect estimated at around 0.18, significantly different from zero at the 5% level. So two stars significance. Great. You tend to interpret this result as a 0.18 positive significant treatment effect, and you are going to remember 0.18. With sample 28, you have a a treatment effect estimated at around 0.17, not significantly different from zero at the 5% level. You tend to interpret this result a non-significant treatment effect, and in general you are going to remember it as a zero. But the two samples contain exactly the same objective information: the confidence interval for the effect is large, ranging from very small (zero or slightly negative) to large.

You cannot change your opinion on a program because some random noise has marginally changed your estimator so that its test statistic falls just above or just below 1.96. Nothing has objectively changed between these two samples. The only reason why we would need to choose a cutoff and change our minds when crossing this threshold is because we want to make a decision. But there is no decision to make. So we should consider the two samples as bringing exactly the same information: either a very small effect (positive or negative) or a very large positive one.

But the 95% confidence interval for sample 28 tells us that the effect might be negative whereas that is not the case for sample 27. Doesn't it count for something? No, and for two reasons. First, a very small effect, positive or negative, is just small and does not have important consequences in the real world. Second, even if it does, your precision does not allow you to conclude anything certain. Probability distributions are continuous here, and the change of probability for the treatment effect being below zero from sample 27 to sample 28 is marginal, extremely small. You can see that if you use the 99% confidence interval. Then sample 27 also contains zero and small negative effects.

Look now at samples 18 and 19 of the same graph. The effects there are small and zero is well within the confidence bands, so a statistical test would just give you an insignificant estimate. In general, you will interpret this as a zero. But this would be wrong. Completely wrong actually, since the true treatment effect is actually 0.18. And the objective information from the confidence interval tells you just that: 0.18 is well within the confidence bands too.

Stick with the objective information. Tests focus your attention on details and marginal changes and cutoff decisions instead of looking at sampling noise objectively. Tests are used as a way to gain false certainty in front of sampling noise. No statistical test can get rid of sampling noise.
 

Statistically significant treatment effects are biased


One very annoying property of statistically significant results is that they are always biased upwards, especially if sampling noise is large (and thus especially if sample size is small). Look again at the figure above. With N=100, the estimates that are statistically different from zero at the 5% level are 2 to 2.5 bigger than the true effect. With N=1000, not all statistically significant results overestimate the true effect but most do.

That statistically significant results are biased upwards is a mechanical consequence of NHST. In order to shed more light on this, let's compute the p-values for the two-sided t-test that the treatment effect is zero for all our Monte Carlo samples using the CLT-based estimates of sampling noise. The figure below presents the results.

You can see that with N=100, treatment effects are significant at the 5% level only when they are at least bigger than 0.3 (the 5% threshold is the blue line on the graph, the red line is the threshold for 1% significance). Remember that the true effect is 0.18! With N=1000, samples with an estimated effect smaller than 0.1 are never significant at the 5% level. As a consequence, if you take the average of the statistically significant effects overestimate the true effect by a large amount: with N=100, statistically significant effects are on average double than the truth, whereas with N=1000, they are roughly 50% bigger.

Note that there is no such problem for larger sample sizes, where all results are statistically significantly different from zero. Actually, we should expect some estimates to be close to zero, but the probability that it happens is so small that it has no practical effect. People accustomed to p-values are thrown off when using large sample sizes where everything is significant at conventional levels. This is actually a funny consequence of people not understanding sampling noise and test statistics: the fact that everything is significant means that uncertainty about parameter values has decreased and that you can actually look at the magnitudes of the coefficients, not whether they are different from zero.


Marginally significant results are very imprecise


Another related very unfortunate consequence of p-values is that results that are marginally significant at the 5% level have are very imprecise: their signal to noise ratio is equal to 0.5, meaning that there twice as much noise as there is signal. This is a very simple consequence of using NHST. Remember that scientists consider a result to be significant at 5% when the ratio t=|x|/se(x) is superior to 1.96. Now, signal to noise ratio can be defined as s/n=|x|/y, with y the width of the 95% confidence interval. Remember that with a normal distribution, we have se(x)=y/(2*1.96). As a consequence, t~1.96 -> s/n~0.5.

A related consequence of using NHST is that, when choosing sample size using a power analysis based on a similar testing procedure, the signal to noise ratio using default settings for size (5%) and power (80%) is also small. Indeed, one can show that the signal to noise ratio of a power analysis for a one sided test (for a two-sided, replace alpha by alpha/2) is equal to:
with delta the confidence level used to build an estimate of sampling noise. With delta=95%, and a two-sided test, the signal to noise ratio of a usual power analysis is 0.71, so that there is still 1.4 times more noise than signal.