The best candidate for “paper that I would have loved to write” is a recent paper by Kvarven, Stromland and Johannesson published in Nature Human Behavior. In this paper, the authors compare the results of meta-analysis to those of large multi-lab pre-registered replications for 15 experimentally estimated treatment effects, using the latter as a benchmark.
The paper makes two key contributions, in my opinion:
- It provides an estimate of the magnitude of publication bias. As I've been lamenting for some time, science has a replication problem, mostly due to publication bias. The biggest threat that publication bias poses is that we are unaware of its magnitude (even though we know its direction) and we do not know which results if affects more. Publication bias thus threatens the soundness of the empirical record in several sciences, among them all of the social sciences. This paper gives the first unbiased estimate of the size of publication bias for 15 different literatures.
- It provides an estimate of the performance of statistical methods correcting for publication bias. When I created SKY in order to make science more credible and collaborative, I built the whole edifice around meta-analysis, the aggregation of published results to obtain more precise and credible estimates. Since meta-analytical estimates are affected by publication bias, SKY rests on the assumption that we can find a way to correct for publication bias in meta-analysis, either ex-ante with a multi-lab pre-registered initiative, such as the Psychological Science Accelerator, or ex-post, using statistical techniques. For SKY, I was thinking of starting with the ex-post approach. I was influenced by a recent paper by Ioannidis, Stanley and Docouliagos where they use their favorite method of ex-post correction for selection bias (PET-PEESE) to estimate the extent of publication bias in economics (and find it to be rather large, see here). Evaluating the credibility of statistical methods for correcting for publication bias in meta-analysis is thus of critical importance, especially in economics where conducting ex-ante coordinated initiatives is super hard (the reasons why are for another blog post).
What do we learn then?
Publication bias is large
The average mean effect in meta-analysis is of .42 of a standard deviation, while it is of .15 in the preregistered multi lab replications. The overestimation due to publication bias is of .27 of a standard deviation, a very substantial amount. One could also say that the overestimation is of a factor 3, but I’m not sure using multiplicative scales is useful here.
PET-PEESE is unbiased but noisy
Could we do better by implementing a bias correction procedure in each meta-analysis? Here, the authors look at three of the main bias correction methods: trim and fill, a parametric selection model and PET-PEESE. The first two methods completely fail at correcting for publication bias, giving in most cases the same result as the uncorrected estimator.
PET-PEESE works very differently. The bias of treatment effect estimation after PET-PEESE is .03 of a standard deviation. So here you go, you have an unbiased procedure for estimating effect sizes in meta-analysis. To see how effective the method is, note that it correctly states that an effect is not there around 85% of the time. To see it in action, look at the Sripada et al paper on the figure below (taken from Nature website, I apologize for the awful quality): the original uncorrected meta-analytic effect ("Random effects" on the figure) is around .6, but the true effect ("Replication" is at around zero). Starting from the biased data, PET-PEESE is able to recover the true effect of zero, while the two other bias correction methods fail. Looks like magic! The same magic happens for the Strack paper, and the Srull and the Mazar and the Jostmann papers. Tom Stanley and Hristos Docouliagos can be super proud of their baby: it is really rare to see such a beautiful performance for a statistical correction procedure in the social sciences!
Unfortunately something else happens when you move down the list: noise increases. That is the PET-PEESE estimator is less and less precise. What does that mean? With PET-PEESE we are often uncertain about the true magnitude of the treatment effect. How much so? Well, the authors do not give direct information to recover the level of noise in the various estimators. One indirect way to assess the level of noise is to use the MDE estimates presented by the authors in their table and the formula linking MDE and standard error of the estimate (divide MDE by 2.8 to obtain the standard error when using a two-sided test with size 0.05 and power 0.80, see my class slides for why here). What do we obtain? The basic meta-analysis without correction for publication bias has a standard error of 0.06, as the trim-and-fill method does, while PET-PEESE has a standard error of 0.16, almost three times higher. My feeling si that the loss in precision is the price to pay for publication bias: since we do not know its exact nature, we are sometimes unable to recover the true effect with precision.
Where does this leaves us? The basic meta-analysis and trim-and-fill have a precision of ±0.12 around the estimated effect, and a bias of 0.27. PET-PEESE has a precision of ±0.32 around the true effect and a bias of zero. What should we do? This is a classical bias/variance trade-off problem. The authors solve it by using a root mean squared error metric. They find a RMSE score of 0.31 for the basic meta-analysis, 0.28 for the trim-and-fill approach and 0.22 for the PET-PEESE approach. So, PET-PEESE shows a clear improvement along the RMSE metric: the gains in term of bias are enough to compensate the losses of precision.
What do I take out of this paper?
Well, first, before more results come in, I’m going to discount the effect size of any non preregistered study by .27 of a standard deviation. This is extreme because some studies actually have no bias while others have much larger bias, but what can I do? There is no way to tell in advance which ones are the good studies and which ones are the bad.
And second, I’m gonna trust PET-PEESE estimates much more than before. Even though I’m going to lose some precision, I’m at least going to be correct on average.
There are at least three things that I would have loved to see in the paper (I was not R2, otherwise they would be in the paper ;)
- Other methods for detecting and correcting for publication bias (p-curving, Andrews and Kasy procedure).
- A comparison of the estimated noise of the PET-PEESE estimates with the true one. It seems to me that the estimated noise is larger than the actual noise, so that the procedure seems more precise than its estimators of precision suggest.
- A comparison of treatment allocations using the results of basic meta-analysis and of PET-PEESE.
I leave that for further research (and maybe further blog posts, if I have time).