Tuesday, June 25, 2019

Why I'm putting SKY on hold

I’ve decided to put SKY on hold, and to focus on my research instead. It is not that I do not think that SKY or something like SKY is not necessary. On the contrary. But I can see myself trying to juggle SKY with my research and with being a parent, and I can see myself failing at all three or trying to do two well. And SKY has to go. 

I’ve learned a lot in the meantime. 

The first thing I’ve learned is that everyone agrees that SKY or something like SKY is needed and would be much better than what we have now. I’ve heard no-one trying to defend the current publication system as being the best of all possible worlds. I’ve heard no-one arguing that paywalls are great and that they love inaccessible datasets and code. I’ve heard no-one saying that we do not need more aggregation, more collaboration, and less publication bias. I’ve heard a lot of people being frustrated with the current system of prestige publications, on both ends of the spectrum, both those who are in and those who are out. 

The second thing I’ve learned is that it is feasible. We can preregister and publish individual research results, write comments and referee reports on them and vote for the ones we prefer, aggregate them with constantly updated meta-analysis and write collaborative consensus articles interpreting the results. We can do it. Together. 

The third thing I've learned is that I was not original in bringing up SKY. Some people had already thought about it, and some even started implementing something very similar. As always in research, independent discovery is a reminder that you should stay humble.

The fourth thing I’ve learned is how change is hard. Oh my god, little did I know how hard it was. In retrospect, I was completely, ridiculously, naive with respect to how we can make change happen. I should have known better, as if any change has ever come easily. But I did not know better, wishful thinking, naivetĂ© and probably lack of experience explain that. I’ve learned, and I’m happy I’ve learned. 

Change is hard for several reasons. First, it is not clear how to make it happen. What is the politically most savvy and most respectful way to do things. I’ve tried a lot and failed a lot in that respect. When I started with my first blog on the empirical revolution in economics, I was sure that people would get the bottom part of my message, that I was worried about rampant publication bias and Questionable Research Practices in economics and that we would have something like a “Aha” moment, and start collectively thinking about how to make things better. But that part of the post went AWOL, people liked the self-congratulatory content, not the more difficult self-reflecting part. OK, I took that and I wrote a series on the problems with p-values. People seemed not to read much through this, and got lost in the details I guess. OK, so I wrote my blog post on why we cannot trust the published empirical findings in economics, and here, I’ve had two very different reactions: on the one hand, a lot of people, most of them outside of Econ, were on board and found what I was saying pretty clear and convincing. On the other hand, most people in Econ just ignored my post. No reaction. Deafening silence. I guess that’s the sound of you being right but also annoying. OK, so then I created SKY in order to show that it was not all talking and all criticising, that there were solutions. Here, two types of reactions: enthusiastic from some (almost 200 likes 🙂) and positive but skeptical from most others. “It seems very nice in practice, but how can we make that happen?" were they saying. “With the current way of promoting researchers and allocating research funds, it is impossible to get there. You would need a complete overhaul, good luck for making that happen.” 

Second, change requires a lot of work and a lot of available time. I do not have time right now to put the work in, unless I decrease the quality of my research output or I decrease my quality as a dad, or I burn out. I cannot decrease the quality of my research output because this is the key for me to stay relevant within the field. So I have to buckle up, accept the current publication system with all its flaws, and accept to play along or become irrelevant, or get out of research and prioritise SKY, with the risk of losing expertise and respect in the field, and go against my intuition that SKY or something like SKY can only come from inside the research community. Maybe I’m wrong on that, I’m willing to listen. Maybe in the transition time from the current system to SKY or something like SKY, we need professional people getting into SKY or something like SKY full time to get it out of the ground. But then, I should work on SKY full time, and as a consequence completely jeopardise my research. And I cannot do that. I love my research projects and I have to see them through. I cannot let them down right now. They are my way to better things within my research community.  

Third, change requires to understand the politics of human endeavours and a good strategy, and I came to understand that in two key ways. First, I did not see that by claiming that “we could not trust the published empirical record in empirical economics”, I was striking a blow against my own house, and criticising the track record of my very community. Not a very clever move. Even if the community is highly akin to self-reflection (which econ partly is), the consequences of acknowledging your mistakes might be dismal. At least, people are not too kin into doing it because of the huge uncertainties as to how the general public is going to receive these kind of statements. For example, I was only saying that because of an unknown amount of publication bias, we do not really know what we know. I never said that everything in empirical economics is wrong, I said we do not know how much is. But a cursory reading of what I was saying might have concluded as such. And I was putting the entire house in peril because I had doubts. The key thing is whether I had doubts about foundations or about details. But even if my doubts were foundational, the best way to address them is probably one at a time, starting by assessing the level of publication bias in one question. Then in another. Etc. Start slow, small and within the community, so as not to lose its trust. Message received. Second, I did not anticipate that by questioning the current track record and practices of empirical economists, I might alienate them, or upset them. I was expecting something like a self-reflection: “Oh, I see, maybe we’re wrong, what could we do to make things better”. Yep, I was that naive. In retrospect, I’m amazed at my ignorance of what makes people tick. And that’s something I’ve learned repeatedly since: you have to factor the human component in, people fallibilities and imperfections and limitations. You cannot ask them to be their better selves all of the time, otherwise you’ll fail. Message received as well.

So eventually, is SKY the way to go? I’m pretty sure it is, and that 10, 20 to 30 years from now, maybe more, most research in the social sciences will look like SKY or something like SKY. But the way to go there is not to erase everything and build up afresh from the ground. I think it goes through careful piecemeal change of the existing structure. And some people are already doing that, with the introduction of registered reports in economics, for example, the data and code requests at most top journals, the arrival of meta-analysis and of tests for publication bias in top journals. We are still a very long way from producing reliable reproducible aggregated evidence, but it is growing and I’m going to contribute to that as much as I can with my own research.

OK, so thanks to all of those who have supported SKY and contributed to it. Thanks to everyone that gave me feedback on it. It has been a great occasion for learning and growth. I end up being more convinced that we need slow piecemeal changes than a complete sudden overhaul, even though I’m convinced that the end result will look a lot like what I (and a lot of others) have delineated.

If you feel like taking the flame, everything is open source, so you can just take over 😉

Friday, March 22, 2019

SKY: Towards Science 2.0

SKY, the Social Science Knowledge Accumulation Initiative (yep, SSKAI=SKY ;), announced in my previous blog post, is live. SKY ambitions to change the way we conduct research in the social sciences by leveraging the tools of the digital revolution in order to make science more accessible, open, collaborative and effective.

Accessible: by providing the most up-to-date and collectively discussed summaries on research questions.

Open: by providing the code and data behind the results, at the level of both aggregate and individual results.

Collaborative: by enabling and leveraging the active participation of all the researchers working on a topic to come to a consensus on the interpretation of the evidence.

Effective: by enabling a better coordination of researchers

Credible: by discouraging publication bias.

In this blog post, I’m going to describe my vision for SKY, which service it’s going to bring to its various users and why they are not brought by the current system, how SKY works, and how you can contribute to it. One subsequent blog post will walk you through some examples, another will describe the links between SKY and the current publishing system while a third one will talk about the personal challenges that changing the status quo imply. A fourth blog post will be dedicated to discussing the concerns some of you might have with respect to the use of meta-analysis and how SKY ambitions to alleviate them. 

The vision for SKY


SKY ambitions to provide:
  1. A collectively agreed upon set of scientific theories on human behavior that have been empirically validated and a set of major open questions where empirical validation is deemed essential and a priority.
  2. A set of interventions that have been proven to work in the past along with the precision surrounding the estimate of their impact, a comparison of their cost-effectiveness and a set of open questions. 
  3. A set of agreed upon methodological procedures and guidelines for testing theories and measuring the effects of interventions and a set of open methodological questions.

SKY is continuously updated in order to reflect the most recent developments on each question.

SKY is decentralized and open source, meaning that anyone can contribute to SKY.

SKY is usable by anyone under the terms of the GNU GPL license.  

Which services will SKY bring


SKY will serve many purposes that are not well served at the moment:
  1. The users of research (policy makers, charities, citizens, firms) will have instant access to the agreed-upon consensus on a topic. At the moment, knowledge is not aggregated and is behind paywalls, making it super hard to convey outside of the ivory tower. 
  2. Funders will be able to identify the core questions that they want to see solved and put their money where their mouth is. At the moment, long back and forth are needed between funders and researchers to identify key questions, and to a large extent, funders defer to an unrepresentative set of researchers to select the important questions to fund.
  3. Researchers will instantly identify open questions, find collaborators, preregister their analysis, report their results, receive feedback constantly. Once validated, their results will be instantly updated in the summary meta-analysis that will aggregate the whole knowedge on a question. Papers will be refereed constantly by anyone. Especially PhD students. Methods will be normalized, constantly validated, tested and discussed. At the moment, frontier questions are “in the air.” Refereeing is a closed process, extremely limited in time. There are still debates on which methods are most convincing and on how best to implement them. Corrections and replications are the exception.
  4. Precision will be a core focus, and replications will be deemed worthy as long as precision will be deemed insufficient to come to a useful conclusion for policy makers or for scientific progress. At the moment, precision is not taken seriously, only significant results get reported, with spurious interpretations, publication bias and Questionable Research Practices (QRPs).
  5. Publication bias will be a thing of the past. First, we will be able to test and correct for it on the past literature. Second, because all projects will eventually be included in the summary meta-analysis, irrespective of their statistical significance, there will be no incentive for publication bias anymore. 
  6. We will never lose knowledge thanks to continuously updated meta-analysis. At the moment, the datasets built in a meta-analysis are often kept private by the authors and more often than not run the risk of being lost. Both of these issues prevent continuously updating our knowledge as new results come in.

The main components of SKY


In order to reach these goals, SKY uses the following tools:
  1. Consensus articles: articles conveying the scientific consensus on a question (think Wikipedia meets IPCC report). These articles are freely editable by all but they are based on quantified accumulated evidence mainly taking the form of a continuously updated meta-analysis. Each time the meta-analytic result changes in a major way, the article wording will change as well to reflect the evolving consensus. Each article will contain a set of important open questions to be answered. As with Wikipedia, a discussion chamber will eventually be added in order to coordinate discussions and achieve consensus.
  2. Continuously updated meta-analysis: meta-analysis are the statistically correct way to aggregate statistical evidence on a question. SKY maintains a database of estimates on each question of interest along with the code analyzing them. The results of the meta-analysis (the aggregated estimated effect size or cost-effectiveness ratio of a given intervention and its precision along with tests for publication bias and questionable research practices) are published on the consensus article page and serve as the basis for the consensus evaluation of an intervention/theoretical proposition. Each time a new individual research result is validated, the database is updated, immediately updating the results of the meta-analysis.
  3. Dynamic research reports: the individual results in each meta-analysis are the outcome of individual research projects. These projects unfold on the website as follows: each researcher or team starts with a declaration of intent, that for the moment is public. Giving the option of keeping the declaration of intent private, or open to a subset of users, is in the books. The declaration of intent serves as a time-stamped pre-registration, and as such will help avoid most QRPs. Then the team reports on its advances and first results. All along the life-time of the project, the team can receive comments and suggestions, so as to amend the analysis. Making the declaration of intent public enables to receive suggestions at the early stages of a project, before data collection for example, which is when feedback is the most useful. Once the results are posted, along with data and code, others will try to reproduce them. The most effective persons to do that are probably masters and PhD students in methods class, under the supervision of their teacher. Refereeing will thus be continuous and in depth, from the birth of a project to forever. Once the analysis in the paper has been reproduced, the entry will be added to the meta-analysis. For already published papers, a link to the paper webpage, and a summary of its main results, will be a start. 
  4. Methodological guidelines: for each method, we will have a set of agreed-upon guidelines on how to best implement them, which tests are required and why. This will be supplemented by an open source “Statistical Tools for Causal Inference” book that will introduce the basic way to implement the methods along with the statistical code. For frontier methodological research, consensus articles and dynamic research reports will also exist.

Why SKY is needed


We are in dire need of a tool to accumulate knowledge in the social sciences for at least three reasons.

First, the social sciences are not currently organized along collectively agreed-upon principles. There are many disciplines that try to explain human behavior in society, from psychology to sociology, from anthropology to economics, from political science to biology, and even physics. We have several major organizing theoretical principles, from homo oeconomicus to evolutionary psychology, from psychological biases to social inscription. We have a lot of interventions (nudges, Cognitive Behavioral Therapy, taxes, quotas, subsidies, norms...). We have a lot of methods to help us assess whether theories are correct or whether our interventions work. We have a lot of empirical evidence on both theories and interventions. BUT: 
  1. The empirical evidence is not organized so as to answer the main questions: which theoretical principles have been refuted/upheld? Which interventions work the best? At the moment, in most cases, we have review articles or chapters that contain a list of empirical results, some in favor, some not in favor of a given theory or intervention, without any firm conclusion drawn. Sometimes, we have a review counting the number of significant results in each direction (an approach called vote counting). That is a step in the direction of providing a unique quantified answer but that is NOT correct (vote counting is a BIASED way of aggregating scientific evidence). We then have some meta-analysis, which are the correct way to aggregate statistical results, but they are plagued by three serious problems: i) publication bias: the extent of publication bias is such in the social sciences that the summary of the published knowledge might be extremely misleading. Sometimes it might not be, but the fact that we do not know whether that is the case or not is devastating. This is a disgrace and a major problem that we have to overcome collectively. ii) Researchers bias: each review, or ideally meta-analysis, is the product of a narrow team of researchers, not of a whole community. That means that we can have disagreement about the interpretations in the review article, and that the conclusions of the review article do not carry the full weight of the whole community. As a consequence, disagreement can persist without ever being solved and progress stalls. iii) Absence of continuous updating: the database behind a meta-analysis is most of the time kept private and generally lost. Updating it requires recollecting the whole data, an awfully costly and long process. With SKY, the process would be automatic and would have memory.
  2. The empirical toolkit is not stabilised so as to determine which methods can answer which questions with which accuracy. We start to converge on a toolkit of methods for causal inference, but there still are a lot of debates and varieties of appreciation as to how to implement them, which tests are required and needed and why. This generates endless back and forth during the refereeing process, especially for people that are not close to the main centres of academic elite. Also, most of our evaluation of precision is based on the results of statistical tests, which is wrong and generates huge biases and triggers QRPs. Finally, there is no agreed-upon set of methodological questions to answer.

Second, there does not exist a place to build consensus on both individual papers and aggregate results. Interactions and discussions in academia, especially in the social sciences, use 20th century technology and have not at all taken advantage of the possibilities offered by the digital revolution. At the moment, interactions are extremely rare and are limited to seminars, conferences and, most of all, the refereeing process. We have inherited these interactions formats from the 20th century, an era in which most interactions were either in person or through mail. This is highly ineffective. Interactions are useful all along the life of a project, especially at its inception, and refereeing is needed even after a project is published, in order to catch mistakes. At the moment, interactions at the beginning of a project are limited to a handful of colleagues. Seamless interaction is all the more needed when writing overview articles that reflect the current views of the community on a question.

Third, the current tools have made us focus a lot of our efforts on publication instead of finding the truth. One especially big problem is that we measure productivity by the number of publications. More precisely:
  • We are focused on publication, and forget almost everything that occurs after, especially aggregation of knowledge and consensus building.
  • Our focus on publication is so strong that we compromise the exactitude of science by allowing for publication bias and Questionable Research Practices (QRPs).
  • Our knowledge is behind paywalls.
  • Knowledge is not aggregated, and when it is, it is not properly or only by part of the community.
  • We study the same objects with different principles.

My coming up with SKY comes from being at the same time inspired by and frustrated with the current ongoing solutions that try to implement the vision or part of the vision:
  1. Wikipedia proves that decentralized communities of interacting users can produce knowledge of high quality. But, one core rule of Wikipedia is No Original Research, which precludes Wikipedia from aggregating scientific knowledge, as long as a meta-analysis is considered original research. Also, Wikipedia has amazing tools for editing interactively an entry but does not have tools for interactively editing code and data.
  2. The several initiatives that conduct systematic meta-analysis on social science questions (Cochrane, Campbell, Washington State Institute for Public Policy, NICE) are amazing but they have several limitations: they are not interactive, they are limited to a restricted team of researchers, meaning that they do not reflect the opinion of the community, they are sometimes paywalled, sometimes very hard to access, their databases and codes are not open, accessible and editable.
  3. The open social science initiatives (PsySciAcc, OSF) are awesome, but their goal is to conduct new studies, not to summarize the results of a whole literature on a question. The results of their studies will be great entries on SKY but they do not provide an easy way to aggregate results, update them and collaboratively write a consensus piece. It will be fine to link to them on SKY for information on an individual study, at least in a first stage. Probably slightly more detailed pieces will eventually be extracted from these articles for their SKY page.
  4. In economics, both J-PAL and IPA provide a first intent at trying to unify methodological practice and have generated some content close to what SKY strives for. For the moment, neither J-PAL nor IPA propose interactive tools to aggregate the knowledge of the community nor an open dataset of results nor meta-analysis.
  5. Twitter is a great tool to interact with scientists, but it is not structured along research topics lines and does not provide tools for editing text, data and code.
  6. Several researchers, myself included, have started to compile all the papers on one question on google spreadsheets. This is neat, but does not provide neither aggregation of the results nor an easily accessible unique place where one can find such datasets, nor a way to get to a consensus based on those datasets.

Why SKY is open source


Before describing how SKY works in practice, let me explain why I make SKY open source. The main reason is efficiency. SKY fulfills all five criteria put forward by Eric Raymond for the decentralized approach to project management to dominate the centralized approach:

  1. Reliability/ stability/ scalability are critical. 
  2. Correctness of design and implementation cannot readily be verified by means other than independent peer review. 
  3. The software is critical to the user’s control of his/her business. Here, users are funders, policy-makers and scientists. I’m betting on the fact that SKY is fulfilling an important need for these users, and maintaining it will therefore appear critical to them.
  4. The software establishes or enables a common computing and communications infrastructure. 
  5. Key methods (or functional equivalents of them) are part of common engineering knowledge.

Countless examples in the software industry, from Linux to Apache, from Firefox to, of course, Wikipedia, prove that the decentralized open source approach works better than the closed approach when these criteria are fulfilled.  

There are two other more philosophical reasons for SKY to be open source. 

First, science by design cannot be closed. Science is the daughter of the Enlightenment, it thrives on rational discourse and exchange. Anyone who has something to contribute, acting in good faith, bringing data, theory, questions or interpretations, is welcome. Science hates arguments of authority, and thus I cannot see myself acting on them. Actually, my feeling is that we need to accept to engage with anyone, especially citizens, both the ones skeptical of our results, and the science-loving ones, we need to show that we have nothing to hide, that we can address all criticisms and explain our position, how we know what we know, and accept to reveal what we do not know, to anyone. Well, obviously, there’ll be rules of engagement that I’ll describe later, and anyone breaking these rules will not be welcome on SKY. We want rational data-based discussion, not endless ideological debates. We want people acting in good faith, not proselytizers. 

Second, I hope that being open to anyone will leverage the intrinsic motivations of contributors, their intrinsic love for science and their positive emotions at contributing to the accumulation of scientific knowledge rather than winning at the editorial lottery. There will be no badges to win on SKY other than contributing to the progressive accumulation of knowledge, no editorial positions to hold, no publication in major journals, nothing to win other than contributing to the collective scientific endeavor of understanding the world better to make it a better place. It’s not to say that there won’t be any rewards on SKY other than the  feeling of contributing to science. It seems obvious that, as it has happened with Wikipedia and Linux, contributors will earn social rewards from the respect and recognition of their peers. But my belief is that, on SKY, these rewards will be of a different nature, and will put value on different types of contributions: useful pieces of public software, useful new methods, a replication that makes an estimated effect more precise, will all be valued because they will make the work of everyone in the community better. People good at organizing teams of researchers around questions will also be valued. But the aggregative nature of SKY, the fact that the main entry that everyone will be concerned about is the consensus article, a piece that will aggregate the results of a lot of researchers, and the interpretations of even more, will tone down the incentive for individualistic rewards such as publishing in big journals or having editorial positions where you act as a gatekeeper. The community will be the gatekeeper, users of science will be the gatekeepers. By doing so, SKY, I hope, will realign the incentives of scientists towards doing good, sometimes boring, but truly life-changing science and away from doing publishable, eye-catching, shiny but eventually shaky and biased science.

Which technology makes SKY possible


Let me now describe how SKY works in practice. The current version of SKY is built around three main tools. There is a MySQL database storing the results of individual studies. These results are downloaded and analysed in R and transformed into a web page with graphs, tables and text using Markdown syntax. Results of the meta-analysis are also stored in the MySQL database for later use. The whole code generating SKY is publicly accessible on a GitHub repository. SKY is currently hosted on GitHub Pages, that is each time a modification to the code is pushed to the GitHub repo, SKY reflects it after a few minutes. 

If you are not familiar with the tools that enable the existence of SKY, here is a quick description (they are all free and open source):
  1. MySQL is a database management tool. It stores data on a repository in a clean and normalised way. Moreover, MySQL enables access to and modification of the data by a lot of users simultaneously. With MySQL, no need to make endless copies of the data. You just have one copy on a shared repository. 
  2. Git is a collaborative writing tool that is used mostly by coders. It enables to track changes made in a text, to merge two separately updated versions of a document, to identify the potential conflicts and to solve them. Git is basically MSWord “track changes” for nerds. With Git, there is a centrally maintained version of the code, in general on a web-based repository such as GitHub, that users can download and modify and then push changes to.   
  3. Rmarkdown is an editor that combines R code for statistical analysis with the Markdown syntax for text editing to produce neatly edited documents such as pdf, word or html files (the format used on the web).
As an intermediary stage, SKY also uses Google doc spreadsheets to upload information about estimates for which there is no entry as an individual research report. Hopefully, as time passes, the number of such entries will decrease.

How to use and contribute to SKY


There are different ways to use SKY. Obviously, consulting the results of meta-analysis on questions that interest you is a first. A second is to link to meta-analysis or related studies or a google doc listing them on SKY, so that you never lose the information and you share it and run the chance of having someone update it and enrich it. Finally, you can also use SKY to pre-register an idea and receive feedback on it.  

You can start contributing to SKY right now. The easiest way to contribute is to make me aware of a meta-analysis on a topic. You can for example cc me when you tweet about one, or when you find one on tweeter. You can also drop me an email. As soon as I'll receive the link, I'll upload it on SKY.

If you download the whole project from the GitHub repo and start tinkering with it, you can create the link yourself. Once you have made modifications, save them to git and push them to GitHub. At the moment, I have to accept your additions for them to appear online. That will evolve over time. Either by adding more master users with permissions to accept changes or by giving master rights to everyone. If you need to access the SQL database, you’ll need an Id and a password. Email me and I’ll send them to you and you’ll be all set. 

When contributing, please describe as clearly as possible what you have done in the commit message, so that everyone can easily identify your changes and what they are doing.

I might ask you to open a new branch when your changes are important, so that we can see what the changes look like on our computers before putting it online.

I've written a succinct document describing how to get started with the tools on SKY. I have to update it for using Rmarkdown (it describes how to use knitr for the moment, but sources for Rmarkdown are easy to find).


What can you contribute to SKY


Meta-analysis


The easiest way to contribute, if you have done, or have knowledge of, a meta-analysis, is to make me aware of a meta-analysis on a topic. You can for example cc me when you tweet about one, or when you find one on tweeter. You can also drop me an email. As soon as I'll receive the link, I'll upload it on SKY.

If you have access to the data behind the meta-analysis, it would be great if you could create a google spreadsheet containing the data and create an article summarising your interpretation of the data. Again, if you do not have time or do not have the knowledge on how to do it, I can do it for yo gif you send me the dataset.

If you’re interested in a question but have not done a meta-analysis, please start a google doc with the few estimates you know. We will improve on the coverage over time with addition by others, especially by original authors who want their papers to be included in SKY. You can already see how SKY changes the way we conduct a meta-analysis: we can start small and aim for exhaustivity only in the limit, with the combined efforts of a lot of people. 

Try as much as possible to add an individual research report for each entry in your dataset, linking to the initial study and explaining how you went from the figures in the study to the entry in the dataset.

If you do not have time to do all of that, just start with something. Make a guess. Explain how you made it. Other people or you when you’ll have more time will come back to this and improve. 

We need to agree on aggregation principles, especially the estimates. I propose to aggregate using Odds Ratios (OR), Effect Sizes (ES), elasticities or ratio of ES and Cost-Effectiveness or Benefit/Cost ratios, depending on what’s appropriate for a particular application. For interventions, Cost-Effectiveness ratios should be the ultimate goal, in that they enable a clear comparison between interventions aiming at the same goal. Other quantities of interest should be computed as much as possible for each estimate.

Individual research reports


You can start a new research project on SKY. That will act as preregistration, and will be time stamped on the Git server when you commit your changes. 

You can link to an ongoing research project on another website such as the AEA registry,  OSF or PsySciAcc or to a registered report sent to or accepted by a journal.

Do not hesitate to create a page for the general question your study is trying to answer and to put a few word of motivation there.

If you’re aware of similar studies, mention them, even if only in passing. Better, provide a link to the project/paper. Better even, let the authors know that if they want, they can put their estimates on SKY and review and improve your entry. 

Methodological Research


You can also contribute to developing methodological guidelines. Anyone can, not only people doing methodological research.

If you’re doing methodological research, do not hesitate to create new research reports and synthetic entries. 

Finally, you can update and revise or even contribute directly to the Statistical Tools for Causal Inference Book. Its GitHub repo is here.

What are the rules of the game on SKY


Be bold 


Do not hesitate to create new questions and to feed them the little information that you have. If it is nothing, then that’s fine. Just ask the question.

Do not strive for perfection at first


A stub, a line, a link, a back of the enveloppe estimate are better than nothing. Others or yourself later will complete your work.

Act in god faith,


When interpreting evidence and discussing results,  be open to arguments. 

Assume good faith


Be civil, be tolerant: errors and mistakes happen. They most of the time come in good faith.

Use data to settle disputes


If various interpretations of the same data can be held, mention them and mention which tests might be able to settle between them. When discussing methods, use theory, simulations or performance on real data to come to an agreement. 

Do not proselytise


If you’re coming on SKY only to make a point, whatever the evidence, you’ll be frowned upon. Accept that some people are not convinced by the same evidence than you are, try to understand why and try to find a way to answer them or at least delineate the possible concurrent interpretations and how we could go about deciding between them.

Do not violate copyright 

When mentioning a published paper, always provide links to the published version. It's fair to also provide a link to a freely accessible pre-print. If you quote from the paper, use quotation marks or the quotation environment. Refrain from extracting a figure or a table from a paper. Simply provide the link to it on the publisher webpage. I'm in discussion with publishers in order to understand the rules of the game. 

How SKY is funded


SKY is open source, free and free of advertisement, and not for profit. SKY benefits from the work of researchers whose wages are already being paid by their institutions. In the future, SKY will be organised as not for profit organisation able to receive donations, crowdfunding and subsidies for funding servers. For the moment, SKY is only funded by me, for the cost of maintaining an online MYSQL server.

SKY is supported by my employers, TSE and Inra.

What’s in the books for SKY


The major evolution that I foresee for SKY in the medium run is to move to a full cloud-based solution. This will enable several key improvements:
  1. Cloud-based computing avoids having to download and run the code code for the whole website on the computer of each user. At the moment, the only cloud-based parts in SKY are the SQL (and Google spreadsheets) database and the GitHub repository for the code. In order to make modifications, the user has to download the full SKY project, which might at some point become a mouthful. It is not at the moment, since the website is small but if and when SKY grows, this might come as an issue. There are ways around that issue that do not require moving to full cloud-based computing (such as using RMarkdown “cache” option in order to not having to run expensive pieces of code as long as their result does not change, and even putting the cached databases on GitHub), but they are dominated by a full cloud-based solution.
  2. Cloud-based computing would probably be more welcoming to users of statistical software that is not R and to users not familiar with git. 

The solution that I am exploring at the moment is to use CodeOcean amazing infrastructure. I’m going to talk with them and see if we can work something out.

Another important evolution is to add to SKY a discussion chamber, much like the “discussion” page in a Wikipedia project. Such a chamber is crucial to prevent cycles of edits by editors that share opposite views and to fluidify discussions and sharing of work. 

Another important addition to SKY is a refereeing suggestion and voting infrastructure, that, much like answers on Stackoverflow, will enable you to suggest edits and modifications to a project and to vote for already suggested modifications. 

Another important component of SKY will be an anonymous reporting infrastructure where behaviour deemed unethical or violating the rules of SKY can be reported. Individuals appearing multiple times there for different interactions will be warned by SKY representatives that their behaviour should change if they want to remain on the platform. Public shaming and exclusion from the platform are possible sanctions for reiterated behaviour despite warnings.

Finally, offering the option of maintaining private repos is also in the books.

OK, now, let's roll :)

Thursday, September 20, 2018

Why We Cannot Trust the Published Empirical Record in Economics and How to Make Things Better

A strain of recent results is casting doubt on the soundness of the published empirical results in economics. Economics is now undergoing a replication crisis similar to the one psychology and cancer research have undergone in the last ten years. This crisis is so broad that it concerns all of the published empirical results and it is so severe that it might mean that most of them are wrong. The mere fact that we cannot trust any of them and do not know which ones hold and which ones do not casts doubt on all of the empirical results in our field. This is very serious business.

In this blog post, I want to briefly explain what the replication crisis in economics is and what are its most likely causes. I'll then provide some evidence along with personal stories and anecdotes that illustrate the bad behaviors that generate the replication crisis. Then, I'm going to detail the set of solutions that I think we need in order to overcome the crisis. Finally, I will end with a teaser about a project that I am preparing with some colleagues, the Social Science Knowledge Accumulation Initiative, or SKY (yeah, I pronounce SSKAI SKY, can I do that?), that we hope is going to provide the infrastructure necessary in order to implement the required changes.

In a sense, I'm relieved and happy that this crisis occurs, because it is high time that we put the great tools of the empirical revolution to good use by ditching bad practices and incentives and start accumulating knowledge.

What is the replication crisis?


The replication crisis is the fact that we do not know whether the published empirical results could be replicated if one tried to reproduce them exactly in the same way as the original authors did, with the same data collection procedure and the same statistical tools.

The replication crisis come from editors choosing to only publish studies that show statistically significant results. Because statistically significant results can be obtained by chance or by using Questionable Research Practices (QRPs), the published record is populated by studies whose results are due to luck or to data wizardry.

The replication crisis is a very bad problem because it implies that there is no guarantee that the truth will emerge from the published record. When published results are selected in order to find effects that are just big enough to pass the infamous statistical significance thresholds, then the published record is going to be severely biased. Small effects are going to look bigger than they really are, zero effects (or very small effects) might look like they exist and are substantial. And we do not know which effects are concerned and how many of them there is. The replication crisis means that we cannot trust the published empirical record in economics.

The Four Horsemen of the Apocalypse: the Behaviors Behind the Replication Crisis


Before getting to the evidence, let me detail the four behaviors that make the published record unreliable: publication bias, the file drawer problem, p-hacking and HARKing. Let me describe them in turn with the help of an example.

Let's start with publication bias. Imagine that you have a theory saying that all coins are biased. The theory does not say the direction of the bias, only that all coins are biased, some might give more heads than tails and others more tails than heads. In order to test the theory, 100 research teams start throwing coins, one for each team. By sheer luck, sometimes, even unbiased coins will have runs with more heads than tails or more tails than heads. How do you decide whether the run is due to sheer luck or to a biased coin? Well, scientists rely on a procedure called a statistical test. The idea is to assume a balanced coin (we call that the Null Hypothesis) and derive what type of data we should see under that assumption. The way it works is a follows: assuming a fair coin, the distribution of let's say the proportion of heads over runs of N flips is going to be centered around 0.5, with some runs giving a larger probability than 0.5, and some runs giving a lower probability. The distribution is going to be symmetric around 0.5, with tails that get thinner and thinner as we move away from 0.5 on each side because we are less likely to obtain 75% of heads than 55%. Now, the test works like this: each team compares its own proportion of heads to the theoretical distribution of the proportion of heads under the Null for a sample of size N. If the observed proportion of heads is far away in the tails of this distribution, on either side, we are going to say that the coin is biased, because it is unlikely that its run comes from the fair coin Null distribution. But how do we decide precisely? By convention, we decide that the assumption that the coin is unfair is going to be statistically significant if it is such that the proportion of heads over N runs is equal or higher to the 97.5 percentile of the distribution of the fair coin under the Null or equal to or smaller than the 2.5 percentile of the same distribution. Why do we choose these values? Because they have the property that, under the Null, we are only going to conclude incorrectly that a coin is biased while it is actually fair 5% of the time, and 5% is considered a small quantity, or, rather, it was chosen by the inventor of the procedure, Ronald Fisher, and it kind of became fetishized after that. Something like a papal decree (we have a lot of them in science, rather surprisingly). The precise values of the thresholds above or below which we are going to decide that a coin is not fair depend on sample size. They are closer to 0.5 as the number of flips increases, because with a larger number of flips, more and more samples are going to find values closer to 0.5 under the Null Hypothesis that the coin is fair. With larger sample sizes, a smaller deviation is going to be considered more surprising.

Even if all coins are fair, and the theory is wrong, by the sheer property of our testing procedure, 5 out of the 100 competing teams are going to report statistically significant evidence in favor of the biased coin theory: 2.5 that the coin is biased towards heads and 2.5 that the coin is biased towards tails. That would be fine if all the 100 teams reported their results. The distribution of the proportion of heads over the 100 teams would be very similar to the distribution under the null, and would thus invalidate our theory and vindicate the Null.

But, what would happen if only the statistically significant results get published? Well, The published record would only contain evidence in favor of a false theory, invalidating the correct Null hypothesis: the published record would say "All coins are unfair."

Waouw, how is it possible that something like that happens? It seems that my colleagues and friends, who are referees and editors for and readers of scientific journals, hate publishing or reading about Null non-significant results. Why is that? You might say that it looks completely foolish to do so once I've presented you with how the procedure works. Sure, but I came with what I hope is a clear understanding of the problem after a lot of effort, reading multiple books and articles about the replication crisis, simulating these coin tosses for myself, teaching that problem in my class and writing a series of blog posts on the topic. (i) Most scientists do not have a great mastery of statistical analysis (well, you already have to master your own discipline, and now you would need stats on top of that, that's obviously hard). The statistical education of scientists is a series of recipes to apply, and of rules to follow, instead of an introduction to the main issue (that of sampling noise) and to the ways to respond to it. As a consequence of poor statistical training, most scientists interpret Null results as inconclusive, and statistically significant results as conclusive. (ii) Most scientists are not really aware of the aggregate consequences of massive science, with multiple teams pursuing the same goal, and how that interacts with the properties of statistical tests. Most teams have not read John Ioannidis' paper on this (on which my example is loosely based). (iii) Researchers are like every other human being on the planet, they like a good story, they like to be surprised and entertained, they do not like boring. And most correct results are going to be boring: “Coins are fair? Boring! Who would like to know about that? Wait, what, coins are unfair? Here is the front page for you, buddy!” The problem is that the world is probably mostly boring, or, stated otherwise, most of our theories are probably false. We are thus going to find a lot of Null results and we should know about them in order to weed out the wrong theories.

OK, now we understand publication bias. But there’s more. Scientists understand that editors and referees do not like insignificant results. As a consequence, they adopt behaviors that make the replication crisis more severe. These Questionable Research Practices (QRPs) - the file drawer problem, p-hacking and HARKing - increase the likelihood of finding a significant effect even if there is none to levels much higher than 5%, sometimes to levels as high as 60%. As a consequence, the published record is populated by zombie theories even if the truth is actually the good old Null hypothesis. The file drawer problem is scientists simply not even bothering to convert non significant results into papers because they know they’ll never get to publish them. As a consequence, the bulk of evidence in favor of the Null is eradicated before even getting to the editors. The second problem is called p-hacking. P-hacking is when researchers use the degrees of freedom that they have in experimental design in order to increase the likelihood of finding significant results. And there are unfortunately tons of ways to do just that. In my biased coin theory example, p-hacking could happen for example if researchers checked periodically the result of the test after a few coin flips, and then after collecting a few more, and decided to stop collecting data when the test result becomes significant. Researchers might also decide to exclude some flips from the sample because they were not done in correct conditions. This might look very innocent and might not be done in bad faith: you’re just doing corrections to your data when your result is non significant and stop doing most of them when they are. Finally, scientists might try out a host of various tests of the hypothesis and report only the statistically significant results. For example, bias could be defined as the proportion of  heads being different from 0.5, but it could also be that the coin has long runs of successive heads or of successive tails, even if it is unbiased on average. Or it might be that the coin is biased only when tossed in a certain way. If all teams try all these combinations and only report the statistically significant results, the proportion of the 100 teams finding significant results would increase by a lot, sometimes up to 60, instead of the initial 5. HARKing or Hypothesizing After the Results are Known is the procedure or formalizing the theory after seeing the data and reporting it as if the theory came first. For example, over the 100 team investigating coin fƂips, the ones finding significantly more heads might end up publishing a paper with evidence in favor of a theory stating that coins are all biased towards heads. The other teams might publish a paper with evidence in favor of the theory that all coins are biased towards tails. Some teams might find that coins produce successive “hot hands” runs of heads or tails, some teams might publish evidence for theories that the way you throw the coin makes a difference on how biased the coins are, etc. Don’t laugh, that has happened repeatedly.


What is the evidence that the published record in economics is tainted by publication bias and Questionable Research Practices?


If my theory of the replication crisis is correct, what should we see? Well, it depends on whether we spend our time hunting for true effects or for Null effects. More precisely, it depends on the true size of the effects that we are hunting for and on the precision of our tests (i.e. the size of our samples). If we spend our time looking for small, very small or zero effects with low precision, i.e. low sample size, we should see the distribution under the null, but truncated below the significance thresholds if there is publication bias. That is, we should see only significant results being published, with the bulk of them stacked closer to the statistical significance threshold and then the mass slowly thinning as we move to higher values of the test statistics. If there is p-hacking or HARKing, we should see a bulging mass just above the significance thresholds. On the contrary, if we mostly are looking at large effects with large sample sizes, the distribution of test statistics should look like a normal only slightly censored on the left, where the missing mass of non-significant results fall. We should have a small bulging mass above the significance threshold if there is p-hacking or HARKing, but that should not be the main mass. The main mass should be very far away from the significance thresholds, providing undisputable evidence of large and/or precisely estimated effects.

So what do we see? The following graph is taken from a very recent working paper by Abel Brodeur, Nikolai Cook and Anthony Heyes. It plots in black the distribution of 13440 test statistics for main effects reported in 25 top economics journals in 2015, by method (I'll come back to that). It also reports the theoretical distribution under the Null that all results are zero (interrupted grey line) and the thresholds above which the effects are considered to be statistically significant (vertical lines).



What we see is highly compatible with researchers hunting mostly for Null effects, of massively missing Null effects and of published effects due either to luck or to QRPs.
  1. There is no evidence of large and/or precisely estimated effects: in that case, test statistics would peak well above the significance thresholds. In the actual data, it's all downhill above the significance thresholds.
  2. We might interpret the plot for IV as evidence of medium-sized or not super precisely estimated effects.  Taken at face value, the peak around the significance thresholds of IV might tell the story of effects whose size places them just around the significance thresholds with the precision that we have in economics. That interpretation is very likely to be wrong though:
    1. It just is weird that our naturally occurring effects just happen to align nicely around the incentives that researchers face to get their results published. But, OK, let's accept for the moment that that's fate.
    2. Editors let some Null results pass, and actually there are too many of them. As a consequence, the distribution is not symmetric around the main effect, and the bulge is too sharp around the significance thresholds, and the mass of results at zero is too big to be compatible with a simple story of us hunting down effects that are exactly around detection thresholds.
    3. The quality of the evidence should not depend on the method. IV seems to provide evidence of effects, but the much more robust methods RDD and RCT show much less signs of a bulge around significance thresholds. It could be that they are used more for hunting Null effects, but it is hard to believe that users of IV have a magic silver bullet that tells them in advance where the large effects are, while RCT and RDD users are more stupid. No, most of the time, these are the same people. The difference is that it's much harder to game an RCT or an RDD.
  3. We see clear evidence of p-hacking or HARKing around significance thresholds, with a missing mass below the thresholds and an excess mass above. It is clearly apparent in RCTs, tiny in RDDs, huge in DID and so huge in IV that it actually erases all the mass until zero! 

OK, so my reading of this evidence is at best extremely worrying, at worst completely damning: yes, most statistically significant published results in economics might very well be false. There is no clear sign of undisputable mass of precisely estimated and/or large effects far above the conventional significance thresholds. There are clear tell-tale signs of publication bias (missing mass at zero) and of p-hacking and HARKing (bulge around significance thresholds, missing mass below). So, we clearly do not know what we know. And we should question everything that has been published so far.

Some personal anecdotes now that are intended to illustrate why I'm deeply convinced that the evidence I've just presented corresponds to the worst case scenario:
  1. I p-hacked once. Yes, I acknowledge that I have cheated too. It is so innocent, so understandable, so easy, and so simple. I had a wonderful student that did an internship with me. He was so amazing, and our idea was so great, we were sure we were going to find something. The student worked so hard, he did an amazing job. And then, the effect was sometimes there, sometimes not there. We ended up selecting the nicely significant effect and published it. I wanted to publish the finding for the student. I also was sure that our significant effect was real. I also needed publications, because this is how my output is measured, and at the time I did not understand very well the problem of publication bias and how severe it was. I chose to send the paper to a low tier journal though, in order not to claim this paper too much on my resume (it is not one of my two main papers). Yes, it was a little bit of a half-assed decision, but nobody's perfect ;)
  2. After that, I swore never to do it again, I felt so bad. I put in place extremely rigorous procedures (RCTs with preregistration, publication of everything that we are doing, even Null results, no selection of the best specifications). And, mind you, out of my 7 current empirical projects, 5 yield non-significant results.
  3. After presenting preliminary non-significant results at the first workshop of my ANR-funded project, a well-intentioned colleague told me that he was worried about how I was going to publish. "All these non significant results, what are you going to do about them?" he said. He said it like it was a curse or a bad omen that I had so many insignificant results. He is an editor. It almost felt like I was wasting public research funds on insignificant results, a problem I could not foresee before starting with the study. But see the problem: if I cannot convert my research in published results, I am going to fail as a researcher and to fail to raise more money to do my future research. But in order to publish, I need significant results, which are beyond my control, unless I start behaving badly and pollute the published empirical record with tainted evidence. This is the incentive structure that all of us are facing and it is deeply wrong.
  4. Turns out my colleague was right: the editor of the first journal we sent a paper with a non significant result wrote back "very well conducted study, but the results are insignificant, so we are not going to publish it." I kid you not.
  5. I asked in a seminar to the presenter whether he obtained his empirical results before or after crafting his theory. He looked at me puzzled. Took a long time to think about his answer and then said: "Do you know anyone who formulates theories before seeing the data?" I said "Yes, people who pre-register everything that they are doing." He replied "Is that a sin not to do it?" I said "It's kind of a sin, it's called HARKing. The main thing that I want to know is whether the evidence in favor of your theory came before or after your theory. It is much more convincing if it came after."After talking more with him, he could delineate which results inspired his theoretical work and which ones he obtained after. But he was unaware that presenting both sets of results on the same level was a QRP called HARKing (and I'm not pointing fingers here, we are all ignorant, we all sinned).
  6. Recently on Twitter, a young researcher boasted that his large productivity in terms of published papers was due to the fact that he stopped investigating non-promising research projects early. If you stop at once when you have non-significant results, then, you contribute to the file drawer problem.
  7. One of the most important current empirical researchers in economics said in an interview “where we start out is often very far away from the papers that people see as the finished product. We work hard to try to write a paper that ex-post seems extremely simple: “Oh, it’s obvious that that’s the set of calculations you should have done.”” This leaves the possibility open that he and his team try out a lot of different combinations before zeroing in on the correct one, a behavior dangerously close to p-hacking or HARKing. I'm not saying that's what they do. I'm saying that no procedures ensure that they do not, and that the empirical track record in the field pushes us to lose trust when this kind of behavior is allowed.
  8. With my students last year, we examined roughly 100 papers in empirical environmental economics. I cannot fathom how many times we commented on a result that was like “the conclusion of this paper is that this policy saved 60000 lives plus or minus 60000.” The bewildered look in my students' eyes! “Then it could be anything?” Yes it could. And even more worrying, how come all of these results are all so close to the significance threshold? My students were all wondering whether they really could trust them.
  9. I cannot express how much people in my field are obsessed with publication. It seems to be the end goal of our lives, the only thing we can talk about. We talk more about publications than we talk about scientific research. Careers are made and destroyed based on publications. The content itself is almost unimportant. Publications! Top journals! As if the content of our knowledge would follow directly from competition for top publications. Hard to avoid pointing that neither Darwin nor Newton, nor Einstein considered publication to be more important than content. And hard to avoid saying that then we are only rewarding luck (the luck of finding the few significant results stemming out of our Null distributions) or worse, questionable research practices. And that we are polluting the empirical record of evidence and ultimately betraying our own field, science and, I'd say, shooting ourselves in the foot. Why? Because when citizens, policy-makers and funders will catch up with that problem (and they will), they will lose all confidence in our results. And they'll be right.

I hope now that you are as worried as I am about the quality of the empirical evidence in economics. Obviously, some people are going to argue that there is not such a problem. That people like me are just frustrated failed researchers that try to stifle the creativity of the most prolific researchers. How do I know that? It all happened before in psychology. And it is happening right now. There are people mocking Brodeur et al for using the very statistical significance tests that they claim are at the source of the problem (I let you decide whether this is an argument or just a funny deflection zinger. OK, it's just the latter). Some people point out that significant results are just results that signal that we need more research in this area, in order to confirm whether they exist or not. I think this is a more elaborate but eventually fully wrong argument. Why? For two reasons. First, it is simply not happening like that. When a significant result gets published, it is considered as the definitive truth, whether or not it is precise, close to the significance threshold or not. And there is NEVER an independent replication of that result. This is not seen as an encouragement for more research. It is just the sign that we have a new result out. Cases of dissent with the published record are extremely rare, worryingly rare. And the first published significant result now becomes an anchor for all successive results, a behavior already identified by Feynman in physics years ago that lead him to propose blind data analysis. Second, there are no incentives for doing so: replications just do not get attention, and have a hard time being published. If they find the same results: boring! If they don't, then why? What did the authors do that lead to a different result? No-one believes that sampling noise alone can be responsible for something as a change in statistical significance, whereas it is actually the rule rather than the exception, especially when results are very close to the significance threshold. Third, this is simply a waste of time, effort and money: all of these significant results stem from the selection of 5 significant results among 100 competing teams. Imagine all the time and effort that is going into that! We have just ditched the work of 95 teams just to reward the 5 lucky ones that obtained the wrong result! This simply is crazy. It is eve crazier when you add QRPs to the picture: following that suggestion, you just end up trying to reproduce the work of the sloppiest researchers. Sorry, I prefer to hunt down rigorously for Null effects all of my life than trying to chase the latest wild geese that some of my colleagues claim to have uncovered.

The existing solutions to the replication crisis


OK, so what should we do now? There are some solutions around that start being implemented. Here are some of the ones that I do not think solve the issue because they do not make the Null results part of the published record:
  1. Most referees of empirical research now require extensive robustness checks, that is trying out to obtain your results under various methods and specifications. This is in a sense a way to try to have you not do too much p-hacking, because a p-hacked result should disappear under this type of scrutiny. Actually, it is true, the most fragile p-hacked results might disappear, but some extreme results will remain, and some true results will disappear also. Or maybe we will see more p-hacking and HARKing around the robustness checks. It is actually an interesting question for Brodeur et al: do papers including robustness checks exhibit less signs of p-hacking. Note also that this approach will not make the record include Null results, just a more selected set of significant ones. So, as a consequence, it does not address the main issue of publication bias.
  2. Some people have suggested to decrease the threshold for statistical significance from 5% to 0.5%. We will still miss the Null results we need.
  3. Some people have suggested the creation of a Journal of Null results. Well, that's great in theory but if Null results have no prestige or impact because people do not read them, or use them, or quote them, then noone will send papers there. And we will miss the Null results.
  4. Some journals have started requiring that datasets be posted online along with the code. It will for sure avoid the most blatant QRPs and correct some coding mistakes but it will not populate the empirical record with the Null results that we need in order to really be sure of what we do know.

There are some solutions that are more effective because they make the track record of Null results accessible:
  1. The AEA recently created a registry of experiments. This is a great first step, because it obliges you to pre-register all your analysis before you do anything. No more QRPs with pre-registration. Pretty cool. Three problems with pre-registration. First, it is voluntary. to make it really work, we need both funders and journals to agree that pre-registration is a prerequisite before submitting a paper or spending money on experiments. Second, it does not solve publication bias. we still are going to see only the significant results get published. It helps a little because now we theoretically know the universe of all studies that have been conducted, if it is being made a pre-requisite, and we can estimate how many of them have been published eventually. If it is only 5%, we should just be super cautious. If it is also mandatory to upload your results on the registry, we could collect all these unpublished Null results and get a truthful answer. But that's with a lot of caveats. For the moment, it does not apply to methods other than RCTs: natural experiments (DID, IV, RDD) and structural models are never pre-registered.
  2. An ever better solution is that of registered reports. Registered reports are research proposals that are submitted to a journal and examined by referees before the experiment is run. Once the project is accepted, the journal commits to publish the final paper whatever the results, significant or not. That's a great solution. Several journals in psychology now accept registered reports, along with one in economics. The problem is that it is not really adapted to some field research where things are decided in great urgency at the last minute (it happens a lot when working with non research institutions).
  3. A great solution is simply to organize large replications of the most important results in the field. Psychologists have created the Psychological Science Research Accelerator (modeled after physicists CERN) and OSF where hundreds of labs cooperate in choosing and conducting large precise studies trying to replicate important results.
  4. A super cool solution is to start accumulating knowledge using meta-analysis. Meta-analysis simply aggregate the results of all the studies on the same topic. Meta-analysis can also be used to detect publication bias. First, you can compare the most precise results to the less precise ones. If the more precise results are smaller, that is sign of publication bias. It does not mean that the most precise results are not biased though. We would need pre-registered replications to say that. Second, you can also compare the distribution of the p-values of statistically significant results (it's called p-curving). It should be decreasing if there is a true effect (a super cool result), it is flat under the Null and it is increasing with QRPs, with a bulge just below significance thresholds. What if we were p-curving the Brodeur et al data? We'll probably just find huge signs of publication bias, because it is clear that there is a bulge, and the upward part of the distribution is compatible with the Null. It would be great if Abel and his coauthors could actually p-curve their data to confirm my suspicion.

What is the track record of these changes? Let's see what happened when doctors moved to registered reports: they moved from 57% of trials reporting a significant effet (i.e. an effective drug) to 8%!



When psychologists tried to replicate huge swaths of published findings, they found that 60% of them did not replicate.


When economists did the same (for a very small number of studies), they found 40% did not replicate, and effects where overestimated by a factor of 2! Below is a graph from a paper by Isaiah Andrews and Max Kasy where they propose methods for detecting and correcting for publication bias and apply their tools to the data from this replication. The blue lines are the original estimates and the black lines the ones corrected for publication bias. You can see that originally borderline statistically significant results (with the confidence interval caressing the 0 line) are actually containing zero and that most of the presented results contain zero in their confidence interval, and are thus compatible with zero or very small effects.


When meta-analyzing economics literatures, John Ionnanidis and his colleagues found huge signs of publication bias, with effects overestimated by a factor of 2 to 4! Below is a graph from their paper showing the distribution of the inflation factor due to publication bias. The mode of the distribution is 300% inflation, so effects that are 3 times too large. Note that this is a cautious estimate of the amount of publication bias that there is in these literatures, since it uses the most precise estimates as an estimate of the truth, while they might very well be biased as well.


My proposal: the Social Science Knowledge Accumulation Initiative (SKY)


The problem with all these changes is that they are voluntary. That is unless funders and journal editors decide to require preregistration for example, we will not see huge progress made on that front. What I'm proposing is to use meta-analysis to organize the existing literature, to evaluate the quality of the empirical published record bit by bit, and to use it to discipline fields, journals, universities and researchers by giving them the incentives to adopt the best research practices. This will be the first job of SKY. We are going to provide a tool that makes doing and reporting the results of meta-analysis easier.

What is nice with meta-analysis is that they summarize what we know and they help detect our likely biases. You can detect publication bias in a literature, obviously, but you can also apply p-curving to a journal, to an editor, to a university, a department, a team and even to an individual researcher. I'm not saying that we should, but the mere fact that we could should push everyone into adopting better practices. Moreover, it seems very hard to game meta-analysis and p-curving, because the only way to game them seems to me to be to expressly cheat.

The application of meta-analysis and p-curving to whole literatures should curb down the incentive for publication bias. Indeed, what will be judged passed on to policy-makers and students are the results of meta-analysis, and not the results of individual studies. Researchers should not compete for doing meta-analysis, they should cooperate. Anyone having published a paper included in the meta-analysis should be included as an author, that is they have incentive to disclose everything that they are doing (even Null results will be valued now), and incentives to control what their fellow researchers are doing, because their results are going to be on the paper as well.

Meta-analysis will also push us towards the normalization of methods, so that the quality of methods is controlled and standardized, in order for studies to be included in a meta-analysis. We also need replicators that redo the statistical analysis of each separate studies. Great candidates are master and PhD students in methods class, under the supervision of a PI.

We finally need a tool to make research more open that is more versatile than pre-registered reports. I propose to make all of our lab notes open. That is we will set up a researcher's log, where he reports everything that he is doing on a project. He will first start with the idea. Then he will report the results of the first analysis, or his pre-analaysis plan, and his iterations. That does not obviate the need for pre-registered reports, but it looks like a pre-registered report only less rigid. If the lab notes are made open, you can receive comments from colleagues at all stages in order to improve your analysis. I know that researchers are afraid of having their ideas stolen, but first this is a red herring (I still have to meet someone to whom it has actually happened, but I have met dozens of researchers who have heard of someone to whom it has happened). What better way not to be beaten to the post than publicizing your idea on a public time-stamped forum? This will also be a very effective way to find co-authors.

Some last few thoughts in the form of an FAQ:
  1. Why not use the existing infrastructure developed in Psychology such as the Accelerator? I see SKY as complementary to the Accelerator as a way to summarize all research, including the one conducted on the Accelerator. The Accelerator is extraordinary for lab research. Some empirical economists do some lab research (and we should probably do much more), but most of us do research using Field Experiments (FE, a.k.a. Randomized Controlled Trials (RCTs)), Natural Experiments (NE) or Structural Models (SM). The Accelerator is not super adapted to both FE and NE because it requires a very strict adherence to a protocol and a huge level of coordination that is simply not attainable in the short run in economics. When running FE, you have to choose features in agreement with the local implementers, most of the details are decided at the last minute. When using NE, we are tributary of the characteristics of the data, of the exact features of the Natural Experiment that we are exploiting and that are sometimes revealed only after digging deep in the data. My hope is that eventually, the coordination brought about by SKY around research questions will enable the gradual standardization of methods and questions that will enable us to run standardized FE and NE. SM are in general intending to predict the consequences of policies or reforms. Their evaluation will entail pre-registration of the predictions before seeing the data, or an access to the code so that anyone can alter the estimation and holdout samples.
  2. How does SKY contribute to the adoption of good research practices? My idea is that SKY will become the go-to site for anyone wanting to use results from economics research (and hopefully in the end eventually any type of research). Every result will be summarized and vetted by entire communities on SKY, so that the best up-to-date evidence will be there. Funders, policy-makers, journalists, students, universities, fellow researchers will come on SKY to know what is the agreed upon consensus on a question, if the evidence is sound, what are the unanswered questions, etc. Fields with a track record of publication bias will risk losing their funding, their positions, their careers, so that they will be incentivized into adopting good research practices. Journals, facing the risk of their track record of publication bias being exposed, should start accepting Null results under the risk of losing readership because of a bad replication index.
  3. How would SKY edit knowledge? My idea is that whole communities should vet the meta-analysis that concerns them. In practice, that means that the plan for the meta-analysis will be pre-registered on SKY, the code and data will be uploaded as well, and an interface for discussion will be in place. I imagine that interface to be close to the forum of contributors of a Wikipedia page, but with GitHub-like facilities, so that one can easily access the changes implied by an alternative analysis and code. The discussion among the contributors will go-on as long as necessary. The editor in charge of the meta-analysis will be responsible for the final product, but it might be subject to changes, depending on the results of the discussion.
  4. Do you intend SKY only for economics research or for all types of social science research? To me, the social or behavioral sciences are a whole: they are the sciences that are trying to explain human behavior. SKY will be open to everyone from psychology, economics, medecine, political science, anthropology, sociology, as long as the format of is one using quantitative empirical methods. It might be great eventually to add a qualitative research component to SKY, and a users portion for policy-makers for example, where they could submit topics of interest for them.
  5. What about methodological research and standardization of empirical practice on SKY? Yep, that will be up there. Any new method will be vetted independently on simulated data, mostly by PhD students as part of their methods class. I'll start by providing my own methods class with its simulations, but then anyone proposing a new method be added to SKY will provide simulations and code so that it can be vetted. We will also provide the most advanced set of guidelines possible in view of the accumulated methodological knowledge.
  6. Once we have SKY, do we really need journals? Actually, I'm not sure. Journals edit content, providing you with the most important up-to-date results according to a team of editors. SKY does better: it gives you the accumulated knowledge of whole research communities. Why would you care for the latest flashy results, whose contribution to the literature will be to push the average meta-analysed effect in a certain direction by a very limited amount? Of course, super novel results will open up new interpretations, or connect dots between different strands of the literatures, or identify regularities (sets of conditions where effects differ) and confirm them experimentally. But the incentive of researchers will be to report these result in the place where everyone is looking: SKY. Why run the risk of having one's impact limited by paywalls?
  7. How do you measure the contributions of scientists and allocate research funds in the absence of the hierarchies accompanying journals? That's the best part: SKY will make this so easy. Researchers with a clean record contributing a lot to a meta-analyzed effect (by conducting a lot of studies or one a large one) will be rewarded. Researchers opening up new fields or finding new connections and/or regularities will be easy to spot. Funding will be made easy as well: just look for the most important unanswered questions on SKY. Actually, researchers should after some time be able to coordinate within a community in order to launch large projects to tackle important questions in large teams, just like physicists do.
  8. Why are you, Sylvain, doing that? While the vision for SKY comes from Chris Chambers' book on the 7 Sins of Psychological Research, I feel this is the most important thing that I can do with my life right now. My scientific contributions will always be parasited and made almost inconsequential in the current system: (i) I cannot identify important questions because I do not know what we know; (ii) I cannot make my contributions known rigorously because Null results get ditched (there are so many more useful things I could be doing with my life right now, and apply my brains, motivation and energy to, rather than publishing biased results and contributing to polluting the published empirical record); (iii) Organizing the published empirical knowledge has a much higher impact on society than conducting another study, because the sample size of literatures is much larger than the one of a given study, so that the precision and impact of the accumulated knowledge will be huge; (iv) I love science, have always done since I was a kid, and I know that it has the power to make the world a better place. I want to contribute to that by nudging science  in the right direction to achieve just that. There are so many useful theories and empirical facts in economics and social science that could be of use to everyone if only they had been properly vetted and organized; (v) I am lucky enough to have an employer that really cares about the public good and that supports me 100%. Lucky me :)
  9. How can I contribute to SKY? Well, drop me an email or a DM on twitter if you want to get involved. We are just starting looking at the solutions and goals, so now is the best time to make a lasting impact on the infrastructure. We will need any single one of your contributions.

With SKY, I want to achieve what Paul Krugman asks economists to do in a recent post: "The important thing is to be aware of what we do know, and why." Couldn't agree more. Let's get to work ;)





Thursday, June 7, 2018

Why p-values are Bad for Science

This post is the fifth in a series of seven posts in which I am arguing against the use of p-values for reporting the results of statistical analysis. You can find a summary of my argument and links to the other posts in the first post of the series. In this post, I present why I think p-values are bad for science.

The problems that I have with p-values and NHST are not minor quibbles. They unfortunately cut to the core of scientific practice. They affect the quality of reported and published scientific evidence, they perturb in major ways the process of accumulation of knowledge and eventually they might undermine the very credibility of our scientific results.

I can see two major detrimental consequences to the use of p-values and NHST:
  1. Publication bias: published results overestimate the true effects, by a large margin, with published results being 1.5 to 2 times larger than the true effects on average.
  2. Imprecision: published results have low precision, with a median signal to noise ratio of 0.26.

Publication bias


Publication bias operates in the following way: if editors decide to publish only statistically significant results, then the record of published results will be an overestimate of the true effect, as we have seen in the previous post of the series. If a true effect is small and positive, published results will overestimate it.

This means that if the true effect is nonexistent, only positive and negative studies showing that it exists and is large will be published. We will get either conflicting results or, if researchers favor one direction of the effect, we might end up with evidence for an effect that is not there.

There are actually different ways publication bias can be generated:
  • Teams of researchers compete for publication based on independent samples from the same population. If 100 teams compete, on average, 5 of them will find significant results even if the true effect is non-existent. Depending on the proportion of true effects that there is to discover, it might imply that most published research findings are false
  • Specification search is another way a team can generate statistically significant results. For example, by choosing to stop collecting new results once the desired significance is reached, choosing to add control variables or changing the functional form. This does not have to be conscious fraud, but simply a result of the multiple degrees of freedom that researchers have, which generate what Andrew Gelman and Eric Loken call "the garden of forking paths." In a pathbreaking paper in psychology, Joseph Simmons, Leif Nelson and Uri Simonsohn showed that leveraging on degrees of freedom in research, it is very easy to obtain any type of result, even that listening to a given song decreases people's age.
  • Conflicts of interest, such as in medical sciences where labs have to show efficiency of drugs, might generate a file drawer problem, where insignificant or negative results do not get published.
Do we have evidence that publication bias exists? Unfortunately yes, massive, uncontroversial evidence. The evidence on publication bias comes from replications attempts and meta-analysis.

Evidence of publication bias from replications


Replications consists in trying to conduct a study similar to a published one, but on a larger sample, in order to increase precision and decrease sampling noise. After the replication crisis erupted in their field, psychologists decided to conduct replication studies. The Open Science Collaboration published in Science in 2015 the results of 100 replication attempts. What they found was nothing short of devastating (my emphasis):
The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size.
Here is a beautiful graph summarizing the terrible results of the study:
The results in red are results that were statistically significant in the original study but that were not statistically significant in the replication.

It seems that classical results in psychology such as priming and ego depletion do not replicate well, at least under some forms. The replication of the ego depletion original study was a so-called multi-lab study where several labs ran a similar protocol and gathered their results. Here is the beautiful graph summarizing the results of the multi-lab replication with the original result on top:

The original study was basically an extreme observation drawn from a distribution centered at zero, or very close to zero, a clear sign of publication bias at play.

What about replication in economics? Well, there are several types of replications that you can run in economics. First, for Randomized Controlled Trials (RCTs), either run in the lab or in the field, you can, as in psychology, run another set of experiments similar to the original. Colin Camerer and some colleagues did just that for 11 experimental results. Their results were published in science in 2016 (emphasis is mine):
We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original.
So, experimental economics suffers from some degree of publication bias as well, although apparently slightly smaller than in psychology. Note however that the number of replications attempted is much smaller in economics, so that things may get worse with more precision.

I am not aware of any replication in economics of the results of field experiments, but I'd be glad to update the post after being pointed to studies that I'm unaware of.

Other types of replication concerns non experimental studies. In that case, replication could mean trying to replicate the same analysis with the same data, in search of a coding error or of debatable modeling choices. What I have in mind would rather be trying to replicate a result by looking for it either with another method or in different data. In my own work, we are conducting a study using DID and another study using a discontinuity design in order to cross check our results. I am not aware of intents to summarize the results of such replications, if they have been conducted. Apart from when it is reported in the same paper, I am not aware of researchers trying to actively replicate the results of quasi-experimental studies with another method. Again, it might be that my knowledge of the literature is wanting.

Evidence of publication bias from meta-analysis

 

Meta-analysis are analysis that regroup the results of all the studies reporting measurements of a similar effect. The graph just above stemming from the multi-lab ego-depletion study is a classical meta-analysis graph. In the bottom, it presents the average effect taken over studies, weighted by the precision of each study. What is nice about this average effect is that it is not affected by publication bias, since the results of all studies are presented. In order to guarantee that there is no selective publication, authors of the multi-lab study preregistered all their experiments and committed to communicate the results of all of them.

But absent a replication study with pre-registration, estimates stemming from the  literature might be affected by publication bias, and the weighted average of the published impacts might overestimate the true impact. How can we detect whether it is the case or not?

There are several ways to detect publication bias using meta-analysis. One approach is to look for bumps in the distribution of p-values or of test statistics around the conventional significance thresholds of 0.05 or 1.96. If we see an excess mass above the significance threshold, that would be a clear sign of missing studies with insignificant results, or of specification search transforming p-values from 0.06 into 0.05. Abel Brodeur, Mathias LĂ©, Marc Sangnier and Yanos Zylberberg plot the t-statistics for all empirical papers published in top journals in economics between 2005 and 2011:
There is a clear bump in the distribution of test statistics above 1.96 and a clear trough below, indicative that some publication bias is going on.


The most classical approach to detect publication bias using meta-analysis is to draw a funnel plot. A funnel plot is a graph that relates the size of the estimated effect to its precision (e.g. its standard deviation). As publication bias is more likely to happen with imprecise results, a deficit of small imprecise results is indicative of publication bias. The first of the three plots below on the left shows a regular funnel plot, where the distribution of results is symmetric around the most precise effect (ISIS-2). The two other plots are irregular, showing clear holes at the bottom and to the right of the most precise effect, precisely where imprecise small results should be in the absence of publication bias.


 




More rigorous tests can supplement the eyeball examination of the funnel plot. For examples, one can regress the effect size on precision or on sampling noise. A precisely estimated nonzero correlation would signal publication bias. Isaiah Andrews and Maximilian Kasy extend these types of tests to more general settings and apply them to the literature on the impact of the minimum wage on employment, and find, as previous meta-analysis already did, evidence of some publication bias in favor of a negative employment effect of the minimum wage.

Another approach to the detection of publication bias in meta-analysis is to compare the most precise effects to the less precise ones. If there is publication bias, the most precise effects should be closer to the truth and smaller than the less precise effects, which would give an indication of the magnitude of publication bias. In a recent paper, John Ioannidis, T. D. Stanley and Hristos Doucouliagos estimate the size of publication bias for:
159 empirical economics literatures that draw upon 64,076 estimates of economic parameters reported in more than 6,700 empirical studies.
They find that (emphasis is mine):
a simple weighted average of those reported results that are adequately powered (power ≥ 80%) reveals that nearly 80% of the reported effects in these empirical economics literatures are exaggerated; typically, by a factor of two and with one‐third inflated by a factor of four or more.

Still another approach is to draw a p-curve, a plot of the distribution of statistically significant p-values, as proposed by Joseph Simmons, Leif Nelson and Uri Simosohn. The idea of p-curving is that if there is a real effect, the distribution of significant p-values should lean towards small values, because they are much more likely than large values close to the 5% significance threshold. Remember the following plot from my previous post on significance testing:
Most of the p-values should be larger than 0.05 if the true distribution is the black one, since most samples will produce estimates located to the right of the significance threshold. If there is no real effect, on the contrary, the distribution of p-values is going to be flat, by construction. Indeed, the probability of observing a p-value or 5% or less is 5% while the probability of observing a p-value of 4% or less is 4%, hence, the probability of observing a p-value between 5% and 4% is exactly 1% in the absence of any real effect, and it is equal to the probability of observing a p-value between 4% and 3% and so on. When applied to the (in)famous "Power Pose" literature, p-curve is flat:


The flat p-curve suggests that there probably is no real effect of power pose, at least on hormone levels, which has also been confirmed by a failed replication. Results from a more recent p-curve study of Power Pose claims evidence of real effects, but Simmons and Simosohn have raised serious doubts about the study.

There is to my knowledge no application of p-curving to empirical results in economics.

Underpowered studies


Another problem with p-values and Null Hypothesis Significance Testing (NHST) is that they are used to perform a power analysis in order to select the adequate sample size before running a study. The usual practice is to choose sample size so as to have 80% chance of detecting an effect at least equal to a minimum a priori postulated magnitude (a.k.a. the minimum detectable effect) using NHST with a p-value of 5%.

The problem with this approach is that it focuses on p-values and test statistics and not on precision or sampling noise. As a result, the results obtained using classical power analysis are not very precise. One can actually show that the corresponding signal to noise ratio is equal to 0.71, meaning that noise is still 25% bigger than signal.

With power analysis, there is no incentive to collect precise estimates by using large samples. As a consequence, the precision of results published in the behavioral sciences has not increased over time. Here is a plot of the power to detect small effect sizes (Cohen's d=0.2) for 44 reviews of papers published in journals in the social and behavioral sciences between 1960 and 2011 collected by Paul Smaldino and Richard McElreath:


There is not only very low power (mean=0.24) but also no increase over time in power, and thus no increase in precision. Note also that the figure shows that the actual realized power is much smaller than the postulated 80%. This might be because no adequate power analysis was conducted in order to help select the sample size for these studies, or because the authors selected medium or large effects as their minimum detectable effects. Whether we should expect small, medium or large effects in the social sciences depends on the type of treatment. But small effects are already pretty big: they are as large as 20% of the standard deviation of the outcome under study.

John Ioannidis and his coauthors estimate that the median power in empirical economics is 18%, which implies a signal to noise ratio of 0.26, meaning that the median result in economics contains four times more noise than it has signal.