Bold ideas and critical thoughts on science.

Jörg Peters

A Replication Crisis in the Making?

Jörg Peters on the lack of replicability of many publications in economics, the role of p-hacking and publication pressure, and reasons for cautious optimism in considering these issues
22 March 2021

Revised & extended crosspost of „Wissenschaft oder Fiktion?“, originally published in Frankfurter Allgemeine Zeitung. 

The new book Science Fictions by Scottish psychologist Stuart Ritchie of King’s College London paints a bleak picture of science. It is a polemic, but economists should take it seriously. Not only does it provide substantial evidence of systematic errors in science, but it also skillfully describes the underlying incentive problems in the system that are indeed particularly pronounced in economics.

Jörg Peters

Ritchie clarifies at the outset that he is running “to praise science, not to bury it.” That is important in times when consolidated scientific evidence is rightly guiding policy while being challenged by people with dangerous political agendas. Indeed, Ritchie’s critique is concerned with that very word: consolidated. An empirical result is considered consolidated, and not randomly triggered, if and only if, it can be replicated. Replicability can refer to a re-analysis using the same data set, but particularly for studies with small sample sizes, it also means that it must be observable not just once in a specific setting, but should be demonstrable under different conditions. If political decisions are based on research results that cannot be replicated, they run astray. This costs taxpayers’ money or, worse, human lives. Ritchie bases his argument on the observation that the replicability rate of influential scientific results is appallingly low, and that this is well known among experts. At the same time, critics accuse him of playing with fire regarding climate sceptics. That critique falls short, because anthropogenic climate change is precisely what, according to Ritchie, other parts of empirical science are not: consolidated. How big the problems really are across disciplines is an open question. For economics, though, there are worrying signs of a replication crisis in the making (see e.g. Brodeur et al. 2020, Camerer et al. 2016, Ferraro & Shukla 2020, Ioannidis et al. 2017).  

Ritchie’s book is an eloquent, sometimes sarcastic, but always analytical description of the scientific process. He relentlessly describes the allocation of research funds and scientific positions, the publication process, including peer review and science communication for what they are: a man-made and therefore error-prone system. Science is a social construct – and so people and their vanities, their will to survive and their careerism, their hubris and their networks play an important role. Factoring this in is more important than ever in view of the growing societal influence of science. 

In recent decades, economics has become increasingly empirical (Angrist et al. 2017). Empirical results are often interpreted as facts in the media and are also presented as such by scientists themselves. However, they arise from the social construct described above, which makes them much more prone to error than is usually portrayed. These errors happen systematically, because people, including researchers, prefer the spectacular to the unspectacular. Suppose I am studying the effects of air pollution on respiratory diseases – it is certainly more interesting to find an effect than to not find one. The mechanisms by which the scientific social construct transfers this search for the spectacular into a systematic error are called p-hacking and publication bias. Publication bias and p-hacking, in short, reduce the statistical methods, which quantitative research is so proud of, to absurdity. To understand these two mechanisms, we need to take a closer look at the publication process and empirical work in practice. 

What is publication bias?

Data availability has grown enormously over the past decades. Not only through Big Data at tech companies, but also in socio-economic data sets of unprecedented scale. This is important because statistical methods can only ensure a correlation found in such datasets with a certain probability. That is, there is always a residual probability that the correlation occurred by chance and one might mistakenly believe a particular result to be true. Such a result would not be replicable and thus worthless. This residual probability is usually set at 5% by convention. So if not only I test my air pollution hypothesis in one data set, but 99 other researchers try it in parallel in 99 other data sets, five of them will find a significant correlation – even if it does not exist in reality. If all 100 attempts were published, there would be no problem. Other scientists could then correctly recognize that the five successful attempts are due to chance. However, not all results are published. Peer reviewers and journal editors find the five significant attempts more interesting and will tend to publish those and reject the non-significant trials. The resulting published literature then presents a false picture of reality.

What is p-hacking? 

Related to this is p-hacking. It denotes what all empirical researchers know: unspectacular results can be made more spectacular. This is done by subtly or crudely altering the data analysis to improve the statistical significance level, expressed in terms of the p-value. In economics, p-hacking is especially problematic because the share of secondary data-based studies is much higher than in other disciplines like psychology or medical research, where larger parts of the literature are based on laboratory experiments or Randomized Controlled Trials (RCTs). p-hacking does not necessarily involve outright fraud. Rather, any empirical investigation involves dozens, if not hundreds, of micro-decisions. These start with very fundamental decisions, such as whether to study the effect of air pollution on respiratory disease or that of air pollution on cardiovascular disease. If I see an association in the latter but not in the former, I pursue that cardiovascular hypothesis, but not the respiratory hypothesis. If there is no association there either, I try air pollution and headaches or other medical conditions. At some point I will find a significant relationship because of the 5% error probability. Such an approach is legitimate if all attempts are documented and published. But they usually are not; hence a false picture of reality is created again. But the micro-decisions reach further, into choices that are hardly noticeable even to the researcher herself, such as how to measure air pollution in the first place, how to clean the raw data, and which econometric specification to use. In many of these steps, there is not one obvious right decision, leaving leeway to influence the results (Huntington-Klein et al. 2020, Simonsohn et al. 2020). The pressure to do so in the direction of more interesting results is high.

Publication pressure 

At the same time, academic careers in economics depend on journal publications. Especially in the early stages of a career, it’s publish or perish. Those who publish well, get good jobs, good research grants, and eventually tenure. Those who don’t, disappear. The incentives are clear. In economics, this selection process is particularly harsh, in that scientific performance is evaluated on the basis of a journal ranking that already drops off very steeply between the so-called top 5 and the top field journals, but especially after that. Hence, a large proportion of journals are in fact irrelevant in terms of careers. At the same time, it is naive to expect that peer review could effectively ensure quality by tracking p-hacking and those micro-decisions. If anything, due to publication bias and the reviewers’ penchant for spectacular results, peer review is even part of the problem.

How serious are these problems?

None of this is new. The renowned Stanford statistician and epidemiologist John Ioannidis published a much-cited paper as early as 2005 that summarizes Ritchie’s points in a nutshell (Ioannidis 2005). Ioannidis has also been involved in various studies that empirically demonstrate these errors in the scientific system – also in economics (Ioannidis et al. 2017). Also from within the economics profession, criticism of the publication system, obsession with certain methods, and the ‘social construct’ has mounted recently, including Nobel laureates such as Angus Deaton (2020), James Heckman (Heckman and Moktan 2020), and George Akerlof (2020), or in the much-noted blog post “Economics is a disgrace” by Claudia Sahm. Numerous recent review studies point to considerable publication bias and p-hacking in the economics literature, as well as systematic errors in influential publications and the widespread use of questionable or erroneous research practices, also in high-quality journals (Brodeur et al. 2016, Brodeur et al. 2020, Camerer et al. 2016, Dahal and Fiala 2020, Ferraro & Shukla 2020, Gallen 2020, Ioannidis et al. 2017, Kelly 2020, Mellon 2020, Young 2019).

There’s progress on transparency in economics, but…

In his book, Ritchie proposes reforms of the publication process that essentially rely on more transparency to trigger a cultural change. The good news is that these ideas are also well-known in economics (Christensen and Miguel 2018, Miguel et al. 2014). Spearheaded by the Berkeley Initiative for Transparency in the Social Sciences and a few similar initiatives important steps are made towards pre-specification and pre-registration of research questions in pre-analysis plans (PAP), to prevent p-hacking and publication bias. Many high-quality journals in economics pursue ambitious policies to make data available for reanalysis – and indeed, posting data online has increased sharply in recent years (Vilhuber 2020). Data availability is also the main component of a generally positive transparency trend documented in Christensen et al. (2020). However, PAPs are the norm only for RCTs (Christensen & Miguel 2018), a method that – by design – is already less prone to p-hacking and probably also to publication bias. Only a fraction of secondary data-based studies is pre-specified (Ofosu & Posner 2019). 

…organized scepticism must become the norm  

Moreover, these positive transparency steps will only lead to more reliable empirical results if the economics profession develops a norm of replication, post-publication discussion based on re-analyses, and meta-science – in brief, organized scepticism, to use a term coined by Robert Merton in 1942 (Merton 1973). The trend towards PAPs is still associated with uncertainties and considerable leeway for researchers (Banerjee et al. 2020), so that even for RCTs it is hitherto unclear to what extent it will hamper questionable research practices. Ofosu and Posner (2019) hence emphasize that “PAPs are unlikely to enhance research credibility without vigorous policing”. Likewise, this policing culture is required to unfold the potentials of data transparency policies. Only if published papers are reanalysed, p-hacking and other questionable practices will effectively be disincentivized. Re-analysis and post-publication discussion entail more than push button replications reproducing point estimates; oftentimes they also require qualitative deliberations coming from interdisciplinary expertise about specification choices, outcome variable selection or, to stay in the air pollution example, which spirometer was used to measure pulmonary function. 

Replication in different settings is also key to establish external validity, which is often limited in individual studies, especially but not only for RCTs (Esterling et al. 2021, Muller 2015, Peters et al. 2018, Vivalt 2020).  Views differ on the extent to which replications of any kind are already being conducted. While some reviews observe decent or modest rates of published papers being replicated (Berry et al. 2017, Sukhtankar 2017), others diagnose negligible rates (Duvendack et al. 2017, Mueller-Langer et al. 2019).  

It is probably undisputed, though, that replications are not what the culture in economics rewards. Why would we expect a young scholar to invest scarce research time into tracking PAPs or reproducing published papers if it doesn’t pay off for her career? Fecher et al. (2016) suggest including replications in PhD curricula and, like Ferraro & Shukla (2020), recommend clearer rewards for conducting replications. Transparent standards for when journals publish replications would be an important step. In any case, it seems this self-correction process still has a long way to go in economics. In the meantime, Richie suggests that policymakers and the public should not take individual scientific findings as incontrovertible truths – which also holds for economics. After all, analysing the data that depicts the world is complex and prone to error, and it is carried out by people who have their own viewpoints and interests. Internalizing this should generally be part of economics expertise, as it is in other disciplines. It ultimately strengthens resilience, Ritchie also argues, to the perfidious forces of science denialism.

Author info

Jörg Peters heads the research group “Climate Change in Developing Countries” at RWI and is a professor at the University of Passau. His research focuses on environmental, energy and development economics. In this context, he leads several projects in various African countries dealing with infrastructure development, climate policy and the diffusion of new technologies. He studied economics and statistics in Cologne and Paris and received his PhD from the Ruhr University in Bochum. His research has been published in leading journals, including the Journal of Health Economics, Nature Energy, and the World Bank Economic Review.

Digital Object Identifier (DOI)

https://doi.org/10.5281/zenodo.46234

Cite as

Peters, J. (2021). Empirical economics: A replication crisis in the making?. Elephant in the Lab. DOI: https://doi.org/10.5281/zenodo.46234

References

Collapse references

Akerlof, G. A. (2020). Sins of Omission and the Practice of Economics. Journal of Economic Literature, 58(2), 405-18.

Angrist, J., Azoulay, P., Ellison, G., Hill, R., & Lu, S. F. (2017). Economic research evolves: Fields and styles. American Economic Review, 107(5), 293-97.

Berry, J., Coffman, L. C., Hanley, D., Gihleb, R., & Wilson, A. J. (2017). Assessing the rate of replication in economics. American Economic Review, 107(5), 27-31.

Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1-32.

Brodeur, A., Cook, N., & Heyes, A. (2020). Methods matter: P-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11), 3634-60.

Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., … & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433-1436.

Christensen, G., & Miguel, E. (2018). Transparency, reproducibility, and the credibility of economics research. Journal of Economic Literature, 56(3), 920-80.

Christensen, G., Wang, Z., Levy Paluck, E., Swanson, N., Birke, D., Miguel, E., & Littman, R. (2020). Open science practices are on the rise: The state of social science (3S) survey.

Dahal, M., & Fiala, N. (2020). What do we know about the impact of microfinance? The problems of statistical power and precision. World Development, 128, 104773.

Deaton, A. (2020). Randomization in the tropics revisited: a theme and eleven variations. National Bureau of Economic Research.

Duvendack, M., Palmer-Jones, R., & Reed, W. R. (2017). What is meant by” Replication” and why does it encounter resistance in economics? American Economic Review, 107(5), 46-51.

Esterling, K., Brady, D., & Schwitzgebel, E. (2021). The Necessity of Construct and External Validity for Generalized Causal Claims. OSF Preprints, 27.

Fecher, B., Fräßdorf, M., & Wagner, G. G. (2016). Perceptions and practices of replication by social and behavioral scientists: Making replications a mandatory element of curricula would be useful.

Ferraro, P. J., & Shukla, P. (2020). Feature—Is a Replicability Crisis on the Horizon for Environmental and Resource Economics? Review of Environmental Economics and Policy, 14(2), 339-351.

Gallen, T. (2020). Broken Instruments. Available at SSRN 3671850.

Heckman, J. J., & Moktan, S. (2020). Publishing and promotion in economics: the tyranny of the top five. Journal of Economic Literature, 58(2), 419-70.

Huntington-Klein, N., Arenas, A., Beam, E., Bertoni, M., Bloem, J., Burli, P. H., … & Stopnitzky, Y. (2020). The Influence of Hidden Researcher Decisions in Applied Microeconomics.

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.

Ioannidis, J., Stanley, T. D., & Doucouliagos, H. (2017). The Power of Bias in Economics Research. Economic Journal, 127(605).

Kelly, M. (2020). Understanding persistence.

Mellon, J. (2020). Rain, Rain, Go away: 137 potential exclusion-restriction violations for studies using weather as an instrumental variable. Available at SSRN.

Merton, R. K. (1973). The sociology of science: Theoretical and empirical investigations. University of Chicago press.

Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., … & Van der Laan, M. (2014). Promoting transparency in social science research. Science, 343(6166), 30-31.

Mueller-Langer, F., Fecher, B., Harhoff, D., & Wagner, G. G. (2019). Replication studies in economics – How many and which papers are chosen for replication, and why? Research Policy, 48(1), 62-83.

Muller, S. M. (2015). Causal interaction and external validity: Obstacles to the policy relevance of randomized evaluations. The World Bank Economic Review, 29(suppl_1), S217-S225.

Ofosu, G., & Posner, D. N. (2019). Pre-analysis Plans: A Stocktaking.

Peters, J., Langbein, J., & Roberts, G. (2018). Generalization in the tropics–development policy, randomized controlled trials, and external validity. World Bank Research Observer, 33(1), 34-64.

Ritchie, S. (2020). Science fictions: Exposing fraud, bias, negligence and hype in science. Random House.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification curve analysis. Nature Human Behaviour, 4(11), 1208-1214.

Sukhtankar, S. (2017). Replications in Development Economics. American Economic Review, 107(5), 32-36.

Vilhuber, L. (2020). Reproducibility and Replicability in Economics. Harvard Data Science Review, 2(4).

Vivalt, E. (2020). How much can we generalize from impact evaluations?. Journal of the European Economic Association, 18(6), 3045-3089.

Young, A. (2019). Consistency without Inference: Instrumental Variables in Practical Application.

0 Comments

Continue reading

What happens to science when it communicates?

What happens to science when it communicates?

In August 2023 Benedikt Fecher conducted an interview with Clemens Blümel from the German Centre for Higher Education Research and Science Studies (DZHW) on the topic of ‘what happens when science opens up and communicates’ and the emerging challenges for future scientific communication.