Jörg Peters on the lack of replicability of many publications in economics, the role of p-hacking and publication pressure, and reasons for cautious optimism in considering these issues
Revised & extended crosspost of „Wissenschaft oder Fiktion?“, originally published in Frankfurter Allgemeine Zeitung.
The new book Science Fictions by Scottish psychologist Stuart Ritchie of King’s College London paints a bleak picture of science. It is a polemic, but economists should take it seriously. Not only does it provide substantial evidence of systematic errors in science, but it also skillfully describes the underlying incentive problems in the system that are indeed particularly pronounced in economics.
Ritchie clarifies at the outset that he is running “to praise science, not to bury it.” That is important in times when consolidated scientific evidence is rightly guiding policy while being challenged by people with dangerous political agendas. Indeed, Ritchie’s critique is concerned with that very word: consolidated. An empirical result is considered consolidated, and not randomly triggered, if and only if, it can be replicated. Replicability can refer to a re-analysis using the same data set, but particularly for studies with small sample sizes, it also means that it must be observable not just once in a specific setting, but should be demonstrable under different conditions. If political decisions are based on research results that cannot be replicated, they run astray. This costs taxpayers’ money or, worse, human lives. Ritchie bases his argument on the observation that the replicability rate of influential scientific results is appallingly low, and that this is well known among experts. At the same time, critics accuse him of playing with fire regarding climate sceptics. That critique falls short, because anthropogenic climate change is precisely what, according to Ritchie, other parts of empirical science are not: consolidated. How big the problems really are across disciplines is an open question. For economics, though, there are worrying signs of a replication crisis in the making (see e.g. Brodeur et al. 2020, Camerer et al. 2016, Ferraro & Shukla 2020, Ioannidis et al. 2017).
Ritchie’s book is an eloquent, sometimes sarcastic, but always analytical description of the scientific process. He relentlessly describes the allocation of research funds and scientific positions, the publication process, including peer review and science communication for what they are: a man-made and therefore error-prone system. Science is a social construct – and so people and their vanities, their will to survive and their careerism, their hubris and their networks play an important role. Factoring this in is more important than ever in view of the growing societal influence of science.
In recent decades, economics has become increasingly empirical (Angrist et al. 2017). Empirical results are often interpreted as facts in the media and are also presented as such by scientists themselves. However, they arise from the social construct described above, which makes them much more prone to error than is usually portrayed. These errors happen systematically, because people, including researchers, prefer the spectacular to the unspectacular. Suppose I am studying the effects of air pollution on respiratory diseases – it is certainly more interesting to find an effect than to not find one. The mechanisms by which the scientific social construct transfers this search for the spectacular into a systematic error are called p-hacking and publication bias. Publication bias and p-hacking, in short, reduce the statistical methods, which quantitative research is so proud of, to absurdity. To understand these two mechanisms, we need to take a closer look at the publication process and empirical work in practice.
What is publication bias?
Data availability has grown enormously over the past decades. Not only through Big Data at tech companies, but also in socio-economic data sets of unprecedented scale. This is important because statistical methods can only ensure a correlation found in such datasets with a certain probability. That is, there is always a residual probability that the correlation occurred by chance and one might mistakenly believe a particular result to be true. Such a result would not be replicable and thus worthless. This residual probability is usually set at 5% by convention. So if not only I test my air pollution hypothesis in one data set, but 99 other researchers try it in parallel in 99 other data sets, five of them will find a significant correlation – even if it does not exist in reality. If all 100 attempts were published, there would be no problem. Other scientists could then correctly recognize that the five successful attempts are due to chance. However, not all results are published. Peer reviewers and journal editors find the five significant attempts more interesting and will tend to publish those and reject the non-significant trials. The resulting published literature then presents a false picture of reality.
What is p-hacking?
Related to this is p-hacking. It denotes what all empirical researchers know: unspectacular results can be made more spectacular. This is done by subtly or crudely altering the data analysis to improve the statistical significance level, expressed in terms of the p-value. In economics, p-hacking is especially problematic because the share of secondary data-based studies is much higher than in other disciplines like psychology or medical research, where larger parts of the literature are based on laboratory experiments or Randomized Controlled Trials (RCTs). p-hacking does not necessarily involve outright fraud. Rather, any empirical investigation involves dozens, if not hundreds, of micro-decisions. These start with very fundamental decisions, such as whether to study the effect of air pollution on respiratory disease or that of air pollution on cardiovascular disease. If I see an association in the latter but not in the former, I pursue that cardiovascular hypothesis, but not the respiratory hypothesis. If there is no association there either, I try air pollution and headaches or other medical conditions. At some point I will find a significant relationship because of the 5% error probability. Such an approach is legitimate if all attempts are documented and published. But they usually are not; hence a false picture of reality is created again. But the micro-decisions reach further, into choices that are hardly noticeable even to the researcher herself, such as how to measure air pollution in the first place, how to clean the raw data, and which econometric specification to use. In many of these steps, there is not one obvious right decision, leaving leeway to influence the results (Huntington-Klein et al. 2020, Simonsohn et al. 2020). The pressure to do so in the direction of more interesting results is high.
At the same time, academic careers in economics depend on journal publications. Especially in the early stages of a career, it’s publish or perish. Those who publish well, get good jobs, good research grants, and eventually tenure. Those who don’t, disappear. The incentives are clear. In economics, this selection process is particularly harsh, in that scientific performance is evaluated on the basis of a journal ranking that already drops off very steeply between the so-called top 5 and the top field journals, but especially after that. Hence, a large proportion of journals are in fact irrelevant in terms of careers. At the same time, it is naive to expect that peer review could effectively ensure quality by tracking p-hacking and those micro-decisions. If anything, due to publication bias and the reviewers’ penchant for spectacular results, peer review is even part of the problem.
How serious are these problems?
None of this is new. The renowned Stanford statistician and epidemiologist John Ioannidis published a much-cited paper as early as 2005 that summarizes Ritchie’s points in a nutshell (Ioannidis 2005). Ioannidis has also been involved in various studies that empirically demonstrate these errors in the scientific system – also in economics (Ioannidis et al. 2017). Also from within the economics profession, criticism of the publication system, obsession with certain methods, and the ‘social construct’ has mounted recently, including Nobel laureates such as Angus Deaton (2020), James Heckman (Heckman and Moktan 2020), and George Akerlof (2020), or in the much-noted blog post “Economics is a disgrace” by Claudia Sahm. Numerous recent review studies point to considerable publication bias and p-hacking in the economics literature, as well as systematic errors in influential publications and the widespread use of questionable or erroneous research practices, also in high-quality journals (Brodeur et al. 2016, Brodeur et al. 2020, Camerer et al. 2016, Dahal and Fiala 2020, Ferraro & Shukla 2020, Gallen 2020, Ioannidis et al. 2017, Kelly 2020, Mellon 2020, Young 2019).
There’s progress on transparency in economics, but…
In his book, Ritchie proposes reforms of the publication process that essentially rely on more transparency to trigger a cultural change. The good news is that these ideas are also well-known in economics (Christensen and Miguel 2018, Miguel et al. 2014). Spearheaded by the Berkeley Initiative for Transparency in the Social Sciences and a few similar initiatives important steps are made towards pre-specification and pre-registration of research questions in pre-analysis plans (PAP), to prevent p-hacking and publication bias. Many high-quality journals in economics pursue ambitious policies to make data available for reanalysis – and indeed, posting data online has increased sharply in recent years (Vilhuber 2020). Data availability is also the main component of a generally positive transparency trend documented in Christensen et al. (2020). However, PAPs are the norm only for RCTs (Christensen & Miguel 2018), a method that – by design – is already less prone to p-hacking and probably also to publication bias. Only a fraction of secondary data-based studies is pre-specified (Ofosu & Posner 2019).
…organized scepticism must become the norm
Moreover, these positive transparency steps will only lead to more reliable empirical results if the economics profession develops a norm of replication, post-publication discussion based on re-analyses, and meta-science – in brief, organized scepticism, to use a term coined by Robert Merton in 1942 (Merton 1973). The trend towards PAPs is still associated with uncertainties and considerable leeway for researchers (Banerjee et al. 2020), so that even for RCTs it is hitherto unclear to what extent it will hamper questionable research practices. Ofosu and Posner (2019) hence emphasize that “PAPs are unlikely to enhance research credibility without vigorous policing”. Likewise, this policing culture is required to unfold the potentials of data transparency policies. Only if published papers are reanalysed, p-hacking and other questionable practices will effectively be disincentivized. Re-analysis and post-publication discussion entail more than push button replications reproducing point estimates; oftentimes they also require qualitative deliberations coming from interdisciplinary expertise about specification choices, outcome variable selection or, to stay in the air pollution example, which spirometer was used to measure pulmonary function.
Replication in different settings is also key to establish external validity, which is often limited in individual studies, especially but not only for RCTs (Esterling et al. 2021, Muller 2015, Peters et al. 2018, Vivalt 2020). Views differ on the extent to which replications of any kind are already being conducted. While some reviews observe decent or modest rates of published papers being replicated (Berry et al. 2017, Sukhtankar 2017), others diagnose negligible rates (Duvendack et al. 2017, Mueller-Langer et al. 2019).
It is probably undisputed, though, that replications are not what the culture in economics rewards. Why would we expect a young scholar to invest scarce research time into tracking PAPs or reproducing published papers if it doesn’t pay off for her career? Fecher et al. (2016) suggest including replications in PhD curricula and, like Ferraro & Shukla (2020), recommend clearer rewards for conducting replications. Transparent standards for when journals publish replications would be an important step. In any case, it seems this self-correction process still has a long way to go in economics. In the meantime, Richie suggests that policymakers and the public should not take individual scientific findings as incontrovertible truths – which also holds for economics. After all, analysing the data that depicts the world is complex and prone to error, and it is carried out by people who have their own viewpoints and interests. Internalizing this should generally be part of economics expertise, as it is in other disciplines. It ultimately strengthens resilience, Ritchie also argues, to the perfidious forces of science denialism.