Magnus Johannesson

The Reproducibility of Social Science

There is increasing concern about reproducibility in science, i.e. to what extent statistically significant published research findings are true positives or false positives. Factors contributing to a lack of reproducibility include low statistical power, the testing of hypotheses with low prior probability of being true, and publication bias. We launch a program for assessing and improving the reproducibility in the social sciences. The first tool is systematic replication of published studies, where we will introduce a systematic replication program of experimental social science studies published in the leading economics journals and the leading general science journals. The second tool is using prediction markets to quantify the reproducibility of published research findings. The prediction market allows us to estimate probabilities for the tested hypotheses being true at different stages of scientific discovery. Based on the prediction market the probability that the tested hypothesis is true can be derived before and after the replication. It is even possible to derive the initial prior of the hypothesis being true before the publication of the original study. A first study on prediction markets linked to a systematic replication of published findings in psychology shows very promising results, with the prediction market outperforming a survey and successfully predicting about 90% of the outcomes (and only about 40% of the published studies in psychology replicate).
Final report
This project has investigated various issues related to the reproducibility of social sciences. The project largely followed the plan developed in the application. In our project plan we included five projects listed as 4.1 to 4.5 in the application. We have completed four out of these five projects; and replaced the fifth project (project 4.4 in the application) with another major project that has also been completed (NARPS). These five projects have all been published in five papers (Camerer et al. 2016, 2018; Dreber et al. 2015; Forsell et al. 2019; Botvinik-Nezer et al. 2020). In NARPS a second related paper documenting the data has also been published (Botvinik-Nezer et al. 2019). In addition, we have completed one additional project on predicting replication results based on techniques from machine learning that is also published (Altmejd et al. 2019). Together with many prominent researchers across the social sciences and the life sciences we have also published a paper proposing to lower the p-value threshold from 0.05 to 0.005 for what should be considered a statistically significant finding (Benjamin et al. 2018). We are also working on a number of related projects that have not yet been completed. Below we report the results of the three most important projects/results.


EERP

Testing if published studies replicates is an important method to assess the reproducibility of published studies. We have carried out two large scale replication projects as part of this overall project. The two studies have replicated experimental studies and carried out so called “direct replications”. Direct replications implies using the same experimental design and methods as the original study and carry out a new data collection in a similar sample as the sample included in the original study. The first of the two replication projects was the Experimental Economics Replication Project (EERP). The EERP performed a systematic replication effort of 18 laboratory experiments in economics published in two top journals in economics 2011-2014; American Economic Review and Quarterly Journal of Economics. This replication project included studies testing main effects in between-subject designs. The average replication power was 92% to find 100% of the original effect size at the 5% significance level. All replications and analyses were pre-registered and communicated to the original authors before the replications was carried out. 11 (61%) out of 18 original studies replicated in the sense of finding a statistically significant finding (p<0.05) in the same direction as the original study. the mean effect size in the replication studies were about 60% of the mean effect size in the original studies. the study was published in science in 2016 (camerer et al. 2016).>

SSRP

The Social Sciences Replication Project (SSRP) performed a systematic replication effort of 21 social science experiments published in Nature and Science 2010-2015. This replication project included papers that tested for an experimental treatment effect within or between subjects, where the subjects were students or part of some other accessible subject pool. Statistical power was substantially higher in the SSRP than in the EERP, to take into account that effect size of original true positive findings are likely to be inflated. The SSRP included a two-stage design for conducting replications. In stage 1, the replication had 90% power to detect 75% of the original effect size at the 5% significance level. If the original result did not replicate in stage 1, the replication continued into stage 2 such that the pooled replication had 90% power to detect 50% of the original effect size at the 5% significance level. All replications were pre-registered and communicated to the original authors before the replications occurred. 13 (62%) out of 21 original studies replicated in stage 2 in the sense of finding a statistically significant finding (p<0.05) in the same direction as the original study. the mean effect size in the replications was about 50% of the mean effect size in the original studies. the study was published in nature human behaviour in 2018 (camerer et al. 2018).>

NARPS

In the Neuroimaging Analysis Replication and Prediction Study (NARPS) 70 research teams independently tested the same nine hypotheses using the same neuroimaging dataset. We first collected functional magnetic resonance imaging (fMRI) data for more than 100 participants in a risk preferences task. To assess the impact of analytical choices on fMRI results, this fMRI dataset was then independently analyzed by 70 teams of neuroscientists. The research teams were asked to analyze the data to test nine ex-ante hypotheses, each of which consisted of a description of significant activity in a specific brain region in relation to a particular feature of the experimental design. They were given up to 100 days (varying based on the date they joined) to analyze the data and report for each hypothesis whether they found statistically significant evidence in support of the hypothesis (yes/no). The research teams were instructed to perform the analysis as they usually would in their own research groups. We found sizeable variation in reported results, and none of the 70 research teams analyzed the data in an identical way. The fraction of teams reporting a significant result varied from 6% to 84% across the nine hypotheses. The extent of the variation across teams can be measured as the fraction of teams reporting a different result than the majority of teams. On average for the nine hypotheses, 20% of the teams reported a result that differed from the majority of teams. This is a high variation as the maximum possible variation is 50% reporting a result that differs from the majority result. The variation in results is about midway between complete consistency in results across teams and completely random results. This clearly shows that the analytical choices (the degrees of freedom) during the analysis stage importantly affect reported results. For every one out of the nine hypotheses tested there was some combination of analytical choices that led to a “statistically significant finding”. The study was published in Nature in 2020 (Botvinik-Nezer et al. 2020).

CONCLUDING REMARKS

The replication results in the EERP and the SSRP gives important information about the reproducibility of experimental studies published in top economics journals and social sciences experiments published in two of the most prestigious scientific journals (Nature and Science). As the sample sizes of 18 and 21 studies in the EERP and SSRP are limited, one has to be cautious in generalizing these results. However, the results suggest that limited reproducibility is an important problem in the social sciences with a high fraction of false positive results published in top journals. This is also in line with other large scale replication projects in the social sciences in recent years such as the reproducibility project: psychology and the so called Many Labs studies. Pooling the results together from these projects suggests a replication rate of about 50%. This implies that results reported as statistically significant in favor of a tested hypothesis should not be interpreted as strong evidence until the result has been replicated (or other changes in scientific practice have occurred that increases the plausibility of “statistically significant findings”). It should also be noted that these replication projects are based on experimental studies, and the replication rate may be even lower in studies based on observational data (where the researcher degrees of freedom in how to conduct the analysis is arguably larger).

NARPS contributes with important information on one important factor for why the credibility of published results reported as statistically significant is not higher. It shows that different researchers make different analytical choices in deciding how to test a hypothesis in a specific data set. This illustrates the researcher degrees of freedom in how to conduct an analysis, and this variability in analytical choices is not currently taken into account in the testing of scientific hypotheses (leading to that the degree of evidence is exaggerated in statistical testing as currently practiced). It also implies a large scope for researchers to intentionally or unintentionally make analytical choices that bias the results (towards finding statistically significant findings); so called p-hacking. In moving forward, it is important to improve scientific practice to increase the credibility of scientific findings.
Grant administrator
Stockholm School of Economics
Reference number
NHS14-1719:1
Amount
SEK 12,641,000
Funding
New prospects for humanities and social sciences
Subject
Economics
Year
2015