the post below was written by laura scherer following a brief interaction we had on the ISCON facebook page, followed by a few facebook messages back and forth. i think this is a great example of the kind of thoughtful contribution we could be seeing more of if we could find a way to have productive and pleasant discussions online. i realize pleasantness is not the most important factor in intellectual discussions, but the problem with unpleasantness is that it drives people away,* and then we miss out on some potentially fruitful discussions. i don't know what the solution is,** but some food for thought.
* also there are other problems with unpleasantness.
** blogs, obviously.
Much is being said about the Reproducibility Project’s failure to replicate the majority of 100 studies. Judging from the ensuing debate, there appears to be disagreement about virtually every aspect of the project, from whether it was properly designed and conducted to whether it has any implications for our science at all. The debate that has emerged has been heated, especially in online forums. Someone used the word “Gestapo” in an ISCON Facebook discussion the other day (for example).
One unfortunate product of this debate is that people seem to be polarizing, aligning themselves with one of two perspectives:
One perspective argues that psychology is in crisis. There is something wrong with the way that psychology has historically been conducted, and this has led to a problematic body of research that is filled with false positive results. The methodological issues include under-powered studies, selectively publishing studies that “worked”, p-hacking, and HARKing (I’m sure I’ve missed some, forgive me). According to this perspective, failed replications reflect the fact that a lot of published psychology research is wrong.
The other side argues that we are not facing a crisis, because there are lots of reasons why we wouldn’t, and shouldn’t, expect psychological findings to be exactly reproducible. Some people have proposed weak arguments to this effect, and these are best set aside. However, others have made the reasonable point that moderators could cause an exact replication to fail, even when the finding in question is “true” (e.g. Lisa Feldman-Barrett’s article in the New York Times). For example, an exact replication could fail to find the original effect because 1) researchers did not conduct pretests and manipulation checks to ensure that the original study materials are appropriate for the new sample, 2) the original findings are only true in certain cultures or contexts, and 3) further unknown moderators may exist, such as differences in testing environment, RA characteristics, sample demographics, etc.
I would like to respectfully propose that we move beyond this debate. Choosing to attribute recent replication failures to one or the other of these explanations is unhelpful, because both explanations could be correct. Moreover, we have no way of knowing which explanation is correct for any given replication. The unfortunate fact is that we almost certainly do have a literature that is filled with false positive results. We know this because of p-curves and post hoc power analyses. This problem is not unique to psychology—a decade ago John Ioannidis concluded that “most published research findings are false” after looking at many disciplines, from psychology to prescription drug trials. However, it is alsothe case that exact replications may sometimes be inappropriate, and replication studies may need to reformulate the methods for different contexts and cultures. For example, the stimulus materials that are used to show ingroup favoritism in University of Michigan students may not work for students in California or China, because people who live in different places have different group identities. That is a completely reasonable and valid point.
The fact that we cannot distinguish between these two explanations for failed replications is pretty problematic. This is because it makes it really hard to do cumulative science. Cumulative science depends on being able to closely replicate and extend each other’s work (hopefully this point is not controversial). Cumulative science is important, because it lets us develop theories. And yet, we work in a research domain where exact replications might fail, even when the original effect was not a false positive. This is what makes our science really hard. We can’t always just follow a recipe and expect to get a cake.
Nonetheless, if our goal is to create a cumulative science, then we need to identify ways of distinguishing between failed replications that are due to initial false positives, versus failed replications that are due to unidentified moderators. Again, it’s not beneficial to argue about which of these is a better perspective. A better use of time is to develop approaches that allow us to tell the difference.
Here are some ideas.
First, we can reduce the likelihood that false positives get into the literature in the first place. These changes are already being implemented by many researchers and journal editors:
- Conduct highly powered studies.
- Register hypotheses and ask for increased transparency about which analyses were planned (to deter HARKing).
- Make data publically available (to deter p-hacking).
- When possible, conduct an exact replication before moving on to conceptual replications.
- I’m going to stop here, acknowledging that this could go on for a while.
Second, we should publish replications, but set a high bar so that we don’t end up with a bunch of false negatives that further muddy up the literature:
- Conduct highly powered replications.
- Establish that the original study materials are valid for the new sample using pretests and manipulation checks. This is particularly important when the original finding is older or the new sample is drawn from a very different population.
- Whenever possible, replication efforts should collaborate with the authors of the original study to ensure that the procedure is implemented fairly and competently.
- Replication efforts should consider, a priori, differences between their sample and the original study sample, and how those might contribute to differences in findings. When appropriate, the replication should include measures and manipulations that explore these hypothesized moderators. Original authors should also be given the opportunity to propose moderators a priori that replicators may consider adding to their study.
- Also see The Replication Recipe from Brandt et al 2014
Finally, we can follow a reasonable approach for drawing (tentative) conclusions about any set of studies. Obviously, it’s not okay to make conclusions on the basis of one study, either original findings or replications. Instead, we can use the strength of the evidence from the original study, and the strength of the replication study, as a guide. For example:
- If the original study was low quality, and the replication study was high quality (according to above the criteria), and the findings disagree, then the evidence favors the conclusion that the original effect was a false positive. Searching for moderators to explain the replication failure should be low priority unless the effect can be re-established in a subsequent high quality study.
- If both the original finding and the replication were of high quality, but the findings disagree, then moderators might be proposed as a reasonable explanation. Further replication efforts should pursue the most theoretically relevant moderators. Moderators that were proposed a priori but not addressed in the replication might be given more weight. Relatively trivial moderators should be given low priority.
- Regardless of the quality of the original study, if the replication is low quality, don’t publish it.
- Check out Simonsohn’s 2015 Small Telescopes paper for a way more sophisticated treatment of how to evaluate replications.
There may be people who take issue with these proposals and there are certainly a number of important topics that I haven’t adequately addressed (too many to even list…). Nonetheless, I hope that this serves as one example of how we might move beyond debates and continue to make progress in improving our science.
 Weak arguments include:
- Regression to the mean makes it harder to find significant effects in replications. Problem with this argument: why would regression necessarily go in the direction of smaller effects, unless there is publication bias?
- Failed replications are often the result of incompetent researchers who are motivated to find null effects. Problem with this argument: the original authors were presumably motivated to find a positive effect, and their results could have similarly resulted from incompetence.
- The “decline effect” causes effects to mysteriously go away or become smaller over time. Problem with this argument: isn’t that just another way of saying Type I error?
 And other methods that I don’t have the space or qualifications to write about.
 Of course this begs the question: What is a trivial moderator? This could be another essay altogether, but for now I’ll just propose that at some point a lack of generalizability should cause us to rethink the importance of the finding. If a psychological effect only occurs for 19 year old psychology majors in Missouri, then it may not be an effect that we want to hang our hats on.