the post below was written by laura scherer following a brief interaction we had on the ISCON facebook page, followed by a few facebook messages back and forth. i think this is a great example of the kind of thoughtful contribution we could be seeing more of if we could find a way to have productive and pleasant discussions online. i realize pleasantness is not the most important factor in intellectual discussions, but the problem with unpleasantness is that it drives people away,* and then we miss out on some potentially fruitful discussions. i don't know what the solution is,** but some food for thought.
-simine
* also there are other problems with unpleasantness.
** blogs, obviously.
-----------
Much is being said about the Reproducibility Project’s failure to replicate the majority of 100 studies. Judging from the ensuing debate, there appears to be disagreement about virtually every aspect of the project, from whether it was properly designed and conducted to whether it has any implications for our science at all. The debate that has emerged has been heated, especially in online forums. Someone used the word “Gestapo” in an ISCON Facebook discussion the other day (for example).
One unfortunate product of this debate is that people seem to be polarizing, aligning themselves with one of two perspectives:
One perspective argues that psychology is in crisis. There is something wrong with the way that psychology has historically been conducted, and this has led to a problematic body of research that is filled with false positive results. The methodological issues include under-powered studies, selectively publishing studies that “worked”, p-hacking, and HARKing (I’m sure I’ve missed some, forgive me). According to this perspective, failed replications reflect the fact that a lot of published psychology research is wrong.
The other side argues that we are not facing a crisis, because there are lots of reasons why we wouldn’t, and shouldn’t, expect psychological findings to be exactly reproducible. Some people have proposed weak arguments to this effect, and these are best set aside.[1] However, others have made the reasonable point that moderators could cause an exact replication to fail, even when the finding in question is “true” (e.g. Lisa Feldman-Barrett’s article in the New York Times). For example, an exact replication could fail to find the original effect because 1) researchers did not conduct pretests and manipulation checks to ensure that the original study materials are appropriate for the new sample, 2) the original findings are only true in certain cultures or contexts, and 3) further unknown moderators may exist, such as differences in testing environment, RA characteristics, sample demographics, etc.
I would like to respectfully propose that we move beyond this debate. Choosing to attribute recent replication failures to one or the other of these explanations is unhelpful, because both explanations could be correct. Moreover, we have no way of knowing which explanation is correct for any given replication. The unfortunate fact is that we almost certainly do have a literature that is filled with false positive results. We know this because of p-curves and post hoc power analyses.[2] This problem is not unique to psychology—a decade ago John Ioannidis concluded that “most published research findings are false” after looking at many disciplines, from psychology to prescription drug trials. However, it is alsothe case that exact replications may sometimes be inappropriate, and replication studies may need to reformulate the methods for different contexts and cultures. For example, the stimulus materials that are used to show ingroup favoritism in University of Michigan students may not work for students in California or China, because people who live in different places have different group identities. That is a completely reasonable and valid point.
The fact that we cannot distinguish between these two explanations for failed replications is pretty problematic. This is because it makes it really hard to do cumulative science. Cumulative science depends on being able to closely replicate and extend each other’s work (hopefully this point is not controversial). Cumulative science is important, because it lets us develop theories. And yet, we work in a research domain where exact replications might fail, even when the original effect was not a false positive. This is what makes our science really hard. We can’t always just follow a recipe and expect to get a cake.
Nonetheless, if our goal is to create a cumulative science, then we need to identify ways of distinguishing between failed replications that are due to initial false positives, versus failed replications that are due to unidentified moderators. Again, it’s not beneficial to argue about which of these is a better perspective. A better use of time is to develop approaches that allow us to tell the difference.
Here are some ideas.
First, we can reduce the likelihood that false positives get into the literature in the first place. These changes are already being implemented by many researchers and journal editors:
- Conduct highly powered studies.
- Register hypotheses and ask for increased transparency about which analyses were planned (to deter HARKing).
- Make data publically available (to deter p-hacking).
- When possible, conduct an exact replication before moving on to conceptual replications.
- I’m going to stop here, acknowledging that this could go on for a while.
Second, we should publish replications, but set a high bar so that we don’t end up with a bunch of false negatives that further muddy up the literature:
- Conduct highly powered replications.
- Establish that the original study materials are valid for the new sample using pretests and manipulation checks. This is particularly important when the original finding is older or the new sample is drawn from a very different population.
- Whenever possible, replication efforts should collaborate with the authors of the original study to ensure that the procedure is implemented fairly and competently.
- Replication efforts should consider, a priori, differences between their sample and the original study sample, and how those might contribute to differences in findings. When appropriate, the replication should include measures and manipulations that explore these hypothesized moderators. Original authors should also be given the opportunity to propose moderators a priori that replicators may consider adding to their study.
- Also see The Replication Recipe from Brandt et al 2014
Finally, we can follow a reasonable approach for drawing (tentative) conclusions about any set of studies. Obviously, it’s not okay to make conclusions on the basis of one study, either original findings or replications. Instead, we can use the strength of the evidence from the original study, and the strength of the replication study, as a guide. For example:
- If the original study was low quality, and the replication study was high quality (according to above the criteria), and the findings disagree, then the evidence favors the conclusion that the original effect was a false positive. Searching for moderators to explain the replication failure should be low priority unless the effect can be re-established in a subsequent high quality study.
- If both the original finding and the replication were of high quality, but the findings disagree, then moderators might be proposed as a reasonable explanation. Further replication efforts should pursue the most theoretically relevant moderators. Moderators that were proposed a priori but not addressed in the replication might be given more weight. Relatively trivial moderators should be given low priority.[3]
- Regardless of the quality of the original study, if the replication is low quality, don’t publish it.
- Check out Simonsohn’s 2015 Small Telescopes paper for a way more sophisticated treatment of how to evaluate replications.
There may be people who take issue with these proposals and there are certainly a number of important topics that I haven’t adequately addressed (too many to even list…). Nonetheless, I hope that this serves as one example of how we might move beyond debates and continue to make progress in improving our science.
[1] Weak arguments include:
- Regression to the mean makes it harder to find significant effects in replications. Problem with this argument: why would regression necessarily go in the direction of smaller effects, unless there is publication bias?
- Failed replications are often the result of incompetent researchers who are motivated to find null effects. Problem with this argument: the original authors were presumably motivated to find a positive effect, and their results could have similarly resulted from incompetence.
- The “decline effect” causes effects to mysteriously go away or become smaller over time. Problem with this argument: isn’t that just another way of saying Type I error?
[2] And other methods that I don’t have the space or qualifications to write about.
[3] Of course this begs the question: What is a trivial moderator? This could be another essay altogether, but for now I’ll just propose that at some point a lack of generalizability should cause us to rethink the importance of the finding. If a psychological effect only occurs for 19 year old psychology majors in Missouri, then it may not be an effect that we want to hang our hats on.
I have high hopes for today. This is the third very sensible post I read today :P Your suggestions seem very reasonable to me and that you are trying to break that polarisation between "replicators" and "status-quoers". There is a middle ground here and that's what we should be after.
One thing I'd disagree on is that hidden moderators/mediators/confounds are a stronger argument than comments about the competence of the research (be it the original or the replication). Incompetence and/or low quality experiments can very well affect the outcome and I don't think it is as simple to judge that as saying "there is a large N and the original author approved the methods." If, as in a proper adversarial collaboration, the original authors were directly involved in the experiments and the outcome is nonetheless a failure to replicate, then I think it is hard to argue that poor quality is to blame. But in many other replication attempts that judgement call is a lot harder. As you rightly point out though, the same argument applies to the original result too. So the take home message from this is that you shouldn't put too much faith in any set of two studies (original + replication) regardless of what their power etc unless you have very high confidence that one of them was of sufficient quality.
In my view, the hidden moderator defense is a much weaker argument. There will always be unknown factors. You need experiments to test under which conditions a finding generalises (if it replicates at all that is). Unless you can formulate a moderating factor and test it, this argument is meaningless.
Posted by: Sam Schwarzkopf | 21 September 2015 at 07:47 PM
Dear Laura Scherer,
your recommendations for the future are good, but I disagree with the analysis of the current situation.
You write: Choosing to attribute recent replication failures to one or the other of these explanations is unhelpful, because both explanations could be correct.
This is simply not correct because it is possible to use the OSF-reproducibilty data to test these alternative hypotheses. The original studies show clear signs of bias that inflated the reported effect sizes by 50% or more. The unbiased replication studies revealed this bias. It is possible that for some individual findings moderators might have an effect, but the large discrepancy of 97% significant results in the original studies and 36% significant results in the replication study is due to questionable research practices and publication biases in psychology.
Also, recommendations for improvement have been made for several years, but the power of original published studies in psychology in 2015 is not higher than it was in the 50 years before.
https://replicationindex.wordpress.com/
We can all hope that things will get better, but often an honest assessment of the current situation is needed before things improve. The moderator argument is not helpful because it hides the real cause of the problem: original studies are not credible and provide no protection against false-positive results because published results are based on a selection of significant results from a set of underpowered studies.
Sincerely, Uli Schimmack
https://replicationindex.wordpress.com/
Posted by: Ulrich Schimmack | 21 September 2015 at 09:40 PM
Hi Lara,
I cannot disagree with your recommendations for replication studies, they're great. However the way you build the narrative it now appears you are suggesting that RP:P does not comply with these high standards. I hope you agree that would be very wrong to suggest, one could refer to the procedure of RP:P and see virtually all of your recommendations were implemented in RP:P
Also, I was wondering... can you give examples of replication studies that do not meet the high standards and 'further muddy up the literature with a bunch of false negatives'? I can't think of many studies that would qualify as such in the recent literature... there aren't that many replication studies around to choose from.
What I *can* disagree with are the ideas about the role of hidden moderators in the application of the scientific method to produce scientific knowledhe
Assuming that we're talking anout confirmatory studies and that research questions are always based on some deductive chain of propositions that lead to a prediction of measurment outcomes
in terms of an observational constraint between at least a dependent and independent variable....
Then it is absolutely valid to claim, after a failed direct replication study, that an uncontrolled confounding variable in the replication was responsible for the failure to replicate the original effect.
However, the consequence of claiming this was the case for the replication is that the original deductive chain/theory/claim was invalid as well!
It means that:
- the Ceteris Paribus clause is violated. There apparently were sensible/foreseeable moderators that systmaticilly vary with the effect, which were the original study did not explicitly attempt to control for.
- to claim a failed direct replication was due to a hidden moderator and the original study observed a true effect implies that randomisation failed in the original study and the the hidden moderator was controlled for 'by accident' / or by selection bias. Random assignment of subjects to groups or conditions controls for non-systematic variability within / between subjects. A moderator variable implies systematic variation, something that randomisation can't resolve.
So either way, the interpretation of the results of the original study is problematic if one wishes to explain a failed replication was due to hidden moderators.
Luckily philosophers of science (Lakatos) have analysed these defensive strategies in great detail and here's the deal:
1. Progressive research programme: Acknowledge hidden moderator and accept there is a problem with the original study and the theory/deductive chain that predicted the effect. Amend the original claim or start from scratch.
2. Degenrative research programme: Point to hidden moderator in order to protect the perceived veracity of the original result and theory. Do not amend, or reject the original claim.
All the best,
Fred
Posted by: FredHasselman | 22 September 2015 at 04:43 AM
Thanks for the comments, these are great.
Sam, it's possible that I agree with everything that you said. The point about questioning replicators' competence was simply that, so long as published replications are high quality, it shouldn't be acceptable to question a replicator's competence. Further, it may have been too fair to the moderator argument to give it such credence without being clear that we need to consider moderators a priori (I said that somewhere, in passing, I think…). I think you said it best: we shouldn't put much faith in any set of studies unless we can be reasonably certain that at least one of them was high quality.
Uli, I think you and I agree that we currently have a problem with false positives in the literature and that this may take years to sort out. I see this as a long term effort. In the near term, weak findings that shouldn't have made it into the literature will be rejected as a result of failed replications that are high quality and that test reasonable moderators (proposed a priori, I hope). All of us need to be prepared to accept that some original findings were false positives--if we can't, then published findings will be like the walking dead (they refuse to die). That said, I'm going to stand by the point that moderators are an important thing to consider when conducting replications. I don't see these as mutually exclusive and that was part of the point of the post.
Fred, with regard to the first part of your comment, I want to be clear that I have extremely high regard for the RP:P and agree that they met the criteria on the list. Some people that I've talked to disagree with this sentiment, although it seems that the real issue is that 100 studies is just a lot to digest (making it hard to go through and scrutinize each one). With regard to the rest of your comment, when I was writing this piece I just knew that someone would comment with something far more sophisticated about philosophy of science :) So in response, I say we all decide to go with #1 from your comment and not #2, and agree that we should consider *reasonable*, *a priori* moderators and be prepared to toss out findings that we can't replicate after considering those moderators. It seems that we would all agree with this.
To me, these comments represent a huge step in the right direction relative to other degenerative arguments that I have witnessed recently. Thanks for the respectful tone and the thoughtful points. Very much appreciated.
Posted by: Laura | 22 September 2015 at 12:59 PM