there are so many (very good) stories already about the RP:P, it's easy to feel like we're overreacting. but there is a lot at stake. some people feel that the reputation of our field is at stake. i don't share that view. i trust the public to recognize that this conversation about our methods is healthy and normal for science. the public accepts that science progresses slowly - we waited over 40 years for the higgs boson, and, according to wikipedia, we're still not sure we found it. i don't think we're going to look that bad if psychologists, as a field, ask the public for some patience while we improve our methods. if anything, i think what makes us look bad is when psychology studies are reported in a way that is clearly miscalibrated, that makes us sound much more confident than scientists have any right to be when just starting out investigating a new topic.
what i think is at stake is not the reputation of our field, but our commitment to trying out these new practices and seeing how our results look. in the press release, Gilbert is quoted as saying that the RP:P paper led to changes in policy at many scientific journals. that's not my impression. my impression is that the changes that happened came before the RP:P was published. i also haven't seen a lot of big changes. what i've seen is one or two journals make a big change, and a bunch make small changes. Gilbert's comment seems to imply that he thinks these changes should be rolled back. even if you don't accept that the RP:P is informative, it seems very strange to me to conclude that the replicability debate is over, and that journals should roll back their policy changes. if that's what's at stake, this is an incredibly important discussion.
it's tempting to stop there and just say, whatever we all believe about the RP:P, let's focus on the future and making things better. but i feel compelled to explain why Gilbert et al.'s critique of the RP:P leaves me unsatisfied, because if i did find it very compelling, i would want to know other people's reasons for not being compelled.
let me be clear. although i continue to believe the RP:P provides some evidence that our studies are not as replicable as i would like (with a wide margin of error around that estimate), i think it's pretty reasonable to draw different conclusions. specifically, i think many of the replication studies in the RP:P, like many of the original studies, were underpowered. for this and other reasons, i agree the RP:P was flawed, and the replicability rate was suppressed by these flaws. but i'm super uncomfortable with jumping from that conclusion to the one that there is no replicability problem and the discussion is over.
if you don't think the RP:P was informative, what has to come next is a proposal about what would be a more ideal test. it's not enough to say this replication project was flawed and that replication project was flawed, so let's not use replications to test our findings. the replicability of psychological science has to be open to empirical assessment. if not the RP:P and the RRRs and the Many Labs and the hugely-powered, careful replications of key effects, then what? i would really like to know what empirical evidence would convince the critics that we do have a replication problem.
i imagine one answer someone could give is that the empirical basis for their beliefs is all the conceptual replications that exist. the problem with this is that it answers 'why do you believe we don't have a replicability problem?' but it doesn't answer 'what would convince you that we do?'. to protect ourselves against confirmation bias, we have to be able to state what would falsify our beliefs. the existence of successful conceptual replications isn't an answer to the question i'm looking for, because if the conceptual replications fail, we aren't likely to take that as evidence that we have a problem. maybe we shouldn't, that's ok, but we have to be able to name what would make us change our minds, and commit ourselves to that evidence, regardless of outcome.
i understand why people want the flexibility to interpret the results of a study after the fact, with freedom to use reasoning and critical thinking to consider possibilities that they didn't consider ahead of time. but we are human, we know that we are very very good at abusing reasoning and critical thinking in the service of confirmation bias. so we have to tie our hands to some extent. we have to pre-commit to accepting certain results. we have to be able to design a replication study that we ourselves feel comfortable committing to as a valid test. and, if we really don't want to accept the results that we had pre-committed to accepting, we can always run a new pre-registered study to demonstrate that our first pre-registered study was misguided.
indeed, the pre-commitment to accepting the results as informative is what greatly increases the informational value of the RP:P and RRRs and Many Labs projects. they are like conceptual replications of each other, but unlike most conceptual replications, they used pre-registration and open data & materials to tie their hands and reduce the opportunity to exploit flexibility in data collection and analysis. i trust these conceptual replications in large part because the process was so constrained and so transparent. i am on board with conceptual replications when they are conducted in a way that would force the authors to accept that any outcome should be treated as informative, success or failure.
after-the-fact scrutiny of the results is useful, but it has to be taken as less reliable, and more susceptible to human bias, than conclusions that were constrained by pre-registration. we can use these post-hoc explanations as a starting point for follow-up studies, but they are not an end point. if you believe the RP:P, or any other piece of evidence, is uninformative, it's great to point out the flaws, but then you also have to tell us what an informative study would look like, and that answer has to be something that could make you reconsider your beliefs.
think of it as an invitation. if Gilbert et al. laid out what empirical demonstration would convince them that the old way of doing things was problematic, i'm pretty sure the field would spring into action to conduct those tests.
but to be honest, i'm not sure there is any empirical evidence that would convince them, and my goal is not to convince them anyway. my goal is to explain why i am dissatisfied with the post-hoc critiques of each new piece of evidence that comes out suggesting that we have a problem. it's not that i don't think the critics make some valid points. it's that i don't see the critics taking the necessary next steps. they aren't making any risky predictions, or opening themselves up to possible disconfirmation.
turning that question around on myself, what empirical evidence would convince me that published social/personality psych studies are as replicable as we would like, that the old practices are sound? my answer is: if people start pre-registering their study design and analysis plan, test similar questions, and the results look the same (same distribution of p-values, same effect sizes) as they have in the past. that would definitely convince me, and i would accept the arguments of those who want to go back to doing things the old way.
i would love a similarly concrete answer from the Gilbert et al. crowd. but maybe i'm just too demanding.
"i think many of the replication studies in the RP:P, like many of the original studies, were underpowered. for this and other reasons, i agree the RP:P was flawed, and the replicability rate was suppressed by these flaws."
I am trying to understand this, but couldn't one always state that replication studies are under-powered after a "failed" replication attempt? (assuming the effect is never exactly 0)?
Perhaps researchers need to come up with some new rules for designing replication studies (especially concerning power), and for judging them/deciding if a replication succeeded or failed.
I would love to hear more on how to exactly do this.
Posted by: Anonymous | 05 March 2016 at 05:14 AM
if you stick with the NHST framework, yes, you can't draw conclusions from null effects, including null replications. but in effect estimation or bayesian frameworks, you can (if you have enough precision/evidence). for an example of an effect estimation approach, i really like uri simonsohn's "small telescopes" approach: http://datacolada.org/wp-content/uploads/2016/03/26-Pscyh-Science-Small-Telescopes-Evaluating-replication-results.pdf
Posted by: Simine Vazire | 05 March 2016 at 05:16 AM
That's a good point. I get the impression that there's a lot of moving the goalposts these days and it's not clear what type of evidence it would take for some academics to change their position. Though I understand that it is a charged topic.
Speaking as an outsider to psychology, I think the replication problem doesn't make the field look bad, quite the opposite. I suspect a lot of other fields have the same problem, but aren't really taking steps to fix it. Coming from a statistics perspective, I find this movement to improve statistical methods, experimental design and replicability in psychology to be quite refreshing and exciting. No need to be negative.
Posted by: Maude Lachaine | 05 March 2016 at 05:57 AM
Great post, Simine. I think you're completely spot on when you ask about the falsifiabity of the hypotheses. We should always do that and it certainly applies to this. What evidence could convince anyone that there is/isn't a problem? If the answer is 'none' it is pointless to continue discussing any results.
Whether or not psychology is in crisis is entirely subjective. But there are objective ways to quantify the validity of research and all people in this discussion should really get together and decide what level of replicability they think is needed.
As far as the RPP is concerned, a considerable proportion of the findings (both the original and the replications) are completely inconclusive and only a small proportion yield compelling support for the existence of these effects. I don't know about anyone else but to me those stats aren't evidence of a healthy field. Large number of inconclusive findings imply that the research is done with sufficient power and sensitivity. The fact that even the replications suffer from this means that estimates of power are based on invalid assumptions, presumably at least in part because the original effect sizes are substantial overestimates.
The issue of methodological differences that could have been avoided is certainly another reason for some concern but again I must ask what a proponent of the original effects would accept as compelling evidence for the null hypothesis. As I said in my post (that single-handedly pissed off all of social psychology research apparently), if you can't make some a priori decisions on what a hypothesis implies then you end up chasing ghosts.
My main worry, and the reason why I was so sceptical of preregistration etc for so long, is that if we only concentrate on strong effects that are robust to methodological differences and come out strongly even in preregistered designs then we may bias science towards only strong effects and miss the more nuanced but potentially important ones. So we need to be wary of this. An effect that is a total snowflake, e.g. "Professor prime makes you smarter, Einstein makes you dumber, Einstein with the tongue out makes you smarter again, and generally it only works when the temperature is below 30C..." (I made some of these up) then I think you need to eventually accept that you are chasing a ghost. But if you think that there are modulating factors at play, which is a perfectly justified assumption, then you should test that and show it's robust.
The field should reflect this. We should stop reporting single findings that are likely to not be robust to lots of confounding factors as general mechanisms.
Posted by: Sam Schwarzkopf | 05 March 2016 at 06:12 AM