there are so many (very good) stories already about the RP:P, it's easy to feel like we're overreacting. but there is a lot at stake. some people feel that the reputation of our field is at stake. i don't share that view. i trust the public to recognize that this conversation about our methods is healthy and normal for science. the public accepts that science progresses slowly - we waited over 40 years for the higgs boson, and, according to wikipedia, we're still not sure we found it. i don't think we're going to look that bad if psychologists, as a field, ask the public for some patience while we improve our methods. if anything, i think what makes us look bad is when psychology studies are reported in a way that is clearly miscalibrated, that makes us sound much more confident than scientists have any right to be when just starting out investigating a new topic.
what i think is at stake is not the reputation of our field, but our commitment to trying out these new practices and seeing how our results look. in the press release, Gilbert is quoted as saying that the RP:P paper led to changes in policy at many scientific journals. that's not my impression. my impression is that the changes that happened came before the RP:P was published. i also haven't seen a lot of big changes. what i've seen is one or two journals make a big change, and a bunch make small changes. Gilbert's comment seems to imply that he thinks these changes should be rolled back. even if you don't accept that the RP:P is informative, it seems very strange to me to conclude that the replicability debate is over, and that journals should roll back their policy changes. if that's what's at stake, this is an incredibly important discussion.
it's tempting to stop there and just say, whatever we all believe about the RP:P, let's focus on the future and making things better. but i feel compelled to explain why Gilbert et al.'s critique of the RP:P leaves me unsatisfied, because if i did find it very compelling, i would want to know other people's reasons for not being compelled.
let me be clear. although i continue to believe the RP:P provides some evidence that our studies are not as replicable as i would like (with a wide margin of error around that estimate), i think it's pretty reasonable to draw different conclusions. specifically, i think many of the replication studies in the RP:P, like many of the original studies, were underpowered. for this and other reasons, i agree the RP:P was flawed, and the replicability rate was suppressed by these flaws. but i'm super uncomfortable with jumping from that conclusion to the one that there is no replicability problem and the discussion is over.
if you don't think the RP:P was informative, what has to come next is a proposal about what would be a more ideal test. it's not enough to say this replication project was flawed and that replication project was flawed, so let's not use replications to test our findings. the replicability of psychological science has to be open to empirical assessment. if not the RP:P and the RRRs and the Many Labs and the hugely-powered, careful replications of key effects, then what? i would really like to know what empirical evidence would convince the critics that we do have a replication problem.
i imagine one answer someone could give is that the empirical basis for their beliefs is all the conceptual replications that exist. the problem with this is that it answers 'why do you believe we don't have a replicability problem?' but it doesn't answer 'what would convince you that we do?'. to protect ourselves against confirmation bias, we have to be able to state what would falsify our beliefs. the existence of successful conceptual replications isn't an answer to the question i'm looking for, because if the conceptual replications fail, we aren't likely to take that as evidence that we have a problem. maybe we shouldn't, that's ok, but we have to be able to name what would make us change our minds, and commit ourselves to that evidence, regardless of outcome.
i understand why people want the flexibility to interpret the results of a study after the fact, with freedom to use reasoning and critical thinking to consider possibilities that they didn't consider ahead of time. but we are human, we know that we are very very good at abusing reasoning and critical thinking in the service of confirmation bias. so we have to tie our hands to some extent. we have to pre-commit to accepting certain results. we have to be able to design a replication study that we ourselves feel comfortable committing to as a valid test. and, if we really don't want to accept the results that we had pre-committed to accepting, we can always run a new pre-registered study to demonstrate that our first pre-registered study was misguided.
indeed, the pre-commitment to accepting the results as informative is what greatly increases the informational value of the RP:P and RRRs and Many Labs projects. they are like conceptual replications of each other, but unlike most conceptual replications, they used pre-registration and open data & materials to tie their hands and reduce the opportunity to exploit flexibility in data collection and analysis. i trust these conceptual replications in large part because the process was so constrained and so transparent. i am on board with conceptual replications when they are conducted in a way that would force the authors to accept that any outcome should be treated as informative, success or failure.
after-the-fact scrutiny of the results is useful, but it has to be taken as less reliable, and more susceptible to human bias, than conclusions that were constrained by pre-registration. we can use these post-hoc explanations as a starting point for follow-up studies, but they are not an end point. if you believe the RP:P, or any other piece of evidence, is uninformative, it's great to point out the flaws, but then you also have to tell us what an informative study would look like, and that answer has to be something that could make you reconsider your beliefs.
think of it as an invitation. if Gilbert et al. laid out what empirical demonstration would convince them that the old way of doing things was problematic, i'm pretty sure the field would spring into action to conduct those tests.
but to be honest, i'm not sure there is any empirical evidence that would convince them, and my goal is not to convince them anyway. my goal is to explain why i am dissatisfied with the post-hoc critiques of each new piece of evidence that comes out suggesting that we have a problem. it's not that i don't think the critics make some valid points. it's that i don't see the critics taking the necessary next steps. they aren't making any risky predictions, or opening themselves up to possible disconfirmation.
turning that question around on myself, what empirical evidence would convince me that published social/personality psych studies are as replicable as we would like, that the old practices are sound? my answer is: if people start pre-registering their study design and analysis plan, test similar questions, and the results look the same (same distribution of p-values, same effect sizes) as they have in the past. that would definitely convince me, and i would accept the arguments of those who want to go back to doing things the old way.
i would love a similarly concrete answer from the Gilbert et al. crowd. but maybe i'm just too demanding.