this blog post is an attempt to lay out my reasoning about why i think it's safe to conclude that p-hacking is a big problem, and false positives are a big problem, as clearly and bluntly as i can.
there have been grumblings online about a new Registered Replication Report (RRR) about to come out showing that the meta-analytic result of 20 pre-registered replications of an ego depletion study is pretty much zero.
it might seem like jumping the gun to write a blog post about it before it’s come out. that's because it is jumping the gun. but i’m doing it anyway, because i think the most important conclusion is not about ego depletion. the most important conclusion is that we need to accept that 50 million frenchmen can be wrong.
throughout the last few years, when i have talked to people,* one of the most strongly and frequently expressed reasons i’ve heard for not panicking is that it seems impossible that p-hacking is so rampant that even a phenomenon shown in 50 or 100 studies (e.g., ego depletion) could be a false positive. if a paradigm has been used over and over again, and dozens of papers have shown the effect, then it can’t all be a house of cards. the idea was that even though there were some problems, there was no reason to panic because of these conceptual replications. (also, meta-analysis.)
i wrote a blog post about this a while ago, pointing out the problem with this logic. i argued that if we test a few bricks in the wall (studies), and many of them turn out to be weak (fail to replicate), we can’t point to the rest of the wall as evidence that everything is ok. if even a small sample of bricks (studies) is tested and a high proportion turn out to be weak (fail to replicate), we need to consider the possibility that the wall is not sound.
the only way out of that conclusion is if we think the sample of bricks (studies) tested was biased (i.e., we only tried to replicate studies that had obvious signs of weakness), or that the process of testing the bricks (replicating studies) was flawed. i think the reproducibility project ruled out the first possibility by selecting a mostly-representative sample of studies published in our top journals. i think the RRR rules out the second possibility by having a very careful pre-registration process that involves both believers and skeptics and a multi-site replication effort. it’s a one-two punch. we can no longer be confident that an effect is real just because there are many studies showing the effect. or because there is a meta-analysis showing the effect. the existence of the wall (a large literature full of conceptual replications) can no longer be a source of comfort.
to me, the RRR is another exit door slamming shut, pushing us to the conclusion that p-hacking is rampant and can produce mountains of false positives. if we can no longer trust meta-analysis, or trust the existence of 100 studies in top journals showing similar effects, there is nowhere to retreat.
i can understand why we didn’t panic when the ‘false positive psychology’ paper came out. who knows how many of those p-hacking strategies people use, much less in what combination.
i can understand why we didn’t panic when the QRP paper came out. who knows whether people did these things just once or many times in their lives. and i think we can all agree that, whatever our beliefs about the prevalence of p-hacking are, we could design a survey that would produce responses consistent with those beliefs. which of the two QRP papers you think is more valid is predicted by your a priori beliefs about the prevalence of QRPs.**
i can even almost understand why not everyone panicked when the reproducibility project (RP:P) came out because, as we all teach our students, we never put much stock in single studies anyway.
ok, wait, no. i’m lying. i can’t understand why not everyone was worried about the RP:P results. i have resisted writing a blog post about the RP:P but i can’t hold it in anymore. here it is:
so, it seems to me that the only way to not be worried about what the RP:P results are telling us is to reject the idea that the replicability rate in the RP:P tells us something about the (inverse of the) false positive rate.
there is definitely noise and error (e.g., hidden moderators) that affect replication results, but there is still something we can learn from the RP:P, even if there's uncertainty around that conclusion. in the RP:P, 74.5% (41 out of 55) of social/personality studies failed to replicate by the “replication study reaches p < .05” standard.# let's assume that half of those failed replications were due to noise or error. ## that's still 20 out of 55 studies that failed to replicate, or 36%. so, i think it's reasonable to conclude that the RP:P suggests that a good estimate of the false positive rate in social/personality psych is 36-75%. that's a wide confidence interval, but it's still way more precise than anything we had before. and it's way way higher than we would like.
# if you prefer a different metric for deciding what counts as a failed replication, here is the same calculation applied to the most generous metric of replication success: if we use “meta-analytic p < .05” (which i think is far far too generous because it gives the non-pre-registered original study as much credibility as the pre-registered second study), 27 out of 55 studies successfully replicated. that’s 51% that didn’t replicate, even by this very generous standard. now let’s generously assume half of those failed replications were due to noise or error, and we’re still left with a 25% false positive rate as the lower bound of our estimate. best case scenario.
## note that i’m generously### assuming all noise/error leads to underestimating the effect. as sanjay has noted, noise and error (and hidden moderators) can lead to overestimating effects, too.
### so much generosity.
ok, back to the original blog post.
so, i can almost understand if people weren’t worried until now. but the RRR makes it very hard for me to imagine being ok with the state of the field.*** even if you look at your own practices and those of everyone you know, and you don't see much p-hacking going on, the evidence is becoming overwhelming that p-hacking is happening a lot. my guess is that the reason people can't reconcile that with the practices they see happening in their labs and their friends' labs is that we're not very good at recognizing p-hacking when it's happening, much less after the fact. we can't rely on our intuitions about p-hacking. we have to face the facts. and, in my view, the facts are starting to look pretty damning.
in conclusion, fifty million frenchmen can definitely be wrong. so can fifty JPSP papers.
(this situation reminds me a bit of the situation described in the big short, about the mortgage bubble - few economists and bankers thought it was possible that there was this huge crisis looming, partly because everyone thought that if there was this fundamental problem, it would be obvious to the experts.**** sometimes, things that seem super solid are actually way more shaky than almost anybody realizes.)
i know there are very smart people that i respect a lot who disagree with me. i very much welcome comments/responses.
* introvert gone wild.
** p < .000007
*** in case you were worried about running out of things to panic about when lying awake at 3 a.m.
**** i'm not sure why paramount hasn't called to acquire the rights to my blog. mark ruffalo already told me he'd play uri simonsohn.