this blog post is an attempt to lay out my reasoning about why i think it's safe to conclude that p-hacking is a big problem, and false positives are a big problem, as clearly and bluntly as i can.
there have been grumblings online about a new Registered Replication Report (RRR) about to come out showing that the meta-analytic result of 20 pre-registered replications of an ego depletion study is pretty much zero.
it might seem like jumping the gun to write a blog post about it before it’s come out. that's because it is jumping the gun. but i’m doing it anyway, because i think the most important conclusion is not about ego depletion. the most important conclusion is that we need to accept that 50 million frenchmen can be wrong.
throughout the last few years, when i have talked to people,* one of the most strongly and frequently expressed reasons i’ve heard for not panicking is that it seems impossible that p-hacking is so rampant that even a phenomenon shown in 50 or 100 studies (e.g., ego depletion) could be a false positive. if a paradigm has been used over and over again, and dozens of papers have shown the effect, then it can’t all be a house of cards. the idea was that even though there were some problems, there was no reason to panic because of these conceptual replications. (also, meta-analysis.)
i wrote a blog post about this a while ago, pointing out the problem with this logic. i argued that if we test a few bricks in the wall (studies), and many of them turn out to be weak (fail to replicate), we can’t point to the rest of the wall as evidence that everything is ok. if even a small sample of bricks (studies) is tested and a high proportion turn out to be weak (fail to replicate), we need to consider the possibility that the wall is not sound.
the only way out of that conclusion is if we think the sample of bricks (studies) tested was biased (i.e., we only tried to replicate studies that had obvious signs of weakness), or that the process of testing the bricks (replicating studies) was flawed. i think the reproducibility project ruled out the first possibility by selecting a mostly-representative sample of studies published in our top journals. i think the RRR rules out the second possibility by having a very careful pre-registration process that involves both believers and skeptics and a multi-site replication effort. it’s a one-two punch. we can no longer be confident that an effect is real just because there are many studies showing the effect. or because there is a meta-analysis showing the effect. the existence of the wall (a large literature full of conceptual replications) can no longer be a source of comfort.
to me, the RRR is another exit door slamming shut, pushing us to the conclusion that p-hacking is rampant and can produce mountains of false positives. if we can no longer trust meta-analysis, or trust the existence of 100 studies in top journals showing similar effects, there is nowhere to retreat.
i can understand why we didn’t panic when the ‘false positive psychology’ paper came out. who knows how many of those p-hacking strategies people use, much less in what combination.
i can understand why we didn’t panic when the QRP paper came out. who knows whether people did these things just once or many times in their lives. and i think we can all agree that, whatever our beliefs about the prevalence of p-hacking are, we could design a survey that would produce responses consistent with those beliefs. which of the two QRP papers you think is more valid is predicted by your a priori beliefs about the prevalence of QRPs.**
i can even almost understand why not everyone panicked when the reproducibility project (RP:P) came out because, as we all teach our students, we never put much stock in single studies anyway.
ok, wait, no. i’m lying. i can’t understand why not everyone was worried about the RP:P results. i have resisted writing a blog post about the RP:P but i can’t hold it in anymore. here it is:
---
so, it seems to me that the only way to not be worried about what the RP:P results are telling us is to reject the idea that the replicability rate in the RP:P tells us something about the (inverse of the) false positive rate.
there is definitely noise and error (e.g., hidden moderators) that affect replication results, but there is still something we can learn from the RP:P, even if there's uncertainty around that conclusion. in the RP:P, 74.5% (41 out of 55) of social/personality studies failed to replicate by the “replication study reaches p < .05” standard.# let's assume that half of those failed replications were due to noise or error. ## that's still 20 out of 55 studies that failed to replicate, or 36%. so, i think it's reasonable to conclude that the RP:P suggests that a good estimate of the false positive rate in social/personality psych is 36-75%. that's a wide confidence interval, but it's still way more precise than anything we had before. and it's way way higher than we would like.
# if you prefer a different metric for deciding what counts as a failed replication, here is the same calculation applied to the most generous metric of replication success: if we use “meta-analytic p < .05” (which i think is far far too generous because it gives the non-pre-registered original study as much credibility as the pre-registered second study), 27 out of 55 studies successfully replicated. that’s 51% that didn’t replicate, even by this very generous standard. now let’s generously assume half of those failed replications were due to noise or error, and we’re still left with a 25% false positive rate as the lower bound of our estimate. best case scenario.
## note that i’m generously### assuming all noise/error leads to underestimating the effect. as sanjay has noted, noise and error (and hidden moderators) can lead to overestimating effects, too.
### so much generosity.
---
ok, back to the original blog post.
so, i can almost understand if people weren’t worried until now. but the RRR makes it very hard for me to imagine being ok with the state of the field.*** even if you look at your own practices and those of everyone you know, and you don't see much p-hacking going on, the evidence is becoming overwhelming that p-hacking is happening a lot. my guess is that the reason people can't reconcile that with the practices they see happening in their labs and their friends' labs is that we're not very good at recognizing p-hacking when it's happening, much less after the fact. we can't rely on our intuitions about p-hacking. we have to face the facts. and, in my view, the facts are starting to look pretty damning.
in conclusion, fifty million frenchmen can definitely be wrong. so can fifty JPSP papers.
(this situation reminds me a bit of the situation described in the big short, about the mortgage bubble - few economists and bankers thought it was possible that there was this huge crisis looming, partly because everyone thought that if there was this fundamental problem, it would be obvious to the experts.**** sometimes, things that seem super solid are actually way more shaky than almost anybody realizes.)
i know there are very smart people that i respect a lot who disagree with me. i very much welcome comments/responses.
ps: as gloomy as this post sounds, i am, still, optimistic. also stubborn.
* introvert gone wild.
** p < .000007
*** in case you were worried about running out of things to panic about when lying awake at 3 a.m.
**** i'm not sure why paramount hasn't called to acquire the rights to my blog. mark ruffalo already told me he'd play uri simonsohn.
Post a comment
Comments are moderated, and will not appear until the author has approved them.
Your Information
(Name and email address are required. Email address will not be displayed with the comment.)
To me it seems that the only real solution to this is the "Registered Reports"-format (https://osf.io/8mpji/wiki/FAQ%203:%20Design%20and%20Analysis/): Pre-registration, high-powered, and no publication bias.
I would love to hear more solutions to the false-positive problem, but so far, if i understood things correctly, the best solution i have read about is the Registered Report-format. I wonder what your, and others', thoughts are on this.
More importantly, i wonder why journals don't adopt this format as the only suitable one.
Posted by: Anonymous | 09 February 2016 at 12:55 AM
I understand why the RPP is worrying. The evidential value even in most of the studies (including the replications) is pretty low as Alex Etz's Bayes factor reanalysis of that shows (http://alexanderetz.com/2015/08/30/the-bayesian-reproducibility-project). Clearly we need more power at the outset of a study and we need a culture of independent replications. I honestly don't know how to bring about this change. Even despite all the developments of preregistration and replication efforts the incentive structure is still opposed to this.
However, in this discussion I also worry about another thing. there is a lot of talk about how many of the RPP studies failed to replicate. But aren't we missing the point a little here? Surely as scientists we want to actually discover stuff and increase our understanding rather than making really, really sure that some specific claim is true.
A while ago I had a look at some of the RPP studies that actually did replicate and which did have strong effect sizes. This wasn't very systematic but I already spotted one or two where I immediately thought "Well, this is almost certainly wrong." I can't remember what they were (and I realise this isn't very useful of me to say right now :P) but I'll look it up at a later point. They effect replicated beautifully but the underlying hypothesis is probably still completely false. I do feel like the discussion about replicability is often missing this issue (however, in their defense the RPP authors discuss it very clearly in the paper).
In my view the best way out of this mess is to encourage both. Replicability is critical but I believe we can do it without slowing down the progress of research.
Posted by: Sam Schwarzkopf | 09 February 2016 at 01:08 AM
What i also find interesting to think about is what ego-depletion researchers will do now, as a result of the new information of the RRR.
I wonder if they will abandon their line of research. I doubt they will. I think we can look forward to multiple low-powered, possibly p-hacked studies by ego-depletion researchers showing all kinds of "moderators" and who knows what. Then a large replication project will no doubt show that there is really nothing going on. Then multiple new articles will arrive again, etc. I fear it will be a never-ending cycle...
Posted by: Anonymous | 12 February 2016 at 12:28 AM
Why do you use capital letters so inconsistently and non-standardly? Please explain or it's going to drive me crazy. Thank you!
Posted by: Anonymous | 08 March 2016 at 06:53 AM