this idea keeps popping up: if you conduct a replication study and get a null result, you need to explain why the original study found a significant effect and you didn't.what's wrong with this idea? a few things.first, it seems to discount the possibility that the original finding was a fluke - a false positive that made it look like there is an effect when in fact there isn't. here's an analogy:null hypothesis: my coin is fairresearch hypothesis: my coin is weighted
original study: i flip the coin 20 times and get 15 heads (p = .041)
replication study: i flip the coin another 20 times and get 10 heads (p = 1.0)
do i need to explain why i got 15 heads the first time?maybe. or maybe the first study was just a fluke. that happens sometimes (4.1% of the time, to be exact).
what if the replication study was: i flip the same coin 100 times and get 50 heads? now isn't the evidence pretty strong that the null is true, and the original study was just a fluke? when an original study had low power and the replication study had lots more power, we are kind of in this situation. the most likely explanation is not that there is a meaningful difference between the original study and the replication study that needs to be explained, the most likely explanation is that the original study was a fluke, and in fact there is no effect.am i implying the original result was p-hacked??no. if we assume that people sometimes p-hack, then a false positive becomes an even more likely explanation, but if you're uncomfortable with this assumption, that's ok.* it's still pretty plausible that a small, underpowered study that got a barely-significant result was a false positive, even without p-hacking.where does the idea that we need to explain the original result come from? i think one source is that we underestimate randomness. we think that a significant result is almost like an existence proof (this doesn't help). that is, we think that if someone found an effect once, the effect really happened, just like if someone saw a black swan once, black swans must really exist. this has been debunked many times, but i think even when people rationally understand why the analogy doesn't hold, we have a hard time shaking the gut feeling that it happened once so it must have been true. but that's like saying that if i got 15/20 heads once, then the coin must have been weighted, at least at that time.
why doesn't the analogy hold? because there is error in our studies, but not in the sighting of the black swan, or in any other existence proof. the better analogy would be if someone reported that they are 95% sure that they saw a black swan, but it's possible it was actually a really dirty white swan,** they can't be totally sure. (well, there are other reasons the analogy doesn't hold up, so don't think about it too hard, but that's the main one).but isn't it possible the original result was real and there is a theoretical explanation for the different result in the replication study (i.e., a moderator)?absolutely. and it's also possible the replication is a type II error and the effect is real and robust. but my point is that it's also possible - very, very possible - that the original result was a fluke and there never was any effect. how do we decide among these explanations (no real effect, a robust real effect, a real effect with boundary conditions/moderators)? we don't have to decide once and for all - we shouldn't treat a single study (original or replication) as definitive anyway. but we should weigh the evidence: if the replication study has no known flaws, was well-powered, and the original study had low power and found a barely-significant effect, then i would lean towards believing that the original was a fluke. if the replication study is demonstrably flawed and/or underpowered, then it should be discounted or weighed less. if both studies were well-powered and rigorous, then we should look for moderators, and test them. (actually, i'm totally happy with looking for moderators and testing them anyway, as long as someone else does the work.)my point is simply that when the original study was relatively modest to begin with, and the replication study is rigorous, there is no need for any other explanation than 'hey, look, that original finding may have been a false positive.' i'm not saying let's decide 100% for sure it was a false positive and never study the phenomenon again. i'm just saying let's not make the replicators come up with a theoretical explanation for the difference between their results and the original results. we would be overfitting the data to a more complicated model when a simple one suffices: flukes happen. (for a nice demonstration of just how flukey p-values are, watch this.)in a recent blog post, sam schwarzkopf made an argument related to the one i'm trying to debunk here. in it, he wrote: "the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable."it's absolutely falsifiable: run a well-powered, pre-registered direct replication of the original study, and get a significant result.
in a comment on that blog post, uli schimmack put it best:'[...] often a simple explanation for failed replications is that the original studies capitalized on chance/sampling error. There is no moderator to be found. It is just random noise and you cannot replicate random noise. There is simply no empirical, experimental solution to find out why a particular study at one particular moment in history produced a particular result.'it seems like that should be all that needs to be said. i feel like this entire blog post should be unnecessary.*** but this idea that those who fail to replicate a finding always need to explain the original result keeps coming up, and i thinks it's harmful. i think it's harmful because it confuses people about how probability and error work. and it's harmful because it puts yet another burden on replicators, who are already, to my mind, taking way too much shit from way too many people for doing something that should be seen as a standard part of science, and is actually a huge service to the field.**** let's stop beating up on them, and let's stop asking them to come up with theoretical explanations for what might very well be statistical noise.* it's not really ok.** get your shit together, swan.*** except for the underwater hippo.
**** if you think doing a replication is a quick and easy shortcut to fame, glory, and highly cited publications, call me. we need to talk.this is not a hippo.