this idea keeps popping up: if you conduct a replication study and get a null result, you need to explain why the original study found a significant effect and you didn't.
what's wrong with this idea? a few things.first, it seems to discount the possibility that the original finding was a fluke - a false positive that made it look like there is an effect when in fact there isn't. here's an analogy:null hypothesis: my coin is fairresearch hypothesis: my coin is weighted
original study: i flip the coin 20 times and get 15 heads (p = .041)
replication study: i flip the coin another 20 times and get 10 heads (p = 1.0)
do i need to explain why i got 15 heads the first time?maybe. or maybe the first study was just a fluke. that happens sometimes (4.1% of the time, to be exact).
what if the replication study was: i flip the same coin 100 times and get 50 heads? now isn't the evidence pretty strong that the null is true, and the original study was just a fluke? when an original study had low power and the replication study had lots more power, we are kind of in this situation. the most likely explanation is not that there is a meaningful difference between the original study and the replication study that needs to be explained, the most likely explanation is that the original study was a fluke, and in fact there is no effect.am i implying the original result was p-hacked??no. if we assume that people sometimes p-hack, then a false positive becomes an even more likely explanation, but if you're uncomfortable with this assumption, that's ok.* it's still pretty plausible that a small, underpowered study that got a barely-significant result was a false positive, even without p-hacking.where does the idea that we need to explain the original result come from? i think one source is that we underestimate randomness. we think that a significant result is almost like an existence proof (this doesn't help). that is, we think that if someone found an effect once, the effect really happened, just like if someone saw a black swan once, black swans must really exist. this has been debunked many times, but i think even when people rationally understand why the analogy doesn't hold, we have a hard time shaking the gut feeling that it happened once so it must have been true. but that's like saying that if i got 15/20 heads once, then the coin must have been weighted, at least at that time.
why doesn't the analogy hold? because there is error in our studies, but not in the sighting of the black swan, or in any other existence proof. the better analogy would be if someone reported that they are 95% sure that they saw a black swan, but it's possible it was actually a really dirty white swan,** they can't be totally sure. (well, there are other reasons the analogy doesn't hold up, so don't think about it too hard, but that's the main one).but isn't it possible the original result was real and there is a theoretical explanation for the different result in the replication study (i.e., a moderator)?absolutely. and it's also possible the replication is a type II error and the effect is real and robust. but my point is that it's also possible - very, very possible - that the original result was a fluke and there never was any effect. how do we decide among these explanations (no real effect, a robust real effect, a real effect with boundary conditions/moderators)? we don't have to decide once and for all - we shouldn't treat a single study (original or replication) as definitive anyway. but we should weigh the evidence: if the replication study has no known flaws, was well-powered, and the original study had low power and found a barely-significant effect, then i would lean towards believing that the original was a fluke. if the replication study is demonstrably flawed and/or underpowered, then it should be discounted or weighed less. if both studies were well-powered and rigorous, then we should look for moderators, and test them. (actually, i'm totally happy with looking for moderators and testing them anyway, as long as someone else does the work.)my point is simply that when the original study was relatively modest to begin with, and the replication study is rigorous, there is no need for any other explanation than 'hey, look, that original finding may have been a false positive.' i'm not saying let's decide 100% for sure it was a false positive and never study the phenomenon again. i'm just saying let's not make the replicators come up with a theoretical explanation for the difference between their results and the original results. we would be overfitting the data to a more complicated model when a simple one suffices: flukes happen. (for a nice demonstration of just how flukey p-values are, watch this.)in a recent blog post, sam schwarzkopf made an argument related to the one i'm trying to debunk here. in it, he wrote: "the only underlying theory replicators put forth is that the original findings were spurious and potentially due to publication bias, p-hacking and/or questionable research practices. This seems mostly unfalsifiable."it's absolutely falsifiable: run a well-powered, pre-registered direct replication of the original study, and get a significant result.
in a comment on that blog post, uli schimmack put it best:'[...] often a simple explanation for failed replications is that the original studies capitalized on chance/sampling error. There is no moderator to be found. It is just random noise and you cannot replicate random noise. There is simply no empirical, experimental solution to find out why a particular study at one particular moment in history produced a particular result.'it seems like that should be all that needs to be said. i feel like this entire blog post should be unnecessary.*** but this idea that those who fail to replicate a finding always need to explain the original result keeps coming up, and i thinks it's harmful. i think it's harmful because it confuses people about how probability and error work. and it's harmful because it puts yet another burden on replicators, who are already, to my mind, taking way too much shit from way too many people for doing something that should be seen as a standard part of science, and is actually a huge service to the field.**** let's stop beating up on them, and let's stop asking them to come up with theoretical explanations for what might very well be statistical noise.* it's not really ok.** get your shit together, swan.*** except for the underwater hippo.
**** if you think doing a replication is a quick and easy shortcut to fame, glory, and highly cited publications, call me. we need to talk.this is not a hippo.
Thanks for this post and the invitation to comment! Please forgive any grammatical or typographical errors. Severe lack of time on my end...
First of all, I don't understand how you get the p-value in your coin flip example. A binomial test should give you p<0.01 in that situation? Simulation (parametric bootstrap, if you will) also gives me 0.01ish. I'm probably missing something...
Anyway, I think there is a very big and stubborn misunderstanding in this whole debate. As you will see in my blog post, I account for the possibility that the swan might just be really filthy (although my example was poor lighting). Of course results can be flukes. Most results probably are, especially if the experiment lacks power/sensitivity.
I never argued that replicators (henceforth, the "muggles") must explain their failure to replicate. I am arguing that *all* science should seek to explain observations. This includes the experiments by the original authors (the "wizards"). I advocate what Platt called "strong inference".
The null hypothesis is not a very strong hypothesis. It is certainly true in many situations but your argument is purely statistical. Accumulating a lot of evidence supporting the null is *not* a way falsify the hypothesis. Instead you are merely doing what Psi researchers are doing when they collect lots of data that appears consistent with precognition or telepathy. You are not confirming the existence (or the absence) of anything. You are simply observing.
There is nothing wrong with observation. We need observation in science and it is perhaps even good to have more of it. As Tony Movshon said in some online debate I saw "I'd never be against the accumulation of knowledge." I just expect science to do more.
Returning to your coin flip example, as a muggle you simply cannot ever be sure that you aren't somehow doing it wrong. In fact, you can't even be sure that you are using the same coin as the wizard. Even if a lot of muggles get together they cannot really be sure that they aren't somehow doing it wrong - although I would concede that the more coin flips are being done, the stronger the evidence that the coin is fair. However, I would argue that it is a waste of time to do that for every single coin out there...
I did not say you need to explain why you failed to reproduce the results. I am arguing that when you design an experiment, you should try to make inferences about the world. Replication should not be the domain of muggles. Replication should be part of *all* studies. So whenever you design a experiments that is based on previous literature (which is true for almost all experiments) you should incorporate a replication in that study. This is really just a question of good experimental design. It is fair enough if that replication fails. You can then report that whole experiment in such a way:
"Previous research showed X.
Here we tested whether X could be due to Y.
We failed to replicate X.
When we tested Y, we found Z.
We reject our hypothesis that X causes Y.
Instead we learned that Y causes Z.
It is possible that X was a fluke."
It really mystifies me why nobody seems to understand that this is a better way to do science than what muggles are doing:
"Previous research showed X.
We don't believe X exists.
We tried but failed to find X.
X could have been a fluke unless we did something wrong.
We have learned nothing new about the world whatsoever."
It is perfectly acceptable to say "I don't believe this". It's even better if you can say why you don't believe it but that's perhaps optional. But if your working hypothesis is that X was a fluke (or due to QRPs), then there is probably no experiment that can confirm or even strongly support this hypothesis.
Posted by: Sam Schwarzkopf | 10 April 2015 at 01:22 AM
What's the p-value of the second picture not being the capture of a hippo ? Does it depend on the quality of your vision ? of your attention ? of your knowledge about animals (and burgers) ?
By the way, I agree with you, the experiments of others shall not impose commentaries -just maybe citations- ...
Wait a minute, I start to think that the following sentence was actually neither a joke nor a sign of madness:
"If the experiment doesn't fit well with your theory, then do the experiment again..."
I feel like the sentence had been cut from its end:
"...and hope that you reach shining enough p-value to decide whether you've been hitting a (anti-)fluke or a flawed theory."
Posted by: vincent | 10 April 2015 at 01:34 AM
Binomial probability of 15 out of 20 is 0.0148 (binopdf(15,20,0.5)). Where does your number come from.
P.S. people who live in glass houses shouldn't throw stones
Posted by: Anomynous | 10 April 2015 at 01:42 AM
the probability i'm referring to (p = .041) is the two-tailed probability of getting a result that extreme or more extreme (to make it parallel to psych studies). so i am including probability of getting: 15, 16, 17, 18, 19, 20, 5, 4, 3, 2, 1, and 0 heads.
but, i did do the calculation in an excel spreadsheet so i might be wrong. will double check my math when i'm out of my current meeting, but feel free to correct me if you beat me to the punch!
-simine
Posted by: simine | 10 April 2015 at 01:45 AM
Even as a two tailed probability I believe it ought to be more like 0.02 (ie twice the one I mentioned which was .009ish). But I may be wrong and definitely can't do it better in my head.
I wholeheartedly disagree with the comment about glass houses. Apart from the fact that this is really a minor point, scientists must be allowed to make mistakes. Part of our problems is that we still have a culture that prohibits people from admitting mistakes and accepting flukes. This must change if science is to improve!
Posted by: Sam Schwarzkopf | 10 April 2015 at 02:22 AM
All Simine is saying, as far as I can tell, is that there is no particular truth value to data collected in January as compared to May. Being first to collect data on a question doesn't make you right--it just makes you first.
If you have two studies in front of you, and thought them done simultaneously, one small N with p less than .05 and one large N with p greater than .50, what would you believe?
The order does NOT matter (indeed, see Eidelman & Crandall, 2014--is self-promotion allowed here?). It's a case of status quo bias, and it's just a bias.
However, there IS a term of art in this blog entry that deserves some attention, and that it the word "rigorous." By this, I think Simine means "well" or "properly" or "carefully recreating the original conditions" or the like. I object to the word "rigorous" because it connotes attention to Type I error over Type II error, and that is a value judgment I don't share. I suspect that connotation was inadvertent; Simine can say otherwise if she meant it that way.
But here's the thing--some effects require substantial skill to pull off. I doubt that a person with a touch of the Asperger's could ever pull off a good forced-compliance experiment. The effect (highly replicated and well-established) is not easy to pull off--it requires a certain social sensitivity to demonstrate. But to say that that undermines the reality of the phenomenon is akin to saying there's no such thing as a good salesperson or good sales technique. Of course there is, but not everyone can do it. And, I believe, good science is harder to do than car sales.
In sum: Order doesn't matter. Skill does.
Eidelman, S., & Crandall, C. S. (2014). The Intuitive Traditionalist: How Biases for Existence and Longevity Promote the Status Quo. Advances in Experimental Social Psychology, 50, 53-104.)
Posted by: Chris C. | 10 April 2015 at 02:27 AM
Chris - I agree with pretty much everything you wrote. I do believe studies take some skill to run. I also think that if someone criticizes a researcher (usually an author of a replication study) for lacking the skill/expertise, they should be able to point to a flaw in the study (and this presumes the replication author made their materials/data available, which they should). Unless the replication author has no track record of being a good researcher/doing psych studies, I think it's not fair if critics can just say 'they lacked expertise' without pointing to any problem with the study. I think this happens, and I wish we put more of a burden on critics who say that a study failed because the researchers did something wrong to actually point to what was done wrong, or suggest some things at least.
Sam - re: probabilities. I think you're referring to the probability of getting exactly 15 heads (or that probability times two). I'm referring to the probability of getting at least 15 heads or at least 15 tails. I checked my math again and got the same answer, but I am definitely open to being wrong!
re: the more substantive point. I think we agree that the more information a study provides, the better, so if it can not only rule out one theory but also test another, that's even better. I think where we may disagree is that I still think there is a lot of value to a study that does nothing more than repeat another study but with even more power/precision. Especially if that first study has a lot riding on it (got a lot of attention, has a lot of implications for other theories, etc.).
Posted by: simine | 10 April 2015 at 02:43 AM
Re: "Order doesn't matter." Order actually does matter, but this works against the initial publishers of a claim. A standard regression to the mean argument, such as that taught to first year psychology students (also called the "significance filter", etc) suggests that if you're bothering to take a second look at an experiment, the first result was probably an over-estimate.
So, order doesn't matter *logically*, but it sure does matter when you're trying to evaluate why the first result is successful and the second isn't. There's nothing mysterious about it; it's just regression to the mean.
Posted by: Richard D. Morey | 10 April 2015 at 03:40 AM
another way in which order matters, in practice: if the second study is a direct replication of the first, the first study serves as quasi-pre-registration of the second - it constrains researcher degrees of freedom. so there is more possibility of p-hacking in the first study than the second.
-simine
Posted by: simine | 10 April 2015 at 03:41 AM
The underwater hippo is very cute :-), but there's also an elephant in the room here, namely the relative prestige and power of the authors of the original and replication studies. It ought not to matter whether the first is a named-chair full professor and the second is a grad student. In practice, it does. We're never going to be doing "science" properly until we stop doing the "comédie humaine".
Posted by: Nick Brown | 10 April 2015 at 04:58 AM
I think as usual this discussion is missing the point. Obviously doing good experiments takes skill. This is why you need to provide evidence that you can detect a *credible* effect. Naturally this effect cannot be the one that you are studying. That's a circular argument. So you cannot say "to replicate X you must have proven that you can show X".
But what you should do is this:
"I want to test X. Here I confirm a similar effect Y which is well established. I nevertheless found no evidence of X therefore I don't think X is true."
What most muggles are doing is this:
"I want to test X. I didn't find X therefore I don't think X is true"
Doesn't anyone see the difference? It's not about replicators vs famous authored. It's not about muggles vs wizards. It's about good experimental design. Good experiments contain a control condition.
Most direct replications do not.
Simine: I will look at the binomial stats thing later. I may very well be wrong. I usually am.
Posted by: Sam Schwarzkopf | 10 April 2015 at 06:20 AM
In related news, psychologists stink at probability. An R simulation of what Simine is doing:
f 14) + sum(f < 6)) / 100000
An excel exact computation:
=(1-binomdist(14, 20, .5, true))*2
An explanation:
First, we want to know the probability of 15 or greater. Taking the cumulative probability of 14 or less out of 20 trials gives me everything EXCEPT 15 or greater (this is about .979). Subtract that from 1 to get the probability for 15 or greater =~ .02. Multiply that times 2 to make it 2-tailed (i.e., 15 or greater AND 5 or less) to get ~.04.
Posted by: Ryne Sherman | 10 April 2015 at 06:34 AM
Regarding the order of findings I actually completely agree with Richard. It should be about testing theories/hyptheses and who published what first is irrelevant in that.
The thing is that direct publications by definition seek to test findings, not theories. I have far greater confidence in a well designed failed conceptual replication than a failed direct replication. That is not to say that direct replications cannot support the null hypothesis - of course they can. But a good theoretical argument is far better than a direct replication - especially if if you cannot be sure that the direct replication is valid.
Now cue the next person saying that I'm holding muggles to a different standard than wizards. I don't. I don't subscribe to that distinction. I don't believe there are wizards. There are only scientists. It should be a daisy chain. The original authors should have replicated a previous finding as part of *their* study.
The fact that this concept seems so difficult for people to grasp is the real problem our field is facing.
Posted by: Sam Schwarzkopf | 10 April 2015 at 06:41 AM
Sam, if I understand what you're calling a "control condition," I think you mean building in a way to validate the methods (both that they are generally valid and that they were correctly implemented in a given experiment). If that is missing from a direct replication then it was missing from the original. And it is just as much of a problem for the original study, possibly more so:
http://hardsci.wordpress.com/2014/07/07/failed-experiments-do-not-always-fail-toward-the-null/
Posted by: Hardsci | 10 April 2015 at 06:46 AM
Hi Sam: I don't think people have trouble grasping the ideal you espouse. Lots of people have expressed the desire to see original authors publish adequately powered direct replications and then extensions in multi-study packages. How many of those kinds of findings fail to replicate?
The point about the informativeness of a failure of well-designed conceptual replication is tricky. I predict a chorus of people pointing to the Duhem-Quine thesis. I just think it is a lot easier to judge the validity of a direct replication as opposed to the validity of a conceptual replication. (Note: I am not saying either is absolutely easy to judge).
Posted by: BrentDonnellan | 10 April 2015 at 06:58 AM
HardSci: "If that is missing from a direct replication then it was missing from the original"
Yes! Now we're getting somewhere. If you realise this, then perhaps you realise why most science is garbage :P. And I am not sure you need to replicate garbage.
BrentDonnellan: "adequately powered direct replications" are not control conditions. But they are admittedly good design
Posted by: Sam Schwarzkopf | 10 April 2015 at 07:05 AM
As I will have seen on Twitter I posted a reply to this (and others) on my blog so can continue any discussion there (although I should do some work...).
However, just checking back in to say that you are right about the probability. If it's two-tailed it's around 0.041. That's why you shouldn't do statistics in the pub.
Posted by: Sam Schwarzkopf | 10 April 2015 at 11:34 AM
@ Sam Schwarzkopf, that seems a bit bait-and-switch to me. Why do you not think (or are not sure) it makes sense to replicate garbage studies if they're widely cited, believed and reviewers regularly compel you to pretend they're true? Because that is what everyone else is talking about.
If arguments about quality were all it took to remedy the above, do you think people would feel the need to replicate?
Posted by: genobollocks | 10 April 2015 at 06:37 PM
@genobollocks: It's the weekend now and I am losing the energy to repeat myself over and over. You are putting words in my mouth. Please read my rejoinder post about it:
https://neuroneurotic.wordpress.com/2015/04/10/black-coins-swan-flipping/
If it's still unclear after that, please post a comment there and ask me to clarify (can't promise I will though as next week will be hectic and after that I am off the grid).
I should also concede that the "garbage" comment might have been a somewhat imprecise reflection of my views as it was probably a little beer-fueled... You can blame Neuroskeptic, Neuroconscious and Neurobiblical for that. But not me. Never me :P
Posted by: Sam Schwarzkopf | 11 April 2015 at 06:36 AM