Enter your Email:
Preview | Powered by FeedBlitz

« Guest Post: Not Nutting Up or Shutting Up | Main | Guest Post: Check Yourself before you Wreck Yourself »


Sam Schwarzkopf

Thanks for this post and the invitation to comment! Please forgive any grammatical or typographical errors. Severe lack of time on my end...

First of all, I don't understand how you get the p-value in your coin flip example. A binomial test should give you p<0.01 in that situation? Simulation (parametric bootstrap, if you will) also gives me 0.01ish. I'm probably missing something...

Anyway, I think there is a very big and stubborn misunderstanding in this whole debate. As you will see in my blog post, I account for the possibility that the swan might just be really filthy (although my example was poor lighting). Of course results can be flukes. Most results probably are, especially if the experiment lacks power/sensitivity.

I never argued that replicators (henceforth, the "muggles") must explain their failure to replicate. I am arguing that *all* science should seek to explain observations. This includes the experiments by the original authors (the "wizards"). I advocate what Platt called "strong inference".

The null hypothesis is not a very strong hypothesis. It is certainly true in many situations but your argument is purely statistical. Accumulating a lot of evidence supporting the null is *not* a way falsify the hypothesis. Instead you are merely doing what Psi researchers are doing when they collect lots of data that appears consistent with precognition or telepathy. You are not confirming the existence (or the absence) of anything. You are simply observing.

There is nothing wrong with observation. We need observation in science and it is perhaps even good to have more of it. As Tony Movshon said in some online debate I saw "I'd never be against the accumulation of knowledge." I just expect science to do more.

Returning to your coin flip example, as a muggle you simply cannot ever be sure that you aren't somehow doing it wrong. In fact, you can't even be sure that you are using the same coin as the wizard. Even if a lot of muggles get together they cannot really be sure that they aren't somehow doing it wrong - although I would concede that the more coin flips are being done, the stronger the evidence that the coin is fair. However, I would argue that it is a waste of time to do that for every single coin out there...

I did not say you need to explain why you failed to reproduce the results. I am arguing that when you design an experiment, you should try to make inferences about the world. Replication should not be the domain of muggles. Replication should be part of *all* studies. So whenever you design a experiments that is based on previous literature (which is true for almost all experiments) you should incorporate a replication in that study. This is really just a question of good experimental design. It is fair enough if that replication fails. You can then report that whole experiment in such a way:

"Previous research showed X.
Here we tested whether X could be due to Y.
We failed to replicate X.
When we tested Y, we found Z.
We reject our hypothesis that X causes Y.
Instead we learned that Y causes Z.
It is possible that X was a fluke."

It really mystifies me why nobody seems to understand that this is a better way to do science than what muggles are doing:

"Previous research showed X.
We don't believe X exists.
We tried but failed to find X.
X could have been a fluke unless we did something wrong.
We have learned nothing new about the world whatsoever."

It is perfectly acceptable to say "I don't believe this". It's even better if you can say why you don't believe it but that's perhaps optional. But if your working hypothesis is that X was a fluke (or due to QRPs), then there is probably no experiment that can confirm or even strongly support this hypothesis.


What's the p-value of the second picture not being the capture of a hippo ? Does it depend on the quality of your vision ? of your attention ? of your knowledge about animals (and burgers) ?

By the way, I agree with you, the experiments of others shall not impose commentaries -just maybe citations- ...

Wait a minute, I start to think that the following sentence was actually neither a joke nor a sign of madness:

"If the experiment doesn't fit well with your theory, then do the experiment again..."

I feel like the sentence had been cut from its end:

"...and hope that you reach shining enough p-value to decide whether you've been hitting a (anti-)fluke or a flawed theory."


Binomial probability of 15 out of 20 is 0.0148 (binopdf(15,20,0.5)). Where does your number come from.

P.S. people who live in glass houses shouldn't throw stones


the probability i'm referring to (p = .041) is the two-tailed probability of getting a result that extreme or more extreme (to make it parallel to psych studies). so i am including probability of getting: 15, 16, 17, 18, 19, 20, 5, 4, 3, 2, 1, and 0 heads.
but, i did do the calculation in an excel spreadsheet so i might be wrong. will double check my math when i'm out of my current meeting, but feel free to correct me if you beat me to the punch!

Sam Schwarzkopf

Even as a two tailed probability I believe it ought to be more like 0.02 (ie twice the one I mentioned which was .009ish). But I may be wrong and definitely can't do it better in my head.

I wholeheartedly disagree with the comment about glass houses. Apart from the fact that this is really a minor point, scientists must be allowed to make mistakes. Part of our problems is that we still have a culture that prohibits people from admitting mistakes and accepting flukes. This must change if science is to improve!

Chris C.

All Simine is saying, as far as I can tell, is that there is no particular truth value to data collected in January as compared to May. Being first to collect data on a question doesn't make you right--it just makes you first.

If you have two studies in front of you, and thought them done simultaneously, one small N with p less than .05 and one large N with p greater than .50, what would you believe?

The order does NOT matter (indeed, see Eidelman & Crandall, 2014--is self-promotion allowed here?). It's a case of status quo bias, and it's just a bias.

However, there IS a term of art in this blog entry that deserves some attention, and that it the word "rigorous." By this, I think Simine means "well" or "properly" or "carefully recreating the original conditions" or the like. I object to the word "rigorous" because it connotes attention to Type I error over Type II error, and that is a value judgment I don't share. I suspect that connotation was inadvertent; Simine can say otherwise if she meant it that way.

But here's the thing--some effects require substantial skill to pull off. I doubt that a person with a touch of the Asperger's could ever pull off a good forced-compliance experiment. The effect (highly replicated and well-established) is not easy to pull off--it requires a certain social sensitivity to demonstrate. But to say that that undermines the reality of the phenomenon is akin to saying there's no such thing as a good salesperson or good sales technique. Of course there is, but not everyone can do it. And, I believe, good science is harder to do than car sales.

In sum: Order doesn't matter. Skill does.

Eidelman, S., & Crandall, C. S. (2014). The Intuitive Traditionalist: How Biases for Existence and Longevity Promote the Status Quo. Advances in Experimental Social Psychology, 50, 53-104.)


Chris - I agree with pretty much everything you wrote. I do believe studies take some skill to run. I also think that if someone criticizes a researcher (usually an author of a replication study) for lacking the skill/expertise, they should be able to point to a flaw in the study (and this presumes the replication author made their materials/data available, which they should). Unless the replication author has no track record of being a good researcher/doing psych studies, I think it's not fair if critics can just say 'they lacked expertise' without pointing to any problem with the study. I think this happens, and I wish we put more of a burden on critics who say that a study failed because the researchers did something wrong to actually point to what was done wrong, or suggest some things at least.

Sam - re: probabilities. I think you're referring to the probability of getting exactly 15 heads (or that probability times two). I'm referring to the probability of getting at least 15 heads or at least 15 tails. I checked my math again and got the same answer, but I am definitely open to being wrong!

re: the more substantive point. I think we agree that the more information a study provides, the better, so if it can not only rule out one theory but also test another, that's even better. I think where we may disagree is that I still think there is a lot of value to a study that does nothing more than repeat another study but with even more power/precision. Especially if that first study has a lot riding on it (got a lot of attention, has a lot of implications for other theories, etc.).

Richard D. Morey

Re: "Order doesn't matter." Order actually does matter, but this works against the initial publishers of a claim. A standard regression to the mean argument, such as that taught to first year psychology students (also called the "significance filter", etc) suggests that if you're bothering to take a second look at an experiment, the first result was probably an over-estimate.

So, order doesn't matter *logically*, but it sure does matter when you're trying to evaluate why the first result is successful and the second isn't. There's nothing mysterious about it; it's just regression to the mean.


another way in which order matters, in practice: if the second study is a direct replication of the first, the first study serves as quasi-pre-registration of the second - it constrains researcher degrees of freedom. so there is more possibility of p-hacking in the first study than the second.

Nick Brown

The underwater hippo is very cute :-), but there's also an elephant in the room here, namely the relative prestige and power of the authors of the original and replication studies. It ought not to matter whether the first is a named-chair full professor and the second is a grad student. In practice, it does. We're never going to be doing "science" properly until we stop doing the "com├ędie humaine".

Sam Schwarzkopf

I think as usual this discussion is missing the point. Obviously doing good experiments takes skill. This is why you need to provide evidence that you can detect a *credible* effect. Naturally this effect cannot be the one that you are studying. That's a circular argument. So you cannot say "to replicate X you must have proven that you can show X".

But what you should do is this:
"I want to test X. Here I confirm a similar effect Y which is well established. I nevertheless found no evidence of X therefore I don't think X is true."

What most muggles are doing is this:
"I want to test X. I didn't find X therefore I don't think X is true"

Doesn't anyone see the difference? It's not about replicators vs famous authored. It's not about muggles vs wizards. It's about good experimental design. Good experiments contain a control condition.

Most direct replications do not.

Simine: I will look at the binomial stats thing later. I may very well be wrong. I usually am.

Ryne Sherman

In related news, psychologists stink at probability. An R simulation of what Simine is doing:

f 14) + sum(f < 6)) / 100000

An excel exact computation:

=(1-binomdist(14, 20, .5, true))*2

An explanation:

First, we want to know the probability of 15 or greater. Taking the cumulative probability of 14 or less out of 20 trials gives me everything EXCEPT 15 or greater (this is about .979). Subtract that from 1 to get the probability for 15 or greater =~ .02. Multiply that times 2 to make it 2-tailed (i.e., 15 or greater AND 5 or less) to get ~.04.

Sam Schwarzkopf

Regarding the order of findings I actually completely agree with Richard. It should be about testing theories/hyptheses and who published what first is irrelevant in that.

The thing is that direct publications by definition seek to test findings, not theories. I have far greater confidence in a well designed failed conceptual replication than a failed direct replication. That is not to say that direct replications cannot support the null hypothesis - of course they can. But a good theoretical argument is far better than a direct replication - especially if if you cannot be sure that the direct replication is valid.

Now cue the next person saying that I'm holding muggles to a different standard than wizards. I don't. I don't subscribe to that distinction. I don't believe there are wizards. There are only scientists. It should be a daisy chain. The original authors should have replicated a previous finding as part of *their* study.

The fact that this concept seems so difficult for people to grasp is the real problem our field is facing.


Sam, if I understand what you're calling a "control condition," I think you mean building in a way to validate the methods (both that they are generally valid and that they were correctly implemented in a given experiment). If that is missing from a direct replication then it was missing from the original. And it is just as much of a problem for the original study, possibly more so:



Hi Sam: I don't think people have trouble grasping the ideal you espouse. Lots of people have expressed the desire to see original authors publish adequately powered direct replications and then extensions in multi-study packages. How many of those kinds of findings fail to replicate?

The point about the informativeness of a failure of well-designed conceptual replication is tricky. I predict a chorus of people pointing to the Duhem-Quine thesis. I just think it is a lot easier to judge the validity of a direct replication as opposed to the validity of a conceptual replication. (Note: I am not saying either is absolutely easy to judge).

Sam Schwarzkopf

HardSci: "If that is missing from a direct replication then it was missing from the original"

Yes! Now we're getting somewhere. If you realise this, then perhaps you realise why most science is garbage :P. And I am not sure you need to replicate garbage.

BrentDonnellan: "adequately powered direct replications" are not control conditions. But they are admittedly good design

Sam Schwarzkopf

As I will have seen on Twitter I posted a reply to this (and others) on my blog so can continue any discussion there (although I should do some work...).

However, just checking back in to say that you are right about the probability. If it's two-tailed it's around 0.041. That's why you shouldn't do statistics in the pub.


@ Sam Schwarzkopf, that seems a bit bait-and-switch to me. Why do you not think (or are not sure) it makes sense to replicate garbage studies if they're widely cited, believed and reviewers regularly compel you to pretend they're true? Because that is what everyone else is talking about.

If arguments about quality were all it took to remedy the above, do you think people would feel the need to replicate?

Sam Schwarzkopf

@genobollocks: It's the weekend now and I am losing the energy to repeat myself over and over. You are putting words in my mouth. Please read my rejoinder post about it:

If it's still unclear after that, please post a comment there and ask me to clarify (can't promise I will though as next week will be hectic and after that I am off the grid).

I should also concede that the "garbage" comment might have been a somewhat imprecise reflection of my views as it was probably a little beer-fueled... You can blame Neuroskeptic, Neuroconscious and Neurobiblical for that. But not me. Never me :P

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.

Your Information

(Name and email address are required. Email address will not be displayed with the comment.)