Enter your Email:
Preview | Powered by FeedBlitz

« self-correction hurts | Main | on ferguson »


Jake Westfall

The preliminary results shared at APS, based on something like 30 of the eventual 180 or so studies to be run, were that -- depending on how you count what is considered a successful replication -- the replication rate is as low as 1/3 or as high as 2/3. The former estimate is based on the simple "Is the replication significant?" rule, while the latter is based on pooling the replication estimate together with the original estimate and then testing it all against 0. Which seems pretty questionable to me since we know the original estimates are likely inflated for various reasons. Anyway, it will be interesting to see how these estimates change (or not) as more studies come in and as we try different ways of evaluating the results statistically, such as Simonsohn's procedure and the Verhagen/Wagenmakers procedure.

Chris Crandall

I was present at the APS presentation. Simine, I wish to encourage you to re-specify your hypothesis. It was the case that about one-third of the replications crossed the p<.05 threshold.

But that's the wrong way to present the data, and the wrong way to draw conclusions. It is dichotomizing the data--a practice that cannot be supported as an analytic strategy.

By contrast, we did learn that, with the nearly 30 studies, that the effect sizes(original) correlated with the effect sizes(replication) at r=.60. I ask you, with N=30, how high could that correlation be expected to be? I have asked a dozen or more social psychologists about this, and NOT ONE of them has predicted an effect greater than .60 (and only one predicted as high as .60).

And so, instead of using dichotomous data, which are certain to underestimate the replicability (and certainly under-describe it), I encourage you to make an "effect-size" estimate, rather than a wins/losses estimate.

Chris Crandall

Oh, and by the way, p-curve analyses do NOT generate an estimate of false positives. Even when p-curve analysis generates an "alarm," it is only indicative of researchers stopping their research protocols when "significance" has been found, or of p-values being slightly nudged over an arbitrary line through dropping participants, ANCOVA, and the like. This is modestly correlated with false positives, but it's not the false positive itself. The p-curve analysis will almost certainly, if treated as evidence of false positives, over-estimate the prevalence of Type I error.


hi chris,
that's an interesting way to think about it! i would argue that, with an N of 30, the correlation could easily be .60 or even .90, because that is a very small N so it's easy to get fluky (high or low) correlations. the confidence interval around that .60 is quite big.
still, i take your point, and i agree that the pass/fail mentality is not a good one. thanks for pointing that out.
however, i have a bigger concern about the approach you propose, namely, that the original studies and the replication studies could have effect sizes that correlate perfectly, but the original studies could have effect sizes in the d = .5 to d = 2.0 range, while the replication studies could all have much smaller effects (i.e., correlations are not sensitive to differences in level/means/magnitude). so maybe a better statistic to use would be some kind of intraclass correlation that also takes into account differences in magnitude.
more broadly, i'm not sure the right question is whether the rank-ordering of effect sizes in the original studies is replicable, which is what the correlation tests. i'll have to think more about this. thanks for bringing it up!

Michael Frank

Nice post! Is the egg metaphor meant to signify that we can't make an omelette without breaking some? I estimated my class's reproducibility rate in a post a little while ago (http://babieslearninglanguage.blogspot.com/2014/06/shifting-our-cultural-understanding-of.html). If the reproducibility rate signals not "things that are definitively false" but "things a smart, motivated grad student can't easily reproduce and build on" then I think we're in trouble, at least in some subfields...

Chris Crandall

Simine: You are quite correct, on all points. The effect sizes were indeed lower. But we're on the same page that dichotomous isn't quite what we're seeking.

Effect sizes tend to shrink over the history of an effect, an observation that Jonathon Schooler has made (among others). (The "why" of this is still in dispute, see http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off).

One easy way to model this is with a regression model that includes the intercept and the beta weight. Both are likely to be significant, and both tell an interesting story about progress (or its lack) in science.

Sanjay Srivastava

Simine, super interesting post as always.

I think the reproducibility project will inform the discussion but I don't think it will be decisive, because the results of replications (both "successful" and "failed") always have multiple explanations. When replication results differ from original results, for example, that could be because of a variety of discrepancies of procedure. In fact, I think it could be interesting to meta-analyze the reproducibility dataset with moderator analyses for expert-coded variables like the perceived allegiances of the original and replicating authors to an effect, original and replicating experimenters' prior experience with the methods of the study, how thoroughly the original methods were described to the replicators (e.g. how much original authors were brought into auditing materials and procedures), etc.

And on the flip side, a "successful" replication can mean that errors in operationalizations (confounds etc.) are carried into the replication study. I do not think the "simpleminded" folks are only concerned with the replicability of empirical effects, they are also concerned with the validity of methods and with whether studies are designed to be good tests of theories -- and direct replication doesn't really address those things.

I thought Brent D had an interesting post along similar lines to yours a while back:

Brett Buttliere

A few small things (didn't closely read the comments):

a high powered replication doesn't mean anything if the studies are not the same..

Mitchell has a point with his (Pearson's?) black swan. no matter how many other studies don't find a black swan, the original researcher's did (unless we are willing to say and prove that it was not actually a black swan).

That is the real problem for me.



brett -
there are several excellent responses to jason mitchell's essay, including:

Aaron Weidman

Nice post Simine. Jake's comment above made me wonder about another method for assessing the success of a replication attempt--namely, set a 95% confidence interval around the original effect size, and see if the effect size from the replication attempt falls within that confidence interval. Then one would know whether the replication effect size could have plausibly been obtained using the same methods, sample, etc., as the original study. Is this method of evaluation ever used or considered in large-scale replication projects?



i like the spirit of your point - for a study to replicate, we don't need to find the exact same effect size. however, i wonder about computing a 95% confidence interval using the original study, because in a way that rewards underpowered studies (b/c they will have wider confidence intervals, and therefore it will be easier to conclude that a replication result is consistent with the original finding).

deciding whether a replication 'succeeded' is a really tough problem (and of course it's not a yes/no question, but people want a yes/no answer).

thanks for sharing your thoughts!



those are all great points. i agree that replication is not the answer to everything. both 'successful' and 'failed' replications can be hard to interpret, and there are other issues that replications can't address.

i also think it would be really interesting to identify moderators in the reproducibility project (assuming they do enough replications to have decent power).

all that said, i think replications are one of the best tools we have to provide an empirical test of the state of the field. it's funny to find myself defending direct replications because i've never conducted one and have some important reservations, but i also worry that it's too easy to write them off. i kind of think replications are like what churchill said about democracy - the worst form of self-correction [government] except all the other forms that have been tried.

and thanks for the pointer to brent's excellent blog post!


Aaron Weidman

Simine--just one other thought. Yes, a small n study has a wide confidence interval, and therefore is "easier" to replicate, in that more replication effect sizes will fall within the interval. But then, if one views both studies meta-analytically, the (larger n) replication study is weighted more heavily, pulling the overall estimate away from the original small n study. The original study becomes a single cloud in a blue sky--so to speak ;)

So, in the end, we still arrive at a more true conclusion, without having to resort to language of "failed" and "successful" replications, which invariably cause simpleminded and muddleheaded people to get upset at one another.

Greg Francis

As noted above, a statistical analysis cannot identify a "false positive", but only suggest that something is not quite right between the reported analyses and theoretical claims of a paper. I've used the test for excess success to explore this issue across papers in the journal Psychological Science. Details are at


The bad news is that even though the excess success test is quite conservative, 82% of the papers cannot pass it (according to standard criteria). This is telling us that even if the effects are real and properly estimated by the studies, then most studies should not fully replicate. I think panic is appropriate.

Lee Jussim

Most of the comments seem to focus on the statistics and numbers. What counts as a replication? How many false positives? How big are the confidence intervals?

That is all fine, we need to get the technicals right. But that is not what I see as Simine's main point. Instead, I see this as her main point:

"5. what empirical evidence would convince the simpleminded? what empirical evidence would convince the muddleheaded?

this is, to me, the fundamental question that each side has to answer. we should make our beliefs/intuitions falsifiable, by making concrete empirical predictions about what the world would look like if our intuitions are correct."

I concur with Meehl's ambivalence about the simpleminded and muddleheaded, but also with Meehl's point in the notes, that it is easier to move the simpleminded than the muddleheaded.

I suppose I lean towards the simpleminded, but I can be convinced by data. Despite arguing for decades that self-fulfilling prophecies are generally weak, fragile, and fleeting, I have also reported some of the most powerful SFP effects ever found by any social psychologist (Jussim et al, 1996, Adv. Exp. Social Psych, and reviewed them -- along with the abundant evidence of accuracy and weak stereotype and expectancy effects -- in my recent book -- Jussim (2012)).

The simpleminded can be persuaded by data, at least usually.

Which raises the question, what can persuade the muddleheaded? Anyone who identifies themselves as primarily interested in dramatic, world-changing findings, with BIG ideas, please let us know what could convince you, e.g., that:

1. Stereotypes are largely accurate and rational
2. That situations are no more powerful than persons
3. That unconscious processes are not more powerful than conscious processes
4. That stereotype threat findings explain a small fraction, about 20%, of the Black/White difference in achievement test scores
5. Self-fulfilling prophecies are weak, fragile, and fleeting
6. That implicit prejudice has almost no relation to discrimination

I could go on, and I am sure there are lots of other examples.

My implicit assumption, and, I think, Simine's: That science requires the testing of disconfirmable hypotheses. HOWEVER, I am not at all sure there is consensus on this. Indeed, part of the fundamental divide may be that the simpleminded care about falsifiability, whereas the muddleheaded do not.

Are there really scientists who disdain falsifiability? I think there are, and have a blog entry on Brent Roberts' PIG-IE site to this effect:


So, please, I ask, even beg you to explicitly articulate: Do you believe social psychological claims can be disconfirmed by data?

If not, then what is your view of how science should be conducted?

If so, and you buy most of the longstanding claims of social psychology, please explicitly articulate what could lead you to change your mind?

Lee Jussim

Boy, I can't help but notice that asking readers to identify what could disonfirm their scientific beliefs was, apparently, a real conversation stopper.

Or, put differently, when Simine asked folks to do this, there were many replies, but none addressed this issue. When I pointed out that doing this was the main point of her post, it evoked a deafening silence.

Who are we?

I will start. I believe that the evidence -- scores of studies now -- shows that stereotype accuracy is one of the largest and most replicable effects in all of social psychology. You can go to my blog entries for references:


What could change my mind? Well, lots of things, but here is just one. If the next 20 studies of stereotype accuracy, addressing a wide range of stereotypes, all show that people's beliefs about groups correlate below r=.20 with groups' characteristics, I will change my mind.
(instead, the current state of play is quite the opposite; only about 5% of all social psych studies produce effect sizes > r=.50, whereas half or more of all stereotype accuracy studies produce such effects).

You believe in the "power of the situation" and that
situations are far more powerful than individual differences? Fine. What could change your mind?

You believe that "but for stereotype threat, Black and White
standardized test scores would be equal"? Fine, what could change your mind.

You believe that implicit prejudice is a more powerful predictor of behavior than explicit prejudice? What could change your mind?

You believe the conservatives are more anti-science than liberals? What could change your mind?

Absent answers to these sorts of questions, our field is at serious risk of being plausibly viewed, not as a science, but as little more than rhetorical forum for advocating for our a priori beliefs and values, under the guise of science.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.

Your Information

(Name and email address are required. Email address will not be displayed with the comment.)