i have been sitting on this paul meehl gem for a few months now, ruminating on how it relates to our current situation:
"The two opposite errors to which psychologists, especially clinical psychologists, are tempted are the simpleminded and the muddleheaded (as Whitehead and Russell labeled each other in a famous dinner exchange). The simpleminded, due to their hypercriticality and superscientism and their acceptance of a variant of operationalist philosophy of science (that hardly any historian or logician of science has defended unqualifiedly for at least 30 years), tend to have a difficult time discovering anything interesting or exciting about the mind. The muddleheads, per contra, have a tendency to discover a lot of interesting things that are not so. I have never been able, despite my Minnesota “simpleminded” training, to decide between these two evils. At times it has seemed to me that the best solution is sort of like the political one, namely, we wait for clever muddleheads to cook up interesting possibilities and the task of the simpleminded contingent is then to sift the wheat from the chaff. But I do not really believe this, partly because I have become increasingly convinced that you cannot do the right kind of research on an interesting theoretical position if you are too simpleminded to enter into its frame of reference fully (see, e.g., Meehl, 1970b). One hardly knows how to choose between these two methodological sins." *here is what i have come up with (i am trying to fit what probably belongs in several separate blog posts into one because i think the points are interconnected. bear with me.)
1. another way to describe these groups is that the simpleminded are terrified of type I error while the muddleheaded are terrified of type II error. pick your paranoia.2. this seems like a pretty accurate (if caricatured) way to describe the two extremes of the scientific integrity debate. throughout the rest of the post, i am going to use these labels. i apologize to alexa tullett for speaking in dichotomies, and to all simpleminded and muddleheaded people out there for calling you (us) names.3. one thing that has struck me as i've observed discussions is that people at both extremes have incredibly strong intuitions about which is the bigger problem. the simpleminded are absolutely convinced that the literature is littered with false positives and that is the major threat to our field. the muddleheaded are equally certain that the proposed reforms would stifle scientific discovery and lead us to abandon ideas that are in fact correct and would significantly improve our understanding of human behavior. what is really striking is that not only are both sides full of conviction, they are so sure that they can't even believe someone would sincerely have the opposite intuition. i have seen people on both sides accuse those with opposite intuitions of being disingenuous - their own perception of reality seems so blatantly obvious that they think anyone who denies having that intuition is putting them on.**i'll admit, i am pretty far on the simpleminded side of the continuum, and i have sometimes caught myself completely flabbergasted at the vast distance between my own intuitions and others'.this is a terrible situation to be in. when we don't even believe that our colleagues sincerely hold the intuitions they profess to hold, we are at an impasse. when both sides believe that their (conflicting) assumptions are absurdly self-evident, attempts at reasoning with each other are futile.4. so what can we do?first, i think we need to give each other the benefit of the doubt that, at the very least, we all actually believe what we claim to believe, and that these beliefs do in fact feel intuitive to the belief holders. we should stop questioning each other's sincerity.second, we need to find a way out that does not rely on intuitions. happily, we are in the business of not relying on intuitions. we need empirical evidence. this leads us to the question:5. what empirical evidence would convince the simpleminded? what empirical evidence would convince the muddleheaded?this is, to me, the fundamental question that each side has to answer. we should make our beliefs/intuitions falsifiable, by making concrete empirical predictions about what the world would look like if our intuitions are correct.let me start with the simpleminded, since i feel more comfortable speaking for that side of the continuum.for me, a good test of the 'false-positives-are-everywhere' paranoia that characterizes the simpleminded is brian nosek and the open science framework's reproducibility project. this project aims to conduct close replications of published studies in some of the top journals in psychology. with a large enough sample size, this could give us an estimate of the proportion of published studies that are replicable. assuming this estimate has a decent amount of validity, it can provide a test of the simpleminded world view. what results would be consistent with simpleminded intuitions? it probably depends on which simpleminded person you're talking to. this simpleminded person would probably guess that the replication rate will be below 60%.*** call this my preregistration. if that is wrong, i will admit that the situation is not as dire as i thought, that my intuitions were wrong. of course even a 40% (or, for that matter, 20%) false positive rate should be alarming and would, in my view, justify some reforms. but some of the more drastic reforms proposed by the simpleminded rely on the assumption that the problem is Very Serious. so if the replication rate is much higher than 60%, we have to admit we were wrong, at least in the degree of our panic.so, now i come to a question i can't answer: what would the muddleheaded think is a fair test of their intuitions, and what evidence would cause them to reconsider those assumptions?i thought about writing an entirely separate blog post about this question, because it is one i've been wondering about for a while, but instead i will just add a heading and keep going:bricks in the wall vs. the wall itselfa common view i've heard in response to single instances of rigorous, conclusive failed replications is that the particular study that failed to replicate was just one brick in the wall, and the wall (i.e., the evidence for the broader phenomenon) is made up of hundreds of bricks (i.e., studies that show the predicted effect).that seems fair. pulling out a few bricks here and there does not do much to undermine the integrity of the entire wall, if it is made up of lots of bricks. but what if we sample 5% of all the bricks at random, and we find that most of them are faulty. then wouldn't we worry about the wall? it seems harsh to say that you can't test the wall by testing its bricks - there is no wall other than the bricks that make it up.**** so it seems like replication attempts - especially systematic ones like the reproducibility project - would be a good source of evidence about the soundness of the wall. i would be curious to know what replication rates the muddleheaded would consider consistent with their intuitions, and what results might lead them to rethink those intuitions.another approach is to look at statistical summaries of the literature that give us a clue about the degree of bias (e.g., the p-curve). i wonder whether the muddleheaded would consider these tools appropriate tests of their intuitions about the (non)-prevalence of false positives.
of course, the intuition that false positives are not a huge problem is only one part of the muddleheaded worldview. another important part is the belief that the proposed reforms would stifle discovery (i.e., increase type II error). it would be important to do an empirical test of the severity of this problem under various practices/policies as well. how would we go about doing that? i don't know.conclusion: because people at both ends of the continuum have such strong intuitions, i am not sure that talking at each other is going to get us anywhere (but i am still here typing, so obviously i have not lost all hope). we need empirical tests of: a) the prevalence of false positives in the published literature (which can be achieved by things like large-scale, systematic replication projects and by statistical estimates of bias like the p-curve), and b) the consequences of proposed reforms not only for false positive rates, but also for type II error ('misses').
* i left out the next part of the paragraph, which is this: "One thing I can say in favor of the simpleminded is that I have seen several cases of it get cured, by personal experience of psychoanalysis or by exposure to sufficiently bright, rational, and articulate intellects of opposite persuasion, or by just getting older, securer, and more “relaxed.” Simplemindedness is (not being correlated with stupidity among academicians) a curable condition. But I have, alas, never seen a muddlehead get well. I am inclined to believe that this condition has a hopeless prognosis." i left it out because i worry that it is unnecessarily inflammatory, but it's also an observation from one our field's great methodologists, so i am keeping it in the footnotes, as i hear humanists like to do with things they know they should cut but can't bring themselves to.** you know, like what abe thought god was doing.*** full disclosure: i was recently at a conference where some preliminary results from the reproducibility project were shared. i did not attend the presentation (in my defense, it was early in the morning and the chickens kept me up the night before*****). i did, however, hear about it from others. i can't remember what they told me, but it's entirely possible that i unconsciously absorbed this information and it is influencing my estimate. in fact, that seems pretty likely. especially if i end up being right.**** i am leaving aside the mortar for now.***** in their defense, their eggs were delicious.photo credits: rich lucas, me, & the chickens.
Post a comment
Comments are moderated, and will not appear until the author has approved them.
Your Information
(Name and email address are required. Email address will not be displayed with the comment.)
The preliminary results shared at APS, based on something like 30 of the eventual 180 or so studies to be run, were that -- depending on how you count what is considered a successful replication -- the replication rate is as low as 1/3 or as high as 2/3. The former estimate is based on the simple "Is the replication significant?" rule, while the latter is based on pooling the replication estimate together with the original estimate and then testing it all against 0. Which seems pretty questionable to me since we know the original estimates are likely inflated for various reasons. Anyway, it will be interesting to see how these estimates change (or not) as more studies come in and as we try different ways of evaluating the results statistically, such as Simonsohn's procedure and the Verhagen/Wagenmakers procedure.
Posted by: Jake Westfall | 29 July 2014 at 02:57 AM
I was present at the APS presentation. Simine, I wish to encourage you to re-specify your hypothesis. It was the case that about one-third of the replications crossed the p<.05 threshold.
But that's the wrong way to present the data, and the wrong way to draw conclusions. It is dichotomizing the data--a practice that cannot be supported as an analytic strategy.
By contrast, we did learn that, with the nearly 30 studies, that the effect sizes(original) correlated with the effect sizes(replication) at r=.60. I ask you, with N=30, how high could that correlation be expected to be? I have asked a dozen or more social psychologists about this, and NOT ONE of them has predicted an effect greater than .60 (and only one predicted as high as .60).
And so, instead of using dichotomous data, which are certain to underestimate the replicability (and certainly under-describe it), I encourage you to make an "effect-size" estimate, rather than a wins/losses estimate.
Posted by: Chris Crandall | 29 July 2014 at 03:32 AM
Oh, and by the way, p-curve analyses do NOT generate an estimate of false positives. Even when p-curve analysis generates an "alarm," it is only indicative of researchers stopping their research protocols when "significance" has been found, or of p-values being slightly nudged over an arbitrary line through dropping participants, ANCOVA, and the like. This is modestly correlated with false positives, but it's not the false positive itself. The p-curve analysis will almost certainly, if treated as evidence of false positives, over-estimate the prevalence of Type I error.
Posted by: Chris Crandall | 29 July 2014 at 03:38 AM
hi chris,
that's an interesting way to think about it! i would argue that, with an N of 30, the correlation could easily be .60 or even .90, because that is a very small N so it's easy to get fluky (high or low) correlations. the confidence interval around that .60 is quite big.
still, i take your point, and i agree that the pass/fail mentality is not a good one. thanks for pointing that out.
however, i have a bigger concern about the approach you propose, namely, that the original studies and the replication studies could have effect sizes that correlate perfectly, but the original studies could have effect sizes in the d = .5 to d = 2.0 range, while the replication studies could all have much smaller effects (i.e., correlations are not sensitive to differences in level/means/magnitude). so maybe a better statistic to use would be some kind of intraclass correlation that also takes into account differences in magnitude.
more broadly, i'm not sure the right question is whether the rank-ordering of effect sizes in the original studies is replicable, which is what the correlation tests. i'll have to think more about this. thanks for bringing it up!
-simine
Posted by: simine | 29 July 2014 at 03:38 AM
Nice post! Is the egg metaphor meant to signify that we can't make an omelette without breaking some? I estimated my class's reproducibility rate in a post a little while ago (http://babieslearninglanguage.blogspot.com/2014/06/shifting-our-cultural-understanding-of.html). If the reproducibility rate signals not "things that are definitively false" but "things a smart, motivated grad student can't easily reproduce and build on" then I think we're in trouble, at least in some subfields...
Posted by: Michael Frank | 29 July 2014 at 04:00 AM
Simine: You are quite correct, on all points. The effect sizes were indeed lower. But we're on the same page that dichotomous isn't quite what we're seeking.
Effect sizes tend to shrink over the history of an effect, an observation that Jonathon Schooler has made (among others). (The "why" of this is still in dispute, see http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off).
One easy way to model this is with a regression model that includes the intercept and the beta weight. Both are likely to be significant, and both tell an interesting story about progress (or its lack) in science.
Posted by: Chris Crandall | 29 July 2014 at 04:15 AM
Simine, super interesting post as always.
I think the reproducibility project will inform the discussion but I don't think it will be decisive, because the results of replications (both "successful" and "failed") always have multiple explanations. When replication results differ from original results, for example, that could be because of a variety of discrepancies of procedure. In fact, I think it could be interesting to meta-analyze the reproducibility dataset with moderator analyses for expert-coded variables like the perceived allegiances of the original and replicating authors to an effect, original and replicating experimenters' prior experience with the methods of the study, how thoroughly the original methods were described to the replicators (e.g. how much original authors were brought into auditing materials and procedures), etc.
And on the flip side, a "successful" replication can mean that errors in operationalizations (confounds etc.) are carried into the replication study. I do not think the "simpleminded" folks are only concerned with the replicability of empirical effects, they are also concerned with the validity of methods and with whether studies are designed to be good tests of theories -- and direct replication doesn't really address those things.
I thought Brent D had an interesting post along similar lines to yours a while back:
http://traitstate.wordpress.com/2012/10/15/two-types-of-researchers/
Posted by: Sanjay Srivastava | 29 July 2014 at 06:30 AM
A few small things (didn't closely read the comments):
a high powered replication doesn't mean anything if the studies are not the same..
Mitchell has a point with his (Pearson's?) black swan. no matter how many other studies don't find a black swan, the original researcher's did (unless we are willing to say and prove that it was not actually a black swan).
That is the real problem for me.
Brett
Posted by: Brett Buttliere | 29 July 2014 at 07:42 AM
brett -
there are several excellent responses to jason mitchell's essay, including:
http://filedrawer.wordpress.com/2014/07/07/jason-mitchells-essay/
and
http://hardsci.wordpress.com/2014/07/07/failed-experiments-do-not-always-fail-toward-the-null/
and
http://scatter.wordpress.com/2014/07/09/the-bigfoot-black-swan-continuum-of-behavioral-science/
Posted by: simine | 29 July 2014 at 07:59 AM
Nice post Simine. Jake's comment above made me wonder about another method for assessing the success of a replication attempt--namely, set a 95% confidence interval around the original effect size, and see if the effect size from the replication attempt falls within that confidence interval. Then one would know whether the replication effect size could have plausibly been obtained using the same methods, sample, etc., as the original study. Is this method of evaluation ever used or considered in large-scale replication projects?
Posted by: Aaron Weidman | 29 July 2014 at 01:14 PM
aaron,
i like the spirit of your point - for a study to replicate, we don't need to find the exact same effect size. however, i wonder about computing a 95% confidence interval using the original study, because in a way that rewards underpowered studies (b/c they will have wider confidence intervals, and therefore it will be easier to conclude that a replication result is consistent with the original finding).
deciding whether a replication 'succeeded' is a really tough problem (and of course it's not a yes/no question, but people want a yes/no answer).
thanks for sharing your thoughts!
-simine
Posted by: simine | 29 July 2014 at 01:19 PM
sanjay-
those are all great points. i agree that replication is not the answer to everything. both 'successful' and 'failed' replications can be hard to interpret, and there are other issues that replications can't address.
i also think it would be really interesting to identify moderators in the reproducibility project (assuming they do enough replications to have decent power).
all that said, i think replications are one of the best tools we have to provide an empirical test of the state of the field. it's funny to find myself defending direct replications because i've never conducted one and have some important reservations, but i also worry that it's too easy to write them off. i kind of think replications are like what churchill said about democracy - the worst form of self-correction [government] except all the other forms that have been tried.
and thanks for the pointer to brent's excellent blog post!
-simine
Posted by: simine | 29 July 2014 at 01:27 PM
Simine--just one other thought. Yes, a small n study has a wide confidence interval, and therefore is "easier" to replicate, in that more replication effect sizes will fall within the interval. But then, if one views both studies meta-analytically, the (larger n) replication study is weighted more heavily, pulling the overall estimate away from the original small n study. The original study becomes a single cloud in a blue sky--so to speak ;)
So, in the end, we still arrive at a more true conclusion, without having to resort to language of "failed" and "successful" replications, which invariably cause simpleminded and muddleheaded people to get upset at one another.
Posted by: Aaron Weidman | 30 July 2014 at 05:09 AM
As noted above, a statistical analysis cannot identify a "false positive", but only suggest that something is not quite right between the reported analyses and theoretical claims of a paper. I've used the test for excess success to explore this issue across papers in the journal Psychological Science. Details are at
http://link.springer.com/article/10.3758%2Fs13423-014-0601-x
The bad news is that even though the excess success test is quite conservative, 82% of the papers cannot pass it (according to standard criteria). This is telling us that even if the effects are real and properly estimated by the studies, then most studies should not fully replicate. I think panic is appropriate.
Posted by: Greg Francis | 30 July 2014 at 07:20 AM
Most of the comments seem to focus on the statistics and numbers. What counts as a replication? How many false positives? How big are the confidence intervals?
That is all fine, we need to get the technicals right. But that is not what I see as Simine's main point. Instead, I see this as her main point:
"5. what empirical evidence would convince the simpleminded? what empirical evidence would convince the muddleheaded?
this is, to me, the fundamental question that each side has to answer. we should make our beliefs/intuitions falsifiable, by making concrete empirical predictions about what the world would look like if our intuitions are correct."
I concur with Meehl's ambivalence about the simpleminded and muddleheaded, but also with Meehl's point in the notes, that it is easier to move the simpleminded than the muddleheaded.
I suppose I lean towards the simpleminded, but I can be convinced by data. Despite arguing for decades that self-fulfilling prophecies are generally weak, fragile, and fleeting, I have also reported some of the most powerful SFP effects ever found by any social psychologist (Jussim et al, 1996, Adv. Exp. Social Psych, and reviewed them -- along with the abundant evidence of accuracy and weak stereotype and expectancy effects -- in my recent book -- Jussim (2012)).
The simpleminded can be persuaded by data, at least usually.
Which raises the question, what can persuade the muddleheaded? Anyone who identifies themselves as primarily interested in dramatic, world-changing findings, with BIG ideas, please let us know what could convince you, e.g., that:
1. Stereotypes are largely accurate and rational
2. That situations are no more powerful than persons
3. That unconscious processes are not more powerful than conscious processes
4. That stereotype threat findings explain a small fraction, about 20%, of the Black/White difference in achievement test scores
5. Self-fulfilling prophecies are weak, fragile, and fleeting
6. That implicit prejudice has almost no relation to discrimination
I could go on, and I am sure there are lots of other examples.
My implicit assumption, and, I think, Simine's: That science requires the testing of disconfirmable hypotheses. HOWEVER, I am not at all sure there is consensus on this. Indeed, part of the fundamental divide may be that the simpleminded care about falsifiability, whereas the muddleheaded do not.
Are there really scientists who disdain falsifiability? I think there are, and have a blog entry on Brent Roberts' PIG-IE site to this effect:
http://pigee.wordpress.com/2014/07/09/is-it-offensive-to-declare-a-social-psychological-claim-or-conclusion-wrong/
So, please, I ask, even beg you to explicitly articulate: Do you believe social psychological claims can be disconfirmed by data?
If not, then what is your view of how science should be conducted?
If so, and you buy most of the longstanding claims of social psychology, please explicitly articulate what could lead you to change your mind?
Posted by: Lee Jussim | 31 July 2014 at 07:38 AM
Boy, I can't help but notice that asking readers to identify what could disonfirm their scientific beliefs was, apparently, a real conversation stopper.
Or, put differently, when Simine asked folks to do this, there were many replies, but none addressed this issue. When I pointed out that doing this was the main point of her post, it evoked a deafening silence.
Who are we?
I will start. I believe that the evidence -- scores of studies now -- shows that stereotype accuracy is one of the largest and most replicable effects in all of social psychology. You can go to my blog entries for references:
http://www.psychologytoday.com/blog/rabble-rouser
What could change my mind? Well, lots of things, but here is just one. If the next 20 studies of stereotype accuracy, addressing a wide range of stereotypes, all show that people's beliefs about groups correlate below r=.20 with groups' characteristics, I will change my mind.
(instead, the current state of play is quite the opposite; only about 5% of all social psych studies produce effect sizes > r=.50, whereas half or more of all stereotype accuracy studies produce such effects).
You believe in the "power of the situation" and that
situations are far more powerful than individual differences? Fine. What could change your mind?
You believe that "but for stereotype threat, Black and White
standardized test scores would be equal"? Fine, what could change your mind.
You believe that implicit prejudice is a more powerful predictor of behavior than explicit prejudice? What could change your mind?
You believe the conservatives are more anti-science than liberals? What could change your mind?
Absent answers to these sorts of questions, our field is at serious risk of being plausibly viewed, not as a science, but as little more than rhetorical forum for advocating for our a priori beliefs and values, under the guise of science.
Posted by: Lee Jussim | 09 August 2014 at 03:44 AM