with permission, an excerpt from john doris's forthcoming book. i wanted to post this because i think it is an excellent summary of the current crisis in social/personality psychology from a well-informed 'outsider'. this will be review for many psychologists, but should be useful for new researchers, and for those outside of psychology who want some background/analysis from someone who is not in the trenches but has read and thought a lot about this literature and these issues.
art by melissa dominiak
Doris, J. M. 2015. Talking to Our Selves: Reflection, Ignorance, and Agency. Oxford:Oxford University Press.
[…] At this writing, social psychology is being shaken by charges that many published findings, including numerous iconic findings, do not replicate when tested by independent investigators. Cyberspace is thick with skirmishes between Replicators, who broadcast the failed replications, and Finders, who insist that their findings are real. Viewed from a safe distance, it’s all good fun, the sort of academic kerfuffle that makes for diverting reads in those corners of the media where academic kerfuffles get covered. Unfortunately, I’m not at safe distance. Here, as elsewhere, I’ve repurposed psychological findings in philosophical argumentation. Awkward for me, if the findings are false.
Popular reports on RepliGate link the controversy to notorious cases of scientific misconduct (Bartlett 2013, Yong 2012). But the big problem’s not a few cheaters, or even a few more than a few cheaters; if the Replicators are right, the infirmity results from standard practice in experimental psychology. Consider the disciplinary norm for “statistical significance,” a “p-value” of .05 or less. When a finding is reported with this value, it means that if the experiment were run 100 times, and the phenomenon is not real –i.e., if the null hypothesis is true -- the finding would appear “by chance” five times or fewer. The .05 is supposed to warrant confidence that the effect in question is real, but it also entails that “about 5% of the time when researchers test a null hypothesis that is true (i.e., when they look for a difference that does not exist) they will end up with a statistically significance difference” (Pashler and Harris 2012: 531).
This depiction perhaps underestimates the appearance of false positives in the published literature. Some grounds for suspecting this are dauntingly technical, and involve complex models requiring contestable assumptions (Pashler and Harris 2012: 531-2; cf. Ioannidis 2005). Others -- the most prominent being the Publication Bias and File Drawer Effect -- are more easily grasped. The Publication Bias is a pervasive tendency for psychology journals to publish only effects that are novel and statistically significant, with the result that most investigators, unless they strike gold every time, are likely to have a File Drawer full of failures to find significant effects, very often in the immediate vicinity of those effects their publications indicate exist. In the worst case, assuming a significance threshold of .05, a researcher who only investigated phenomena that did not exist could have 95 failed experiments for each 5 successful ones they publish. If all researchers were worst case, every published finding could be a false positive capitalizing on chance, an alarming circumstance concealed in vast File Drawers of failure (Simonsohn 2012: 597).
It’s probably not exactly like this, but neither is it completely unlike this. Experimenters start with a protocol, and if it doesn’t produce the hoped for result, they tweak, and re-tweak, the protocol in order to find their finding. Along the way, some failed experiments that ought be treated as non-occurrences of an effect are instead treated as imperfectly designed pilot studies and interred in File Drawers. No surprise then, if there’s considerable numbers of unpublished failures haunting many published success.
That much might be good faith science; there’s a fine line between refining a protocol and the not-quite-kosher-not-quite-misconduct “questionable research practice” of “selectively reporting” only findings congenial to one’s research program in the papers one submits for publication. But other tricks of the trade, informally known as “p-hacking,” are rather shadier. For example, there’s “optional stopping,” the practice of continually checking results and halting an experiment as soon as p < .05 is reached, which substantially increases the probability that significant findings are a function of chance; during data collection, the p-value fluctuates, and even where there is no effect it will sometimes reach .05. Such practices are likely common: in an anonymous survey of some 2000 psychologists, over 50% admitted selective reporting, and over 20% admitted optional stopping (John et al. 2012: Figure 1).
All this considered, one might predict that many published findings will fail to replicate (capitalizing on chance is a chancy business). Unfortunately, it’s hard to know exactly what’s going on: the Publication Bias militates against the appearance of failed replications, but it also militates against the appearance of successful replications, because journals typically require novel findings (though failed replications may now be getting published more frequently; e.g., Pashler et al. 2013). The 64,000-dollar question concerns what molders in all those File Drawers: how many interesting but non-significant failed replications, and how many significant but boring successful replications?
The Replicators suspect there’s a lot of failure moldering, and if that’s right, tribulation is inevitable, when the published work gets checked. In fact, on a website serving as a repository for replication attempts, psychfiledrawer.org, there seems to be a fair bit of just that. Among the hardest hit are some studies most congenial to the arguments I’m making here, those involving subliminal “priming” of behavior.  The classic is an experiment by Bargh and colleagues (1996: 236-7) that found exposure to words invoking elderly stereotypes (e.g., grey, wrinkle, Florida) resulted in healthy young people walking more slowly, without being able to attribute their performance to the “semantic prime.”
The Infirm Words Effect was much celebrated; I celebrated it myself (Doris 2009: 57; Merritt et al 2010: 374), since I thought the existence of such quirky influences on behavior undermined philosophically standard accounts of agency. My merrymaking was truncated, however, for along came a failed replication by Doyen and associates (2012), joining another failure to replicate by Pashler and colleagues (2011) that appeared on psychfiledrawer.org. In response, Bargh (2012) noted two replications of Infirm Words appearing in a top journal (Cesario et al. 2006; Hull et al. 2002), together with two replications on science television (!). Moreover, the effect has appeared in a range of domains: exposure to material associated with stereotypes of the elderly may also slow decision making (Dijksterhuis et al. 2001, Kawakami et al. 2002), decrease reading speed (Loersch and Payne, forthcoming), and weaken memory performance (Dijksterhuis et al. 2000; Levy 1996).
My hunch is that the priming studies have themselves been victims of a prejudice, the Incredulity Bias, which presumes that if a study reports a surprising finding, there must be something fishy. On the other hand, there’s also a Surprising Effect Bias, where editors favor incredible findings, in hopes of garnering attention for their journal. (Less cynically: if scientific findings weren’t surprising, why would we need experiments and publications?) In a perfect world, unlikely findings would be both published and scrutinized -- and maybe that world’s not so far from the world we have. Still, the evidence appears to be badly mixed; can any conclusion – save that we’ve got a mess on our hands – be safely drawn?
Hard to say, partly because the feuding parties feud about what should be counted as successful replication. In a direct replication, the new investigator tries to exactly copy what the original investigator did. Variation in time, place, and resources means the copies won’t be identical (the Pashler et al. attempted replication of Infirm Words is listed as “fairly exact” on psychfiledrawer.org), but if they’re close enough, they may bear quite closely on the question of whether the original finding can be trusted. Conceptual replications, on the other hand, don’t exactly follow the original; they aim to extend the effect, by testing whether the process at issue obtains in other domains. Thus, the extension of Infirm Words from walking to reading might be counted as a conceptual replication, suggesting not only that the original finding was tracking something real, but also that the finding may generalize beyond the original domain.
For the academic psychologist, conceptual replications are more professionally advantageous, since they, unlike direct replications, meet the novelty standard for publication. The problem with conceptual replication is that it ain’t replication; if a study’s different enough from the original to count as novel, its finding could be real while the original finding is not (or vice versa). Moreover, failed conceptual replications may be even more likely to end in File Drawers than failed direct replications. If an investigator fails to conceptually replicate a study with which he is sympathetic, he may be tempted to treat the failure as an injudiciously large deviation from the original study instead of failure to replicate, meaning a (non) finding that ought to raise questions about the original study doesn’t get treated as doing so. At the same time, successful conceptual replications will frequently be deemed publication worthy, so the practice of conceptual replication may be biased in favor of validating previously published effects (Pashler and Harris 2012: 533).
This doesn’t mean conceptual replications say nothing about the reality and generality of an effect; when a cluster of conceptual replications accrues around a finding, it should increase confidence that there’s something real around which the cluster is accruing. Of course, if you’re in sympathy with the Replicators, you might have doubts about many studies comprising the cluster. But something like the Publication Bias is arguably present in all scientific journals, so to require dismissing all published research where the bias is present “leads,” as one psychologist (himself a Replicator) put it, “to the absurd conclusion that all published scientific knowledge should be ignored” (Simonsohn 2012: 598).
Maybe the absurd conclusion isn’t so absurd. In a paper entitled “Why Most Published Research Findings Are False,” Ioannidis (2005) deployed statistical techniques to argue just that: the majority of science is not to be believed. Certainly, the tsuris goes beyond psychology. Scientific medicine and genetics have both been unsettled by controversies akin to RepliGate (e.g., Begley and Ellis 2012), with the debate in genetics involving strikingly similar concerns about direct vs. conceptual replication (e.g., Kang et al. 2011, Munafò et al. 2009, Risch et al. 2009). Does this mean the findings of these fields should be ignored? Absolutely – the minute we have something better. No doubt there’s some trouble in genetics. But this doesn’t mean we’re better off consulting soothsayers about heredity than we are consulting geneticists. Likewise, there’s some trouble in medical research (which has been in a bit of a rut since identifying the importance of good hygiene in controlling infectious disease). But that doesn’t mean we’re better off consulting crystal healers about cancer than we are consulting oncologists.
And so too, there’s some trouble in psychology. But that doesn’t mean we’re better off trusting “common sense” than we are trusting the best available systematic study. There’s good, bad, and indifferent in psychology, just as in all of science, and the existence of the bad and indifferent shouldn’t dissuade us from figuring out what the good is. Doubtless, numbers of scientific findings should be discarded, but if all scientific findings were cast aside, we’d have a lot bigger problems than working out the right account of agency.
When the dust of RepliGate has settled, some currently venerated findings may be less venerated, and some currently commonplace research practices may be less commonplace. But we’re talking about an extensive and variegated field, and it’s not easy to predict what the casualties will be, and what forms retrenchment will take.
(Norms requiring larger sample sizes would make a good start; larger samples reduce random error, because more observations decrease the influence of random fluctuation, thus reducing the margin of error [Fraley and Vazire, forthcoming]). Priming effects range over perception, behavior and motivation, and in each domain, effects vary with both individual differences and situational variation (Loersch and Payne 2011). With this much diversity, indiscriminate statements like “priming studies have been challenged” don’t mean very much. Difficulty with one priming study does not entail difficulty for all studies in its domain, and difficulty with priming studies in one domain does not entail difficulty for all priming studies. Still less does difficulty with priming studies entail that all experimental psychology is in the soup.
When facing conditions of scientific uncertainty, caution is in order. This is nothing new: scientific conditions are always conditions of uncertainty, and caution is always in order. (Where certainty obtains, what call is there for science?) There’s no science without a story, and if a large body of perplexing fact is to have theoretical utility, the theorist is forced to make editorial decisions that may require taking sides in scientific controversy. Conscientiously assuming this risk requires identifying the drift of a literature, or better yet, a range of literatures, where we find the strongest trends in results, even as we acknowledge the existence of results, and the possibility of future results, that don’t conform to these trends.
Reservations about the Publication Bias notwithstanding, patterns of conceptual replication are one important way to get the drift. Also important are meta-analyses, which average the results of a large number of studies, and allow more confident assessment of effects than is possible with a single study. For example, a meta-analysis of 167 priming studies (Cameron et al. 2012) found moderate relationships between priming manipulations and both behavioral measures and explicit attitude measures, while a meta-analysis of 65 priming effects in social cognition found effect sizes ranging from “small to medium” to “medium to large” (DeCoster and Claypool 2004: 9). These meta-analyses give reason to think priming effects are real, even as individual priming studies receive unflattering scrutiny.
In any event, the broader lesson remains: don’t lean too heavily on any one study, or one series of studies, in theory construction. All the more so, where there have been difficulties with replication. […]
 According to Cameron et al. (2012), “[p]riming involves presenting some stimulus with the aim of activating a particular, idea, category, or feeling and then measuring the effects of the prime on performance in some other task.” Not much in a name, methinks: this description looks to hold for a varietyof experimental manipulations.