Check yourself before you wreck yourself
by Michael Inzlicht
Things have gone sideways in social psychology.
And, I am not sure this is necessarily a recent trend. The rot in my chosen field might have been festering for a very long time. I say this because not a day goes by when I do not hear about one or another of our cherished findings falling under disrepute. You have all heard about the depressingly high number of failed replications, be they one-offs or large scale coordinated lab attempts. You have also, no doubt, heard about problems with publication bias, low statistical power, and the widespread use of questionable research practices. Taken individually, none of these might be terribly upsetting; together they are an unmitigated disaster.
Perhaps more upsetting than the real problems facing my beloved social psychology is the level of denial I see. I have heard suggestions that things aren’t so bad. “Yes, there are a few effects that are probably not real,” say the defenders, “but we have made real and extraordinary breakthroughs and most of the stuff we study is rock solid”. I have also heard that every twenty to thirty years our field goes through its routine hand-wringing, but that it always passes. Many people are waiting for our current hand-wringing to pass. Finally, I have heard whispers and intimations that all of this is a way for personality psychologists, who are some of our most vocal critics, to get even with social psychologists, who seemed (for a time, at least) to have won the person versus situation debate.
This denial needs to stop.
This denial stops us from reflecting on our own work and asking what we have done to contribute to the problem. It stops us from trying to make things truly better. Now is the time to stop making excuses and to take a long look in the mirror.
Starting with me.
Have I contributed to our current crisis? It is painful to wonder if my research is part of the problem, but such introspection is vital to recognizing our troubles and, critically, rectifying our field.
Despite how painful it has been to watch my troubled field, I have been listening to our critics; I have been open to change. After years of hearing critiques of the field by my friend, colleague, and pub-adversary, Uli Schimmack, I have made active attempts to get better. For example, I have tried to collect larger samples; I have tried to use more statistically powerful designs (e.g., within-subject and repeated-measures); and I have tried to look at neighboring fields to get a richer understanding of my phenomena of interest. By not denying our troubles, I have changed how I conduct research.
So: Is there some way to quantify the quality of my work? Is there some way to see if changes to my research practice made a difference?
Looking to my H-index or number of citations is pointless, as these might say more about the popularity of my work and less about the quality of my science. Instead, it might be better to look toward metrics that might say something about the presence of bias and that can ascertain if my work has informational value.
Recent years have seen a few such metrics emerge. The most well-known of these is the p-curve, developed by Uri Simonsohn, Leif Nelson, and Joe Simmons, which can assess the evidential value of a number of findings. There is also the Replicability-Index (R-Index), developed by Uli Schimmack, which tracks the replicability of findings, with higher values suggesting a higher probability that replications will be successful. Uli has developed a second metric, the Test for Insufficient Variance (TIVA), which can also be used to detect the use of questionable research practices.
More than a little afraid of what I would discover, I analyzed my research output using all three of the above metrics. I decided to separately examine my first ten empirical papers and then my last ten empirical papers to see if my research output has increased in scientific quality. I suspected that it would. So, what did my analyses reveal?
First, a quick word or two on how I conducted my analyses. I decided to examine the one statistic per study that corresponded to the main hypothesis of interest. This was typically an interaction effect, but not always. If the main hypothesis corresponded to a crossover interaction, I would typically examine the two simple effects. Now because each typical study contains multiple statistics, often testing various parts of the main hypothesis, it was not always easy to pick which statistic to include. I typically selected the first relevant statistic to appear in the paper and I tried my best to be unbiased; but I would not be surprised if different people came up with somewhat different results. Finally, my life was made very easy by using Felix Schönbrodt’s amazing online p-checker app. I cannot recommend this app enough, and see it as essential tool for reviewers and editors wanting to improve the quality of our science.
So how did I do?
The statistics for my first ten papers are a mixed bag, but mostly unwelcome (see Table above). You can also recreate the analyses here. First, the good news: My p-curve analysis revealed no evidence of intense p-hacking and strong evidence that my first ten papers contain evidential value. This is where the good news ends, however.
While a positive p-curve result indicates that there is informational value to my studies, this does not necessarily mean they are devoid of bias. One’s work can have evidential value, can be informative, but still contain biased results. Do the results of my first ten papers contain bias? Sadly: yes. The results of the TIVA indicate that my results are not variable enough, suggesting that questionable research practices might have influenced things. While many of these practices are the norm in psychology, we now know that they are suspect and can lead to faulty conclusions. The R-index for my first 10 papers is also worryingly low, clocking in just under 40%[1]. Although there are no strict cutoffs, an R-index less than 50% raises doubts about the empirical support for the underlying conclusions in these papers. Unfortunately, my R-index is low and suggests that some of my results might not replicate. Unfortunately, I am in good company here. Uli Schimmack has analyzed a representative set of findings in social psychology and found that they achieved an R-index of 43%. He also reanalyzed a set of psychological studies appearing in Science, finding that they achieved a painfully low R-index of 33.9%. These low values raise concerns about the empirical support for the underlying articles
Looking more closely at my first ten papers reveals that the state of my (initial) science leaves much to desire. My median sample size was 54 participants, and that is not counting the number of participants that remained after excluding the occasional outlier. The most common design in my first ten papers was a 2x2 between-subject design, which means that on average I had a measly 13 to 14 participants per cell. I’m embarrassed to admit it, but in one study I collected only 28 participants in a two-cell between-subject design…but get this: it also included an individual difference variable as a moderator. Ouff!
Two of the three analyses revealed that questionable research practices might have influenced my initial results. These research practices are the norm and all too common, yet we now know how damaging they can be to our cumulative science. We now know, for example, that too much flexibility in data collection, too many experimenter degrees of freedom, allows practically anything to appear significant, even when it is not. If I look back at my early days and try to honestly assess where I went wrong, I cannot come up with any easy answers. I was slavishly focused on statistical significance and probably ran the fewest number of participants that I could. Perhaps, I too selectively reported the results confirming my hypotheses, overlooking those that didn’t. I most definitely ran underpowered studies. But to be honest, I simply do not know the precise reason that these analyses came out as they did. In that sense, I wonder if my bias was unconscious, with me fooling me.
I can do better. And it is clear from my most recent papers that once I stopped denying our problems, I have in fact done better.
Also on Table 1 are the analyses of my last ten papers, which you can recreate here. There are clear signs of improvement on every metric of scientific integrity. My R-index improved to a respectable 52.7%. While an R-index of 80% or higher is ideal (corresponding to Cohen’s recommendation that researchers achieve 80% power), an R-index between 50% and 80% should be considered trustworthy. I also saw significant improvements in my p-curve, TIVA, and median sample sizes (all Zs > 2.69, all ps < .007). And, yes, I just derived statistics on statistics of statistics, my first ever meta-meta-analysis.
While the p-curve of my last ten studies reveals that my papers continue to have informational value, the TIVA suggests that these same papers did not take advantage of questionable research practices. I’m getting better!
I’ve already mentioned the steps I took to improve, but it is worth repeating. The most telling part of my analyses reveals that the median sample size of my studies increased twofold. I think this one simple step can remedy many (though, not all) of our troubles. Instead of rushing through data collection until some arbitrary cutoff is reached (e.g., 20 per cell rule of thumb), I now routinely conduct power analyses with honest and conservative effect size estimates. As statistical power is not derived by sample size alone, but also by experimental design (among other things), I have made a concerted effort to use better designs, and as a result, my science has become more robust. Inspired by the elegance and replicability of cognitive psychology, where within-subject and repeated-measures paradigms are the norm, four of my last ten papers included within-subject factors, and six of my last ten papers used repeated measures dependent variables. These designs are extremely effective at reducing random error and between-person variability and thus produce results you can take to the bank. These designs have their limitations, and they cannot be used to address all questions, but I am convinced they can and should be used more often. And with a little imagination, I have been using them with great success. Most critical, perhaps, is that I have tried to be as transparent as possible—to others as well as to myself—by reporting results as fully as possible, even when they are unflattering.
I have been fortunate to have colleagues who have pushed me to become a better scientist (Thank you Uli! Thank you Liz!). Although I might have been reluctant at times, I have made concerted efforts to improve, and it is satisfying to see that I have. I haven’t stopped trying to get better, with my lab slowly embracing open science (see here, for example) and even experimenting with pre-registration. I am pushing to collect yet bigger samples.
I found the experience of checking myself to be humbling, yet gratifying. The results of my analyses suggest that at the outset of my career, I made mistakes, I skipped corners. But, by being open to the critiques of my beloved field—and by not denying our problems or making excuses or blaming outsiders—I have improved.
I no doubt continue to make mistakes, but hopefully I’m cutting fewer corners, and hopefully I’m helping improve social psychology. And as a quick reminder about how badly we need to improve, how badly we all need to check ourselves, when I perform the same set of analyses on papers that cross my desk, it doesn't take me long to flag papers that, to put it kindly, are simply non-robust. Many of these papers come from our elite universities, published in of our elite journals, with results sometimes appearing in elite newspapers. We can do better.
[1] To re-create the results for the R-index on p-checker, one needs to select “omit nearly significant results.”
For those of you want to see my self-analysis on p-checker itself, you can see my first ten papers here: http://bit.ly/1cwjjf1; and you can see my last ten papers here: http://bit.ly/1yw88Nk
Posted by: Michael Inzlicht | 21 April 2015 at 02:02 AM
Excellent post!
Posted by: R. Chris Fraley | 21 April 2015 at 04:17 AM
Moving the field forward in theory and in practice. Awesome, Mickey!
Posted by: Erika Carlson | 21 April 2015 at 08:28 AM
Huzzah! Great post.
Posted by: Brent Roberts | 21 April 2015 at 10:10 AM
This is great. Thanks so much for posting! It means a lot when known and senior people in the field are honest in this way.
Posted by: SS | 22 April 2015 at 12:23 AM
Way to set the standard for transparency! This is what we all need to do.
Posted by: Steven Heine | 22 April 2015 at 02:36 AM
From The Economist, 10/19/13, article titled Trouble at the Lab, quoting Dr. Bruce Alberts, 12year President of the National Academy of Sciences and former editor of Science:
“And scientists themselves, Dr Alberts insisted, ‘need to develop a value system where simply moving on from one’s mistakes without publicly acknowledging them severely damages, rather than protects, a scientific reputation.’”
I am slowly coming around to thinking our worst problems are not getting it wrong the first time (though for sure that happens and getting less wrong is important). But I think our worst problems are failure to self-correct.
According to Google Scholar, Since 2014, Bargh, Chen, & Burrows (priming elderly stereotypes/walking slow) has been cited 387 times. The Doyen et al failure has been cited 99.
Go here: the priming elderly stereotypes/walking slow phenomenon seems to exist almost entirely on the strength of p-hacking:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2381936.
(which is not to say "all priming is hogwash" -- the same paper shows otherwise).
Michael, we have never met. But I thank you for modeling the type of scientific behavior that offers hope that we can actually create a self-correcting psychological science.
Posted by: Lee Jussim | 23 April 2015 at 02:38 AM
I apologise for repeating myself but this is precisely why I think we need a better search engine for scientific studies, be it PubMed, Google Scholar or something else. You shouldn't take an individual study as final evidence but as the first piece of the puzzle.
In this system the DOI of any given study (like the Bargh et al one) should come with a whole tree of links to replications, exact and conceptual as well as related topics, so you can quickly identify the current state of the evidence. Ideally the platform should allow meta-analyses of the effects.
Each node in the tree should come with tags helping you to determine the directness of the replication and thus allowing you to refine the search. This will not only allow you to check if the effect replicates generally but also identify possible reasons why subsequent attempts might have failed which could inform further experiments that directly test those factors. If most replications are missing a crucial aspect of the original study, there would be a good reason to do such a follow up experiment. If the picture is very messy on the other hand, this makes it unlikely that the effect replicates at all.
I think implementing a system like this will take some initial effort but once things get going I don't think it's a major undertaking. It would be in the authors' own interest to ensure their effects are registered properly in the system.
Posted by: Sam Schwarzkopf | 19 May 2015 at 05:50 PM
Great discussion! It's impressive that you did this...great way to open up honest dialogue in the field!
Posted by: Dan Dolderman | 16 September 2015 at 10:05 AM