A Tale of Two Papers
By Michael Inzlicht
Change is afoot in psychology.
After years of bickering on social media and handwringing about whether our field is or is not in serious trouble, some consensus is emerging. Although we might not agree on the severity of our problems, almost no one doubts that our field needs improvement. And we’re now seeing the field take real steps toward that, with new editors stepping in with mandates for genuine and powerful change.
As an Associate Editor at a journal I’m very proud of, I have lived through some of this change. While the standards at the Journal of Experimental Psychology: General have always been high, the standards are more stringent now than when I began. Some people interpret this change as a turn toward conservatism, of valuing safe over creative work. While I appreciate this perspective, I disagree. Instead, I see this as a turn toward transparency, as a turn toward robustness. I still value creative research—and demand it of myself and of authors with whom I work, but now I also value transparency, which allows for robustness.
Transparency is critical, yet has been lacking in the past. Transparency not only offers an honest reflection of how hard science is, it allows for cumulative science, with others that follow being able to truly build on and extend past work, knowing precisely what steps to take and what pitfalls to avoid. It is amazing that for so long we have been writing papers backassward: We were taught to look at data first and then derive hypotheses after the fact, with our resulting papers seeming perfect for far too long. The result of all these “perfect” papers is that our field produces findings that cannot be replicated. If, however, our work and our data were transparent, if we revealed all their ugly flaws, we might actually be able to build on them and create a robust science.
To illustrate, let me tell you the story of two papers I edited. The first paper contained 7 experiments, often with outliers removed and covariates included, and reported effect sizes in the medium to large range all supporting the main hypothesis. The second paper contained 18 experiments, didn’t exclude anyone or add any covariates, and reported small effect sizes that sometimes were contrary to the hypothesis. The first paper found 7 out of 7 significant results; the second paper contained 2 significant effects out of 18.
These were not two papers. They were two versions of the same paper.
The first was emblematic of the old way of doing business, with 7 studies that were scrubbed clean to be near-perfect. The second is emblematic of the new ways we are trying to do business, with studies that were raw, unvarnished, and true.
This paper, authored by the very brave Mirjam Tuk, Kuangjie Zang, and Steven Sweldens titled, “The propagation of self-control: self-control in one domain simultaneously improves self-control in other domains” (volume 144, pages, 639-654) is now my favorite as editor. I say this not necessarily because of the topic area (which is fascinating in its own right and deserving of its own blogpost), but because it is a model of transparency and a template for the kinds of things we should be seeing more of in our top journals.
Despite high marks from reviewers enthralled with the topic, I decided to reject the first paper because of signs of non-robustness. One intrepid reviewer analyzed the paper using my friend Uli Schimmack’s Incredibility Index and found it to be excessively significant. Papers are excessively significant when they contain multiple studies and there is a discrepancy between the expected number of significant results (as determined by power) and the actual number of significant results. Excessive significance, or so-called incredible results, undermines the trustworthiness of reported results and intimates that something else might be going on.
In the case of the first paper, that something else was a rather large file drawer.
Not everyone considers file drawering studies to be a questionable research practice; however, almost everyone agrees that they are very problematic for the field. File drawering studies warps our sense of how real, robust, and large an effect is. Because researchers file drawer papers—be that because they choose to or because the review process leads them to—there is a growing sense that we cannot fully trust the published record. Even meta-analyses, seen as the best way of establishing the veracity of an effect, have become suspect because of things like the file drawer. Garbage in, garbage out, right? We are now witnessing a meta-analysis boon, with researchers re-doing old meta-analyses but adding sophisticated techniques to detect and correct for bias, chiefly the problems caused by file drawering.
So: file drawering is bad. And that first paper was guilty of doing it. But, almost everyone is guilty of doing it. I know I am.
After I rejected the first paper based on it appearing too good to be true, the authors came back to me saying it indeed was too good to be true, that they had a large file drawer. But here was the rub: when the authors included all studies, when they emptied the file drawer, the meta-analytic effect was robust, albeit small. Better still, the meta-analytic effect held when including all participants (some of whom legitimately could be dropped) and excluding all covariates.
I invited the authors to submit this second paper and after a few rounds of revision, the paper was accepted. Their meta-analytic effect, unlike almost all others in the field, is based on a body of data that is not warped by the file drawer. Clearly, we now know that all data warrant independent replication; but to me, these data are trustworthy. I mean, only two of the studies exceeded the magic p<.05 line and a couple of the studies were in the wrong direction altogether. Check out their forest plot. This is what real data look like. The data are not always pretty, they have warts, but they are real.
This paper is a model of transparency and leads to better studies on this topic in the future. Because the unvarnished effect size from this study is quite small at d=0.22, it means that future follow-up studies will need 652 participants in a 2-cell between subject design to achieve 80% power. Yes, this is a huge number. But, you know: research is hard.
If we want a robust science, we need to run high-powered studies. But the only way to correctly calculate power is with a realistic estimate of the underlying effect size. This second paper allows for this, even if the usual caveats of independent replication apply. Power is more than just sample size, however, and some of you might be relieved to hear that 80% power can be achieved with 166 participants in a simple pre-post repeated measured design[1].
I am a huge fan of this second paper. I love all my children, but I would be lying if I said that this wasn’t my favorite as editor. I love it because it is transparent; and because it is transparent, it allows for a robust science. This push for transparency, of revealing our warts, is exactly what the field needs.
Many of the new standards emerging from our field are means to ensure transparency. Scientists can still pursue creative, even bold ideas, but should do so while being more sincere, more circumspect about the fruits of their effort, even when they reveal imperfections.
Science is hard. While it can sometimes reveal true beauty, it can only do so when we allow it to be ugly too.
[1] This assumes a .5 correlation between pre and post measures. Sample size will go down (up) with higher (lower) correlations between pre-post measures.