[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
happy halloween
here's an argument i've heard against registered reports and results-blind reviewing: "judging studies based on their methods is like judging a baking contest based on the recipes." the implication being that this would be ridiculous.
i've been thinking a lot about this analogy and i love it. not because i agree with it, but because i think it gets at the crux of the disagreement about the value of negative (null) results. it's about whether we think the value of a study comes from its results or its methods.
the baking contest analogy rests on the assumption that the goal of science is to produce the best-tasting results. according to this logic, the more we can produce delicious, mouth-watering results, the better we're doing. accumulating knowledge is like putting together a display case of exquisite desserts. and being able to produce a delicious result is itself evidence that your methods are good. we know a good study when we see a juicy result. after all, you wouldn't be able to produce a delicious cake if your recipe was crap.
this analogy probably sounds reasonable in part because of how we talk about negative results - as failures. in baking that's probably apt - i don't want to eat your failed donut.* but in science, the negative result might be the accurate one - you can't judge the truthiness of the result from the taste it leaves in your mouth. we may not like negative results, but we can't just toss them in the bin.
here's what i think is a better analogy for science. producing scientific knowledge is like putting together a cookbook that allows other people to follow recipes to reliably produce specific outcomes. if someone wants their recipe included in the cookbook, we don't necessarily need for the recipe to produce something yummy, we want it to produce something predictable. maybe it's bland, maybe it's sour, maybe it tastes like cilantro. the point is that we know what it will produce most of the time (within some specified range of uncertainty), regardless of who the cook is. in other words, what we really want to know is "what happens when we follow this recipe?" and we want to know this for a wide range of recipes, not just the ones that produce delicious results, because the world is full of strange mixtures and combinations that are bound to occur, and we often want to know what the outcome is when those ingredients mix.
in the case of psychology, the recipe might be something like "measure intelligence and happiness in a bunch of college students in the US and correlate the two variables" and the outcome might be "tiny relationship." that's not as delicious as a large relationship, but it's important to know anyway, because it's a fact that helps us understand the world and now constrains future theories.
we can definitely reserve the right to say that some recipes are not interesting enough to want to know what happens when we follow them (e.g., maybe it's not worth knowing what the correlation is between shoe size and liking mackerel**), but we can decide that based on the recipe, without knowing the result. every once in a while, there will be a recipe that looks boring but produces a result that is actually interesting (e.g., maybe inhaling dog hair cures cancer), and we should certainly have a mechanism for those kinds of studies to make it into the literature, too. but at least a good chunk of the time, it's the recipe that makes the study interesting, not the result.
what's the problem with choosing what to publish based on the tastiness of the outcome? first, if we reward the tastiness of the result, we are incentivizing people to take shortcuts to produce tasty results even if that means deviating from the recipe. that might work at a potluck (why no, i didn't use any baking powder in these eclairs!). but in science, if you take shortcuts to get an impressive result, and you don't report that (because it would no longer be impressive if you admitted that you had to use a little baking powder to get your choux pastry to rise), that corrupts the scientific record.
second, it gives too little weight to the quality of the methods. even putting aside questionable research practices like p-hacking and selective reporting, there's the problem of accurate interpretation. in science, unlike in baking,*** you can make scrumptious claims with crappy ingredients. if we don't pay close attention to the soundness of the methods, we risk letting in all kinds of overblown or misinterpreted findings. if we look closely at the recipe, we may be able to tell that it's not what it claims to be. for example, if you say you're going to make a carrot cake but you're using candy corn instead of carrots, it may taste delicious but it won't be a carrot cake. (in this analogy, the carrot is the construct you're claiming to be studying and the candy corn is the really crappy operationalization of that construct. keep up!) there are many ways to produce replicable findings that don't actually support the theoretical claims they're used to bolster.
third, publications are pieces of the scientific record, not prizes. and if we only record the successes, there is no hope of a cumulative record. often the same people who are against giving negative results a chance also say that we shouldn't worry about the replicability of individual studies because no one thinks that an individual study is definitive anyway. they appeal to our ability to accumulate evidence across a set of studies. but that's only a safeguard against flawed individual studies if the larger set of studies is unbiased - if it includes the negative results as well as the positive ones. science can't accumulate if we only record the successes (this problem is compounded if the successes can be exaggerated or p-hacked, but it's still a problem even in the absence of those distortions).
the response i usually get at this point is that many negative results are just failures of execution - poorly designed studies, unskilled experimenters, etc. i have two answers to this: 1) even if this is true, we need a way to identify the rare valid negative result, so that we have a chance to know when an effect really is zero. if the negative result itself is a reason to ignore the study, we'll never be able to correct false positives. 2) the same argument can be made about positive results. i know how to make any null association significant: measure both variables with the same method. or in the case of an experiment, throw in a confound or a demand characteristic. how do i identify these problems? not by saying it must be spurious because it's
easy to mess up a study in the direction of producing a significant result, but by scrutinizing the method. the same should be done for studies with negative results that we suspect of being shoddy.
we must have a way of evaluating quality independent of the outcome of the study. that means requiring much more transparency about how the study was run and analyzed. in the absence of transparency, it's easy to use results as a crutch. (i do it, too. for example, i sometimes use p-values to gauge how robust the results are, because i often can't see everything i need to see (e.g., the a priori design and analysis plan) to tell whether the research was conducted rigorously.) but we must demand more information - open materials, pre-registration, open data, evidence of the validity of measures and manipulations, positive controls, etc. - so that we can separate the evaluation of the quality of the study from the outcome of the study.
some argue that committing to publishing some studies regardless of the outcome will lead to a bunch of boring papers because people will only test questions they can predict the answer to. i think the opposite is true. when we aren't evaluated based on our results, we can take more risks. it's when we need to produce significant results that we're incentivized to only test hypotheses we're pretty sure we can confirm. indeed, as i learned from one of
Anne Scheel's tweets, Gigerenzer (1998) reported that Wallach and Wallach (1994, 1998) showed that most "theories" tested in JPSP and JESP papers are almost tautological.
in some fields, p-hacking is called "
testing to a foregone conclusion." let's stop publishing only papers with foregone conclusions. contrary to what often gets repeated, the way to encourage creativity, risk-taking, and curiosity is by publishing rigorous studies testing interesting questions we don't yet know the answers to. we can have our cake and eat it, too – we just can't guarantee it'll be tasty.****
* i would probably eat your failed donut.
** spurious correlation, dutchness
*** if there's a way to hack your way to delicious food without using good methods, please let me know.
**** you thought i'd given up on the tortured analogy, didn't you?
this pumpkin has been hacked.
most excellent - I fully agree and I am not even in psychology! And yes, i would also eat your failed donut :-)
Posted by: koen pauwels | 01 November 2018 at 02:37 AM
Great post, completely agreed! :)
The answer to *** is eating out in a good restaurant.
Posted by: Sam Schwarzkopf | 01 November 2018 at 07:08 AM
I think the metaphor is both accurate and inaccurate. Given the same recipe, skilled chefs and unskilled chefs generate VERY different results. In baking this is exacerbated (see obligatory reference to Great British Baking Show/Bake-Off, which demonstrates this point amply). A recipe assumes a vast amount of background knowledge and skill--which is not apparent in the recipe. Ditto registered reports.
If you believe that some people are better at writing grants than they are at conducting research, then you're already on the page which criticizes the notion that a registered report is sufficient basis to make a publication (or whatever) decision. But all that is just dissecting the metaphor.
Some of the argument takes the form of "if you've p-hacked or otherwise mangled your results, the work is bad." This is true, but registered reports don't really solve *that* problem, and there are other, better-suited ways to address it (IMO).
But what really got me going was endorsing the notion that "Wallach (1994, 1998) showed that most "theories" tested in JPSP and JESP papers are almost tautological."
No, they did not. They made the argument. If anyone still cares, we differ:
Schaller, M., Crandall, C.S., Stangor, C. and Neuberg, S.L. (1995). "What kinds of social psychology experiments are of value to perform?" A reply to Wallach and Wallach (1994). Journal of Personality and Social Psychology, 69, 611-618.
Schaller, M. and Crandall, C.S. (1998). On the purposes served by psychological research and its critics. Theory and Psychology, 8, 205-212.
Crandall, C.S. & Schaller, M. (2002). Social psychology and the pragmatic conduct of science. Theory and Psychology, 11, 479-488.
NB: These papers are 15-20 years old, and we might change some of the examples or arguments, but the main thrust remains.
Posted by: Chris Crandall | 01 November 2018 at 09:11 AM
@Chris Crandall:
To me there is a difference between providing a sufficiently detailed recipe and being a great cook. When we say that a method section should provide the necessary information for anyone to replicate an experiment, I don't think anybody seriously means to imply that training is irrelevant. Obviously I won't be able to replicate a particle physics experiment even if you give me access to the LHC.
I think in the past not even that was given as methods sections, especially in many top impact journals, were far from detailed enough even for experts to replicate the methods. But that situation is fortunately improving.
The other aspect I believe relates to the level of training that should be required. Comparing particle physics and social psychology experiments is obviously a red herring. But shouldn't we expect a psychology research to be able to replicate psychology experiments?
I'd concede that even within a field there *can* be differences in expertise that could be an issue. There is probably no blanket judgement here but you need to look at this case-by-case. Nevertheless, if a finding is strongly susceptible to experimenter effects, at the very least this implies the result is subtle and/or unreliable. Moreover, if it were me, I would really want to know then what actually made the difference.
Posted by: Sam Schwarzkopf | 01 November 2018 at 11:03 AM