keri russell plotting her next QRP
i am teaching a seminar called 'oh you like that finding do you? well it's probably FALSE.'
the students are a little bit shell-shocked.
i am having the time of my life.
the hardest part is explaining how perfectly smart, truth-seeking scientists can repeatedly do idiotic things. happily, i have plenty of examples from my own life.
just this week, i almost p-hacked. the details are a little dry, but i think it's worth telling this tale of mundane p-hackery, because this is what p-hacking looks like in the wild. it is not sinister and dark like a good episode of the americans. it is super boring like the stories you read to your children at night.* in fact, if you run out of bedtime material, read them this.
once upon a time, there were some personality researchers who wanted to measure personality.
they collected self-reports and six peer reports from each of their participants.
they wrote a paper examining how the big five personality traits correlate with friendship satisfaction.
because the paper was due right when the data came in, and organizing the peer reports takes time, they only used self-reports of personality (bad personality researchers!).
the reviewers and editor said 'dude, we know you have peer reports. and you shouldn't just correlate self-reports (of personality) with self-reports (of friendship satisfaction).'**
so they compiled the data and ran the analyses with the peer reports of personality. not surprisingly, all the correlations with (self-reported) friendship satisfaction were weaker. still significant (yay large samples), and with a similar pattern as the self-reports, but weaker.
they started off by reporting all of their results, with the self-reports and the peer reports side by side.
then their tables got unwieldy, and they decided that they should just aggregate the self- and peer-reports into a composite measure for each personality trait. because obviously that is the best measure of personality. and it would make their tables easier to read.***
so now, gentle reader, we have a self-report and (up to) six peer reports for each participant. we want to aggregate them. what are we to do?
option a is to first aggregate the six peer reports, and then average that composite with the self-report, in which case the self-report is weighted as much as ALL the peer reports put together.
option b is to just average all seven reports (self-report and six peer reports) all at once, in which case the self-report is weighted only as much as any single peer report.****
pop quiz hotshot: which one will give us "better" results?
option a would give us bigger effects. and many many researchers in this area***** would say option a is the better option because self-reports are special and should be weighted more than each individual peer report. so our protagonists would have cover if they went with option a.
option b, however, is what our protagonists have typically done in the past. partly because they are convention-busting mavericks and partly because they believe that the self is not that special - each of your close friends knows about as much about your personality as you do yourself.
so the noble researchers sat there and, for several minutes, contemplated what to do. 'contemplated' is the wrong word. it was immediately obvious to them what they should do. they even said 'this is p-hacking, what we're contemplating doing right now.' out loud. and still, they sat there for another minute. squirming.
then they did the right thing because that's what protagonists do and would i really be telling you this story if we p-hacked? come on.
the moral of the story is: not p-hacking is HARD. side effects include redness, swelling, and deep, deep frustration. and self-righteousness.
what's that? your kids are still awake? here is a fun little postscript:
another thing the reviewers said was 'look, i know your analyses were exploratory, but anyone with half a brain would have predicted finding ABC. stop dicking around and, in the introduction, tell your readers about how all existing personality theory and research would predict ABC.'
what are our protagonists to do? they did not predict ABC, but mostly because they were too damn lazy to make specific predictions (bad bad researchers!). they did not want to lie in the introduction. but they also agreed that only a dumbass would not predict ABC.
in a brilliant stroke of genius, they decided to write: 'previous theory and research on A and B and C would clearly suggest that ABC. although this would be a reasonable prediction, we did not actually make this prediction a priori' and they lived happily ever after.
the end.******
* i am extrapolating from the stories my dog makes me read at night. i hear parents love it when you compare their kids to dogs.
** this is called getting a taste of your own medicine. your kids will appreciate this lesson someday.
*** of course we are putting the data and disaggregated results on OSF. we're not total idiots.
**** the younger kids might get confused here. just write out the formulas for them.
***** almost all seven of them.
****** everything in this story is true (you know, without the dicking around and the dumbass bits), but in our defense, our paper (and methods) were a little more sophisticated than described here.
sweet dreams.
Here is an idea what a researcher should do with self-ratings and informant ratings that deals with the problem of aggregation. DO NOT AGGREGATE!
The reason is that now error variance in self-ratings is meshed up with error variance in informant ratings and it becomes impossible to say which variance components drive a correlation.
The best way to analyze these data is to use Structural Equation Modeling. Now you can correct for measurement error in personality and show the true strength of the relationship and you can show that error variance in self-ratings is correlated with the criterion (shared method variance).
See, Kim, Schimmack, & Oishi (JPSP, 2012) for an example.
Structural equation modeling was invented in the 1950s. Maybe 60 years later, personality psychologists can start using this useful statistical tool that avoids the problem of QRPs in aggregation.
Posted by: Dr. R | 03 February 2015 at 06:09 AM
Okay now I'm confused. Perhaps it's because I'm not a personality researcher or because of my cursory reading over my alter ego's morning coffee (it's not always easy to concentrate when you share someone else's mind). But what does this story have to do with p-hacking?
Your protagonists were faced with a choice which analysis protocol to follow. The option most commonly used in the literature would have given them the more striking results (you say "bigger effects" so in the context of p-hacking I assume that the p-values were lower for this statistical comparison?). The option most typically used in the protagonists' own research would have given them weaker effects. This isn't so much about p-hacking as it is about deciding what the most appropriate analysis should be.
The Crusaders (as I call them) will tell you that the protagonists should have just preregistered that analysis pipeline. This example serves again as a reminder why this doesn't actually work in practice: if the experiment had been preregistered, the preregistered analysis would not have included the peer reports. So prereg probably wouldn't have saved the protagonists from this dilemma.
Of course, the best solution to this situation would have been to simply report *both* analyses alongside a discussion of this problem. Or(and?) choose a more appropriate analysis method that gets around aggregation, as the previous commenter suggested.
Posted by: The Devil's Neuroscientist | 04 February 2015 at 09:31 PM
I thought I'd respond to The Devil's Neuroscientist as she prompted me on Twitter =)
// Is it p-hacking?
The story is about researchers potentially (but not actually) exploiting their researcher degrees of freedom in a manner that could artificially inflate the strength of their findings.
It certainly sounds like p-hacking in essence to me, even if that term is not strictly accurate in a technical sense* if there was no p less than .05 threshold issue (unclear if there was).
// Would pre-registration have made a difference?
I think a registered reports version of pre-reg would have done. If we assume that the pre-reg document was reviewed by the same reviewers mentioned above, then they would have pointed out the missing peer reports issue before the study had even been done. So when the researchers came to write up their report, that would have been one less degree of freedom available for potential exploitation.
Also imagine how much time/resources would have been saved if the reseachers hadn't even been intending to collect peer reports at all and the reviewer pointed this out early on...
// (im)perfect science
I think another important issue raised by this piece (in addition to the fact that I dropped my tea laughing at one point) is the inherent tension between the drive for aesthetically pleasing science (cf. desire to tame unwieldy tables, reviewer encourging HARKing) and the rough looking version of science that is less easy to communicate but much more transparent. Are we getting the balance right?
* I drew my p-hacking definition from urban dictionary as I think we are talking about a colloquial term here rather than one with a precise definition: http://www.urbandictionary.com/define.php?term=p-hacking
Posted by: Tom | 05 February 2015 at 05:24 AM
Thanks for this comment. I agree that with even more coffee in my system I can now somewhat see why you would regard this as a form of p-hacking even if it isn't strictly about getting p<.05.
I think what threw me off at first is that the dilemma was really about the choice between typical research practice and the approach the authors would have normally taken. The reason I don't really see this as p-hacking is that it doesn't "artificially inflate the strength of their findings" because we don't really know which results are truly the more appropriate. Perhaps option 2 is deflating the results? It's more of a question of whether the authors stick to their own theories or conform with the status quo. But yes, I can see now why you see it this way.
Again, rather than pre-registration I think the best course of action would have been to simply present both types of analysis (plus perhaps others). In situations where the right course of action is unclear I feel it's better to just provide the reader with the available evidence and let them make up their minds.
You're of course right that a peer-reviewed registered report could have determined this from the outset. I have previously argued that this is the only way that preregistration could realistically work. In the meantime, discussions I witnessed between various proponents of prereg though have led me to reevaluate this idea. In fact, Tywin Lannis... sorry, David Shanks recently argued quite cogently (in my mind) that registered reports are unlikely to ever catch on.
Finally, my goody two-shoes twin brother, Sam, would like to say that he agrees that this blog is very funny.
Posted by: The Devil's Neuroscientist | 05 February 2015 at 07:41 PM
thanks for your comments everyone!
Dr. R - yup, i totally agree. SEM is very useful and we should use it more. it doesn't eliminate the problem of researcher degrees of freedom, but of course no statistical technique can do that. in the end we will always need to rely on judgment and reason.
devil's neuroscientist - part of the point i wanted to make is that often, when researcher degrees of freedom are involved, both/all options are justifiable. it's not a matter of simply avoiding the 'wrong' approach. but if we systematically choose the approach that gives us bigger effects/smaller p-values, this contributes to the bias in the published literature (and we are capitalizing on chance). p-hacking is relevant anytime our decision about which analysis to use is influenced by how strong/significant the results are with each option. (reporting all results is great, and we do that in the supplemental materials, but most people won't read the supplemental materials, and i don't always want to see every possible analysis in the main text.)
re: pre-registration. i think it's great. i'm a big fan in principle. but i can't seem to do it. i just love exploring data. we are trying to be more systematic about using one dataset for exploration, and then another strictly for confirmation (this is our solution to preregistration when dealing with datasets that took us several years to collect). so i'm trying to move closer to the ideal. in my view, the ideal is some combination of exploration and confirmation, with total transparency about which is happening when. (i also think that it's almost impossible to anticipate all possible researcher degrees of freedom ahead of time, so pre-registration can help a lot, but can't completely eliminate the temptation to p-hack.)
thank you all for your interest and feedback!
Posted by: simine | 06 February 2015 at 04:40 AM
Thanks for the clarifications. I also appreciate your candor about exploring data. Sam is certainly the same. Data are exciting, why wouldn't you explore them? I agree that it would be great to have a better idea what is exploration and what is hypothesis-driven research. However, I think that is really a cultural change. The current culture puts undue value on hypothesis-driven research which forces people to pretend that their exploration is hypothesis-driven. Thus I believe pre-registration is trying to cure the symptom rather than the cause of the problem.
Posted by: The Devil's Neuroscientist | 06 February 2015 at 12:32 PM