one of the themes of the replicability movement has been the
Campaign for Real Data (Kaiser, 2012). the idea is that real data, data that haven't been touched up by QRPs, are going to be imperfect, sometimes inconsistent. part of what got us into this mess is the expectation that each paper needs to tell a perfect story, and any inconsistent results need to be swept under the rug.
whenever this comes up, i worry that we are sending researchers a mixed message. on one hand we're saying that we should expect results to be messy. on the other hand we're saying that we're going to expect even more perfection than before. p = .04 used to be just fine, now it makes editors and reviewers raise an eyebrow and consider whether there are other signs that the result may not be reliable. so which is it, are we going to tolerate more messiness or are we going to expect stronger results?
yes.
on the face of it, these two values (more tolerance for messiness vs. more precise/significant estimates) seem contradictory. but when we dig a little deeper, i don't think they are. and i think it's important for people to be clear about what kind of messy is good-messy and what kind of messy is bad-messy.
recently, a colleague told me about a result he got that looked pretty robust - it was strong and significant when he analyzed it six out the seven ways possible. but the seventh analysis was not significant. this is probably good-messy. this is the kind of messy that you should absolutely report,* and nobody should really care very much about. assuming the seventh analysis is equally diagnostic as the other six (i.e., it's not a clearly better way to test the question), and the first six results give very compelling evidence (e.g., the confidence intervals are pretty tight, the p-values are mostly close to zero), we should probably chalk up that seventh result to random noise and move on.
good messy is when you have mostly very strong evidence (precise estimates, p-values close to zero, etc.), but every now and again there's a misbehaving result. if the misbehaving results are about as frequent as you'd expect by chance when there is a true effect (i.e., if the proportion of null results is consistent with what you think your type II error rate should be), then it shouldn't be a cause for concern. if a reviewer gives you a hard time for it, or asks you to over-interpret it, send them over here.**
bad messy looks pretty different. bad messy is where most of the individual results are unconvincing. the prototypical case is one where there are a string of underpowered studies with results just below p = .05. by now most of us know that this is a sign that the result is unreliable, probably due to perfectly common things like HARKing, flexibility in data analysis, or a file drawer. what we've learned is that, when your studies are underpowered, it's super unlikely that you'll get only significant results. what that DOESN'T mean is that throwing in one not-quite-significant result makes things ok. sorry. i know that seems like the obvious conclusion. if too-many-underpowered-but-significant-studies = bad, then it seems perfectly reasonable that many-underpowered-but-significant-studies-plus-one-or-two-not-quite-significant-studies should be good. and, sure, it's better. but it's still not great. if none of the individual studies are high-powered and produce compelling evidence (a highly precise/significant result), a bunch of studies that kinda sorta suggest something is still not going to be convincing.
why not? why can't you just do a meta-analysis and show that, overall, you have clear evidence for a result? the blunt answer is because of p-hacking and QRPs. if your studies are pre-registered and you did not deviate from the pre-registered analysis plan,*** and you include all studies, then yes - meta-analyze away and the result should carry a lot of weight (though see
this paper about how optional stoping at the study level could threaten the results of your internal meta-analysis). but in the absence of pre-registration, it's entirely possible that each study in the set is a false positive. i know that makes me sound cynical. that's because i am. i'm not cynical about researchers' intentions - i truly believe that almost nobody thinks they're p-hacking and almost everybody believes their effects are real. i'm cynical about people's abilities to recognize when they're capitalizing on chance.
how do you know when your results are good messy or bad messy? i think there's one easy question you can ask yourself that provides a pretty good guide: could i clear up the messiness with a new high-powered, pre-registered study? if the answer is yes, then your data are probably bad-messy. that is, if you haven't already done a high-powered study or a pre-registered study, or ideally a high-powered pre-registered study, and your results are messy, there's a chance it's because none of your studies were adequately powered or free of p-hacking. doing a high-powered pre-registered study would likely push the effect either into the 'pretty sure it's real' or the 'pretty sure it's not real' column, and there's not much reason to publish the messy result now when you could do one more study and make it a much-less-messy result.
if, on the other hand, the evidence you have is based on one or more high-powered studies, and especially if at least one of those studies was pre-registered, then another high-powered and/or pre-registered study is not likely to provide much more clarity (though of course it might). even with high-powered studies, and even without p-hacking, we're going to see fluctuations in results. those kinds of fluctuations are often just what you'd expect because of noise, type II error, etc. we'll never be able to make them go away completely, and those are the kinds of fluctuations we should tolerate.
it sucks when editors and reviewers ask you to do impossible, contradictory things. more messiness! more perfection! which is it? in this case, i think it's both. whether messiness is good depends on the type of messiness and the reason for the messiness. but i fear that repeating 'we should tolerate more messiness' is going to lead to a lot of disappointment, because i don't think it means what a lot of people might hope it means.
post-script: you know what else sucks?
two years ago i said that high-powered studies were the solution to everything. now it's high-powered pre-registered studies. what's next, high-powered pre-registered blindfolded underwater studies? perhaps. the goal posts keep moving, because we keep learning more about the problems and the solutions. i'm still
obsessed with power, but also coming to the realization that pre-registration is going to be a big part of the new way we do science. it's a painful realization for someone who uses mostly existing data that i know too well to be able to pre-register anything with. i will work these feelings out in a future blogpost.
** i can take 'em. i was on my high school wrestling team.
*** i'm not saying that if you pre-register, you can't deviate from the pre-registered analysis plan. but if you do, those findings are now exploratory and the meta-analysis is not longer bias-free. and a bias-full meta-analysis is maybe completely useless. ****
**** who wants tequila?
Fine advice -- not that we writers need much help in this area...
Posted by: astradaihatsublitar.com | 08 December 2016 at 05:11 PM