[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
do you feel frustrated by all the different opinions about what good science looks like? do you wish there were some concrete guidelines to help you know when to trust your results? well don't despair!
it's true that many of the most hotly debated topics in replicability don't have neat answers. we could go around and around forever. so in these tumultuous times, i like to look for things i can hold on to - things that have mathematical answers.
here's one: what should we expect p-values for real effects to look like? everyone's heard a lot about this,* thanks to p-curve, but every time i run the numbers i find that my intuitions are off. way off. so i decided to write a blog post to try to make these probabilities really sink in.
do these p-values make me look fat?let's assume you did two studies and got p-values between .02 and .05. should you be skeptical? should other people be skeptical?many of us have gotten used to thinking of p = .049 as the world's thinnest evidence, or maybe even p = .04. but what i'll argue here is that those intuitions should stretch out way further. at least to p = .02, and maybe even to p = .01 or lower.let's get one thing out of the way: if you pre-registered your design, predictions, and key analyses, and you are interpreting the p-values from those key analyses (i.e., the confirmatory results) then you're fine. you're doing real Neyman-Pearson NHST (unlike the rest of us), so you can use the p < .05 cutoff and not ask any further questions about that. go eat your cookie and come back when the rest of us are done dealing with our messy results.now for the rest of us, who don't pre-register our studies or who like to explore beyond the pre-registered key analysis (i.e., who aren't robots**), what should we make of two results with p-values between .02 and .05?the math is simple. using an online app (thanks Kristoffer Magnusson, Daniel Lakens, & JP de Ruiter!) i calculated the probability of one study producing p-values between .02 and .05 when there is actually a true effect. it's somewhere between 11% and 15% (i played around with the sample size to simulate studies with power ranging from 50% to 80%). so what's the probability of getting two out of two p-values between .02 and .05? at "best" (i.e., if you're doing underpowered studies), it's 14% x 14% = 2%. with decent power, it's around 1.4% (i.e., 1 in 70).CORRECTION (11/24/2017): i was wrong! the table that was here before (and the numbers in the paragraph above) were wrong - i'm not sure what happened. the paragraph above has been updated, and here is the corrected table. the biggest difference is that the probability of getting high p-values at 80% power is higher than i had said. the points made in the rest of the post still hold, as far as i can tell!if you agree with this math (and there's really not much room for disagreement, is there?), this means that, if you get two out of two p-values between .02 and .05, you should be skeptical of your own results. if you're not skeptical of your own results, you make everyone else look like an asshole when they are skeptical of your results. don't make them look like assholes. be your own asshole.yes, it's possible that you're in that 2%, but it would be unwise not to entertain the far more likely possibility that you somehow, unknowingly, capitalized on chance.*** and it's even more unreasonable to ask someone else not to think that's a likely explanation.and that's just two p-values between .02 and .05. if you have even more than two studies with sketchy p-values (or if your p-values are even closer to .05), the odds you're asking us to believe are even smaller. you're basically asking anyone who reads your paper to believe that you won the lottery - you managed to get the thinnest evidence possible for your effect that reaches our field's threshold for significance.of course none of this means that your effect isn't real if your p-values are sketchy. i'm not saying you should abandon the idea. just don't stop there and be satisfied with this evidence. the evidence is telling you that you really don't know if there's an effect - whether you remember it or not, you likely did a little p-hacking along the way. that's ok, we all p-hack. don't beat yourself up, just design a new study, and another, and don't stop until the evidence is strong in one direction or the other.****and i don't just mean the cumulative evidence. yes, you can combine the p = .021 study, the p= .025 study, and p = .038 study to get a much smaller p-value, but that still doesn't explain how you got three high p-values out of three studies (extremely unlikely). even with the meta-analytic p-value at p < .005, a reader (including you) should still conclude that you probably exploited flexibility in data analysis and that those three results are biased upwards, making the cumulative (meta-analytic) evidence very hard to interpret. so keep collecting data until you get a set of results that is either very likely if there's a true effect (i.e., mostly small p-values) or very likely under the null (i.e., a flat distribution of p-values). or, if you're brave and believe you can design a good, diagnostic study, pre-register and commit to believing the results of the confirmatory test.if that's too expensive/time-consuming/impossible, then do stop there, write up the results as inconclusive and be honest that there were researcher degrees of freedom, whether you can identify them or not. maybe even consider not using NHST, since you didn't stick to a pre-registered plan. make the argument that these results (and the underlying data) are important to publish because this small amount of inconclusive evidence is super valuable given how hard the data are to collect. some journals will be sympathetic, and appreciate your honesty.*****we talk big about how we want to preserve a role for creativity - we don't want to restrict researchers to pre-registered, confirmatory tests. we need space for exploration and hypothesis generation. i wholeheartedly agree; everything i do is exploratory. but the price we have to pay for that freedom and creativity is skepticism. we can't have it both ways. we can't ask for the freedom to explore, and then ask that our p-values be interpreted as if we didn't explore, as if our p-values are pure and innocent.* brent roberts says it's ok to keep repeating all of the things.** this is not a dig at robots or pre-registerers. some of my favorite people walk like robots.*** yes, that's a euphemism for p-hacking.**** my transformation into a bayesian is going pretty well, thanks for asking. if you drink enough tequila you don't even feel any pain.***** probably not the ones you were hoping to publish in, tbh.