sometimes i read a paper with three studies, and the key results have p-values of .04, .03 and .045. and i feel like a jerk for not believing the results. sometimes i am skeptical even when i see a single p-value of .04.* is that fair?
mickey inzlicht asked a similar question on twitter a few weeks ago, and daniel lakens wrote an excellent blog post in response. i just sat on my ass. this is the highly-non-technical result of all that ass-sitting.
we are used to thinking about the null distribution. and for good reason. the rationale behind Null Hypothesis Significance Testing (NHST) is that we are testing the probability that our observed effect actually came from the distribution of effects you would expect if the true effect is zero (null). so NHST starts with the assumption that we are living in the world where the real effect is zero, and the expected distribution of results when we do studies looks something like this:**
when we do a study and we get an effect, we want to know whether that effect (e.g., a difference between means, or a correlation) is unlikely to come from this distribution, i.e., if it is far enough in the tail of the distribution that the probability that it came from this distribution is low (less than 5%, to be precise). so the 'goal' is to get an extreme effect.
if we think about it this way, it makes sense that it is hard, or rare, to get an effect in the tail of the distribution, and that p-values just below .05 should be common.
but of course, we don't believe that our effect actually comes from this distribution. our aim is to reject that null hypothesis. we actually believe that our effect comes from another distribution, one around a non-zero true effect. let's call this distribution the H1 distribution.
if our observed effect really does come from the H1 distribution, we should often observe effects that are far out in the tail of the null distribution, not barely past the p < .05 threshold. if we're actually sampling from H1, we should get lots of effects more extreme than the p < .05 threshold.
this is more true to the extent that:
-the true effect size (around which H1 is centered) is far from zero
-each distribution is narrow (i.e., the study has high power/precision, i.e., has a large sample (for between-subjects designs), low error, etc.)
so p-values close to .05 should be rare. p-values closer to 0 should be more common. that’s the basic idea behind the p-curve.
how much more common? according to simonsohn, nelson, & simmons (2014), if a study has 46% power (probably pretty realistic, maybe a tad optimistic), the ratio of p-values below .01 to p-values between .04 and .05 should still be 6:1. daniel lakens’s blog post also provides some useful stats about the probabilities of various combinations of p-values. basically, p-values close to .05 should be rare.
so why aren’t they?
maybe we’re really good at power planning! maybe we know exactly what size our effect will be! and we can run exactly enough subjects to get our p-value just below .05!
if you believe that, you haven’t been reading my blog.
why is that unlikely? first, if we knew the exact size of our effect, we wouldn’t need to keep studying it. we don’t know. second, even if you did know the exact size of the true effect, why would you power your study to be only just barely significant? maybe you like to live dangerously, but i prefer to sleep at night.
ok, so hopefully i’ve convinced you that p-values close to .05 should be rare, and that perfect power planning can’t be the reason we see them all the time. so why do we see them all the time?
it is tempting to look for more complicated answers, but i think the answer is simple: p-hacking. and file drawers.
what should we do with this information? first, it’s ok to be skeptical of a paper with a couple studies with high p-values. it’s even ok to be skeptical of a single study with a high p-value. that doesn’t make you a jerk.***
but more importantly, p-values close to .05 can be a signal to ourselves that we may be unintentionally p-hacking. see mickey's blog post.
ps: i know that if i only became a bayesian, none of this would matter and i would never feel lonely and my dog would never barf in my office during office hours. but by now it should be abundantly clear that my role in this whole thing is to think like someone who does not have advanced stats training. (it turns that this is pretty easy to do if you do, in fact, lack advanced stats training.) also, the dog barfing in office hours has its pluses.
* sometimes just the letter p sets me off.
** yes, i instagrammed my graph. what's your point?
*** it doesn’t make you not a jerk. only extensive personality testing by a trained expert can tell you that. i offer discounts for friends and family.