sometimes i read a paper with three studies, and the key results have p-values of .04, .03 and .045. and i feel like a jerk for not believing the results. sometimes i am skeptical even when i see a single p-value of .04.* is that fair?
mickey inzlicht asked a similar question on twitter a few weeks ago, and daniel lakens wrote an excellent blog post in response. i just sat on my ass. this is the highly-non-technical result of all that ass-sitting.
we are used to thinking about the null distribution. and for good reason. the rationale behind Null Hypothesis Significance Testing (NHST) is that we are testing the probability that our observed effect actually came from the distribution of effects you would expect if the true effect is zero (null). so NHST starts with the assumption that we are living in the world where the real effect is zero, and the expected distribution of results when we do studies looks something like this:**
when we do a study and we get an effect, we want to know whether that effect (e.g., a difference between means, or a correlation) is unlikely to come from this distribution, i.e., if it is far enough in the tail of the distribution that the probability that it came from this distribution is low (less than 5%, to be precise). so the 'goal' is to get an extreme effect.
if we think about it this way, it makes sense that it is hard, or rare, to get an effect in the tail of the distribution, and that p-values just below .05 should be common.
but of course, we don't believe that our effect actually comes from this distribution. our aim is to reject that null hypothesis. we actually believe that our effect comes from another distribution, one around a non-zero true effect. let's call this distribution the H1 distribution.
if our observed effect really does come from the H1 distribution, we should often observe effects that are far out in the tail of the null distribution, not barely past the p < .05 threshold. if we're actually sampling from H1, we should get lots of effects more extreme than the p < .05 threshold.
this is more true to the extent that:
-the true effect size (around which H1 is centered) is far from zero
-each distribution is narrow (i.e., the study has high power/precision, i.e., has a large sample (for between-subjects designs), low error, etc.)so p-values close to .05 should be rare. p-values closer to 0 should be more common. that’s the basic idea behind the p-curve.
how much more common? according to simonsohn, nelson, & simmons (2014), if a study has 46% power (probably pretty realistic, maybe a tad optimistic), the ratio of p-values below .01 to p-values between .04 and .05 should still be 6:1. daniel lakens’s blog post also provides some useful stats about the probabilities of various combinations of p-values. basically, p-values close to .05 should be rare.
so why aren’t they?
maybe we’re really good at power planning! maybe we know exactly what size our effect will be! and we can run exactly enough subjects to get our p-value just below .05!
if you believe that, you haven’t been reading my blog.
why is that unlikely? first, if we knew the exact size of our effect, we wouldn’t need to keep studying it. we don’t know. second, even if you did know the exact size of the true effect, why would you power your study to be only just barely significant? maybe you like to live dangerously, but i prefer to sleep at night.
ok, so hopefully i’ve convinced you that p-values close to .05 should be rare, and that perfect power planning can’t be the reason we see them all the time. so why do we see them all the time?
it is tempting to look for more complicated answers, but i think the answer is simple: p-hacking. and file drawers.
what should we do with this information? first, it’s ok to be skeptical of a paper with a couple studies with high p-values. it’s even ok to be skeptical of a single study with a high p-value. that doesn’t make you a jerk.***
but more importantly, p-values close to .05 can be a signal to ourselves that we may be unintentionally p-hacking. see mickey's blog post.
ps: i know that if i only became a bayesian, none of this would matter and i would never feel lonely and my dog would never barf in my office during office hours. but by now it should be abundantly clear that my role in this whole thing is to think like someone who does not have advanced stats training. (it turns that this is pretty easy to do if you do, in fact, lack advanced stats training.) also, the dog barfing in office hours has its pluses.
* sometimes just the letter p sets me off.
** yes, i instagrammed my graph. what's your point?
*** it doesn’t make you not a jerk. only extensive personality testing by a trained expert can tell you that. i offer discounts for friends and family.
Great post Simine!
I want to quibble with one thing. You wrote: "maybe we’re really good at power planning! maybe we know exactly what size our effect will be! and we can run exactly enough subjects to get our p-value just below .05!"
But I don't think that is true. The result of good power planning (for non-null effects) would be that the vast majority of p-values would be very small, not just below .05. That's why adequately-powered studies of real effects produce right-skewed p-curves, and that's where the 6:1 ratio you refer to earlier comes from.
In fact, even massively underpowered studies will probably produce right-skewed p-curves. I just ran a quick and dirty simulation out of curiosity, and with rho = .3, N = 20, which gives 25% power, you still get a right-skewed p-curve.
Hopefully I'm not just being a pedantic nitpicker. I've heard people make the argument elsewhere that left-skewed p-curves could result from good power planning alone, and I don't think it is ever correct.
Posted by: Hardsci | 10 June 2015 at 03:27 AM
I cant find in this piece what the author considers "rare." Scientists work on projects that previous work and intuition suggest should provide action, in which there is a propensity for H1. I'm puzzled by this.
Posted by: Don Strong | 10 June 2015 at 05:34 AM
I should say, I'm puzzled by the worry that the frequency of small p values suggests a lack of objectivity.
Posted by: Don Strong | 10 June 2015 at 05:35 AM
Great post! I think that - without mentioning it - you also demonstrated what is wrong with all those posthoc attempts to determine the "questionableness" of published results based on the assumed power.
You can't calculate power after the fact based on the observed effect. Doing so makes the implicit assumption that the researchers did what you call 'perfect power planning' in that they predicted the to-be-observed effect size and knew to just collect n data points to get it just below p=0.05. It doesn't make any sense.
The true effect size may be smaller or it may be larger than what was observed. In the real world, most true effect across a general population are probably *smaller* because of noise that some careful smaller laboratory studies on biased samples will exclude (I should write a post about this finally...). But there will also be situations where the true effect could be larger. The point is, the observed effect can at best give you a relatively decent estimate where the true effect probably falls.
Posted by: Sam Schwarzkopf | 10 June 2015 at 08:38 AM
Three points to add. First is that the main point of the argument is quite good.
Second, the drawings are, let's say, "not to scale." That is, the "dark tails" of the drawing are less than 5% of the area under the curve. Chalk it up to "rhetorical flourish" with graphics.
Third, I think that the drawing might be mixing up the H1 distribution with the distribution of sample means. That is, the p-value isn't represented by region/area in the H1 distribution, but rather in the (much narrower) distribution of sample means.
Points two and three are based on the notion that, to the extent the argument is based on looking at the graphical representation, the blog post "over-argues" the point. (Which is still good, in my analysis, just without so much interocular power.)
I now await correction from the statistically-better-informed-than-me, whose membership is legion.
Posted by: Chris | 10 June 2015 at 08:41 AM
I believe, correct me if I am wrong, that the author's point here is that too many p-values around 0.05 are indicative of p-value hacking. She does not mean that there are too many small p-values, rather she is saying there are not enough truly small p-values http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 for another perspective.
Posted by: JSanJ | 10 June 2015 at 08:49 AM
Nice post Simine. I would add that p-hacking, and the broader issue of Type I and II errors, only captures one specific source of false findings. The curves assume that the research is valid to begin with, that the constructs and measures are valid. That's not always going to be the case.
For example, if you measure sunshine and call it self-esteem, it doesn't matter if "self-esteem" predicts weight loss at p < .001. P-values don't matter in cases of invalidity. The field gravitates toward numerics, but validity rarely comes with numbers attached. Cultural biases, like those highlighted in the WEIRD paper, are good examples of validity issues.
It would be interesting to find out if invalid research has a p-value pattern, maybe clustered well below .05. That would be interesting to find out, but it will be hard since "invalid" is complicated and arguable. It would be more tractable if nailed down to specific forms of invalidity.
Posted by: Joe Duarte | 10 June 2015 at 09:41 AM
Hi Simine,
I wrote a related blog post that you might find interesting.
https://replicationindex.wordpress.com/2015/05/27/when-exact-replications-are-too-exact-the-lucky-bounce-test-for-pairs-of-exact-replication-studies/
The main point of Daniel's and my blog posts is that we do not need to know the real power of a study to answer Mickey's question about the probability of observing a p-value between .05 and .025. We can simply state the maximum probability that this event should occur for the most optimal level of power. For the interval of .05 and .025, the most optimal level is 56% power. In your figure, the H1 distribution would be centered just to the right of the significance level (not as far right as in your figure).
The probability of obtaining a p-value between .05 and .025 is small because it is a small slice of the distribution of possible p-values.
To use this as a test the question is where we should draw the line. So, .049 and .039 are suspicious but what about .06 and .04? What about .03 and .008?
Best, Dr. R
Posted by: Dr. R | 10 June 2015 at 09:56 PM