[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
do you feel frustrated by all the different opinions about what good science looks like? do you wish there were some concrete guidelines to help you know when to trust your results? well don't despair!
it's true that many of the most hotly debated topics in replicability don't have neat answers. we could go around and around forever. so in these tumultuous times, i like to look for things i can hold on to - things that have mathematical answers.
here's one: what should we expect p-values for real effects to look like? everyone's heard a lot about this,* thanks to p-curve, but every time i run the numbers i find that my intuitions are off. way off. so i decided to write a blog post to try to make these probabilities really sink in.
do these p-values make me look fat?let's assume you did two studies and got p-values between .02 and .05. should you be skeptical? should other people be skeptical?many of us have gotten used to thinking of p = .049 as the world's thinnest evidence, or maybe even p = .04. but what i'll argue here is that those intuitions should stretch out way further. at least to p = .02, and maybe even to p = .01 or lower.let's get one thing out of the way: if you pre-registered your design, predictions, and key analyses, and you are interpreting the p-values from those key analyses (i.e., the confirmatory results) then you're fine. you're doing real Neyman-Pearson NHST (unlike the rest of us), so you can use the p < .05 cutoff and not ask any further questions about that. go eat your cookie and come back when the rest of us are done dealing with our messy results.now for the rest of us, who don't pre-register our studies or who like to explore beyond the pre-registered key analysis (i.e., who aren't robots**), what should we make of two results with p-values between .02 and .05?the math is simple. using an online app (thanks Kristoffer Magnusson, Daniel Lakens, & JP de Ruiter!) i calculated the probability of one study producing p-values between .02 and .05 when there is actually a true effect. it's somewhere between 11% and 15% (i played around with the sample size to simulate studies with power ranging from 50% to 80%). so what's the probability of getting two out of two p-values between .02 and .05? at "best" (i.e., if you're doing underpowered studies), it's 14% x 14% = 2%. with decent power, it's around 1.4% (i.e., 1 in 70).CORRECTION (11/24/2017): i was wrong! the table that was here before (and the numbers in the paragraph above) were wrong - i'm not sure what happened. the paragraph above has been updated, and here is the corrected table. the biggest difference is that the probability of getting high p-values at 80% power is higher than i had said. the points made in the rest of the post still hold, as far as i can tell!if you agree with this math (and there's really not much room for disagreement, is there?), this means that, if you get two out of two p-values between .02 and .05, you should be skeptical of your own results. if you're not skeptical of your own results, you make everyone else look like an asshole when they are skeptical of your results. don't make them look like assholes. be your own asshole.yes, it's possible that you're in that 2%, but it would be unwise not to entertain the far more likely possibility that you somehow, unknowingly, capitalized on chance.*** and it's even more unreasonable to ask someone else not to think that's a likely explanation.and that's just two p-values between .02 and .05. if you have even more than two studies with sketchy p-values (or if your p-values are even closer to .05), the odds you're asking us to believe are even smaller. you're basically asking anyone who reads your paper to believe that you won the lottery - you managed to get the thinnest evidence possible for your effect that reaches our field's threshold for significance.of course none of this means that your effect isn't real if your p-values are sketchy. i'm not saying you should abandon the idea. just don't stop there and be satisfied with this evidence. the evidence is telling you that you really don't know if there's an effect - whether you remember it or not, you likely did a little p-hacking along the way. that's ok, we all p-hack. don't beat yourself up, just design a new study, and another, and don't stop until the evidence is strong in one direction or the other.****and i don't just mean the cumulative evidence. yes, you can combine the p = .021 study, the p= .025 study, and p = .038 study to get a much smaller p-value, but that still doesn't explain how you got three high p-values out of three studies (extremely unlikely). even with the meta-analytic p-value at p < .005, a reader (including you) should still conclude that you probably exploited flexibility in data analysis and that those three results are biased upwards, making the cumulative (meta-analytic) evidence very hard to interpret. so keep collecting data until you get a set of results that is either very likely if there's a true effect (i.e., mostly small p-values) or very likely under the null (i.e., a flat distribution of p-values). or, if you're brave and believe you can design a good, diagnostic study, pre-register and commit to believing the results of the confirmatory test.if that's too expensive/time-consuming/impossible, then do stop there, write up the results as inconclusive and be honest that there were researcher degrees of freedom, whether you can identify them or not. maybe even consider not using NHST, since you didn't stick to a pre-registered plan. make the argument that these results (and the underlying data) are important to publish because this small amount of inconclusive evidence is super valuable given how hard the data are to collect. some journals will be sympathetic, and appreciate your honesty.*****we talk big about how we want to preserve a role for creativity - we don't want to restrict researchers to pre-registered, confirmatory tests. we need space for exploration and hypothesis generation. i wholeheartedly agree; everything i do is exploratory. but the price we have to pay for that freedom and creativity is skepticism. we can't have it both ways. we can't ask for the freedom to explore, and then ask that our p-values be interpreted as if we didn't explore, as if our p-values are pure and innocent.* brent roberts says it's ok to keep repeating all of the things.** this is not a dig at robots or pre-registerers. some of my favorite people walk like robots.*** yes, that's a euphemism for p-hacking.**** my transformation into a bayesian is going pretty well, thanks for asking. if you drink enough tequila you don't even feel any pain.***** probably not the ones you were hoping to publish in, tbh.
Another confounding point in these kinds of analyses, the p values don't come from independent sources (i.e., they come from the same set of participants with alternative analyses done on them, rather than completely independent studies). That is, simple multiplication of the two p values ignores the dependence of the analyses on each other. Thus, without adjusting for this dependence (perhaps using corrections of degrees of freedom? allowing fractional "successes" in a normal-ish approximation of a binomial distribution?), multiplication likely overestimates both the probability of finding two significant findings due to chance alone AND overestimates the probability of finding a multiplicity of non-significant effect.
tl;dr: Assholes are correlated, and I often forget that in my attempts to understand how likely it was to see a certain number of significant p values emerge by chance alone.
Posted by: Stephen Benning | 04 May 2017 at 01:11 AM
You say "there's really not much room for disagreement, is there?" with the math, and I say there is. As for a Neyman-Pearson NHST--that sounds like an oxymoron. If you're serious that you're in the middle of transforming into a Bayesian, then that can be the source of some of the confusion here--the reasoning radically different. If you're speaking in one language it is hard to translate into the other, without feeling yourself a foreigner.
Posted by: Mayo | 04 May 2017 at 04:35 PM
Thank you for your comment Prof Mayo, I really appreciate hearing your thoughts (I literally just finished reading the chapter on your work in Chalmers's book!). I'd be very curious to hear where the wiggle room is with the math.
As for the "Neyman-Pearson NHST" - I may very well be using the wrong terminology (my understanding of the different approaches to null hypothesis testing comes mostly from Zoltan Dienes's book, though any misunderstanding is surely mine). I'm not at all wedded to the terminology. But as I mentioned, am very curious to hear how the math could be interpreted (or calculated) differently.
Posted by: Simine Vazire | 05 May 2017 at 11:49 AM
Thanks for sharing this app, and for highlighting the problem of significant p-values that are still a little too big for comfort. I warn my students about this as well - don't trust a paper whose main findings are all supported by p-values just a little bit below 0.05.
I agree with Dr. Mayo that there's room for disagreement on the math, though my reasons may be different from her's. I think the mathematical problem here is how the 15% (or 2.25%) figure is being interpreted. The common misinterpretation of the p-value is that it gives you the probability of the null being true. So a statement like "my results are unlikely to be due to chance, because my p-value is 0.03" would be formally invalid: P(results|Ho is true) isn't the same as P(Ho is true|results). In your example, you go from "there is a 2.25% chance of getting results like these, given the null is false" to "if you get two out of two p-values between .02 and .05, you should be skeptical of your own results". But P(results|Ho is false) isn't the same as P(Ho is false|results), and it seems you're implying that a small value for the former implies a small value for the latter.
The important piece of missing information is: how likely would these results be under a proposed alternate scenario? If the alternate scenario is "Ho is true and the test is pre-registered and no p-hacking at all takes place", then the probability of getting two p-values between 0.02 and 0.05 is 0.03^2 = 0.0009, and so we'd bet that Ho is false, since 0.025 is large by comparison.
But the alternate scenario that we're rightly concerned about is that the small p-value came from exploiting flexibility in analysis (another euphemism for p-hacking). So the question is: how likely is it to p-hack your way to a p-value between 0.02 and 0.05? And that I don't have an answer for. It depends on just how much flexibility you have, and how unscrupulous you're willing to be. If you really only did "a little bit" of p-hacking, maybe two p-values between 0.02 and 0.05 isn't cause for much skepticism. After all, p-hacking isn't guaranteed to yield significance. If you have data on a small number of variables and you have a well defined hypothesis, there's only so much flexibility there to exploit. If you have 20 variables and you spent the last hour trying every variation on every test you can imagine, then getting only two p-values between 0.02 and 0.05 suggests you aren't very good at p-hacking!
Posted by: Ben Prytherch | 08 May 2017 at 10:57 AM