Enter your Email:
Preview | Powered by FeedBlitz

« Perspectives You Won't Read in Perspectives: Thoughts on Gender, Power, & Eminence | Main | what is rigor? »

Comments

Stephen Benning

Another confounding point in these kinds of analyses, the p values don't come from independent sources (i.e., they come from the same set of participants with alternative analyses done on them, rather than completely independent studies). That is, simple multiplication of the two p values ignores the dependence of the analyses on each other. Thus, without adjusting for this dependence (perhaps using corrections of degrees of freedom? allowing fractional "successes" in a normal-ish approximation of a binomial distribution?), multiplication likely overestimates both the probability of finding two significant findings due to chance alone AND overestimates the probability of finding a multiplicity of non-significant effect.

tl;dr: Assholes are correlated, and I often forget that in my attempts to understand how likely it was to see a certain number of significant p values emerge by chance alone.

Mayo

You say "there's really not much room for disagreement, is there?" with the math, and I say there is. As for a Neyman-Pearson NHST--that sounds like an oxymoron. If you're serious that you're in the middle of transforming into a Bayesian, then that can be the source of some of the confusion here--the reasoning radically different. If you're speaking in one language it is hard to translate into the other, without feeling yourself a foreigner.

Simine Vazire

Thank you for your comment Prof Mayo, I really appreciate hearing your thoughts (I literally just finished reading the chapter on your work in Chalmers's book!). I'd be very curious to hear where the wiggle room is with the math.
As for the "Neyman-Pearson NHST" - I may very well be using the wrong terminology (my understanding of the different approaches to null hypothesis testing comes mostly from Zoltan Dienes's book, though any misunderstanding is surely mine). I'm not at all wedded to the terminology. But as I mentioned, am very curious to hear how the math could be interpreted (or calculated) differently.

Ben Prytherch

Thanks for sharing this app, and for highlighting the problem of significant p-values that are still a little too big for comfort. I warn my students about this as well - don't trust a paper whose main findings are all supported by p-values just a little bit below 0.05.

I agree with Dr. Mayo that there's room for disagreement on the math, though my reasons may be different from her's. I think the mathematical problem here is how the 15% (or 2.25%) figure is being interpreted. The common misinterpretation of the p-value is that it gives you the probability of the null being true. So a statement like "my results are unlikely to be due to chance, because my p-value is 0.03" would be formally invalid: P(results|Ho is true) isn't the same as P(Ho is true|results). In your example, you go from "there is a 2.25% chance of getting results like these, given the null is false" to "if you get two out of two p-values between .02 and .05, you should be skeptical of your own results". But P(results|Ho is false) isn't the same as P(Ho is false|results), and it seems you're implying that a small value for the former implies a small value for the latter.

The important piece of missing information is: how likely would these results be under a proposed alternate scenario? If the alternate scenario is "Ho is true and the test is pre-registered and no p-hacking at all takes place", then the probability of getting two p-values between 0.02 and 0.05 is 0.03^2 = 0.0009, and so we'd bet that Ho is false, since 0.025 is large by comparison.

But the alternate scenario that we're rightly concerned about is that the small p-value came from exploiting flexibility in analysis (another euphemism for p-hacking). So the question is: how likely is it to p-hack your way to a p-value between 0.02 and 0.05? And that I don't have an answer for. It depends on just how much flexibility you have, and how unscrupulous you're willing to be. If you really only did "a little bit" of p-hacking, maybe two p-values between 0.02 and 0.05 isn't cause for much skepticism. After all, p-hacking isn't guaranteed to yield significance. If you have data on a small number of variables and you have a well defined hypothesis, there's only so much flexibility there to exploit. If you have 20 variables and you spent the last hour trying every variation on every test you can imagine, then getting only two p-values between 0.02 and 0.05 suggests you aren't very good at p-hacking!

The comments to this entry are closed.