[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
i've been bitching and moaning for a long time about the low statistical power of psych studies. i've been wrong.our studies would be underpowered, if we actually followed the rules of Null Hypothesis Significance Testing (but kept our sample sizes as small as they are). but the way we actually do research, our effective statistical power is actually very high, much higher than our small sample sizes should allow.
let's start at the beginning.
background (skip this if you know NHST)
Null Hypothesis Significance Testing (over)simplified
in this table, power is the probability of ending up in the bottom right cell if we are in the right column (i.e., the probability of rejecting the null hypothesis if the null is false). in Null Hypothesis Significance Testing (NHST), we don't know which column we're in, we only know which row we end up in. if we get a result with p < .05, we are in the bottom row (and we can publish!* yay!). if we end up with a result with p > .05, we end up in the top row (null result, hard to publish, boo). within each column, the probability of ending up in each of the two cells (top row, bottom row) adds up to 100%. so, when we are in the left column (i.e., when the null is actually true, unbeknownst to us), the probability of getting a false positive (typically assumed to be 5%, if we use p < .05 as our threshold for statistical significance) plus the probability of a correct rejection (95%) add up to 100%. and, when we are in the right column (i.e., when the null is false, also unbeknownst to - but hoped for by - us), the probability of a false negative (ideally at or below 20%) plus the probability of a hit (i.e., statistical power; 80%) add up to 100%.
side note: even if the false positive rate actually is 5% when the null is true, it does not follow that only 5% of significant findings are false positives. 5% is the proportion of findings in the left column that are in the bottom left cell. what we really want to know is the proportion of results in the bottom row that are in the bottom left cell (i.e., the proportion of false positives among all significant results). this is called the Positive Predictive Value (PPV) and would likely correspond closely to the rate of false positives in the published literature (since the published literature consists almost entirely of significant key findings). but we don't know what it is, and it could be much higher than 5%, even if the false positive rate in the left column really was 5%.**
back to the main point.
we have small sample sizes in social/personality psychology. small sample sizes often lead to low power, at least with the effect sizes (and between-subjects designs) we're typically dealing with in social and personality psychology. therefore, like many others, i have been beating the drum for larger sample sizes.
not background
our samples are too small, but despite our small samples, we have been operating with very high effective power. because we've been taking shortcuts.
the guidelines about power (and about false positives and false negatives) only apply when we follow the rules of NHST. we do not follow the rules of NHST. following the rules of NHST (and thus being able to interpret p-values the way we would like to interpret them, the way we teach undergrads to interpret them) would require making a prediction and pre-registering a key test of that prediction, and only interpreting the p-value associated with that key test (and treating everything else as preliminary, exploratory findings that need to be followed up on).
since we violate the rules of NHST quite often, by HARKing (Hypothesizing After Results are Known), p-hacking, and not pre-registering, we do not actually have a false positive error rate of 5% when the null is true. that's not new - that's the crux of the replicability crisis. but there's another side of that coin.
the point of p-hacking is to get into the bottom row of the NHST table - we cherry-pick analyses so that we end up with significant results (or we interpret all significant results as robust, even when we should not because we didn't predict them). in other words, we maximize our chances of ending up in the bottom row. this means that, when we're in the left column (i.e., when the null is true), we inflate our chances of getting a false positive to something quite a bit higher than 5%.
but it also means that, when we're in the right column (i.e., when the null hypothesis is false), we increase our chances of a hit well beyond what our sample size should buy us. that is, we increase our power. but it's a bald-faced power grab. we didn't earn that power.
that sounds like a good thing, and it has its perks for sure. for one thing, we end up with far fewer false negatives. indeed, it's one of the main reasons i'm not worried about false negatives. even if we start with 50% power (i.e., if we have 50% chance of a hit when the null is false, if we follow the rules of NHST), and then we bend the rules a bit (give ourselves some wiggle room to adjust our analyses based on what we see in the data), we could easily be operating with 80% effective power (i haven't done the simulations but i'm sure one of you will***).
what's the downside? well, all the false positives. p-hacking is safe as long as our predictions are correct (i.e., as long as the null is false, and we're in the right column). then we're just increasing our power. but if we already know that our predictions are correct, we don't need science. if we aren't putting our theories to a strong test - giving ourselves a serious chance of ending up with a true null effect - then why bother collecting data? why not just decide truth based on the strength of our theory and reasoning?
to be a science, we have to take seriously the possibility that the null is true - that we're wrong. and when we do that, pushing things that would otherwise end up in the top row into the bottom row becomes much riskier. if we can make many null effects look like significant results, our PPV (and rate of false positives in the published literature) gets all out of whack. a significant p-value no longer means much.
nevertheless, all of us who have been saying that our studies are underpowered were wrong. or at least we were imprecise. our studies would be underpowered if we were not p-hacking, if we pre-registered,**** and if we only interpreted p-values for planned analyses. but if we're allowed to do what we've always done, our power is actually quite high. and so is our false positive rate.
also
other reasons i'm not that worried about false negatives:
- they typically don't become established fact, as false positives are wont to do, because null results are hard to publish as key findings. if they aren't published, they are unlikely to deter others from pursuing the same question.
- when they are published as side results, they are less likely to become established fact because, well, they're not the key results.
- if they do make it into the literature as established fact, a contradictory (i.e., significant) result would probably be relatively easy to publish because it would be a) counter-intuitive, and b) significant (unlike results contradicting false positives, which may be seen as counter-intuitive but would still be subject to the bias against null results).
in short, while i agree with Fiedler, Kutzner, & Krueger (2012)***** that "The truncation of research on a valid hypothesis is more damaging [...] than the replication of research on a wrong hypothesis", i don't think many lines of research get irreversibly truncated by false negatives. first, because the lab that was testing the valid hypothesis is likely motivated to find a significant result, and has many tools at its disposal to get there (e.g., p-hacking), even if the original p-value is not significant. second, because even if that lab concludes there is no effect, that conclusion is unlikely to spread widely.
so, next time someone tells you your study is underpowered, be flattered. they're assuming you don't want to p-hack, or take shortcuts, that you want to earn your power the hard way. no help from the russians.*******
* good luck with that.
** it's not.
*** or, you know, use your github to write up a paper on it with Rmarkdown which you'll put in your jupyter notebook before you make a shinyapp with the figshare connected to the databrary.
**** another reason to love pre-registration: if we all engaged in thorough pre-registration, we could stop all the yapping and get to the bottom of this replicability thing. rigorous pre-registration will force us to face the truth about the existence and magnitude of our effects, whatever it may be. can we reliably get our effects with our typical sample sizes if we remove the p-hacking shortcut? let's stop arguing****** and find out!
***** Fiedler et al. also discuss "theoretical false negatives", which i won't get into here. this post is concerned only with statistical false negatives. in my view, what Fiedler et al. call "theoretical false negatives" are so different from statistical false negatives that they deserve an entirely different label.
****** ok, let's not completely stop arguing - what will the psychMAP moderators do, take up knitting?
******* too soon?
psychMAP moderators, when they're not moderating
i was just made aware of an excellent blog post making a similar point (but with actual numbers) by Erika Salomon. you should go read it:
http://www.erikasalomon.com/2015/06/p-hacking-true-effects/
Posted by: Simine Vazire | 22 December 2016 at 03:55 AM
Hmm, didn't Ioannidis (2005) answer this one? See Box 1:
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. ... the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to find an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically significant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the investigators manipulate their design, analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. ... In the presence of bias with u = 0.10, the post-study probability that a research finding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them finds a formally statistically significant association, the probability that the research finding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
Posted by: HJ Hornbeck | 25 December 2016 at 06:08 PM