[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
i was going to do a blog post on having thick skin but still being open to criticism, and how to balance those two things. then a paper came out, which i’m one of 72 authors on, and which drew a lot of criticism, much of it from people i respect a ton and often agree with (one of them is currently on my facebook profile picture, and one of the smartest people i know). so this is going to be a two-fer blog post. one on my thoughts about arguments against the central claim in the paper, and one on the appropriate thickness for scientists’ skin.
PART I: the substantive argument*
in our paper we argue that we should introduce a new threshold for statistical significance (and for claiming new discoveries), and make it .005.
i want to take a moment to emphasize some other things we said in the paper. we said that .005 should not be a threshold for publication. we also said that we can keep the p < .05 threshold and call results that meet this threshold, but not the .005 threshold, “suggestive” (and not enough to claim a new discovery). so, in some ways, this is not very different from what many people already implicitly (or explicitly?) do when we read papers – treat p-values as a quasi measure of the strength of the evidence, and maintain a healthy amount of skepticism until p-values get pretty low (at least in the absence of pre-registration, more on this below).
here are some arguments against the claim in our paper, and my thoughts on each of these. many of these arguments came out of a long and lively discussion with Ellen Evers and Daniël Lakens (and from Lakens’s blog post). i’m focusing on these arguments because i find them compelling or interesting to think about, so am curious to hear more points in favor of or against any of the points below. this is me playing with the ideas here. i am sure there are flaws in my thinking.
1. if we lower alpha to .005, the ratio of the type I error rate (when the null is true) to the type II error rate (when the null is false) will be way off (1:40).**
thoughts:
that would be true if the type I error rate really were equal to what we say our alpha level is, and the type II error rate really were equal to what our power analyses tell us beta is. i don’t believe either of those is the case. i believe our type I error rate is actually much higher than our field’s nominal alpha level. and the type II error rate is much lower than beta as calculated from power analyses (see my argument for this claim here – basically, p-hacking and QRPs help us make things significant, and that increases type I error, which we know, but also decreases type II error, which we rarely talk about). so, under the current state of affairs in our field, lowering the nominal alpha level to .005 may actually bring the ratio of type I to type II error rates closer to the 1:4 that Cohen and others suggested than it is right now.
of course this would change if practices change a lot. if we consistently pre-register our studies and follow our pre-registration, and only interpret p-values for the planned analyses, then our using a .005 alpha level would indeed lead to a 1:40 ratio of type I error rates when the null is true to type II error rates when the null is false. and i agree that this is too small a ratio in many circumstances. so, i’ll revisit my opinion on what threshold we should use as a default for drawing meaningful conclusions when our practices have changed drastically. (i still use p-values when i’m not doing confirmatory research, so probably the first step is for me to quit doing that.*** one reason i haven’t is that i suspect it would make it very hard for my grad students to publish their work in outlets that would help them get jobs. life is hard. principles ain’t cheap.)
for the same reason, i would be in favor of exempting the results of key, pre-registered analyses from this lower alpha. i don’t know exactly what the right alpha should be, but i think that when the rules of Neyman-Pearson hypothesis testing were followed (pre-registered study with key analysis specified), i’m less worried about an out-of-control type I error rate.
2. we should abandon thresholds, or NHST, altogether.
thoughts:
on thresholds: i have a hard time separating my feelings on the pragmatism of this position (very low) from my answer to “in a perfect world, would this be the right approach?” i don’t know. i definitely think thresholds cause a lot of problems. but it’s hard for me to imagine humans abandoning them. as an editor, i find thresholds useful because otherwise editorial decisions would feel even more arbitrary to authors. i am sympathetic to authors who feel that the criteria on which their submissions will be evaluated are not transparent enough, or are impossibly vague. i know thresholds aren’t the only solution to this problem (and are far, far from sufficient), but i am not imaginative enough to come up with something that would adequately replace them.
also, if we want to encourage less black-and-white thinking, aren’t two thresholds (one for “suggestive” results and one for “statistically significant” results) better than one? isn’t this a (baby) step towards thinking in gray?
but i get it. maybe to accept that thresholds are here to stay is to give away the farm. instead of spending effort tweaking a broken system and trying to make it a little less broken, maybe i should spend more time trying to change the system (see also #4 below).
on abandoning NHST or frequentist approaches: i have to admit, much of this debate is over my head. i have a lot more to learn about the philosophy and math behind Bayesian approaches.
3. requiring results to meet a p < .005 threshold before they’re considered strong evidence is going to make it prohibitively difficult to conduct studies.
thoughts:
i have to admit, i’m a bit surprised by this argument. it seems like one of the few things we all agree on is that of course a single study, especially one with a high p-value, is far from conclusive, and of course we need mounds of evidence before drawing strong conclusions. this kind of view is often expressed, but rarely heeded in discussion sections of papers, or in press releases. in my view, introducing a new threshold is a way to try to enforce this skepticism that we all say we want more of. (particularly because we’re not saying that results shouldn’t be published until they meet this threshold, just that they should be considered nothing more than suggestive). why not make a rule that holds us to this standard we espouse? if we can still consider results below .05 suggestive, we can still claim everything we’ve been saying we should claim, but not more.
even with a much lower threshold, a single study can’t be conclusive. but at least it can have a chance of providing much stronger evidence. and the costs of powering studies to this new threshold are not exorbitant, in my opinion (i know there is strong disagreement on this. it's hard to know what should count as a ridiculous sample size expectation. six years ago many, many people considered n = 20 per cell ridiculously high.). for a correlation of .20, you need 191 people to have 80% power at the .05 threshold, and 324 people to have 80% power at the .005 threshold. I don’t know about you, but that’s a way smaller price than i expected to have to pay for dividing our alpha level by ten. moreover, if you decide to only power your study to the .05 threshold, you should still get a p-value below .005 about 50% of the time. so if you’re running a series of studies and consistently missing the .005 threshold, something’s wrong.****
what does make sense to me about this argument is that lowering the threshold to .005 leads to a more conservative science. yes. absolutely. where i think i disagree with the spirit of the argument is that, implied in that criticism is the view that our past/current standards are reasonable or balanced (i.e., not particularly conservative or liberal), and so a drastic shift towards greater conservatism would be draconian. in my view, our past (i’ll stay agnostic about current) standards were incredibly liberal – i am pretty convinced that it was relatively easy to, without realizing it, get a significant result even when the effect didn’t exist. i know this is controversial, but i think it’s important to be clear about where the disagreement is. i want our standards to become more conservative, but not because i think we should have a very conservative type I error rate or false discovery rate (FDR). i probably want about the same type I error rate or FDR as people who oppose the reforms i’m for. the difference is that i think our current standards put us way, way above that mark, and so we need to shift to a more conservative standard just to hit the target we thought we’d been aiming for all along.
4. the real problem is that p-values, as currently used, are not meaningful, and we should fix that problem rather than hold p-values to a higher standard.
thoughts:
this is the most compelling argument i’ve heard so far. i have tried to push for changes that would bring us closer to being able to interpret p-values in the way that (as i understand it) the Neyman-Pearson approach to NHST describes. that is, more pre-registration, and more transparency (e.g., if you say you predicted something, show us the pre-registration where you wrote down your prediction and how you would test it, and if you didn’t pre-register, refrain from claiming that, or interpreting statistics as if, you did).
if i could choose between introducing a new threshold for significance and getting people (including myself) to follow Neyman-Pearson rules (or not use p-values when we aren’t following those rules – also ok), i would choose the latter. and in a world where we were careful to only interpret p-values when Neyman-Pearson rules were followed, i might advocate lowering alpha, but probably not all the way to .005.
so this is me admitting that by recommending a new threshold, i am caving a bit. my faith in a future in which we achieve transparency and incentivize truly confirmatory research sometimes wavers, and i want a more immediate way to address what i perceive to be a pretty serious problem. this may be a bad compromise to make. i’m still trying to figure that out.
these criticisms, and others, have definitely made me reconsider the recommendations in our paper. i haven’t yet come around to the view that they’re horribly misguided – i still don’t see much harm in them – but i'm considering the possibility that there are better ways to achieve the same goals, or that our efforts are better spent elsewhere. right now, i still feel that introducing a new threshold will actually help the other goals – encourage more transparency and more pre-registration. this is based mostly on my perception that it will be easier to get a result below that threshold by engaging in these practices than by p-hacking (because p-hacking down to .005 is pretty hard, from what i understand). so, if we assume that people respond to incentives, or are drawn towards the least effortful path, adding this hurdle could make suboptimal research practices less attractive, by making them less effective. if p-hacking isn’t going to save you, then the costs of pre-registering are lower (i.e., tying your hands is less costly when the loopholes you’re closing by pre-registering were unlikely to work anyway). but i acknowledge that, for me, the end goal is to improve research design and practices, and the change to statistical interpretation is, in large part, a means to that goal. that, and my playing around with G*power leads me to the conclusion that, when an effect is really there, and you’ve done a halfway decent job of designing your study, you should get a p-value below .005 quite often, so this threshold isn’t as scary as it might seem.
PART II: the process
i talk big about embracing criticism, fostering skepticism, etc. i also agree with many people who’ve expressed the view that you have to have a healthy layer of thick skin to be in science. these two views seem at odds to me, and i’ve been thinking about how to reconcile them.
i now have a bit more direct experience on which to base my reflection. it's still going to sound platitudinous. here’s what i’ve come up with:
-look hard for kernels of logic/reasoning or empirical evidence. sometimes it’s garnished with a joke at your expense, or accompanied by a dose of snark, but often there’s at least a nugget of real argument in there. the criticism may still be wrong, but if it takes the form of a good faith effort at reasoning or empiricism, it’s probably worth entertaining.
-talk it out. find people who you know are willing to tell you things you don’t want to hear. ask them what they think about the criticisms. ask them which ones they think you should take most seriously, and if there are any you can dismiss as not in good faith. if possible, talk directly with the people who are criticizing you. (my thinking benefited tremendously from talking to Lakens and Evers).
-take your time. you’ll notice when your emotional reaction starts to subside and you can laugh a bit at your own initial defensiveness. don’t decide what to do about the criticism until you’ve reached that point. if you don’t reach that point, it’s not because you were never defensive.
-if, after all that, there is some criticism that clearly seems outside the realm of reasoning/empiricism, or is not in good faith, this is where it’s time to suit up and put on the thick skin. let it go. go play with your cat, or watch a funny video, or bake a cake.*****
some people have a harder time with the first three steps (being permeable), some people have a harder time with the last step (being impermeable). it's not easy to know when – and be able – to go back and forth. but developing those skills is pretty important in science (life?). also – i don’t think anyone is very good at doing these things alone. find people who help you be more open to criticism, and people who help you be more resilient. they might be the same people (those are pretty amazing people, keep them), or they might not (those people are ok, too).
* i speak only for myself, and not for my co-authors or anyone else.
** what should the ratio be? i have no idea, and there’s clearly not just one answer, but apparently our field likes 1:4 a lot (alpha = .05, beta = .20).
*** are p-values useful to report even when you’re doing exploratory research? i have some thoughts on this (inspired by Daniël Lakens, again), but that’s for another blog post.
**** this raises the question of whether every study needs to meet the .005 threshold to qualify for the new threshold. here the answer is clearly no – we can identify what the distribution of p-values across multiple studies should look like if your effect is real, and it’s not 100% significant p-values (because of sampling error), regardless of the threshold. but if you have some ugly p-values, you can only confidently chalk up that messiness to sampling error if your readers can rule out other sources of messiness, like flexibility in data analysis or selective reporting – that is, if everything is pre-registered and all studies are reported. also a blog post for another day.
***** emotion suppression gets such a bad rap.
Post a comment
Comments are moderated, and will not appear until the author has approved them.
Your Information
(Name and email address are required. Email address will not be displayed with the comment.)
You don't need to go full Bayesian. But you can;t escape from Bayes, The prior probability has a huge effect on the false positive rate. The great problem is that you never have a good numerical value for the prior probability. I think that the best way to get round this dilemma is to calculate the prior probability that would be needed in order to give you an acceptable false positive rate.
For example, if you observe P = 0.05, then if you want to limit the false positive rate to 5 %, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. That's clearly preposterously high.
For details, see http://www.biorxiv.org/content/early/2017/07/13/144337
Posted by: David Colquhoun | 25 July 2017 at 03:34 AM
"show us the pre-registration"
Yes! Thank you so much for adding this, in my opinion and reasoning, crucial aspect of pre-registration!!
If you don't let the reader be able to see this information, it is not really pre-registered at all as far as i am concerned.
If you don't draw the line at this point in time concerning the availability of pre-registration information to the reader of a paper, i fear we will be going down a slippery slope.
Please don't let "confirmatory findings" become the new QRP which replaces the old QRP "we hypothesized".
Posted by: Anonymous | 25 July 2017 at 04:18 AM