dear editors,
i love journals. i love editors. i love editors of journals. that's why i want to help. we need more quality control in our journals, and you are the ones who can do it.
in july 2012, in a comment to a blog post, chris fraley wrote 'What we might need, in other words, is a formal “consumer reports” for our leading journals.' i was so excited by this idea that i wrote to him and told him it was the best actionable idea that has come out of the replicability discussion.* fast forward 27 months, and our paper, 'N-Pact Factor: Evaluating the quality of empirical journals with respect to sample size and statistical power' is out.
you can read the actual paper here, and you can read chris's blog post about it here.
the N-Pact Factor (NF) has many flaws. it is not a perfect measure of quality. but it is better than nothing (or better than the Impact Factor alone). if you want to improve on the NF, please do. we hope you will. i would love to see a world in which there are many indices of journal quality, all tapping into various facets of quality. in my ideal world, i could look up not only the average sample size of studies published in a journal, but also how often they publish studies with multiple methods, field studies, non-student samples, longitudinal designs, actual behavior, replications, open data, preregistration, author disclosures, etc. but if i can know only one thing about the studies published in a journal, it would be their sample size.
so this is a call to editors to please please please pay attention to sample size when evaluating a paper.
why am i so obsessed with sample size? in short, because it is shorthand for 'amount of information'. and ultimately, that's what we're in the business of producing: information/knowledge. larger samples lead to more accurate conclusions. they reduce type II error, and indirectly reduce the proportion of published findings that are likely to be type I errors (read the paper for full explanation). in short, large samples mean fewer mistakes. *** (also, for a great summary of the problem with small sample sizes, see my favorite spsp talk ever.)
below i address some potential objections to this obsession with sample size, and make some controversial statements. i speak only for myself here.
how much is enough? one problem with saying 'we need bigger samples' is that it's hard to know how much is enough. i agree this is a problem. it has an easy solution: 200.
let's start with 200. it might turn out not to be enough****. but there are some good reasons to start there. first of all, a sample size of 200 gives you 80% power to detect an effect size of r = .20 (d = .40), which is about the average published effect size in social and personality psychology*****. also, this. and sanjay's 2013 arp talk. (you had to be there******).
what about power analyses? you don't need power analyses. if there is an effect, it will probably be somewhere in the ballpark of all the other effects in social and personality psych. so just go with 200 (or 250. or 300.). my view on power analyses is we can do one for the entire field, and i just did it, and it told me we need 200 people (or 100 per condition). there, we're done with power analyses. (yes, there are exceptions, blah blah blah.)
but i can't afford to run 100 participants per condition. then don't do the study.
but seriously, if you are doing a typical social/personality study, that is, you are running college student participants in a lab and they are doing some stuff on computers and maybe talking a little bit and eating some chocolate, you can run 100 people per condition. if you can't, i'm sorry. i feel bad that you can't. but that doesn't mean we should publish your paper.
if you are not doing a typical social/personality study, that is, if you have non-college-student participants*******, or you are using intensive or expensive methods, or studying something rare, etc., then we should absolutely take that into consideration and be flexible about sample size. each editor needs to weigh the value of the evidence, which includes how important it is and how hard it would be to collect more data.
my recommendation for editors: if a study has fewer than 200 people, the authors should give a convincing reason why it should be published anyway. there are many potentially good reasons, and editors need to use their judgment all the time anyway. make this part of your evaluation.
but this will take forever! yup. there will be fewer papers to read. and review. and for the media to sensationalize. what a terrible, terrible loss.
in my view, increasing sample sizes kills two birds with one stone. (actually i think it kills about eighteen birds. in the good sense.) it improves the quality of our published findings, and it address the looming crisis of The Dearth of Reviewers********.
in conclusion. every time i see a paper with a simple design and 36 participants, i die a little inside. the point of the NF is to encourage editors (and reviewers, and ultimately authors) to place more value on sample size. don't tolerate small samples, unless there is a good reason to. let's push ourselves and each other to do the hard work of doubling or tripling our samples. it is painful. it sucks. it's not fun. but we can't go on like this. and i think in the end we may actually come to enjoy living in a world where there are only 228 new papers to read each month. all of us can help make this happen, but you, editors, can do the most.
xoxo,
simine
* i also wrote 'it wouldn't be hard to do, right?' i crack myself up.
** not literally 'we'. thank you yuchen and ashley.
*** sample sizes do not decrease all kinds of mistakes (i.e., systematic error is not reduced), but they reduce random error and they don't increase other kinds of error.
**** i hope to one day look as foolish as simmons, nelson, and simonsohn now look for their n = 20 recommendation in 2011.
***** note that this means that half of all published effects are smaller than this. which means you should probably assume the effect you are studying is smaller than this. which means 200 people is probably not enough. i am starting to feel foolish already.
****** happily, you can attend the 2015 arp conference. i am almost positive sanjay will be there.
******* i don't mean mturk. if you are using plain old mturkers, you better not have fewer than 200 people or i'll get really mad.
******* soon to be a major motion picture in which editors go insane and start stalking people who submit 42 manuscripts a year and turn down all review requests. starring mindy kaling.
Post a comment
Comments are moderated, and will not appear until the author has approved them.
Your Information
(Name and email address are required. Email address will not be displayed with the comment.)
When talking to experimental psychologists, I find that they are almost universally receptive to the idea of having 5x fewer psychologists and 5x more resources per study. Perhaps they all imagine that they, personally, would all make the cut (cf. Kruger & Dunning, 1999).
Anyway, fewer articles and bigger samples means fewer PIs. Just sayin'.
Posted by: Nick Brown | 09 October 2014 at 05:52 AM
Bit worried that your N>200 suggestion ignores the experimental design and the predicted effect. That would be an absurd N for, e.g., some visual search study where everything is within-sbjs and few individual differences are expected.
Posted by: Luke | 09 October 2014 at 07:29 AM
Is this an example of Simpson's Paradox? A skeptical reader might say "personality psychologists have, as a major goal, parameter estimation. As a result, to meet those goals, they must obtain use larger N's. Social psychologists, on the other hand, do not usually care about careful parameter estimation--they have other theoretical fish to fry, and as a result, do not seek so assiduously to generate a narrow confidence interval."
From the article "This was most clearly illustrated in considering the three sections of JPSP. The overall NF for the Attitudes and Social Cognition section was 79, whereas the overall NF for the Individual Differences and Personality Processes section was 122. This was the case even though previous meta-analyses indicate that the typical effect sizes examined in these subdisciplines are comparable."
This is *exactly* what one would expect if parameter estimation is a quite often major goal of personality research, but not very often a major goal of social psychology research.
Posted by: Chris Crandall | 09 October 2014 at 09:17 AM
thanks for the comments!
Nick - i'm not sure i follow. why would there have to be fewer PIs? there is still just as much research being conducted (e.g., same number of total subjects run), they just get published in fewer, better papers. standards for hiring and promotion would have to change, but i'm not sure why replacing every set of six small studies with two big studies would mean there would be fewer PIs.
Luke - absolutely. i tried very hard to avoid all nuance (hence the 'blah blah blah'). within-subjects designs are exempt. and i'm sure there are other exceptions as well. and of course hard and fast rules are always ridiculous. the reason for my overly blunt approach is that although i think hard and fast rules are a problem, an even bigger problem is getting so wrapped up in the nuances and exceptions that you get paralyzed and decide not to do anything. in social/personality psych, a sample size of 200 is very often reasonable and/or necessary. more nuanced rules would be better in many ways, but i think that using 200 as a default and asking authors to justify smaller samples is a good start. and i hate to see the perfect be the enemy of the good.
Chris - yes, i think the point you make is likely one of the reasons that personality studies often have larger Ns. and perhaps we need larger Ns. however, putting personality psychology aside (we're still far from where we need to be anyway), the Ns in social psych are still way too small even if you don't care about parameter estimation. if you're concerned about type II error and about false positives (through file drawers and other QRPs), these sample sizes are too small even for just determining the direction or existence of an effect.
Posted by: simine | 09 October 2014 at 11:21 AM
Simine, I agree, if all we were doing was taking articles where the PI currently decides to run six underpowered studies and replacing those with two studies with higher N, then "statistically significant" results might also mean something, and PI employment levels are unaffected. But think of all the single-study, N=60 articles that would no longer be possible because the budget, or number of grad students who can be co-opted to conduct the experiments, doesn't extend to N=200.
Maybe three colleagues can amalgamate their resources and run a single study instead of three, and then all be "co-PIs" on a hybrid study that doesn't really represent any of their true interests, but I don't see that being very popular. In my extensive experience of working in bureaucracies of various sizes, I have found that giving up status is about the last thing that people are prepared to do, even if the alternative is giving up some of the resources that are needed to do the job properly. (Why that should be, even among very intelligent people, would probably make an interesting study.)
A further problem is that if fewer articles are being published (hooray!) for us to wade through each month looking for the power problems, it means that fewer "scientific findings" are being made (especially if we continue to not publish and/or ignore null results). I'm not sure if the whole self-sustaining industry of funding, researchers, journals, press releases, media, and institutional prestige is ready for the implications of that. I suspect that the number of studies being conducted has a near-lawful relationship with the number of people with an economic interest in those studies taking place, such that a reduction in one implies a near-proportional reduction in the other. As someone with no dog in this fight (I'm a retired non-academic), I have no problem either describing that or thinking about its implications, but I realise that that probably doesn't apply to most people who are likely to be involved in this discussion.
Posted by: Nick Brown | 10 October 2014 at 12:34 AM
Nick – molecular genetics does pretty well with forming consortia. And if you cannot find two colleagues to agree on a a suggested study design maybe the study shouldn't be done?
All that aside, almost everybody I've encountered in science so far has access to __too much__ data and too little time. So maybe running fewer, better studies would let more of the collected information leave the file-drawer.
Posted by: Rubenarslan | 10 October 2014 at 01:29 AM
Hi Simine, this is a great post, and a really cool paper from you and Chris on the N-Pact Factor. But I'm really curious--how do you foresee (realistically or ideally) the N-Pact factor paper being used? I can think of several possibilities, but certainly there are more.
- Editors implement official sample size policy changes at the journals you coded, or other journals in psychology.
- Reviewers or editors cite the paper to justify rejecting a paper with a small sample, but otherwise sound theory and methods.
- Young eager graduate students (and faculty of course) feel a bit more empowered to question sample size-related decisions when attending talks or informally discussing research at conferences.
- Hiring committees use the paper, and subsequent NF data, to help determine the quality of a job candidate's publication record (e.g., lots of pubs at high NF journals, vs. lots of pubs at low NF journals). This would perhaps be the most drastic use.
I'd love to hear your thoughts!
Posted by: Aaron Weidman | 11 October 2014 at 02:23 AM
"*** sample sizes do not decrease all kinds of mistakes (i.e., systematic error is not reduced), but they reduce random error and they don't increase other kinds of error."
well. consider p. 28 in
Ramscar, Michael, et al. "The myth of cognitive decline: Non‐linear dynamics of lifelong learning." Topics in cognitive science 6.1 (2014): 5-42.
http://psych.stanford.edu/~michael/papers/Ramscaretal_age.pdf
where a higher sample size was associated with a clear trend in subjects' performance. i generally agree with you, but bare in mind that there *is* a difference between screening 20 subjects and 200, which shouldn't just be ignored as it can have an effect.
Posted by: Zerschmetterling | 23 March 2015 at 02:53 AM