i learned a new word the other day. bucketing. it almost made me cry.
one of the most common mistakes i see when reviewing papers is authors who take a continuous variable and, for no good reason, mutilate it by turning it into a categorical variable. our old friend the median split is one example. (whose idea was it to befriend the median split? and why won’t he stop harassing us?)
bucketing, from what i can tell, is another such technique. i had a hard time finding a definition, but i think it’s basically creating categories out of multiple response options and grouping the data that way. for example, you can turn the continuous variable ‘age’ into a categorical variable by categorizing people into age ‘buckets’ (e.g, 20-29, 30-39, etc.).
why is bucketing so sad? first, turning continuous variables into categorical variables is almost always a bad idea because you lose information. unless you have good reason to believe that all variance within each bucket is error (e.g., you know that the difference between someone at the bottom end vs. the top end of the response options in a given bucket is not at all meaningful), you are losing good variance (i.e., valuable information) by lumping them all together. age is a particularly curious variable to ‘bucket’ because there is usually very little error in people’s self-reports of their age,* so there is no reason not to take their reported age at face value. relatedly, turning continuous variables into categorical variables leads to treating two data points that are actually very close together (e.g., a 29 year old and a 30 year old) as if they were quite different (essentially treating the 29 year old as a 25 year old, and the 30 year old as a 35 year old).
another distressing thing about bucketing is that it provides an almost irrestistible opportunity to p-hack. because there are many ways to break up a continuous variable into multiple categories, a researcher can try any number of combinations to see which one ‘works best’. before you know it you will have buckets coming out of your ears. and most of the time the reader cannot know how many ways of bucketing were tried, and what the other results looked like. (though if the researcher makes her data publicly available, the reader can find out). other methods for converting continuous variables into categorical ones have the same problem (e.g., you can try splitting your sample into two groups, three groups, four groups, etc.).
i can see why bucketing is appealing. it's a way to reduce complex data into neater bundles, and may be more intuitive (e.g., you can talk about the category of ‘extreme responders’ instead of talking about the degree of extremity of responses). this is similar to the appeal of personality types, another concept that will not die.** sharp cutoffs are exciting. but often they are an illusion.
there are cases where turning a continuous variable into a categorical variable makes sense. for example, it is sometimes much easier to visually display results in a categorical fashion (e.g., when plotting an interaction effect in regression). however, it is almost never justifiable not to also report the results of the analyses with the original, continuous variable. if you don’t know how to do that analysis, bribe a stats friend (mine like espresso and cocktails. and very small dogs). and if you are going to chop up your continuous variable and place it into buckets, remember that the more different ways of bucketing you try, the more likely you are to capitalize on chance. and if you run a replication, you should commit to using the same buckets when analyzing the second sample. but most of the time, there is just no good reason to bucket.
i am aware that none of this is original, we all learned this in intro stats. but i still see it so often (most recently in a PNAS article), that i think it bears repeating. and what are blogs for if not repeating things we already know?
* i am making this bold claim with no empirical evidence whatsoever. luckily my bold claims have very little error.
** have you noticed that over the course of reading this blog, you have learned little tidbits about personality? you're welcome.
I loved the bears repeating photo!!
Posted by: Etienne LeBel | 07 April 2014 at 07:58 AM
In my experience, this is really an issue of statistical training. Personality psychologists and social psychologists -- especially from top programs -- tend to learn a great deal about why one should avoid median splits, why one should use latent variables, why one should examine familywise error rates and so on. But they don't get a great deal of training on how to measure multicollinearity, how to deal with endogeneity, and so on. In economics, the inverse seems to be true. So training is idiosyncratic to the discipline.
In addition, sociology and public health seem to train people to think of age as a categorical variable. Students who learn this then become professors in the discipline, and thus this idea becomes institutionalized. Most sociology and public health journals in 2014 will accept articles in which age is treated as a categorical variable, simply because that's now an institutionalized standard and it's not even considered to be problematic. And I don't think they're entirely blameworthy because this is a case where they don't know what they don't know.
(Incidentally, I once emailed the first author of 'Looking Deathworthy' to find out why she used a median split in her analysis. I never got a response. Not sure how that median split got past Psych Science reviewers. Anyway the article has now been cited over 250 times.)
Posted by: Commenter | 07 April 2014 at 08:25 AM
A recent dietary study fell prey to this very problem: http://stuartbuck.blogspot.com/2014/03/meat-smoking.html
Posted by: Stuart Buck | 08 April 2014 at 01:16 AM
As a PhD student, training under a personality psychologist, I was warned about the perils of median splits and "bucketing." However, I recently spoke with someone from a marketing firm. It turns out that they use age as a categorical variable. Hearing this, at first I was shocked, but the rationale is simple. If you have are trying to predict college enrollment, you probably don't predict a linear relationship with age. Using 0-17, 18-22, 23-27, works a lot better. I am not saying that most (or even half) of the cases in which this technique is used are justified this well, but sometimes, it may be the best strategy.
Posted by: Dave | 09 April 2014 at 12:57 PM
I personally find developmentalists (psychologists or otherwise) to be the worst offenders here, defending that such "buckets" represent "qualitatively different age groups." But that is just my experience.
Posted by: Ryne Sherman | 12 April 2014 at 06:30 AM