i learned a new word the other day. bucketing. it almost made me cry.
one of the most common mistakes i see when reviewing papers is authors who take a continuous variable and, for no good reason, mutilate it by turning it into a categorical variable. our old friend the median split is one example. (whose idea was it to befriend the median split? and why won’t he stop harassing us?)
bucketing, from what i can tell, is another such technique. i had a hard time finding a definition, but i think it’s basically creating categories out of multiple response options and grouping the data that way. for example, you can turn the continuous variable ‘age’ into a categorical variable by categorizing people into age ‘buckets’ (e.g, 20-29, 30-39, etc.).
why is bucketing so sad? first, turning continuous variables into categorical variables is almost always a bad idea because you lose information. unless you have good reason to believe that all variance within each bucket is error (e.g., you know that the difference between someone at the bottom end vs. the top end of the response options in a given bucket is not at all meaningful), you are losing good variance (i.e., valuable information) by lumping them all together. age is a particularly curious variable to ‘bucket’ because there is usually very little error in people’s self-reports of their age,* so there is no reason not to take their reported age at face value. relatedly, turning continuous variables into categorical variables leads to treating two data points that are actually very close together (e.g., a 29 year old and a 30 year old) as if they were quite different (essentially treating the 29 year old as a 25 year old, and the 30 year old as a 35 year old).
another distressing thing about bucketing is that it provides an almost irrestistible opportunity to p-hack. because there are many ways to break up a continuous variable into multiple categories, a researcher can try any number of combinations to see which one ‘works best’. before you know it you will have buckets coming out of your ears. and most of the time the reader cannot know how many ways of bucketing were tried, and what the other results looked like. (though if the researcher makes her data publicly available, the reader can find out). other methods for converting continuous variables into categorical ones have the same problem (e.g., you can try splitting your sample into two groups, three groups, four groups, etc.).
i can see why bucketing is appealing. it's a way to reduce complex data into neater bundles, and may be more intuitive (e.g., you can talk about the category of ‘extreme responders’ instead of talking about the degree of extremity of responses). this is similar to the appeal of personality types, another concept that will not die.** sharp cutoffs are exciting. but often they are an illusion.
there are cases where turning a continuous variable into a categorical variable makes sense. for example, it is sometimes much easier to visually display results in a categorical fashion (e.g., when plotting an interaction effect in regression). however, it is almost never justifiable not to also report the results of the analyses with the original, continuous variable. if you don’t know how to do that analysis, bribe a stats friend (mine like espresso and cocktails. and very small dogs). and if you are going to chop up your continuous variable and place it into buckets, remember that the more different ways of bucketing you try, the more likely you are to capitalize on chance. and if you run a replication, you should commit to using the same buckets when analyzing the second sample. but most of the time, there is just no good reason to bucket.
i am aware that none of this is original, we all learned this in intro stats. but i still see it so often (most recently in a PNAS article), that i think it bears repeating. and what are blogs for if not repeating things we already know?
* i am making this bold claim with no empirical evidence whatsoever. luckily my bold claims have very little error.
** have you noticed that over the course of reading this blog, you have learned little tidbits about personality? you're welcome.