Excuses for Data Peeking
Guest Post by Don Moore and Liz Tenney
Good research practice says you pre-specify your sample size and you wait until the data are in before you analyze them. Is it ever okay to peek early? This question stimulated one of the more interesting discussions that we (Liz Tenney and Don Moore) had in our lab group at Berkeley. Here’s the way the debate unfolded.
Don staked out what he thought was the methodological high ground and argued that we shouldn’t peek. He pointed out the perils of peeking:
If the early results look encouraging then you might be tempted to stop early and declare victory. However, we all know why this is a mortal sin: selecting your sample size conditional on obtaining the hypothesized result increases the chance of a false positive result. However, if you don’t stop but the effect weakens in the full sample, you will be haunted by the counterfactual that you might have stopped. The part of you that might consider using part of the sample won’t be able to help wondering about some hidden moderator. Maybe the students were more stressed out and distracted as we approached the end of the semester? Maybe the waning sunlight affected participants’ circadian rhythms? So peeking tempts you to sin and runs the risk of leaving you full of regret.
Liz agreed about these risks but pointed out that if you peek at the data very early, with n = 10 or n = 20 per cell, you could get information from peeking and would not be tempted to stop the study there with such a small sample, because, thank goodness, bigger sample sizes are becoming the norm.* If you peek and the results look good directionally, you should keep going to run the full, planned sample.
Alternatively, it is possible that the data fool you. If the results look bad when you peek, but your hypothesis is actually true, it might be vindicated in the full sample. This is the tragedy of a false negative. If you give up on the hypothesis too early, you’ll never know if it might have been supported because you didn’t give it a fighting chance.
But then Liz made an excellent point: If your study is going to fail, then it makes sense to abort it. Let’s say you have the hypothesis that Utahns eat more kale than do Californians. So you’re rooting for the Utahns as you start your big study with a planned sample size of 1000, composed of 500 Utahns and 500 Californians. If you find a difference going in the opposite direction or if you find no difference (a null result), the results aren’t that interesting (because they wouldn’t be surprising to anyone) and aren’t going to be publishable.**
If you peek at the data early and see that the Utahns are eating less kale (i.e., a result in the opposite direction from that hypothesized), wouldn’t it be smart to scrap the study? After all, how likely is it that the Utahns wind up eating more kale in the full sample if they start out with a trend in the opposite direction after collecting 10 or 20 in each cell? This question formed the crux of our debate. The answer depends, of course, on whether your hypothesis is true and how big the effect size actually is.
The figure below*** shows the probability of obtaining a nominal difference in means between conditions that goes in the OPPOSITE direction from the true population difference, given effects of various sizes (from d = .1 to 1) and samples of different sizes (from 5 to 50 per cell).
The probability of obtaining results that go in the opposite direction of the true population difference go to zero as your sample size goes up and your effect size gets larger. If the true effect size is d = .1 and you peek after collecting 10 per cell, the probability of being fooled is around 40%. That seemed unacceptably high to Don, so he thought he had made his point: don’t peek, you probably shouldn’t abort the study early, even if you get a trend in the wrong direction. But Liz looked at the same set of numbers and saw that if d = .5 and you peek after collecting 20 per cell, the probability of being fooled is only around 5%. She saw that as acceptably low, and she thought she had made her point: peek because if the results are in the opposite direction, you can abort early. Liz claimed that with an expected medium effect of d = .5 in a two-cell design, powered at 80%, alpha = .05, two-tailed, you would only need to collect about 30% of your sample (40 participants out of 128) before you had a pretty good idea of whether you should jump ship.**** Even with a smaller d = .3, under the same study parameters, you’d only need to collect 23% of your total sample (80 participants out of 352) to have less than 10% chance of being fooled.
But the truth, as both Liz and Don acknowledge, is actually a bit more complicated. How bad is it to waste the resources on collecting the full sample, only to have to throw it all away at the end? If the study is expensive to run, if the hypothesis isn’t all that exciting to begin with*****, and the opportunity costs are high****** then aborting early and giving up wouldn’t be so terrible. To pick another example, if in your clinical trial your participants are getting sicker instead of getting better, you should end the study early. If in your MTURK sample your participants are reporting slightly lower values on a Likert scale, and the cost to continue is low, you might want to see the study through.
However, there are good reasons you might want to stick with it. If there are other studies with more scientific promise that you could be investing your resources in, you would probably be working on them first anyway. And truth be told, many of us find ourselves doing research out on the thin ice at the edge of our scientific understanding where the effect sizes are small. A Cohen’s d of .3 or smaller is probably pretty common for the effects we’re looking at. If those are both true then the risk of peeking is greater.
Liz and Don both declared victory.
*Unless through your peeking you discover you made a computer coding error, or you have worded things in a confusing way, or you catch some other major issue. Catching mistakes is a benefit of early peeking.
**Only publishing results that support your hypothesis is a bigger, also important issue, but one we won’t tackle here. A premise of our debate was that in this particular case, the results would only see the light of day if they came out a certain way.
***Thanks to Ryan Goh for helping us crunch numbers that went into this graph. The simulations were based off of male and female height means and standard deviations obtained from the US Census Bureau website.
****There is a danger here that a researcher would be tempted to abandon the study only partly—by tweaking the materials and then starting data collection over. We would caution that this approach could increase the chance of a false positive. By jumping ship, we really mean giving up on the study completely.
*****Even a positive result wouldn’t be good enough to stand a chance at getting published in the Global Journal of Psychology and Counseling, no matter how high a publication fee you paid.
******You have so many good ideas. Some of the other ones might actually get published in journals that people read.
I'm going to try to beat Daniel Lakens and the Bayesians to the punch and point out that there are ways to build data peeking into your study design, and do analyses that are not biased by it.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2333729
Posted by: Hardsci | 03 June 2015 at 03:37 AM
Good points, although I was surprised that you didn't address the fact that this "optional abortion" strategy will still lead to an inflated error rate for the set of studies that you do end up seeing through to completion. Maybe this is understood?
I came up with the following intuition. Imagine that we have a set of 100 studies where we have a directional hypothesis that the mean is positive, but in fact the null is true (mean=0) for all 100 studies. 5 of these 100 studies would, if we completed them, result in erroneous rejection of the null (i.e., they will end up having big positive means). Now we peek at the data early on and optionally abort studies with observed means that are negative -- if the null is always true, that's half of the studies. The problem is that those 5 error studies are more likely to end up in the non-aborted half of studies than in the aborted half of studies.
I just did a little simulation of studies that have final N=100; each with one optional abortion point at N=5, or N=10, or ..., or N=50; with standard normal data; and using a one-sided test. Error rate is about 7% for studies where we peeked after N=5 but decided to keep going, and it increases up to about 10% for studies where we peeked after N=50. Of course I just picked these parameters out of a hat, the point is it exceeds 5% in general (if we don't do any corrections).
Of course, there is no divine dictate that the error rate must be 5%. It might be fine to accept a higher error rate for some of the good reasons that you mention in this post. But should we not at least acknowledge that it is above the nominal alpha level of our test?
Posted by: CookieSci | 03 June 2015 at 04:52 AM
You cannot " beat Daniel Lakens and the Bayesians to the punch" because they assumed it in their priors.
Posted by: Chris | 03 June 2015 at 08:21 AM
Hi,
A recent blog on dealing with non-significant results after looking at the data seems relevant.
http://rolfzwaan.blogspot.ca/2015_05_01_archive.html
I would also like to add that most research practices are ok, if you honestly report them. So, if you feel confident about your methods, just say that you picked X times and stopped when criterion Y was reached.
Posted by: Dr. R | 03 June 2015 at 09:11 AM