Enter your Email:
Preview | Powered by FeedBlitz

« Guest Post: Excuses for Data Peeking | Main | submit your papers to SPPS! »

Comments

Hardsci

Great post Simine!

I want to quibble with one thing. You wrote: "maybe we’re really good at power planning! maybe we know exactly what size our effect will be! and we can run exactly enough subjects to get our p-value just below .05!"

But I don't think that is true. The result of good power planning (for non-null effects) would be that the vast majority of p-values would be very small, not just below .05. That's why adequately-powered studies of real effects produce right-skewed p-curves, and that's where the 6:1 ratio you refer to earlier comes from.

In fact, even massively underpowered studies will probably produce right-skewed p-curves. I just ran a quick and dirty simulation out of curiosity, and with rho = .3, N = 20, which gives 25% power, you still get a right-skewed p-curve.

Hopefully I'm not just being a pedantic nitpicker. I've heard people make the argument elsewhere that left-skewed p-curves could result from good power planning alone, and I don't think it is ever correct.

Don Strong

I cant find in this piece what the author considers "rare." Scientists work on projects that previous work and intuition suggest should provide action, in which there is a propensity for H1. I'm puzzled by this.

Don Strong

I should say, I'm puzzled by the worry that the frequency of small p values suggests a lack of objectivity.

Sam Schwarzkopf

Great post! I think that - without mentioning it - you also demonstrated what is wrong with all those posthoc attempts to determine the "questionableness" of published results based on the assumed power.

You can't calculate power after the fact based on the observed effect. Doing so makes the implicit assumption that the researchers did what you call 'perfect power planning' in that they predicted the to-be-observed effect size and knew to just collect n data points to get it just below p=0.05. It doesn't make any sense.

The true effect size may be smaller or it may be larger than what was observed. In the real world, most true effect across a general population are probably *smaller* because of noise that some careful smaller laboratory studies on biased samples will exclude (I should write a post about this finally...). But there will also be situations where the true effect could be larger. The point is, the observed effect can at best give you a relatively decent estimate where the true effect probably falls.

Chris

Three points to add. First is that the main point of the argument is quite good.

Second, the drawings are, let's say, "not to scale." That is, the "dark tails" of the drawing are less than 5% of the area under the curve. Chalk it up to "rhetorical flourish" with graphics.

Third, I think that the drawing might be mixing up the H1 distribution with the distribution of sample means. That is, the p-value isn't represented by region/area in the H1 distribution, but rather in the (much narrower) distribution of sample means.

Points two and three are based on the notion that, to the extent the argument is based on looking at the graphical representation, the blog post "over-argues" the point. (Which is still good, in my analysis, just without so much interocular power.)

I now await correction from the statistically-better-informed-than-me, whose membership is legion.

JSanJ

I believe, correct me if I am wrong, that the author's point here is that too many p-values around 0.05 are indicative of p-value hacking. She does not mean that there are too many small p-values, rather she is saying there are not enough truly small p-values http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106 for another perspective.

Joe Duarte

Nice post Simine. I would add that p-hacking, and the broader issue of Type I and II errors, only captures one specific source of false findings. The curves assume that the research is valid to begin with, that the constructs and measures are valid. That's not always going to be the case.

For example, if you measure sunshine and call it self-esteem, it doesn't matter if "self-esteem" predicts weight loss at p < .001. P-values don't matter in cases of invalidity. The field gravitates toward numerics, but validity rarely comes with numbers attached. Cultural biases, like those highlighted in the WEIRD paper, are good examples of validity issues.

It would be interesting to find out if invalid research has a p-value pattern, maybe clustered well below .05. That would be interesting to find out, but it will be hard since "invalid" is complicated and arguable. It would be more tractable if nailed down to specific forms of invalidity.

Dr. R


Hi Simine,
I wrote a related blog post that you might find interesting.

https://replicationindex.wordpress.com/2015/05/27/when-exact-replications-are-too-exact-the-lucky-bounce-test-for-pairs-of-exact-replication-studies/

The main point of Daniel's and my blog posts is that we do not need to know the real power of a study to answer Mickey's question about the probability of observing a p-value between .05 and .025. We can simply state the maximum probability that this event should occur for the most optimal level of power. For the interval of .05 and .025, the most optimal level is 56% power. In your figure, the H1 distribution would be centered just to the right of the significance level (not as far right as in your figure).

The probability of obtaining a p-value between .05 and .025 is small because it is a small slice of the distribution of possible p-values.

To use this as a test the question is where we should draw the line. So, .049 and .039 are suspicious but what about .06 and .04? What about .03 and .008?

Best, Dr. R

The comments to this entry are closed.