Okay, now it's time to actually start the blog! The goal is 2 posts a month.
Earlier this month at the Society for Personality and Social Psychology conference in lovely Austin, TX, there were (as there have been at every conference for the past few years) some nice sessions on how to improve our science. In years past, the topics have mainly focused on issues of research integrity and replicability.
This year, Uri Simonsohn gave a nice little talk about a topic I've been meaning to blog about for a while now. Power.
Basically, if you set out to test a hypothesis and your hypothesis is correct, power measures how likely you are to actually obtain statistically significant results. You want to have lots of power. Otherwise you're just wasting your time and participants. And it's basically a function of your effect size and your sample size.
Here's the standard textbook guideline: We want to set power at .8. So if your hypothesis is a good one, you want a 4-in-5 chance of turning up the goods once the study is done. You guess what effect size you're expecting. You use some stats software to pick your sample size. Voila! Power!
Hold on, said Simonsohn, there's a problem...this model is just basically and fundamentally flawed because we don't really have any good, feasible, sensible ways to figure out what sort of effect size to expect. Run a pilot? Not gonna work. Check the literature? Nope. Go with your gut? Unless your gut is better than mine (or his), nice try.
His recommendation mirrors pretty much exactly what I've been doing for the past couple of years (damn, I got scooped again!). Instead of trying to predict your effect size a priori, why not pick what effect size you would find meaningful? Brilliant! Some effects are only meaningful if they are very large. Other effects can be meaningful even when small. Pick what effect size you think would be relevant, then go with an appropriate sample size for that effect.
If the results are significant...yay! If the results are not significant...also yay! That second bit sounds counterintuitive, but I'll stand by it. If you said at the outset "I would care about this effect if it was at least X big" and your results are not significant, you can move on from that project because--by your own definition--the effect is probably either nonexistent, or too small for you to care about. After a couple of years of this, it is lovely being able to drop ideas in a more conclusive manner. And you can only do this with an adequately powered study.
So, there's the good news. Now how about some bad news?
As a field, how are we doing with that whole "run sufficiently powerful studies" idea people have been harping about since, I don't know, the middle of the past century?
Here's a plot showing power (y axis) across different sample sizes (x axis) and effect sizes (those pretty curves):
So if you think a medium-sized effect would be meaningful, you follow that curve until you hit the nice green line at .8 power, and then you decide to run an experiment with something like 60 participants per condition (let's just assume a 2 condition between subjects experiment). So easy, surely even an entire field of science can do it?
Based on some back-of-the envelope...totally scientific...estimates, the red curve is a typical effect size for social psych (.35) and the vertical blue line is a typical sample size for social psych (20 per condition). Which gives power less than .2. Given this setup, even if your hypothesis is right, you have less than a 1-in-5 chance of getting significant results.
So, if running underpowered studies is bad...and we've known this for decades...why are so many people still doing it?
I think, unfortunately, a lot of the motivation to run underpowered studies comes from the current incentive structure in this business. Want to get a job/tenure? You need lots of publications. How do you get lots of publications? Well, you can run lots of studies. As many as possible. If you have a finite amount of resources at your disposal, you can't afford to "waste" those resources running a few really big studies when you could spend them instead running lots and lots of smaller studies, then publishing those studies that--despite the odds--yielded publishable results. A strategy, if you will, of running tons of small studies might "beat" a strategy of running larger studies.
To illustrate this, let's try a visual thought experiment. You have two researchers out there. Let's call them Dr. Elephant and Dr. Mayfly. Both researchers have access to 2000 participants per year. They also are able to publish 100% of their significant findings in single study papers. What size studies should they run? Dr. Elephant runs studies with 300 participants. Dr. Mayfly runs tons of studies with 40 participants (keep in mind that Dr. Mayfly's strategy closely resembles the modal strategy in my field). And let's build in some other assumptions. They are both equally adept at identifying plausible hypotheses (let's say they're right 50% of the time...better than me). And they study comparably sized effects (say, Cohen's d of .35). Let's pit them against each other over a 6 year assistant professorship. Who wins?
Boom! Dr. Mayfly publishes almost twice as many papers (1.98 times as many) as Dr. Elephant. By this point, Mayfly is probably swimming in grant money, has received many competing job offers, and now has moved to a cushy, early-tenured position at [Prestigious University X]. Where's the downside to all that???
Okay, let's change the rules a bit to see if we can't rescue Dr. Elephant. Now we will assume that as a new stage in peer review, some nefarious and omnipotent council decides to attempt one exact replication of every finding submitted for publication. If it replicates...accept the paper. If it doesn't...too bad. Now who wins?
You add one rule to the game...Your Effect Must Replicate...and all of a sudden the winning strategy entirely reverses. Under this system, Dr. Elephant publishes 2.56 times as many papers.
What happened? Well, Dr. Mayfly's strategy is completely unsuited to a world in which direct replicability matters. She has lots and lots of significant findings. Many of them are legit, but her designs are so underpowered that others will have a hard time replicating them. In addition, many of her findings were false positives.
In fact, running underpowered studies has the potential to fill the published literature with an undesirably high number of false positives. If everyone is playing by the book, any given study has a 5% chance of turning up a significant result when there is no effect. But if you are chronically running underpowered studies, then a relatively higher number of your statistically significant results will come from false positives. Think about it this way, your likelihood of obtaining a statistically significant result depends on: 1) your likelihood of being right, 2) your power, and 3) that 5% acceptable false positive rate. If you hold 1 and 3 constant, but reduce 2, then a bigger chunk of the "significant results pie" will be taken up by false positives. Here's a graph making the same point. Given the same assumption I used above (you're right half the time), here is the probability that a given significant result is, in fact, a false positive across different effect sizes and sample sizes.
Given that, it looks like fully 20% of the statistically significant results coming out of Dr. Mayfly's lab are actually false positives. And any time you move left (smaller studies) or up (smaller effects) on this graph, you are getting, proportionally speaking, more false positives per significant result. And this, to me, seems to be a somewhat under discussed consequence of low power research.
So, given all of this, I think the recommendation is clear. We need more power!
And, for my research these days, if that increased power comes at a direct cost to the number of studies I can run, well I'm okay with that. Luckily for me, so is my department head (he is awesome).