Pop open a social psychology journal these days (or really any time in the past several decades), and you will notice a pattern. In study after study after study after study, researchers describe having a hypothesis that they tested. And pretty much always, the researchers then report statistically significant results in support of that hypothesis! Wow, we are amazing! We're almost always right, and we almost always have the evidence to back it up!
Now, some folks have pointed out that the picture painted by the journals is pretty much absolutely guaranteed to be too good to be true. Most of our effects aren't very big...and we tend to use samples that are too small to detect them even when we are right. So, how on earth do the journals make it look like we're always right?
One possibility is that we actually are always right...that social psychologists are in fact psychic. Hence, we're always right. And we can always get the right results even though our studies are severely underpowered. We are just that f$#%ing good.
A perhaps less optimistic view is that in some ways the published literature is not a faithful representation of the actual research process (spoiler alert: it isn't!). Maybe the hypotheses espoused in the papers were not made before the results were in (that is, maybe these hypotheses were "post-predicted"). Maybe researchers design overly flexible studies that allow for numerous different comparisons, and researchers can then "pick the winner" and only report the analyses that "worked" (for example, I could have five conceptually related outcome variables, and only report the one that yielded statistically significant results. I could run a 5 condition experiment and then only report two conditions that significantly differed). And, hey, if you're flexible enough in your research design, and aren't transparent about your methodological flexibility, you can find significant results for pretty much anything. That's not to say you're committing outright scientific fraud.. it's more like you're using a variety of Performance Enhancing Designs (PEDs) to present the rosiest possible picture of your evidence (and if most of your evidence is like mine...not all particularly rosy).
So, what do we do about all of this? Ignoring all published evidence seems a bit daft. Trusting it all implicitly seems equally daft. Picking and choosing which pockets of the published literature are reflective of genuine effects sounds like a good idea. But it can be tricky because nobody publishes all of their "failed" studies (we've all got file drawers), and until recently, there was not so much discussion of undisclosed flexibility in methods and analyses (that is, p-hacking exists, and we've all done it).
Well, Simonsohn, Nelson, and Simmons recently published a technique that might be able to help us sort through which pockets of the literature are reflective of strong evidentiary value that the underlying effects are legit, and which might be byproducts of selective reporting of studies, analyses, and methods that fortuitously "worked."
Here's how it works. If you run a gazillion studies on the same topic, but there is no underlying true effect, by chance some studies will yield statistically significant effects (p < .05). But when there's no effect out there, any given p-value is just as likely as any other. So if you chart the p-values from .01 to .05, you'd expect a flat line.
On the other hand, if you chart the p-values that would result from a gazillion studies (adequately powered) on a real effect, it'll look more like this:
And if the effect is real, but small, and you're running underpowered studies (:cough: like pretty much the entire field :cough:), it looks like this:
See? It's still sloping down to the right, but it gets flatter. Probably a real effect, but the researchers should be using lots more participants.
Okay, so a flat line means no real evidence of an underlying effect, and a sloping line means probably some underlying effect. The steeper the slope, the more power the researchers had to detect the effect. So...how about this little fella?
You get a p-curve that looks like that when 1) there is no underlying effect, and 2) the researcher/literature in question is characterized by an extreme willingness to do whatever it takes in terms of undisclosed research flexibility to get statistically significant effects. A p-curve like this tells you more about the (shoddy) methods underlying the effects in question than it does about the effects themselves. According to the authors of the p-curve paper, this type of p-curve is indicative of intensive p-hacking. PEDs in action.
Bottom line looks about like this:
1) A downward sloping p-curve is indicative of underlying evidentiary value for the studies you've subjected to the p-curve analysis
2) a flat p-curve says you should be skeptical of the underlying effect
3) an upward sloping p-curve says you should be quite skeptical of underlying effects, and more than a bit suspicious of the methods and analyses that generated the results reported. All is probably not what it seems.
So, in practical terms, you can select some set of studies out there (you probably want to have strict inclusionary criteria, like "all papers on topic X" or "a randomly selected set of studies from journal Y that include the term Z"). You document what the main hypotheses are. Then you look at the p-values associated with the main reported analyses. The p-curve authors present a nifty online widget that will compile a p-curve for you (you just input the test statistics). It will then present inferential tests of the above three possibilities:
1) Studies contain evidential value (right-skewed)
2) Studies lack evidential value (flatter than 33% power)
3) Studies lack evidential value and were intensely p-hacked (left-skewed)
In the interests of exploring the p-curve a little bit more, I thought "what the hell...why don't I just submit all of my published studies to a p-curve analysis?" So I went through all of the studies I've ever published and entered them into the p-curve widget. Without wading through some of the technical details (as this isn't meant to be a technical blog, overall), here is the p-curve of Will Gervais:
*** Caveats...
1) This was my first attempt at a proper p-curve. It is entirely possible that I messed something up.
2) Does this mean that all of my studies provide evidentiary value, or that I have never p-hacked? Of course not! Just that, on the whole, if you evaluate my entire body of published work "sometimes Will Gervais has been right about something" is a better explanation than "Will Gervais usually p-hacks the hell out of everything." But you don't need a p-curve to tell you that, you could just buy me a beer and ask me!