So this semester I'm teaching a graduate seminar all about methods. I've been trying to highlight some of the current discussion going on in my field concerning whether or not the status quo in methods is up to snuff. And along the way, we've sampled some recent publications introducing new techniques for assessing research. Previously, I blogged about p-curves. Check that out for a detailed introduction, but the bottom line is this:

You can look at how slopey the distribution of p-values is in some chunk of the literature. The slopiness can be tested for a few possibilities.

  1. If the distribution slopes down to the right, this suggest that the studies on the whole contain evidential value. There is evidence of evidence. Be happy about the underlying effects.

  2. If the distribution is flat, this suggests that evidential value, if it exists, is inadequate. There is not evidence of evidence. Be skeptical of the underlying effects.

  3. If the distribution slopes up to the right, this suggest the presence of shenanigans. Be skeptical of the underlying effects, and more than a little suspicious about the methods utilized to support such effects.

This all sounds straightforward, so I wanted to try it out in my class. I came up with an assignment. Basically, every student had to pick something to p-curve. Then they'd target at least 20 studies on their chosen topic (some students ended up curving 50+ studies...impressive!). It could be a researcher, method, finding, whatever. Then, students were required to hop onto OSF to preregister their p-curve plan. How would they decide which studies to include or exclude, etc. Then, following the detailed instructions provided by the fine folks over at p-curve, students generated a p-curve analysis. And then one afternoon, we ordered some pizza and everyone presented their p-curve results. I'll outline the p-curve topics and results in a bit. But in the meantime, we all had some observations.

First, p-curves are a lot of work. A lot of articles are not written in a way that makes it easy to apply the p-curve app. Perhaps the analyses don't provide statistical output that fits the p-curve app's template. Even when the formatting fits, it's still a lot of work. And I think this is a good thing. P-curve is a potentially very powerful tool, and it needs to be handled responsibly. The hard work that goes into a p-curve will likely serve as a reminder to p-curvers to take this business seriously.

No stated hypothesis to be tested? This gives Sir Karl a headache.

No stated hypothesis to be tested? This gives Sir Karl a headache.

More surprising (and a bit disheartening), many of us faced an odd challenge. The logic of p-curve says that one should only include p-values from tests of primary hypotheses. Yet, a shockingly large number of studies out there are quite opaque about what the primary hypotheses actually were. Sure, on a quick skim, it sounds like the researchers were out to test a general idea. Then they give you a bunch of statistics. Then they interpret the statistics with a summary like “These analyses, consistent with our predictions, demonstrate that blah blah blah we were right blah.” Now, the analyses may be broadly consistent with a conceptual-level hypothesis. But it was far from universal that researchers would clearly delineate what their key statistical hypotheses were before the barrage of stats. This made it very tough to pin down which p-values to include.

Combined, these observations made me realize instances in my own publications where I could've been much more clear about why exactly I was doing which tests. Better yet, it made me think even more that preregistration is a good idea. Having tried it on one or two projects by now, it's pretty cool having hypotheses set and analysis code written before the data are even collected. That way, there's no way to sit down with a dataset and then try to remember exactly what the predictions were. No HARKing way.

Anyways, enough with the subjective impressions, how about the good stuff? As I mentioned, the project was open-ended. People could choose whatever they wanted (with the caveat that I banned any masturbatory self-p-curving). In total, the class came up with 10 p-curves. I promised my students that everything would remain anonymous in this post. I invited my students to give me a brief summary of their topics. So, without further ado, here are the curve topics, with brief rationales:

  1. The Taylor Noise Blast Aggression Paradigm. This is a classic proxy measure of behavioral aggression. The setup is straightforward. Participants think they're playing a game against a stranger in another room. The game has 25 (or 9, or however many you want) rounds. At each round, the “winner” gets to blast the “loser” with some noise over headphones. The winner gets to decide both the duration and intensity of the blast. Louder, longer blasts == more aggression. Sounds straightforward, so why p-curve it? Well, the literature is riddled with different ways to score the task, even within individual labs. Some studies average all 25 blasts delivered ("overall aggression"). Some only count the first blast (“unprovoked aggression”). Some only count the last 24 blasts or only the second blast (“retaliatory aggression”). Some count how many of 25 blasts were set at an intensity of at least 8 on a 10 point scale (“extreme aggression”). You get the point. Maybe the decision of which scoring method is always theoretically driven and precedes data collection. More cynically, maybe people report the scoring scheme that left the prettiest results. So, across all scoring schemes, does this method deliver evidence of evidence?

  2. Long Papers. Some journals, such as JPSP, tend to have papers with lots and lots of studies. Maybe lots of studies == lots of evidence. On the other hand, maybe people are using number of studies, rather than strength of individual studies, to judge the quality of research. And there are some signs of this out there. If true, then a 6+ study paper might include a few strong studies, then some padding. So, do the last two studies of 6+ study JPSP papers deliver evidence of evidence?

  3. Dr. David Hardesty. One student wanted to p-curve her adviser's recent research. So, does Dr. Hardesty deliver evidence of evidence?

  4. Dr. Sandra Bem. One student wanted to p-curve research by the late great Dr. Sandra Bem. So, does Dr. Bem deliver evidence of evidence?

  5. Stereotype Threat. There has been a ton of research on stereotype threat. One student wanted to grab high-profile (read: covered in social psych textbooks) stereotype threat studies. So, does stereotype threat deliver evidence of evidence?

  6. Ego Depletion. Self-control gets a lot of research attention. The classic finding from the ego depletion literature is that people who exert self-control at one point in time tend to do more poorly on similar tasks in the future. This is a classic paradigm, but one recent meta-analysis suggests that the underlying effect size might in fact be zero. So, does classic (i.e. highly cited) ego depletion research deliver evidence of evidence?

  7. Glucose and self-control. Some have suggested that self control might rely on a limited and depletable resource; further, this resource might be glucose. This interpretation, however, is controversial in a few ways. The role of glucose as a putative substrate of self-control remains contested and thus an amenable topic for a p-curve analysis that can lay bare whether this effect contains evidential value. Does the glucose-self control link deliver evidence of evidence?

  8. Gender Differences in Rape Myth Acceptance. Sometimes, people endorse rape myths (e.g., “If a girl goes to a room alone with a guy at a party, it is her own fault if she is raped”). A recent meta-analysis finds reliable gender differences where males endorse rape myths more than do females. Do gender differences in rape myth acceptance deliver evidence of evidence?

  9. Gender Differences in Jury Decisions. Much of the jury decision-making literature agrees that females tend to render more guilty verdicts than males overall. However, there is some evidence that these differences aren't quite as strong as people think. The goal of one p-curve was to test the idea that females really do render more guilty verdicts in the courtroom, regardless of the crime type. So, do gender differences in jury decisions deliver evidence of evidence?

  10. International Authors in Psychological Reports. Psychological Reports has a reputation as a low-impact journal. Further, one student noticed that a lot of the articles are international, and often evidently not written by native English speakers. But, once we get past writing fluency, this student was interested in the underlying evidential value of research in a not-much-sought-after journal. So, do international authors in Psychological Reports deliver evidence of evidence?


Rather than wade through all the details of each, I thought I'd plug in a table illustrating how things went. And it should be noted that the new p-curve app also spits out useful information about things like inferred power. But I'm lazy so they don't appear in my table. If you're really curious, drop a comment and I'll try to reply.

So, all in, the news is quite good. 8 out of 10 p-curves delivered evidence of evidence. None of them appear to be overly tainted by p-hacking. Only two failed to show significant evidence of evidence. That's a good thing, I'd say.

Pay attention, men.

Pay attention, men.

I already mentioned that the app spits out inferred power. One curve (gender differences in rape myth acceptance) estimated power at 100%, and consisted entirely of p-values lower than .01; good news for the p-curve, bad news for, well, real life. Most of the curves had estimated power right around 33%, which isn't all that impressive. If I recall, Psych Reports looked like it had dynamite evidence, and power around 80%...didn't see that coming. The two curves with bad news had estimated power of 6% or lower. Ouch. But this isn't entirely surprising, given the rationale behind these p-curves. One of the “bad news” projects (glucose and self control) was chosen because the underlying effect had already come under fire on several different fronts (physiological implausibility; underpowered supporting studies; very few supporting studies over a 7-year span since the initial findings were published). The other one (noise blast) was chosen because the method itself seemed to invite the sort of undisclosed analytic flexibility that has been flagged as undermining evidential value. Read into those findings whatever you'd like to.


If you're teaching a methods seminar, I highly recommend this exercise. And definitely order pizza when everyone is presenting.


All credit goes to my intrepid p-curvers: Charlie Hoogland, Dave Chester, Kellie Lynch, Ellen Stone, Maxine Najle, Ben Ng, Wenjing Li, and Casey Magyarics. Remember those names, and hire them when you get the chance.


Caveat caveat caveat: This is a blog post about pedagogical success in a graduate methods seminar. Interpret accordingly. Grant me a small victory and some excitement in my teaching, even if you don't like the results of one or more p-curve results from this graduate seminar homework assignment.


Another cool experiment from this methods seminar is inviting experts to Skype in as guest "lecturers" or drop by in person if they happen to be around. I can't recommend this highly enough. Brian, Simine, Keith, Steve, Mark, and Catherine, y'all are the best! Here's a public thanks!

AuthorWill Gervais