Suppose you are reviewing a couple of papers. They sound theoretically interesting and all that jazz. You peruse the results of two papers and here’s what you see:
Paper 1: 3 between subjects experiments with the following test statistics
- t (56) = 2.24, p = .029
- t (92) = 2.11, p = .038
- t (144) = 2.20, p = .029
Paper 2: 2 between subjects experiments with the following test statistics
- t (59) = 2.21, p = .031
- t (200) = 2.27, p = .024
Do these outputs look like naturally occurring test statistics that could occur in the wild? Or are they indicative of something perhaps gone awry?
In this post, I outline a forensic meta-science bias detection thingie that I’ve been playing with for a while now. It doesn’t have a clever name (YET!), but it harnesses the power of simulations to assess how probable a given set of findings is, in the absence of biased reporting of analyses or studies.
I know, I know, you’re probably thinking “I need yet another forensic meta-science tool like I need another hole in the head or another Power Posing meta-analysis[i].” And I’m with you there. The quiver is already fairly full of these tools. There are classics like funnel plots and Egger’s test, there’s the Test of Excess significance popularized by a gajillion Francis papers, we have new entrants like the Incredibility Index and p-curve. Isn’t enough enough already?
But I tried to gin up one more to address a niche that the others might not quite hit. Basically my goal was to create a tool that might be able to harness strengths of these various approaches, while addressing some of their weaknesses, and packaging it all with intuitively comprehensible output. Easy peasy, right?
So what are these strengths and weaknesses of existing tools (at least in my idiosyncratic and shallow take)? Without an exhaustive review, I think most of our bias sniffers come in one of two varieties:
Surplus Significance Sniffers and p-distribution Sniffers.
Surplus Significance Sniffers.
We have a class of tools that basically look at a set[ii] of studies, try to figure out the typical per-study power, then look at how many of the reported studies are statistically significant. Then compare this to the assumed power and say “what’s the deal with that?” Core logic is simple: if each study is powered at, say, .5, then the odds are pretty slim that you’d get six straight statistically significant results. With typical power in psychological science, we should be seeing lots of nonsignificant results. We don’t. This means they are being hidden in some sort of drawer where one keeps files; a ‘file drawer’ if you will. File drawering and other reporting biases can turn the literature into a Flemish painter’s hellscape[iii].
Surplus Significance Sniffers include the Test of Excessive Significance, the Incredibility Index, and others.
Strengths: Unassailable underlying logic. Lots of significant results with no nonsignificant results don’t happen in the wild. They can come with their own inferential tests of whether there’s significant bias.
Weaknesses: Post-hoc power isn’t uncontroversial. They rely on a firm and impregnable p = .05 cutoff. That is, a single p = .06 that the authors argue as basically significant can make a paper look more credible (look! It’s not significant!) without addressing the substantive arguments that the authors make. And really, is .06 really categorically different from p = .04? Well these tests assume that they are. Finally, these tests only care whether a p-value is above or below the mystical .05 barrier…and because of this they miss other forms of fishiness, such as a pile of ‘just barely significant’ results. Oh, and finally finally, since they rely on comparing the distribution of significant results to a theoretical distribution, they generally require a large set of studies to be of much use.
The second broad class includes tests like p-curve and p-uniform. The general logic here is that the shape of a distribution of p-values can be diagnostic regarding what produced the results. In the absence of an effect, the distribution of p-values is uniform: p = .02 is just as probable as p = .98. As power increases, the distribution gets ever slopier (more skewed, but I prefer to call it slopiness). With high power, one expects far more p < .0001 than p = .04…with p > .05 being near impossible. It’s kind of counterintuitive[v], but once grasped, this is a Big Important Concept.
Strengths: Can work in the presence of only significant results; this means that they are designed to work on literatures that look like ours. They spot fishy patterns like a string of p = .03 or p = .04. They enable not only detection of bias, but can attempt to correct for bias in meta-analysis[iv]. There are nice online widgets that make p-curve go. They are increasingly common. Perhaps most importantly, they do not merely test for the presence of bias (which we already know exists), they instead test for the presence of evidence above and beyond bias.
Weaknesses: The still treat p = .05 as an impregnable barrier, though unlike Surplus Sniffers they simply ignore any results above the barrier. Like Surplus Significance Filters, they generally require a large set for solid inference. I’ve heard some bemoan the somewhat counterintuitive underlying principle. And the output can be a lot to deal with for the uninitiated. Finally, since it ignores all nonsignificant results, it misses out on the chicanery that Surplus Significance Sniffers look for.
the two approaches have somewhat complementary strengths and weaknesses. Thus my desire to create a technique that takes the best of both worlds, looking for BOTH surplus significance and ALSO an unnatural clustering of just-significant p-values. To top it off, I wanted to decouple the method from inferential stats and use simulations to (hopefully) provide a relatively intuitive inference.
HOW IT WORKS
Essentially, the new unnamed tool takes a set of studies at face value and asks “how probable is it that this particular set of results would crop up in the wild, without interference by biased (read: human) researchers?” At the end of the day, it’s entirely silent about why a pattern of results might appear unnatural. It just says how representative the results are based on what the set of studies’ information implies. It proceeds through five steps:
- Take reported effects at face value; assess underlying effect size and estimate uncertainty
- Simulate a set of studies with the given sample sizes and estimated effect size
- Calculate whether or not the p-values fall within any given (user-specified) range
- Repeat Step 3 a large number of times.
- Form an inference: does that pattern of results show up often in simulations?
More concretely, the method
1. begins by performing a random effects meta-analysis on the set to determine (again, taking everything in the set at strictly face value) the mean underlying effect size and associated heterogeneity. This serves as the input for the next step.
2. The set contains actual sample size information. Neat, simulate the set from the given sample sizes, plugging in the effect size from the first step (sampling the effect size from a normal distribution with a mean of the meta-analytic estimate and a standard deviation set to match the heterogeneity [tau]).
3. You gotta give it something to work with. Did the set contain only p-values between .001 and .07? Cool, set those boundaries. Really, set whatever boundaries you like. I recommend being generous with the boundaries, setting them broadly enough that you can play nice. It’s trivially easy to make a set look improbable by setting the p-value boundary to be infinitesimally small. Don’t be a dick about it. Yes you.
4. Cool, 3 gives you one simulation asking whether you can recreate a pattern of results that looks similar to the initial set. You want more than one simulated set. So simulate it a shitload of times. How many? How precise do you want to be and how much time do you have? I find that 10,000 simulations is plenty in most cases. For rhetorical punch you can go nuts and set it to a billion. Really, it’s your call.
5. The final output will tell you how many of the simulated sets ‘look like’ the original set. Is the proportion of similar looking sets high or low? This one’s a judgment call. But FWIW, the unlikely sets I’ve tried tend to return values like ‘we couldn’t reproduce a set like that in a billion tries, motherfucker.’
Really, that’s it. It takes the results at face value, offers some flexibility in what types of values you’re looking for, and then just returns a probability for the output. It tells you how surprising the set is in the wild, no more and no less.
Let’s take her for a spin.
Starting with the two examples at the beginning…here’s how things look.
Those both look problematic. Given a billion simulations of each, I couldn’t get results similar to the obtained results, even while being very generous with the p-value barriers (the lower cutoff was at .001 even though neither set had a p-value under .01. Again, don’t be a dick). Given the reported effect sizes, it looks all but impossible to recreate the observed results. This suggests that the results may be a product of biased reporting (or insanely good/bad luck).
EDIT 3-16: I caught and fixed a glitch in the code. Damn vectorization failure. The code linked below is updated. Regarding these sims, they occasionally can produce a similar set. I ran a million simulations of both of the above sets this morning. Set 1 gets similar values 0.0065% of the time. Set 2 gets similar results 0.24% of the time. The point still stands: these patterns emerge with human intervention, but not in the wild.
We’ve been playing with this a bit and so far it’s promising. Effects that we now know don’t replicate get flagged as impossible. Results that do replicate appear pretty probable. I don’t have any suggested cutoffs for what is deemed too improbable[vi]
How to use it? HERE’s a link to our R code on an OSF project. You just have to plug in vectors with the df and t-test values. Yeah, sorry, right now it only works for t-tests. We’re hoping to add more functionality over time[vii]. Then you run the final function (metaSim) which does the rest. Your final code will look like this, depending on where you set min_p and max_p (self explanatory), nsims (how many simulations), and plot (do you want a funnel plot?):
df <- c( 24, 276, 104, 212)
t <- c( 2.11, 2.24, 2.95, 2.10)
metaSim( df = df, t = t, min_p= .001, max_p= .05, nsims = 10000, plot = T)
And again, our ongoing validation projects are underway. We’ve fired it at a handful of sets by now, and all the outputs have been sensible. It’s yet to flag as problematic any sets that we know are unbiased (because we ran them). It’s yet to miss a set that seems otherwise problematic, due to failed replications or whatever. We’ve even artificially added bias to unbiased sets and it flags the new biased sets as problematic. And, perhaps best of all, it hasn’t yet found an especially ambiguous set. The inferences tend to be “zero in a billion simulations fail to give one p < .001” or “sure, you can get that like 30% of the time.” What’s especially badass is that it’s been able to spot problematic sets with as few as k = 2 studies. It’s even skeptical of the k = 1 original Power Pose[viii] paper.
HELP US NAME THIS THING
A quick note on interpretation. This technique is silent about what produced a given set of results. It merely tries to figure out through brute force how probable a given set of results is, if the effect size is truly as reported. So if a set looks improbable, that implies either Too Much or Too Little in the original. Either the reported effect sizes are too big (Too Much), or there are additional studies or analyses that have gone unreported (Too Little), and likely both. That’s all. The proof will be in the pudding of whether it can successfully predict replications. I’ve got one of these underway where the technique predicts two studies that won’t replicate. More on this in coming months.
Enjoy, and please hit us with any and all feedback on the code, rationale, whatever. We are writing a paper on it now and are also unrolling a Shiny that’ll do all the simulations from user-generated input.
Oh, and FWIW, one of the sets of studies featured in this blog is from one of my own papers (Gervais & Norenzayan, 2012, Science) previously dragged by yours truly in THIS POST. I reckon if I can't use the tool on myself, I shouldn't release it for public consumption.
[i] I can’t be the only one sick of these yet…surely. My heart goes out to the teams that work to respond to each iteration.
[ii] It could be all studies in a single paper, all studies in a meta-analysis, all studies in an issue of a journal. Really. Go nuts.
[iv] Specifically, p-curve seeks to return the average effect size of the statistically significant results submitted to it. This is somewhat distinct from the underlying ‘true’ effect size that many meta-analysts seem to be seeking (perhaps in vain, but alas…).
[v] it’s weird as a whale’s vestigial leg bones at first, but gets easier the more you play with it IMO
[vi] That’s what landed us in this mess in the first place, after all.
[vii] Like the fine folks at p-curve have been doing
[viii] Okay, I already said I was sick of it, so that was unnecessary. But finding issues with k = 1 is nifty, hopefully I can find a different example because my Power Pose has gradually become a Resigned and Fatigued Slouch of Boredom.