Sept. 22, 2025

P-Values: Are we using a flawed statistical tool?

P-Values: Are we using a flawed statistical tool?

P-values show up in almost every scientific paper, yet they’re one of the most misunderstood ideas in statistics. In this episode, we break from our usual journal-club format to unpack what a p-value really is, why researchers have fought about it for a century, and how that famous 0.05 cutoff became enshrined in science. Along the way, we share stories from our own papers—from a Nature feature that helped reshape the debate to a statistical sleuthing project that uncovered a faulty method in sports science. The result: a behind-the-scenes look at how one statistical tool has shaped the culture of science itself.


Statistical topics

  • Bayesian statistics
  • Confidence intervals 
  • Effect size vs. statistical significance
  • Fisher’s conception of p-values
  • Frequentist perspective
  • Magnitude-Based Inference (MBI)
  • Multiple testing / multiple comparisons
  • Neyman-Pearson hypothesis testing framework
  • P-hacking
  • Posterior probabilities
  • Preregistration and registered reports
  • Prior probabilities
  • P-values
  • Researcher degrees of freedom
  • Significance thresholds (p < 0.05)
  • Simulation-based inference
  • Statistical power 
  • Statistical significance
  • Transparency in research 
  • Type I error (false positive)
  • Type II error (false negative)
  • Winner’s Curse


Methodological morals

  • “​​If p-values tell us the probability the null is true, then octopuses are psychic.”
  • “Statistical tools don't fool us, blind faith in them does.”


References


Kristin and Regina’s online courses: 

Demystifying Data: A Modern Approach to Statistical Understanding  

Clinical Trials: Design, Strategy, and Analysis 

Medical Statistics Certificate Program  

Writing in the Sciences 

Epidemiology and Clinical Research Graduate Certificate Program 

Programs that we teach in:

Epidemiology and Clinical Research Graduate Certificate Program 


Find us on:

Kristin -  LinkedIn & Twitter/X

Regina - LinkedIn & ReginaNuzzo.com

  • (00:00) - Intro & claim of the episode
  • (01:00) - Why p-values matter in science
  • (02:44) - What is a p-value? (ESP guessing game)
  • (06:47) - Big vs. small p-values (psychic octopus example)
  • (08:29) - Significance thresholds and the 0.05 rule
  • (09:00) - Regina’s Nature paper on p-values
  • (11:32) - Misconceptions about p-values
  • (13:18) - Fisher vs. Neyman-Pearson (history & feud)
  • (16:26) - Botox analogy and type I vs. type II errors
  • (19:41) - Dating app analogies for false positives/negatives
  • (22:02) - How the 0.05 cutoff got enshrined
  • (23:46) - Misinterpretations: statistical vs. practical significance
  • (25:22) - Effect size, sample size, and “statistically discernible”
  • (25:51) - P-hacking and researcher degrees of freedom
  • (28:52) - Transparency, preregistration, and open science
  • (29:58) - The 0.05 cutoff trap (p = 0.049 vs 0.051)
  • (30:24) - The biggest misinterpretation: what p-values actually mean
  • (32:35) - Paul the psychic octopus (worked example)
  • (35:05) - Why Bayesian statistics differ
  • (38:55) - Why aren’t we all Bayesian? (probability wars)
  • (40:11) - The ASA p-value statement (behind the scenes)
  • (42:22) - Key principles from the ASA white paper
  • (43:21) - Wrapping up Regina’s paper
  • (44:39) - Kristin’s paper on sports science (MBI)
  • (47:16) - What MBI is and how it spread
  • (49:49) - How Kristin got pulled in (Christie Aschwanden & FiveThirtyEight)
  • (53:11) - Critiques of MBI and “Bayesian monster” rebuttal
  • (55:20) - Spreadsheet autopsies (Welsh & Knight)
  • (57:11) - Cherry juice example (why MBI misleads)
  • (59:28) - Rebuttals and smoke & mirrors from MBI advocates
  • (01:02:01) - Winner’s Curse and small samples
  • (01:02:44) - Twitter fights & “establishment statistician”
  • (01:05:02) - Cult-like following & Matrix red pill analogy
  • (01:07:12) - Wrap-up


00:00 - Intro & claim of the episode

01:00 - Why p-values matter in science

02:44 - What is a p-value? (ESP guessing game)

06:47 - Big vs. small p-values (psychic octopus example)

08:29 - Significance thresholds and the 0.05 rule

09:00 - Regina’s Nature paper on p-values

11:32 - Misconceptions about p-values

13:18 - Fisher vs. Neyman-Pearson (history & feud)

16:26 - Botox analogy and type I vs. type II errors

19:41 - Dating app analogies for false positives/negatives

22:02 - How the 0.05 cutoff got enshrined

23:46 - Misinterpretations: statistical vs. practical significance

25:22 - Effect size, sample size, and “statistically discernible”

25:51 - P-hacking and researcher degrees of freedom

28:52 - Transparency, preregistration, and open science

29:58 - The 0.05 cutoff trap (p = 0.049 vs 0.051)

30:24 - The biggest misinterpretation: what p-values actually mean

32:35 - Paul the psychic octopus (worked example)

35:05 - Why Bayesian statistics differ

38:55 - Why aren’t we all Bayesian? (probability wars)

40:11 - The ASA p-value statement (behind the scenes)

42:22 - Key principles from the ASA white paper

43:21 - Wrapping up Regina’s paper

44:39 - Kristin’s paper on sports science (MBI)

47:16 - What MBI is and how it spread

49:49 - How Kristin got pulled in (Christie Aschwanden & FiveThirtyEight)

53:11 - Critiques of MBI and “Bayesian monster” rebuttal

55:20 - Spreadsheet autopsies (Welsh & Knight)

57:11 - Cherry juice example (why MBI misleads)

59:28 - Rebuttals and smoke & mirrors from MBI advocates

01:02:01 - Winner’s Curse and small samples

01:02:44 - Twitter fights & “establishment statistician”

01:05:02 - Cult-like following & Matrix red pill analogy

01:07:12 - Wrap-up

[Kristin] (0:00 - 0:08)
Yeah, when we co-teach, you bring out this example, which I love, and I think it either makes some people believe in magical creatures and other people hungry.


[Regina] (0:10 - 0:14)
The magical creatures being Paul, the psychic German octopus.


[Kristin] (0:19 - 0:42)
Welcome to Normal Curves. This is a podcast for anyone who wants to learn about scientific studies and the statistics behind them. It's like a journal club, except we pick topics that are fun, relevant, and sometimes a little spicy.


We evaluate the evidence, and we also give you the tools that you need to evaluate scientific studies on your own. I'm Kristin Sainani. I'm a professor at Stanford University.


[Regina] (0:43 - 0:48)
And I'm Regina Nuzzo. I'm a professor at Gallaudet University and part-time lecturer at Stanford.


[Kristin] (0:49 - 0:54)
We are not medical doctors. We are PhDs. So nothing in this podcast should be construed as medical advice.


[Regina] (0:54 - 0:59)
Also, this podcast is separate from our day jobs at Stanford and Gallaudet University.


[Kristin] (1:00 - 1:11)
Regina, in today's episode, we're going to do something a little different. We are actually going to talk about a few of our own papers, and we are finally going to unpack that huge statistical concept of the P-value.


[Regina] (1:12 - 1:18)
P-value stands for probability value, not penis value, just in case anyone was wondering.


[Kristin] (1:19 - 1:26)
Given our track record on this podcast, it's a fair question. I think we've mentioned more penises than probabilities so far.


[Regina] (1:27 - 1:34)
Guilty. P-values can be fun, too. Maybe not, you know, penis fun level, but close.


[Kristin] (1:35 - 1:39)
I'm going to stick to probabilities in my actual class. That's more my comfort zone, Regina.


[Regina] (1:40 - 1:44)
I am proud of you for even saying the P word here, Kristin.


[Kristin] (1:44 - 2:04)
Aren't you, though? Thank you. Now, normally we start with a claim like this supplement reverses aging, but today our claim is different.


We're going to focus on a claim that's about the culture of science and the tools we use in science. So the claim for today is that P-values are a flawed statistical tool.


[Regina] (2:05 - 2:22)
Yeah, it's fascinating because P-values are kind of the good guy and the bad guy at the same time. On the one hand, they have advanced modern science, yes. But on the other hand, they have been so tricky that they have been misunderstood and misused by researchers for decades.


[Kristin] (2:23 - 2:32)
Exactly.


So in this episode, we're going to pull the curtain back on P-values and significance testing. We're also going to liven it up with some Bayesian monsters and psychic octopuses.


[Regina] (2:34 - 2:40)
Sorry, no sex this time in this episode. And I think we've already peaked on the penis mentions.


[Kristin] (2:41 - 2:43)
Psychic octopuses are almost as good, though, Regina.


[Regina] (2:43 - 2:44)
Yeah, almost.


[Kristin] (2:44 - 3:10)
Regina, let's actually set up what a P-value is. Most biomedical research papers report P-values, and they use significance testing, which is directly related to P-values. So P-values are a powerful tool for helping us separate signals from noise.


Data are noisy. There's always some random fluctuation. And without a tool like this, it's really easy to get fooled by patterns that aren't actually real.


[Regina] (3:10 - 3:25)
People love to see patterns in noise, and that's just how we humans are wired. Absolutely. That's why P-values are so useful.


So the definition of a P-value is very technical and unsatisfying.


[Kristin] (3:26 - 3:28)
That might be the understatement of the podcast, Regina.


[Regina] (3:30 - 3:47)
So let's make the P-values concrete, Kristin, with that example from our alcohol episode where I asked you to guess a number between 1 and 20 that I was thinking in my head. No, not guess. Read my mind.


The number I was thinking of, 1 to 20.


[Kristin] (3:47 - 3:53)
Right, and I failed miserably because I am not psychic, and it took me six tries to get the number.


[Regina] (3:53 - 4:00)
Six tries. We decided no one was going to mistake you for being psychic, not very impressive psychic anyway. Right.


[Kristin] (4:00 - 4:15)
And if we actually calculate the P-value for that little experiment, which we can do, it comes out to 30%, which means my performance was totally consistent with the hypothesis that I have no psychic powers.


[Regina] (4:16 - 4:48)
Right. So let's back up and talk about how we got that number. We didn't calculate the P-value back in the alcohol episode, but here, I'll start us off.


First of all, we had to set up what's called a null hypothesis, and usually this is the hypothesis that you are trying to disprove, the straw man hypothesis that you want to knock down. And I think of it as the skeptic's world or the boring world where nothing's happening, the hypothesis of no effect. So here, Kristin, the null hypothesis is that you are not psychic, you are just guessing.


[Kristin] (4:49 - 5:11)
And then to calculate the P-value, we asked what would happen if the null hypothesis is true and we could repeat the experiment again and again. And this is the part that's tricky for people. A P-value comes out of this thought experiment.


This is called the frequentist perspective. It treats probability as how often something would happen in the long run.


[Regina] (5:11 - 5:25)
Yes. And with this psychic ESP game, you can't really picture the hypothetical long run, right? You can picture us sitting in your beautiful backyard, tossing the ball for your dog and playing a thousand rounds of this guess my number game.


[Kristin] (5:25 - 6:15)
Nibbles would absolutely love it, but it would kind of take forever. So Regina, I actually had the computer simulate a thousand virtual games under the null hypothesis where I'm not psychic. And in about 30% of those games, I guess correctly in six tries or fewer.


We can also figure out the probability mathematically. Six guesses cover 30% of the numbers. So the probability of success within six tries is 30%.


That's our P-value. And the technical definition of the P-value is the P-value is the chance of seeing a result like this or something even more surprising if the null hypothesis were true. So here, the P-value is the probability that I would get the correct number in six tries or fewer if I was not actually psychic.


[Regina] (6:15 - 6:45)
Right. It's a measure of how surprising a result is, again, if we assume the null hypothesis to be true, if nothing's actually happening. So a big P-value, like 30% here, it's just telling us that the result is not surprising at all.


The smaller the P-value, the more surprising. Bigger means not surprising. So it's consistent with this null hypothesis being true with, Kristin, sadly, you not having any psychic powers.


Exactly.


[Kristin] (6:47 - 6:53)
Let's crank it up, though, Regina. Imagine if you asked me to guess a number between one and a million, and what if I got it right on the first try?


[Regina] (6:55 - 7:24)
That actually might give me a heart attack. I'm scared to even try it, because the chance of that happening if you are not psychic is one in a million, right? And that's a P-value of 0.00001, basically a one with five zeros in front of it. And so in this case, we would definitely conclude that we have strong evidence against the null hypothesis.


[Kristin] (7:25 - 7:56)
Meaning we're either going to conclude that I'm psychic or maybe that I cheated. These results are clearly not consistent with chance guessing. Now, Regina, those two examples that we just did, they don't actually illustrate all that well why the P-value is useful, because my intuition already tells me that in the first case, yeah, Kristin's not psychic, in the second case, something weird's going on.


But where P-values actually are the most useful, Regina, is in the in-between cases where it's not obvious whether we're looking at signal or noise.


[Regina] (7:57 - 8:29)
And Kristin, what counts as a loud enough signal, where we draw the line for what is a small enough P-value, that is a matter of debate. And as we've talked about in previous episodes, the most common threshold is 5%. But there is nothing magical about that 5% line.


And in fact, some people use different lines. And in this episode, we will talk more about that 5% and the history behind it. Exactly.


[Kristin] (8:29 - 8:54)
Regina, let's talk about your paper on P-values first. That paper was published in Nature in 2014. It wasn't a research article.


It was a science journalism feature, and it got a ton of attention. I remember that it got a really high altmetric score, which is a measure of how much attention the paper got in the media and in social media. Regina, can you share that number with us?


[Regina] (8:55 - 9:01)
I feel a little embarrassed bragging like this, Kristin, but it was a little over 5,500.


[Kristin] (9:01 - 9:20)
Regina, I am happy to brag for you. An altmetric score of 5,500 is like 99.9th percentile. Very, very few papers ever get that high.


I think my highest altmetric score might be something like 300, just to give the contrast. It also got a lot of citations, too, right?


[Regina] (9:20 - 9:23)
Yeah, something over like 2,000 citations.


[Kristin] (9:24 - 9:27)
That is amazing. Hardly any papers have that many citations.


[Regina] (9:27 - 9:44)
I just think it shows the power of plain English, because statisticians had been yelling about p-values and warning for decades. But thanks to my editors, this piece made the ideas really approachable to a wider audience without talking down to them.


[Kristin] (9:44 - 9:50)
Yeah, exactly. This was useful to a lot of people. It actually even won an award from the American Statistical Association.


[Regina] (9:51 - 10:54)
It did. I was very proud of that. I think you can use stories to make abstract concepts come alive, right?


So, for example, in the piece, I had a nice lead, right? I opened with an anecdote about a psychology grad student who had a really sexy result. It was that people who were political moderates could literally see shades of gray better than extremists on either the left or right.


Wow, so not just metaphorical gray areas, but actual visual shades of gray. Right? Isn't that great?


So, his original study had a p-value of 0.01, which is highly significant, and everyone was excited. He was probably, you know, imagining fame and New York Times coverage, TED Talks. But then he and his advisor tried to replicate this study with a slightly larger sample, and this time the p-value was 0.59. Not even close.


[Kristin] (10:55 - 11:04)
So, they could no longer claim an effect. That must have been crushing for the researchers, and you used that story, Regina, to draw the readers in.


[Regina] (11:04 - 11:23)
Right. I wanted to get them curious about what went wrong there, and then from there I could go into the deeper, wonkier issues about p-values. And the big message that I wanted them to get was that p-values are surprisingly slippery. They look precise, but you've got all these hidden assumptions behind them.


[Kristin] (11:24 - 11:32)
Right. And it's a good reminder that a p-value of 0.01 does not mean you've discovered something reliable or true.


[Regina] (11:32 - 11:44)
Right. So, the point of my article was that p-values are not magical, and they can cause big problems if you misinterpret them and treat them like they are magic.


[Kristin] (11:44 - 11:55)
One of the things I loved about your paper, Regina, is that you dug into the history of p-values, and I think a lot of people don't realize the history and the context. Could you give us a quick taste of that history?


[Regina] (11:55 - 13:18)
Right. So, the p-value was developed in the 1920s, so 100 years ago, by R.A. Fisher, who was analyzing agricultural experiments in England, trying to figure out which treatments were worth following up on. And he came up with the p-value as an informal way to judge whether results were interesting enough for just a second look. It was supposed to be part of this whole flexible process where you're blending data and background knowledge.


The whole thing was just very synergistic. It was not supposed to be a rigid rule.


[Kristin]
Which is definitely not how we use it now.


And wasn't he the one who gave us the magic 0.05 number?


[Regina]
Yeah, that was Fisher. He was the first to suggest 0.05 and to introduce the word significance. But he never meant 0.05 to be this universal cutoff. He wrote things like, it's convenient to take 5% as a standard level of significance. So convenient.


It was just a guideline for deciding whether data needed more attention, whether you could take a second look at it. And he saw significance as a continuum, so smaller p-values meant stronger evidence. It was not a binary yes-no.


[Kristin] (13:18 - 13:40)
That makes sense. And 0.05 actually fits people's gut intuition. You know, Regina, in class, we do this little experiment where we ask students, at what point if you keep flipping a coin and it keeps coming up heads, at what point do you start to get suspicious and think that it's an unfair coin? And for most students, it's when we get four or five heads in a row.


And that happens to line up right around a p-value of 5%.


[Regina] (13:40 - 14:15)
Isn't that amazing? So there is something, I think, to the 0.05. But again, just as a, you know, a general guideline, maybe not a hard and fast rule. My old boss at the American Statistical Association had a great dating analogy for p-values. He said a p-value is supposed to be like swiping right on a dating app.


It means, hmm, this person looks interesting. Let's take a closer look. Let's go out to coffee, not, hey, let's get married.


But somewhere along the way, we turned it into a whole marriage proposal.


[Kristin] (14:16 - 14:21)
That's a great analogy, Regina. And it really shows how far we've drifted from Fisher's original idea.


[Regina] (14:22 - 14:49)
Right. So Fisher was shaping this whole thing, his whole p-value thing. And he had rivals, Jerzy Neyman, who was a mathematician from Poland, and Egon Pearson, who is a statistician from the UK. And they came up with their own system.


And the feud between Neyman and Pearson and Fisher was, Kristin, it was just legendary.


[Kristin] (14:49 - 14:51)
This is the tabloid side of statistics. It's the juicy academic gossip.


[Regina] (14:52 - 15:29)
Oh, I love it. So the feud got really heated, both in person and what they were writing. So Neyman once described parts of Fisher's work as worse than useless, which is actually kind of a sick burn mathematically, like, oh, you know, he got you.


Yeah. It was it was double edged on that. Fisher shot back, he said Neyman's approach was childish mathematics and said it was horrifying for the intellectual freedom of the West.


[Kristin] (15:32 - 15:42)
I love this stuff, actually. Listeners may forget some of the math, but they're going to remember these guys shooting like a bunch of kindergarteners. So Regina, explain why they were fighting so hard.


[Regina] (15:42 - 16:24)
Yeah, at the heart of it, they had very different philosophies. Fisher wanted things fluid. The p-value was just one piece of evidence, a little flag to tell you to look closer.


And Neyman and Pearson wanted rules. They were trying to get rid of all of Fisher's squishy fluidity. And they were trying to help you make a yes-no decision based on your data.


So Kristin, it would be, are you psychic? Yes or no. And it's going to be nice and clean.


It's a decision. But in order to do that, you need to sit down ahead of time before you collect your data and make some hard choices and tradeoffs. So I have an analogy for this.


Are you ready?


[Kristin] (16:24 - 16:25)
Yeah, absolutely.


[Regina] (16:26 - 17:10)
Get ready. All right, Botox.


Neyman-Pearson is like the Botox itself. It gives you your forehead that is smooth as a baby's bottom. It is rigid and reassuringly pristine.


There's no wrinkles, no mess. And Fisher is kind of like natural eyebrows without Botox. And your whole forehead is expressive and flexible, but a lot of wrinkles and a lot of mess in there.


So I see the Neyman-Pearson slash Fisher clash as like arguing which is better, a nice, clean Botoxed forehead that you can't move or a flexible forehead that's all wrinkled.


[Kristin] (17:12 - 17:21)
Botox works surprisingly well for stats analogies. We've used Botox as an analogy for confounding before as well, Regina.


[Regina]
Yeah, there's something about Botox, isn't there?


[Regina] (17:22 - 17:36)
So the Neyman-Pearson framework also gave us the vocabulary that a lot of people remember from their intro stats class and have nightmares about, like alpha and beta and statistical power, type one and type two errors.


[Kristin] (17:37 - 17:43)
These things are so poorly named, type one and type two error. That is completely nondescript.


[Regina] (17:44 - 17:49)
I know, right? Kristin, I think they ought to let us rename all the stats concepts.


[Kristin] (17:50 - 17:54)
The two of us. We should totally rebrand it. It would be so much more fun and easy to remember.


[Regina] (17:55 - 19:41)
It would. So they're probably not going to allow us to give dating app names for all of these, but I do have a dating app analogy for type one and type two errors.


[Kristin]
I love it.


[Regina]
Okay. Type one error is a false positive. And this is like swiping right on the dating app when you should not actually do so, right?


Like for me, the guy has great profile photos and his promising bio, so I swipe right, we go out on a date and the date turns out to be a total dud. Not good. Should not have done this.


So the science version of this is you're publishing something that turns out to be false, right? You're putting it out there. So this is the false alarm, the false positive.


I call it the overenthusiastic error.


[Kristin]
Okay. So that's the type one error.


But what about type two error?


[Regina]
Type two error. Right.


That's the false negative. So this is the opposite. This is swiping left on someone who looked like, you know, kind of unimpressive online.


I didn't like his photos, his skin looks bad, you know, he's kind of cross-eyed, but I never know this, but in real life he would have been amazing, but I never met him because I swiped left. So I never got matched with him. He's the one, the hypothetical guy who got away.


And so the science version of this is you've got a real effect in your data, but you never found it. And this is the false negative. So I call this one the missed opportunity error.


[Kristin]
That's the type two error.


[Regina]
And I actually do think about these things when I am swiping on my app. What's the cost of a lot of false positives?


A lot of bad dates. But what's the cost of a false negative? I might have missed my soulmate.


[Kristin] (19:41 - 19:44)
Oh, real life decision theory at work, Regina.


[Regina] (19:44 - 20:04)
Uh huh. Uh huh. So Neyman and Pearson said, OK, you've got these type one errors, type two errors.


You have to decide ahead of time which error matters to you more and kind of set your tolerances. What can you handle? And you set up a decision rule based on that and you follow it and everything is very binary and very clean.


[Kristin] (20:04 - 20:27)
Just to connect the two approaches, Neyman and Pearson had you set a rule before you ran your experiment. You picked a threshold and if your results cleared that bar, you got to reject the null hypothesis and say there was an effect. Otherwise you did not.


That threshold was essentially based on a p-value cutoff. There's p-values under the hood in both.


[Regina] (20:27 - 20:30)
Right. But the way that it works in practice is very different.


[Kristin] (20:31 - 20:31)
Yeah.


[Regina] (20:31 - 22:02)
So remember with Fisher, we were able to say that the smaller the p-value, the more evidence against the null. But under the Neyman-Pearson framework, once you have designed the experiment, the actual p-value does not matter. Let's say we chose 0.05 as our threshold. A p-value of 0.049 just under that threshold is just as much evidence against the null hypothesis as p equals 0.00001. There's no grades of significance. All you can say is I had an effect or I did not. So Fisher thought this Neyman-Pearson approach was bean counting, not good for scientific discovery and Neyman and Pearson thought Fisher's way was very sloppy and prone to errors.


And in the end, so these guys were all feuding, scientists were getting impatient and they did not realize that these were actually two whole different philosophies on how to do science. Right? Because as you pointed out, this kind of has, you know, the p-value hidden underneath on one.


Oh, it's just the same thing. So they mashed the two systems together. They took this easy peasy p-value and crammed it into this rule machine.


And that is when p less than 0.05 got enshrined, everyone knows this, as this universal threshold of statistical significance.


[Kristin] (22:03 - 22:13)
So in this new mashed up system, we get to calculate a p-value and then the rule that we're looking for is just if p is less than 0.05, but that's very different than what Neyman and Pearson actually intended.


[Regina] (22:13 - 22:33)
Right. Because we also set up 0.01 and then you get to say it's highly statistically significant and you can add as many, you know, stars as you want in your table. And Neyman and Pearson would not have gone for that at all.


[Kristin]
And Fisher wouldn't have liked this strict cutoff.


[Regina]
Nope. Nope.


They're all turning over in their graves right now.


[Kristin] (22:34 - 23:00)
All right. So that's some great history, Regina. Now I want to talk a little bit more about what was in your Nature paper, but let's take a short break first.


Regina, I've mentioned before on this podcast, our introductory statistics course, Demystifying Data, which is on Stanford Online. I want to give our listeners a little bit more information about that course.


[Regina] (23:00 - 23:10)
It's a self-paced course where we do a lot of really fun case studies. It's for stats novices, but also people who might have had a stats course in the past, but want a deeper understanding now.


[Kristin] (23:11 - 23:46)
You can get a Stanford professional certificate as well as CME credit. You can find a link to that course on our website, NormalCurves.com, and our listeners get a discount. The discount code is normalcurves10. That's all lowercase.


Welcome back to Normal Curves. We were talking about Regina's Nature paper on p-values.


Regina, in that paper, you laid out some of the big misconceptions with p-values and with this hybrid system that we've talked about. Let's walk through a few of them.


[Regina] (23:46 - 24:08)
So, first misconception is that people confuse statistical significance with practical significance. And part of the problem is indeed Fisher and his word choice, because in everyday English, significant means important or meaningful. But Fisher really meant it as just like, oh, worthy of a second look in there.


[Kristin] (24:08 - 24:16)
Right. Because you can have tiny, meaningless effects, really, really small, that are also highly statistically significant.


[Regina] (24:16 - 24:43)
We saw this in our age gaps episode, because the data showed statistically significant effect of age on a woman's romantic appeal, which sounds super depressing, until I went in and looked at the size of the effect. And they were rating romantic appeal on a one-to-five scale. And for a woman to fall from a perfect five to a rock-bottom one, it would take her 628 years.


[Kristin] (24:43 - 25:21)
Just to be clear, that 628 years was a wildly inappropriate extrapolation, but we had done that just to illustrate how tiny that effect was, even though the p-value was statistically significant.


[Regina]
Right. But practically meaningless.


[Kristin]
And I think this surprises people, Regina, but it's because the p-value doesn't depend only on the size of the effect. It also depends on your sample size and also on the noisiness of the data. So you could have a teeny tiny effect, but if the sample size was large enough, you would still get a small p-value, like a teeny tiny difference between the groups could be statistically significant simply because the sample size was huge.


[Regina] (25:22 - 25:47)
And that is why some statisticians have started replacing the phrase statistically significant with a new one, statistically discernible, meaning statistically detectable. We can see it. In fact, the new edition of the stats textbook that I use has completely adopted the language.


If the p-value is less than 0.05, the effect is statistically discernible, no more significant in the book.


[Kristin] (25:47 - 25:51)
Oh, interesting. That's such a small shift in language, but could clear up a lot of confusion.


[Regina] (25:51 - 26:52)
Yeah. Next issue with p-values in this hybrid system, p-hacking, which might be the only stats term with a definition in the Urban Dictionary, which is usually a little bit more for off-color terms. In the Urban Dictionary, they call it exploiting, perhaps unconsciously, researcher degrees of freedom until p is less than 0.05.


[Kristin]
Nice definition.


[Regina]
Yeah. But to give some context on that, remember, in that Neyman-Pearson framework we just talked about, everything was supposed to be set in advance, your sample size, your tests, your hypotheses. It was pretty rigid, and that rigidity is what protected you, but that is not how modern science usually works because of this mashup that we described.


Today, researchers actually have a lot of flexibility, what we statisticians very nerdily call researcher degrees of freedom.


[Kristin] (26:53 - 27:15)
Regina, it sounds great that they have a lot of freedom, but actually it can be quite dangerous statistically, because researchers think that in order to get published or to get promoted, they need to get their p-value under 0.05, so they end up doing a lot of things either consciously or unconsciously to try to get that p-value under 0.05, basically like gaming the system to get a p-value less than 0.05.


[Regina] (27:15 - 27:26)
We talked about some of these p-hacking things in previous episodes, especially about multiple testing. We talked about that one in the bad boyfriend section of the stats review episode.


[Kristin] (27:27 - 27:55)
Right. That was the case where, say, you were comparing an intervention in a control group, and you compared them in a whole bunch of different subgroups, or you looked at a whole bunch of different possible outcomes. Every time you run a new statistical test and calculate a new p-value, you increase the chance that even if your intervention does absolutely nothing, that just by random chance you're going to get a p-value under 0.05. If you run enough tests, eventually you're going to get a p-value under 0.05. Right.


[Regina] (27:55 - 28:52)
Sometimes they call this torturing the data, torturing it until it confesses. Because then now it is very tempting to cherry-pick just the results that came up significant, might have been significant just by chance, and publish just those. That's the problem.


So it's not really the p-value's fault. It's our fault.


[Kristin]
It's not really a math failure.


It's a human nature failure.


[Regina]
And Kristin, this is why you and I get so crazy excited about pre-registration and registered reports and open data.


[Kristin]
Right. Very sexy.


[Regina]
Very, very sexy. The reason we get so excited about them is because they are forcing transparency on the researchers, and that transparency is what helps keep p-hacking in check so it doesn't spiral out of control.


[Kristin] (28:53 - 29:00)
It's a lot like what Neyman and Pearson had originally envisioned, where you set everything rigidly ahead of time so that you can't cheat later.


[Regina] (29:01 - 29:30)
Another problem with the way that we use p-values is that we treat 0.05 as a hard to cut off, as we have been talking about so far. So 0.04, great, publish, 0.06, forget about it. And this comes from when we're mashing together these two systems.


And my favorite line on this is, surely God loves 0.051 just as much as he loves 0.049.


[Kristin] (29:30 - 29:57)
Right, but he doesn't in Neyman-Pearson's system. I think that's the point, right?


And this false cutoff is what makes p-hacking so tempting. I've literally gotten emails from researchers saying, hey, if I throw out this outlier, then my p-value becomes 0.049. So I'm good to go, right? And I have to say, no, that's not the goal of your analysis. The goal is to figure out what the data are trying to tell you, not to make your p-value under 0.05. But unfortunately, sometimes that's become the goal. People lose sight of what science is really about.


[Regina] (29:58 - 30:24)
Maybe the trickiest problem with p-values is how often we misinterpret what they actually mean. So many people think p-value of 0.05 means that there is a 5% chance that the null hypothesis is true, a 5% chance that there is no effect. And they think, okay, if that's true, that means that there is a 95% chance that there is an effect.


[Kristin] (30:25 - 30:58)
So like, if I get a p-value of 0.05, they think it means that there's a 5% chance that my drug does not work, and therefore a 95% chance that my drug works. But this is absolutely wrong. And Regina, I think it's easy to see if we just go back to that ESP guessing game that we talked about earlier, remember the p-value there was 30%.


So if you were misinterpreting p-values in this way, you would say there was a 30% chance that I am not psychic, and therefore a 70% chance that I am psychic, which is clearly, obviously, wrong.


[Regina] (30:58 - 31:04)
Kristin, I keep my mind open about you because you surprise me so often, but I am pretty sure that you are not psychic.


[Kristin] (31:05 - 31:37)
Definitely not. Let me explain mathematically why that's wrong, Regina. Remember, as we talked about, a p-value tells us the probability of our result if we assume the null hypothesis.


To calculate that p-value, remember, we treated the null as if it's 100% true because it was our starting assumption. So we are not calculating the probability that the null is true or false. We're calculating the probability of our result, or anything more extreme, assuming the null is true.


[Regina] (31:37 - 32:13)
Right. So just to recap, a 5% p-value does not mean that there's a 5% chance that the results are a fluke, or a 5% chance that they happened just by chance, or a 5% chance that they are a false alarm. All of these are different ways of putting the same misinterpretation, the same wrong conclusion.


Kristin, this subtlety really trips up scientists. So I'd like to bring in an example that makes it more tangible, and Kristin, you have seen me use this example in class and in lectures before.


[Kristin] (32:13 - 32:21)
Oh, yeah. When we co-teach, you bring out this example, which I love, and I think it either makes some people believe in magical creatures and other people hungry.


[Regina] (32:23 - 32:27)
The magical creatures being Paul the psychic German octopus.


[Kristin] (32:28 - 32:35)
I love Paul the octopus, but not everyone may be familiar with Paul the octopus. So tell us about Paul.


[Regina] (32:35 - 33:06)
Paul was hatched in 2008 and lived in an aquarium in Germany. And before each big football match, yes, soccer, his handlers would give him two clear food containers. One had a German flag on it, and the other had the flag of the opposing team in the upcoming match.


And whichever container Paul opened first, that was his prediction for who was going to win.


[Kristin] (33:07 - 33:22)
And this was a well-designed experiment because lunch was served before the game. So this was actually a prospective study. These were real predictions.


The handlers did not already know the answer. They publicized his picks in advance even, so they couldn't go back and change them.


[Regina] (33:22 - 34:03)
Mm-hmm. No cheating. And it was well publicized.


So for the 2010 World Cup semi-final, Paul picked Spain over Germany.


[Kristin]
Uh-oh.


[Regina]
And Germans were so mad, they started publishing octopus recipes.


Oh. Spain even offered him asylum if he wanted to, I don't know, swim over there. But he turned out to be right.


[Kristin]
Spain won, so nobody can fault him.


[Regina]
That is the amazing thing. So if you look at his entire two-year career as a psychic, 2008 Euro 2010 World Cup, Paul predicted 12 out of 14 matches correctly.


[Kristin] (34:04 - 34:07)
That's pretty amazing. What's the p-value associated with this?


[Regina] (34:07 - 34:26)
The p-value is roughly 0.01 or 1 in 100.


[Kristin]
Statistically significant for sure.


[Regina]
Right. But then here's the trap we just talked about. Does that mean there's only a 1% chance that Paul's record was a fluke and a 99% chance that he's psychic? No.


[Kristin] (34:26 - 34:39)
That clearly must be wrong. Again, the p-value is not the probability that the null is true. It is not the probability that Paul is not psychic.


And those double negatives, I do realize, get confusing, and that's why people get confused here.


[Regina] (34:39 - 35:04)
They do. They get very confusing, and it is very human to flip things around and say, OK, if random noise would generate results this surprising or more surprising 1% of the time, then therefore that must mean there is a 1% chance that random noise generated this. And it's really subtle, but you just flipped it around, and you can't.


You can't do that.


[Kristin] (35:05 - 35:22)
Exactly. The problem is, humans want to know what's the probability the drug works and what's the probability Paul is psychic. But you cannot answer that in a frequentist framework. To answer that, you need to use an entirely different statistical framework called Bayesian statistics.


[Regina] (35:22 - 36:14)
Right. Traditional frequentist statistics talk about long-run frequencies, as you mentioned before, Kristin. Bayesian statistics are a whole different philosophy from the frequentist statistics that we usually teach.


Bayesian statistics are named for Reverend Thomas Bayes, who liked doing math in his spare time. And Bayesian statistics lets you do something that frequentist statistics does not allow you to do. Bayesian statistics allows you to make probability statements about underlying reality.


That's what we want, as you just mentioned. But in order to do that, in the Bayesian framework, you combine your data with your prior belief or prior information about how plausible the hypothesis was in the first place. That's the important part, the prior belief, prior information.


[Kristin] (36:14 - 36:44)
Right. That's the key. And Regina, my prior belief about psychic octopuses is basically zero.


Let's say, for example, before I met Paul, I believed that there was a one in a trillion chance that psychic octopuses exist. In a Bayesian world, the added data from Paul would change my belief in psychic octopuses, increase it to one in a billion, but not to a 99% chance that Paul is psychic.


[Regina] (36:45 - 37:15)
Definitely not. And in the Bayesian statistics, we can see mathematically that the lower your prior probability is, like you, Kristin, with the one in a, you know, zillion, the more extraordinary your evidence needs to be, the stronger the evidence, Kristin, to change your mind about psychic octopuses. And I like the way Carl Sagan put this.


Extraordinary claims require extraordinary evidence. It makes sense.


[Kristin] (37:17 - 37:22)
And in Paul's case, 12 out of 14 just is not extraordinary enough.


[Regina] (37:23 - 38:28)
It's not. In my Nature paper, I made a graphic to show how p-values don't really tell us about underlying reality in the way that many people think. A lot of people assume, as we've talked about, p-value of 0.05 means there's a 95% chance the effect is real, but it doesn't. It's much lower.


So what the graphic showed is that the impact of a p-value depends on how plausible the claim was to start with, that prior probability. So, for example, if your prior odds on something happening, Paul being psychic, or your drug working, was 50-50, then you get a p-value of 0.05, that is going to bump up your plausibility to about 71% chance that the effect is real. So 50% to 71%, not 95% chance.


Let's say it was a long shot to begin with, not one in a zillion, but say 19 to 1, against. Then that same p-value of 0.05 only lifts your chances to about 11%.


[Kristin] (38:29 - 38:54)
Regina, I love this graphic. I think it really opened people's eyes, and just a little foreshadowing to when we talk about my paper, notice that misinterpreting p-values in this way tends to make your results look way more optimistic than they actually are.


[Regina]
They do.


[Kristin]
But, Regina, our listeners might be wondering, why aren't we all Bayesian? Why don't we use Bayesian statistics if they're so much easier to interpret?


[Regina] (38:55 - 40:08)
Excellent question, Kristin. Really, excellent question. I actually wrote a whole article for New Scientist magazine about probability wars between frequentists and Bayesians.


And if you thought the Fisher/Neyman-Pearson feud was exciting, this blows that one out of the water. Fascinating stuff. So, first of all, there are a lot of Bayesian statisticians.


They have entire conferences and university departments just devoted to Bayesian approaches. But it's true for a long time, Bayesian methods were hard to do without modern computing. And also, at the same time, some people worried about the subjectivity of using prior probabilities, like how are you even supposed to do that?


And Fisher and Neyman and Pearson did not like the squishiness of that at all. Now it's better. We do have nice free software that makes Bayesian analyses much easier.


But most scientists still learn frequentist methods first, at least. And of course, they stick with what they already know. So, Bayesian statistics, it's growing, but yeah, frequentist thinking still dominates for now.


[Kristin] (40:09 - 40:11)
So, Regina, your paper got a lot of attention, and then what happened?


[Regina] (40:11 - 40:27)
Yeah, two years after that, the American Statistical Association put together a working group to try to bring some clarity around p-values. And they cited my article as one of the prominent discussions that pushed them to do so.


[Kristin] (40:28 - 40:44)
That's amazing, Regina. You're like changing statistical practice. I think it's surprising to people outside of the statistics world just how contentious this issue is and how much debate and discussion there is around this issue.


I think people assume that statistics is kind of fixed, but it's not.


And you were part of the working group, correct?


[Regina] (40:44 - 41:50)
I was invited to be the facilitator, which means I had a front-row seat to all of that drama. I had to herd about 20 of the most prominent statisticians in thinking about p-values, and these were like Bayesians and frequentists all in the same room, including your colleague Steve Goodman from Stanford.


[Kristin]
I'm sure there was tabloid-level drama at this meeting, Regina. High passions, I'm sure.


[Regina]

Very high passions.


I was shocked. I was prepared for drama. It was even worse than that.


It was like herding cats, but, you know, to the nth power. So we managed to nail down a definition. I can't tell you how many hours of meetings, emails, drafts.


Okay, here is the definition of a p-value. Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data, for example, the sample mean difference between two compared groups would be equal to or more extreme than its observed value.


[Kristin] (41:52 - 41:56)
They stuck the qualifier informally at the front. Was that to make somebody happy?


[Regina] (41:57 - 42:18)
You do not want to see what the formal definition was. Like, this is the most that we could agree. I kept trying to get them, let's do some plain language, people.


Can we? Can we? No, we cannot.


[Kristin]
That was as simple as you could get, huh?


[Regina]
This was it. This was it.


Okay, but we did have other points that we published in the white paper.


[Kristin] (42:18 - 42:20)
You got people to agree on a few points?


[Regina] (42:22 - 42:56)
Yeah, but you know what? They're not shocking, these things. Okay, so six key principles, and most of them were basically don't, do not do this.


Don't treat p-values as the probability the null is true. We covered that. Don't make important decisions based solely on whether p is less than 0.05 or less than any threshold. Don't p-hack. You need to be transparent. Don't think a p-value tells you how big or important an effect is.


And my favorite, by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.


[Kristin] (42:57 - 42:59)
In other words, a lot of the pitfalls we've just talked through.


[Regina] (42:59 - 43:17)
Yes, like it was kind of my paper. It was nice. So there was one positive statement that we agreed on.


Okay, are you ready?


[Kristin]
Yes.


[Regina]
A p-value can tell you how incompatible the data are with a given statistical model.


Isn't that super exciting?


[Kristin] (43:17 - 43:18)
It's riveting, riveting, Regina.


[Regina] (43:21 - 43:57)
So I want to just wrap it all up because I had a nice kicker paragraph in that Nature paper. And I closed it with three questions that a Hopkins’ statistician had said that I think are really wise. He said that after a study is done, a scientist wants to know three things.


One, what is the evidence? Two, what should I believe? And three, what should I do?


And then I had a quote from Steve Goodman saying, quote, one method cannot answer all these questions. The numbers are where the scientific discussions should start, not end.


[Kristin] (43:58 - 44:07)
That is a perfect place to end, Regina. The p-value isn't magic. It's just the starting place.


All right, let's take a short break.


[Regina]
And then we get to talk about your paper.


[Regina] (44:15 - 44:24)
Kristin, we've talked about your medical statistics program. It's just a fabulous program available on Stanford Online. Maybe you can tell listeners a little bit more about it.


[Kristin] (44:24 - 44:39)
It's a three-course sequence. If you really want that deeper dive into statistics, I teach data analysis in R or SAS, probability and statistical tests, including regression. You can get a Stanford professional certificate as well as CME credit.


[Regina] (44:39 - 45:08)
You can find a link to this program on our website, normalcurves.com.


Welcome back to Normal Curves. In this special episode, we are covering two of our very own papers.


And we were about to take this wild ride with Kristin, where she talks about some statistical debunking and statistical sleuthing that she did. Kristin, start us off.


[Kristin] (45:08 - 45:44)
I got involved a few years back in helping debunk a statistical method that was used in sports science for years and in hundreds of papers. And at its core, this method was built around misinterpreting p-values in that common way we talked about earlier, as the probability that the null is true. Regina, a major consequence of misinterpreting p-values in this way is that it makes results look far more optimistic than they really are.


Here, I believe that that resulted in flooding the literature with a bunch of false positives.


[Regina] (45:44 - 45:58)
Which is a huge deal. So, Kristin, you played a really terrific role in this statistical debunking. That's what it is.


But you also ended up really sinking a lot of time into this whole enterprise.


[Kristin] (45:58 - 46:11)
Way too much time. I didn't realize what I was getting myself into when I first happened upon this. And Regina, it's too bad I didn't get paid for all that time.


I could have used that time for paid work.


[Regina] (46:13 - 46:23)
Yes, but scientific researchers, we are grateful to you for your hard work. Thank you for your service. Okay, so tell us a little bit about this method.


What was it?


[Kristin] (46:23 - 47:16)
It's called magnitude-based inference, or MBI, and it was cooked up by two sports scientists, Will Hopkins and Alan Batterham, and they pushed it hard. Supposedly, they would go around at conferences and pressure young researchers to use it. By the time I got involved, they were even selling lucrative courses to teach the method.


And as we're going to see, it had a cult-like feel to it.


[Regina]
Wow, these guys were pretty ballsy, weren't they?


[Kristin]
Oh, yeah.


And they built it as an alternative to significance testing and as a solution to some of the problems that we've talked about with p-values.


[Regina]
Oh, that's how they sold it to people.


[Kristin]
That was one of their sales pitches, yes.


The whole thing was introduced on Hopkins' personal website, sportsci.org, which he tries to pass off as a peer-reviewed journal.


[Regina] (47:17 - 47:18)
What do you mean by that?


[Kristin] (47:19 - 48:11)
So it says peer-reviewed journal at the top of the website, but the, quote, peer-review is just friends reviewing friends. So at the time I got involved, it was usually just Batterham, quote, reviewing Hopkins. So in no way was it a peer-reviewed journal.


[Regina]
Again, did I mention ballsy?


[Kristin]
And on their website, you could download Excel spreadsheets that implemented MBI. Now there was no real documentation of what the spreadsheets were doing mathematically, no version control, but you could download these things, paste your data in, hit enter, and out would come a whole bunch of results.


[Regina]
Oh, Excel is just usually never a good idea.


[Kristin]
I spend a lot of time telling people why they should never use Excel to manipulate or analyze data. It makes it really easy to introduce errors and really hard to trace those errors.


[Regina] (48:11 - 48:29)
Yeah, Excel is bad, but I can see why this whole MBI thing caught on. It came into these, ooh, easy-to-use Excel spreadsheets, no coding needed, and the results looked extra optimistic. That's what you're saying, Kristin, right?


[Kristin]
Exactly.


[Regina]
So what's the backstory? How did you get pulled into this?


[Kristin] (48:29 - 49:49)
So in 2017, I happened to be on a statistics panel at a conference, and also on that panel was the journalist Christie Aschwanden. She was writing for FiveThirtyEight.com at the time.


[Regina]
Christie! Christie is great. We love Christie.


We've both worked with her.


[Kristin]
Exactly. She was working at the time on a book on sports science, and I analyzed data in sports medicine, so we got to talking, and then she asked me to take a look at this odd statistical method that she had come across in the sports science literature, MBI.


I had never heard of it before, but she sent me some papers, and I set aside a few hours one morning to look through them, and when I started reading them, I could not believe what I was seeing. I actually ended up taking my laptop around all day between my classes at Stanford, and I was furiously running code and typing emails to her, and we ended up having a long conversation the next day in which I explained some of the problems that I was seeing with MBI. I got off that call, though, and I wasn't really sure if she was going to write anything up for a lay audience about this because it's kind of technical and niche-y, so I felt like I should write up a little critique for the academic literature.


I thought it would be a few days of my life. Little did I know what I was actually getting myself into.


[Regina] (49:49 - 49:55)
Famous last words on that one. Yeah. So, tell us what your paper was about, then.


[Kristin] (49:55 - 50:07)
So I proved mathematically that MBI makes it easier to find false positives. In statistical terms, it inflates the type 1 error rates.


[Regina] (50:07 - 50:44)
Right. And just to remind everyone, a type 1 error is a false positive. It's when you conclude there is a real effect when, in reality, there is not.


And this too-many-false-positives is one of the big problems in significance testing. This gets back to John Ioannidis' essay on why most published scientific findings are false, false positives. But, Kristin, you're telling us that there were two researchers here who were actually developing a way, working hard, to make this problem even worse, not better.


Exactly. Yes.


[Kristin] (50:44 - 50:59)
That was even part of their sales pitch. They wrote, MBI has provided researchers with an avenue for publishing previously unpublishable effects from small samples.


[Regina] (51:00 - 51:15)
I love that. It makes it sound like the goal is just publishing, not, you know, uncovering scientific truth, maybe. Right.


So, what they did here was lower the statistical bar to make it easier to find things, right?


[Kristin] (51:15 - 51:34)
That's exactly what they did. And they were selling it as, look, we can find more things. But when critics came back and pointed out that if you can find more things, that necessarily is going to increase false positives, they doubled down and tried to deny that.


It's classic have your cake and eat it too nonsense.


[Regina] (51:35 - 51:39)
Yeah. Absolutely. So, other people had criticized this before you then.


[Kristin] (51:40 - 52:27)
Yes, I got involved kind of late. This had already been around since about 2006 and had already faced plenty of criticism. But none of those criticisms had really stuck because the developers of MBI always countered with smoke and mirrors.


And a lot of people in sports science just didn't feel qualified to evaluate a statistical method. So, the smoke and mirrors kind of worked. Yeah.


You know, Regina, I think the developers of MBI had good intentions in creating this method. They were trying to make something easy for sports scientists to use, and they were trying to avoid some of the problems p-values cause. It does seem like they were trying to be helpful, at least.


Right. But they did the opposite of some of the science Cinderella stories that we've talked about in previous episodes.


[Regina] (52:27 - 52:42)
Right. In the red dress episode, for example, we talked about researchers who accepted the critical feedback on their methods and publicly acknowledged the shortcomings of their previous work and actually improved. Right.


[Kristin] (52:42 - 52:51)
They owned what they did wrong, they made changes, and we loved and trusted them more because of it. MBI could have had a similar story.


[Regina] (52:51 - 53:10)
It could have, yes. All researchers need a bit of humility. Scientists are humans, and humans make mistakes, and science is a self-correcting process, but only if people are willing to self-correct.


So tell us a little bit more about what people previously had criticized.


[Kristin] (53:11 - 53:31)
The first critique to be published in a peer-reviewed journal actually came out back in 2008, and some statisticians pointed out that MBI misinterprets p-values and confidence intervals as Bayesian probabilities. So they suggested that instead of MBI, people should just do a proper Bayesian analysis.


[Regina] (53:32 - 53:38)
Common sense, but also a good burn. Yeah. But how did the MBI guys respond to that?


[Kristin] (53:38 - 54:16)
They fired back with a letter titled, An Imaginary Bayesian Monster.


[Regina]
A Bayesian monster, like a cookie monster? Is it cute?


What is it?


[Kristin]
Oh, I'm picturing like a fuzzy little cute monster, yes, Regina.


And kudos to them for the creative title, and I'll also give them kudos because they managed in their letter to slip in a reference to the Hitchhiker's Guide to the Galaxy. Pretty good. Beyond that, however, it was complete nonsense.


They tossed around some statistical jargon that sports scientists might not question, but it was complete statistical gibberish.


[Regina] (54:17 - 54:30)
It was, I read it, and to me the whole thing read like one big non-sequitur. Like, this sentence did not really relate to that last sentence at all.


[Kristin] (54:31 - 55:02)
Yeah. For our statistician listeners, we do have a few, their argument was basically, we ran a bootstrap version of a t-test, and because it gave us the same answer as the t-test, therefore we have done a Bayesian analysis.


[Regina]
It was exactly that, yes.


[Kristin]
Okay, let me translate for everyone else in our audience.


That is like saying, we are playing chess, and I just showed that a knight is allowed to make an L-shaped move, therefore I am now playing poker.


[Regina] (55:04 - 55:19)
Oh, that's a good one. I was thinking, it reminded me of Inigo Montoya from Princess Bride. Do you remember his famous line?


You keep using that word Bayesian. I do not think it means what you think it means.


[Kristin] (55:20 - 55:51)
Yes, love a good Hitchhiker's Guide reference and a good Princess Bride reference. Another critique was published in 2015 by Alan Welsh and Emma Knight, two Australian statisticians, and they actually went into the MBI spreadsheets and painstakingly worked out what every Excel formula was doing, and what they showed was that the spreadsheets were just running standard hypothesis tests, spitting out p-values, and then misinterpreting those p-values.


[Regina] (55:52 - 55:57)
Wow. So, Alan and Emma are really doing God's work for us here to uncover this.


[Kristin] (55:58 - 56:33)
Yes. God bless them. I would not have had the patience to go through those spreadsheets, but I could not have written my paper without them mathematically laying out what the spreadsheets were actually doing.


Regina, let's make this concrete. So, say we ran a trial where runners drank cherry juice or a placebo drink for two months, and then we measured their improvement in 5K times. One of the null hypotheses MBI would test is that cherry juice has no or only a trivial benefit on running times.


[Regina] (56:33 - 56:42)
Right. Remember, the null hypothesis is that skeptics' world. It's the opposite of what we're actually trying to prove.


[Kristin] (56:42 - 57:11)
But here it's a bit different than what we talked about before, because in this skeptic's world, the true value isn't just zero, it's zero or something very tiny. And Regina, let's imagine that the cherry juice group improved by 15 seconds on average, while the placebo group did not improve at all. And then we take the data and we compare these groups using a standard hypothesis test, a T-test, and we get a p-value of 0.24 or 24%.


[Regina] (57:11 - 57:38)
Right. Now, let's interpret what that actually means. If cherry juice really had no effect or only a trivial one, and we repeated this experiment over and over, you'd expect to see results at least to this extreme in favor of cherry juice about 24% of the time, which is not rare. It's about as surprising as flipping a coin and getting two heads in a row.


[Kristin] (57:38 - 58:03)
Right. Regina, I could do that live right now here on the podcast and get two heads in a row, no problem. It wouldn't be all that exciting or surprising.


But here's the kicker. MBI interprets that 24% as meaning that there is a 24% chance that cherry juice does not improve running times beyond a trivial amount, and therefore that there is a 76% chance that cherry juice improves running times.


[Regina] (58:04 - 58:32)
No, no, no, no, no. This is mind-blowing. This is exactly that wrong definition we talked about earlier.


It's misinterpreting p-values as the probability that the null hypothesis is true or the probability it's a chance finding. Remember, this cannot be the interpretation, because if it were, that would mean there is a 99% chance that Paul the octopus is psychic. If you accept one, you have to accept the other.


That's it.


[Kristin] (58:33 - 59:28)
And people may want that to be true, but I think most of us can agree that unfortunately Paul was not psychic. Because he's an octopus or was an octopus.


[Regina]
Right.


[Kristin]
Regina, if you were to move into a Bayesian world and set a realistic prior, the probabilities that cherry juice works or that Paul the octopus is psychic based on the data we've discussed, those probabilities would be, in fact, much lower. Much lower.


Regina, MBI also set thresholds for declaring effects. So instead of calling them statistically significant, they just changed the word and called them substantial. But people treated substantial effects just like significant effects.


The catch was that these substantial effects came with type 1 error rates far higher than that usual 5% that we set. Welsh and Knight actually showed this in their paper. Mmm.


[Regina] (59:29 - 59:37)
And how did Hopkins and Batterham respond to this? I'm guessing with more smoke and mirrors or did they just say, you know what, you're right? We’re wrong.


[Kristin] (59:38 - 1:00:32)
No, smoke and mirrors, lovely, lovely smoke and mirrors, a whole magic show. They actually came back and wrote a paper in 2016 that got published. The story behind that is kind of funny because in her investigative journalism, Christie uncovered that that paper had been rejected at one journal.


And then the reason it got accepted at the second journal is that Hopkins called potential reviewers up and kind of pushed them into accepting the paper.


[Regina]
Oh, no. Oh, that is bad.


[Kristin]
Right. And in that paper, they claimed that MBI both reduced false positives, that type 1 error, but also at the same time reduced false negatives. And that is actually mathematically impossible.


So what I wrote in my paper is, at face value, this conclusion is dubious. And thus, you do not need to be a statistician to be immediately skeptical of their paper.


[Regina] (1:00:33 - 1:00:36)
Ooh, zing. Nice job, Kristin.


[Kristin] (1:00:36 - 1:00:37)
Don't mess with me.


[Regina] (1:00:38 - 1:01:03)
False positives and false negatives always trade off. Think of that dating app. If you crank up your filter to only match with a six-foot supermodel with six-figure salaries, you are going to cut down on bad dates.


Sure, that's going to be fewer false positives, but you will also miss out on a lot of perfectly nice people. That's the false negatives. That's the way it works.


[Kristin] (1:01:04 - 1:01:36)
Exactly. So, in my paper, I derived mathematical equations for the error rates of MBI, and I proved mathematically that MBI trades off false negatives and false positives just as expected in math. I also showed that it inflates type 1 error at exactly the small sample sizes that it recommends that people use. Regina, in essence, MBI was exploiting Winner's Curse, which is something we talked about back in the red dress episode.


[Regina] (1:01:36 - 1:01:47)
That is fascinating. Winner's Curse, a favorite of mine, where small samples can very easily give off these flashy results that turn out just to be chance flukes.


[Kristin] (1:01:47 - 1:02:01)
Yes, and what they were doing was, in these small samples in particular, they were lowering the bar for distinguishing signal from noise, which made it even easier to turn those random blips into publishable stories.


[Regina] (1:02:01 - 1:02:05)
I bet these two were not very happy with your paper, were they?


[Kristin] (1:02:05 - 1:02:44)
They didn't love me, no.


[Regina]
Yeah. What happened?


What did they say?


[Kristin]
So, Regina, this is actually the story of how I got on Twitter in the first place. I had not been a Twitter user, but within hours of my paper going online, the MBI authors were attacking me on Twitter, claiming to have already disproven my analysis, so I got on Twitter just so I could cut through their smoke and mirrors.


Not long after this, the creators of MBI emailed me, and they invited me to have a debate about my paper that they were going to moderate on their website.


[Regina] (1:02:45 - 1:02:51)
Well, that's so generous of them. Perfectly fair and objective. What nice guys.


[Kristin] (1:02:52 - 1:04:32)
I know. I was like, thanks, but no thanks. We had a little email exchange, and as we were talking on email, I suspected that they actually hadn't even read my paper, so I asked a leading question to test them, and sure enough, they had zero clue what my paper was actually about. They then went on to publish our email exchange on their website without my permission, but of course, they conveniently left out the part where I exposed that they didn't understand my paper.


Regina, they kind of picked the wrong person to mess with, though. Had they just kind of left it be, and I published my paper, and they left it alone and didn't try to throw up smoke and mirrors, I probably would have left it alone, too. The problem is, I have no tolerance for bullshit, and I am also unusually meticulous and persistent, so I decided, well, I'm going to make a video to explain my paper because I do a lot of lecturing, and I'm a good lecturer, so I made a 50-minute YouTube video walking through my paper.


That gave the paper some traction, and around the same time, Christie Aschwanden wrote about MBI and my paper and my video on FiveThirtyEight.com, and that gave my criticism even wider exposure. Now, the MBI proponents, they came back and wrote a lengthy critique of both my paper and Christie's article, and they posted it on their website. Almost none of it addressed the substance of my paper, the type one error issue.


Instead, they tried to attack me by calling me a, quote, establishment statistician, as if that was an insult. I was so flattered I put it on my Twitter bio, actually.


[Regina] (1:04:35 - 1:04:40)
Establishment statistician is a compliment. I know. Christian, let's make T-shirts.


[Kristin] (1:04:40 - 1:04:55)
I am still waiting to get my establishment statistician card in the mail. I want an official badge, and I would wear it with honor because if there's ever any place you want to be establishment and not like a rebel, it's in math.


[Regina] (1:04:58 - 1:05:02)
I know what I'm getting you for your birthday then, spoiler alert.


[Kristin] (1:05:02 - 1:05:09)
Regina, when we do merch for the podcast, we need establishment statistician badges that people can buy.


[Regina] (1:05:10 - 1:05:14)
Normal curve, establishment statistician, I like the ring of that.


[Kristin] (1:05:15 - 1:06:16)
Now, it did turn out that MBI had a cult-like following, and there were a few devotees who, in response to my paper, wrote blog posts supporting MBI, and some of them were really fun, so I want to read a few excerpts from one of my favorites. His first line is, I am not a statistician, and kind of love the transparency there. He goes on to say, I trust the analytical foundations of MBI, and my trust is based on the following.


Alan and Will are amongst the most highly cited researchers in exercise and sport. Their knowledge of the inference literature is clearly beyond reproach, and their logic is impeccable.


[Regina]
Wow.


[Kristin]
They go on to say, realizing the importance of MBIs is much like Neo deciding to swallow the red pill in the movie The Matrix. When making the decision to learn and understand MBIs, one can never revert to conventional methods unless forced to. You simply know better.


Isn't that great?


[Regina] (1:06:17 - 1:07:12)
I love the Matrix reference, though, gotta say, and the red pill. But the ironic thing is that they were actually blue-pilled by this MBI stuff, right?


So Neo gets offered the blue pill and the red pill. Blue pill is go back to your normal, everyday life, and red pill is, no, the scales fall from your eyes. And they were taking a blue pill that was disguised as a red pill.


But Kristin, you and I are the actual ones handing out the red pills, trying to show researchers what evidence really looks like.


[Kristin]
Yeah, you're right, Regina.


[Regina]
And that's kind of what we do in this podcast, right, red-pilling the research.


Like remember our Stanford students this year, and they said, now that my eyes have been opened, I can never see research the same way again. And they were disturbed. But you and I just looked at each other like, yep, mission accomplished.


Done. My work here is done.


[Kristin] (1:07:12 - 1:07:12)
Yep.


[Regina] (1:07:13 - 1:07:16)
Yep. So what happened to MBI in the end, then?


[Kristin] (1:07:17 - 1:09:30)
So this went on for a while. Some journals did ban the method. Batterham, I think, saw the writing on the wall, and he jumped ship pretty early.


But I think Hopkins might still be out there pushing this. I haven't really followed it in the last few years. But while I was still involved, he rebranded MBI as MBD because, of course, if you change the name, you escape all criticisms, right?


And for a while, he was also telling people to just say that they had done a Bayesian analysis when they had used MBI, even though they really hadn't done a Bayesian analysis.


[Regina]
That's called a lie.


[Kristin]
Bending the truth, Regina.


So I ended up having to write more papers, and one of them was actually explaining why MBI is not a Bayesian analysis. Now, to be fair, there is one special case where Bayesian probabilities and p-values line up. And this is when you assume as a prior that all possible effects are equally likely.


So for cherry juice, that would mean it's just as likely that cherry juice can improve your 5k time by infinity seconds as by zero seconds, which is obviously absurd. Although, Regina, I really, really want to improve my 5k time by infinity seconds. Maybe somebody else has a way.


But just because these two things line up in that one case, that does not mean that every p-value can be read as a Bayesian probability, obviously.


Now one fun thing that happened, Regina, is after I wrote my first paper, people reached out to me, and I got to collaborate on the rest of the papers with some fun people. I even co-authored a paper with Andrew Vickers, who is a statistician whose writing I have always admired.


I want to just share one of the lines he put in our paper. He says, "Magnitude-based inference has certain superficial similarities with Bayesian statistics, but you cannot send a baseball team to England and describe them as cricket players just because they try to hit a ball with a bat. The incorrect claim that MBI is a Bayesian analysis is similarly just not cricket and does real damage."


[Regina] (1:09:31 - 1:09:36)
Ooh, I like it. So what would you say the whole takeaway is from this story?


[Kristin] (1:09:36 - 1:09:40)
You can go really far off track if you misinterpret p-values, Regina.


[Regina] (1:09:41 - 1:10:27)
That's true. That's kind of what it boils down to. I think this is the perfect illustration of why do we care so much about these pedantic, fussy little definitions, because they really do matter.


[Kristin]
Right.


[Regina]
Kristin, I think we are ready to wrap up our p-values episode and rate the strength of evidence for our claim. And our claim today is p-values are a flawed statistical tool.


And we rate the strength of evidence for this claim using our five smooch scale. One smooch means little to no evidence in favor of the claim. And five means strong, very strong evidence in favor of the claim.


So Kristin, go first. What do you think?


[Kristin] (1:10:27 - 1:11:06)
Regina, I'm going to go with two smooches on this one. I don't think it's inherently the p-value's fault. I picture the p-value kind of like a Phillips head screwdriver.


It is good for what it's supposed to be used for. So p-values are great for helping us to distinguish signal from noise. And that's it.


But that's the job it was set out to do, and that's the job it does well. I think the problem is really with humans. Humans are flawed users of statistical tools, might be the better claim here.


They try to use the Phillips screwdriver where they were supposed to have used a hammer. I think that ultimately is the issue. So I don't think the tool itself is really the problem.


How about you, Regina?


[Regina] (1:11:07 - 1:12:18)
I explain it like the stupid door. You don't know you're supposed to be pushing them or pulling them, you know, if you want to get out. Yeah.


And I feel like the p-value is like this. So they have to put a little sign saying push or pull. And there's this cognitive psychologist who says if you have to put a label on your door, then you're designing your door poorly.


And I feel like the p-value is just like that. It looks like it's the probability of the null, and we keep interpreting it that way because humans are going to human, and it's just what we do. So, Kristin, I'm going to go with 2.5 smooches. What's a half of a smooch? Like an air kiss?


I'm kind of right in the middle.


I think the problem is with humans, but it's hard to blame humans because humans are humans and we're fallible. It's hard to blame the p-values because they were just designed this way. When you get them together, it's the ergonomics of statistics.


That's the problem. 2.5 smooches. All right, Kristin, I think it's time for some methodological morals.


For my methodological moral, I think I'm going to go with: If p-values tell us the probability the null is true, then octopuses are psychic.


[Kristin]
That's a great moral, Regina. I love it.


[Regina] (1:12:18 - 1:12:40)
What do you have?


[Kristin]
Yeah, my methodological moral for today, Regina, is: Statistical tools don't fool us, blind faith in them does.


[Regina]
Oh, I love this, Kristin. Nice job.


[Kristin]
This has been a really interesting episode, Regina, a little different than our typical episode, but I'm hoping that listeners are going to come away really having a much clearer understanding of what a p-value actually is.


[Regina] (1:12:40 - 1:13:01)
I think so. But, you know, in some ways, Kristin, we have only scratched the surface of p-values. We have left out so many subtleties and tiny little details.


So just giving a nod to all of those p-value aficionados out there. We realize this is just the tip of the iceberg, but this has been delightful, Kristin. Thank you so much.


[Kristin] (1:13:02 - 1:13:15)
Yeah. People get surprisingly emotional about p-values, so be careful. You've been forewarned.


And we will put more papers about p-values in our show notes if you want deeper reading. Thanks, Regina.


[Regina]
Thanks, Kristin.


And thanks, everyone, for listening.