Dec. 1, 2025

Holiday Survival Guide Part 2: The survey study edition

Holiday Survival Guide Part 2: The survey study edition

Does the temperature of your coffee six months ago really predict whether you feel gassy today? This week we dissect a new nutrition survey study on hot and cold beverage habits that claims to connect drink temperature with gut symptoms, anxiety, and more—despite relying on year-old memories and a blizzard of statistical tests. It’s the perfect case study for our Holiday Survival Guide Part 2, where we teach you how to talk with Uncle Joe at the dinner table about one of the most common—and most fraught—study designs in science: cross-sectional surveys. We walk through our easy checklist for making sense of results, show how recall bias and measurement error can skew the story, and reacquaint you with nonmonogamous Multiple-Testing Dude, who’s been very busy in this dataset. A friendly, practical guide to spotting when researchers are just torturing the data until it confesses.


Statistical topics

  • Confounding
  • Cross-sectional studies
  • False positives
  • Measurement error
  • Multiple testing
  • PICOT / PIVOT framework
  • Recall bias
  • Research hypotheses
  • Sample size and power
  • Signal vs. noise
  • SMART framework
  • Statistical significance
  • Subgroup analyses
  • Survey design
  • Transparency and trustworthiness


Methodological morals

  • “When your measurement starts with ‘think back to last winter’ you might as well use a random number generator.”
  • “If the effect is only significant in certain subgroups in certain seasons for certain outcomes, it might just be a bad case of gas.”



References



Kristin and Regina’s online courses: 

Demystifying Data: A Modern Approach to Statistical Understanding  

Clinical Trials: Design, Strategy, and Analysis 

Medical Statistics Certificate Program  

Writing in the Sciences 

Epidemiology and Clinical Research Graduate Certificate Program 

Programs that we teach in:

Epidemiology and Clinical Research Graduate Certificate Program 


Find us on:

Kristin -  LinkedIn & Twitter/X

Regina - LinkedIn & ReginaNuzzo.com


  • (00:00) - Intro
  • (04:36) - Did they have real research hypotheses?
  • (10:29) - Observational or randomized experiment?
  • (20:09) - PICOT and PIVOT
  • (26:20) - Memory problems
  • (32:03) - Five outcomes and measurement problems therein
  • (36:56) - SMART
  • (41:50) - Multiple Testing Dude is having a great time
  • (52:36) - How big is the effect?
  • (59:06) - Wrap-up and Irish Coffee rating scale

00:00 - Intro

04:36 - Did they have real research hypotheses?

10:29 - Observational or randomized experiment?

20:09 - PICOT and PIVOT

26:20 - Memory problems

32:03 - Five outcomes and measurement problems therein

36:56 - SMART

41:50 - Multiple Testing Dude is having a great time

52:36 - How big is the effect?

59:06 - Wrap-up and Irish Coffee rating scale

[Regina] (0:00 - 0:17)
But he is almost certainly going to catch something not pretty from these thousands of women. And similarly, let's just say this study is going to catch some false positives. And I'm picturing, of course, the false positives as being oozy sores.


[Kristin] (0:23 - 0:46)
Welcome to Normal Curves. This is a podcast for anyone who wants to learn about scientific studies and the statistics behind them. It's like a journal club, except we pick topics that are fun, relevant, and sometimes a little spicy.


We evaluate the evidence, and we also give you the tools that you need to evaluate scientific studies on your own. I'm Kristin Sainani. I'm a professor at Stanford University.


[Regina] (0:46 - 0:52)
And I'm Regina Nuzzo. I'm a professor at Gallaudet University and part-time lecturer at Stanford.


[Kristin] (0:52 - 0:57)
We are not medical doctors. We are PhDs. So nothing in this podcast should be construed as medical advice.


[Regina] (0:57 - 1:03)
Also, this podcast is separate from our day jobs at Stanford and Gallaudet University.


[Kristin] (1:03 - 1:15)
Regina, our last episode was a survival guide for talking about science around the Thanksgiving dinner table with family. And it was so popular and so much fun that we decided to do a part two.


[Regina] (1:15 - 1:18)
Right. Thanksgiving may be over, but we've still got December.


[Kristin] (1:19 - 1:28)
That's right. In that episode, we walked your hypothetical, cantankerous Uncle Joe through a randomized trial that he read about on Facebook.


[Regina] (1:29 - 1:45)
Which was really a fun study. It won an Ig Nobel Prize. Pretty well done.


What they did is randomize college students to shoot back either a stiff cocktail or a refreshing glass of ice water and then go give an impromptu monologue in a foreign language.


[Kristin] (1:45 - 2:02)
Yeah, that was a fun paper. And we actually tried to replicate the experiment on some brave science journalists. So everyone should go check out that episode if they haven't already.


This week, we are doing the same idea, but it's for an observational study, not a randomized controlled trial.


[Regina] (2:02 - 2:10)
The study was published in September of this year, and it made headlines. In fact, I found the study on Fox News.


[Kristin] (2:11 - 2:13)
Prime Uncle Joe material.


[Regina] (2:14 - 2:35)
So the paper, quite fascinating, made a slew of claims that we'll discuss in this episode. But big picture, it claimed that whether you drink more hot drinks or more cold drinks can alter your mood, your sleep and your gut, specifically abdominal fullness and gas.


[Kristin] (2:37 - 2:46)
Oh, my. That's one we haven't talked about before on Normal Curves. But Regina, that claim is a little vague.


Can you give us anything more specific?


[Regina] (2:46 - 3:13)
Well, how about this one from Food and Wine? That's an online publication. The headline and the deck below it read this.


Feeling anxious, a warm drink might be a simple fix, scientists say. San Diego State University researchers found that participants who drank more warm beverages during winter reported fewer symptoms of insomnia, gas and depression.


[Kristin] (3:14 - 3:27)
Wow, this is actually, like, good for Starbucks then. Starbucks can reduce depression, insomnia and gas. I think this is an opportunity for ad revenue, actually, Regina.


Starbucks should definitely be advertising on Normal Curves.


[Regina] (3:27 - 3:35)
And not just because their cups are good proxy measurements for male equipment size, right?


[Kristin] (3:35 - 3:47)
I'd forgotten about that. Yes, Regina, for anyone who doesn't get that joke, they need to go listen to our male equipment size episode to get the inside Normal Curves joke there.


[Regina] (3:49 - 3:55)
And Kristin, I haven't even mentioned yet how the temperature of your hand plays into all of this.


[Kristin] (3:56 - 4:17)
Oh my, Regina. Uncle Joe is going to love this one. Okay, but Regina, enough of the PG-13 references.


Back to business here. Back to the paper.


[Regina]
For now.


[Kristin]
This paper was published in the British Journal of Nutrition. And it's called Cold and Hot Consumption and Health Outcomes Among U.S. Asian and White Populations.


[Regina] (4:18 - 4:19)
Super exciting title.


[Kristin] (4:20 - 4:26)
Yeah, it's a bit of what I like to call a blobfish, Regina. Health Outcomes. That is really vague.


[Regina] (4:26 - 4:34)
Well, I think they didn't want to actually say gas in the title. That's my guess. It's specific, but maybe not so cool.


[Kristin] (4:34 - 4:36)
They might have got more readers that way.


[Regina] (4:36 - 5:00)
Okay, first thing we need to do for Uncle Joe is to ask what specific things did the study actually set out to test? Because it's not enough to say, you know, just, hey, we wanted to see if the temperature of your coffee is important. Something vague like that.


You need to be specific and detailed with measurable questions you can answer.


[Kristin] (5:00 - 5:24)
As we mentioned in our previous episode, you need to find the specific research questions or hypotheses. And these are usually hanging out at the end of the introduction section. So this is a kind of a cheat, Regina, but you can skim through the introduction and find these at the end.


And you really don't even have to read the whole introduction, which frankly sometimes reads like a middle school term paper. And I'm not joking about that.


[Regina] (5:25 - 6:01)
You are not joking at all. That is absolutely right. And sure enough, we could do that in this paper.


We can go straight to the end of the intro section. And the authors say this. The aim of this study is to examine cold and hot food and beverage consumption among Asians and Whites and their association with multiple health outcomes, including self-reported mental health, depression, and anxiety symptoms, insomnia symptoms, and gut health, that is sensations of abdominal fullness and gas, period.


[Kristin] (6:01 - 6:18)
And right away, Regina, we hit a big red flag. Their aims statement that you just read, it packs in multiple outcomes, multiple exposures, and two racial groups. And when you tally it all up, it's actually about 40 mini hypotheses hiding inside that one sentence.


[Regina] (6:19 - 6:41)
Which is a major problem right off the bat, because it seems to be a case of just, you know, the whole buffet plate against the wall and seeing what sticks, right? But also, Kristin, quick teaser here. It was not just 40 associations that they tested.


It was literally thousands.


[Kristin] (6:41 - 7:13)
As we are going to see later in this episode, Regina, this is a clear-cut case of data torture. You know, if you torture the data long enough, it will confess. Or the image we like to use on Normal Curves, multiple testing, dude.


And this is from our Stats Reunion episode, where we personified multiple testing as the guy who tells every girl they're his number one while he is really playing the field. And apparently, multiple testing dude is going crazy in this study, sleeping with thousands of women.


[Regina] (7:13 - 7:39)
Which is super fun. For him. I'm not going to judge him.


But he is almost certainly going to catch something not pretty for his thousands of women. And similarly, let's just say this study is going to catch some false positives. And I'm picturing, of course, the false positives as being oozy sores.


Sorry, but yeah.


[Kristin] (7:41 - 7:48)
That's a good image, Regina. We might want to put it just like that for Uncle Joe. Concrete terms that he can understand.


[Regina] (7:48 - 7:52)
I'm thinking, just don't tell Aunt Ruth about that whole thing.


[Kristin] (7:53 - 8:12)
Regina, this study really should have been labeled differently. There is a type of study where you're allowed to play the field like this, throw spaghetti at the wall and see what sticks. It's called an exploratory study.


And the whole point is to sift through a ton of variables and see if anything looks interesting.


[Regina] (8:12 - 8:46)
Ethical nonmonogamy can be OK, right? You've just got to be transparent about it. Right?


But the tradeoff is important, though, because when you're doing this, this nonmonogamy, you do not get to make big definitive claims anymore. You can only say, hey, we poked around, we poked around, we poked around and found a few possible signals. But they could also be just noise.


Do you like how I kept going right through that? I just powered through my unintended joke there.


[Kristin] (8:48 - 9:10)
Good job, Regina. Yeah, exactly. Exploratory studies have to be labeled and interpreted as exploratory.


And that means you cannot draw any big, firm conclusions. But the authors of this study did not label it as exploratory and they did not interpret it as exploratory. They did draw big, firm conclusions that got picked up by Fox News and other news outlets.


[Regina] (9:11 - 9:37)
Right. Big red flag. And, Kristin, it gets worse because this study was from a bigger survey.


That survey was called the Healthy Aging Survey. And there they asked a lot of questions, and I mean a lot of questions in one survey. By the way, they said the purpose of the survey was to identify, quote, novel aging-related risk factors in the general population.


[Kristin] (9:38 - 10:03)
Novel aging-related risk factors. What does that even mean? That is very broad and, again, another blobfish.


And clearly, the survey was not specifically about hot and cold beverages. It may not even have been about food in general. So not only did they throw hot and cold beverages at the wall to see what sticks, they probably threw cake, ice cream, ketchup, who knows what else.


[Regina] (10:04 - 10:06)
Cigarettes, booze, Viagra. Those are important.


[Kristin] (10:06 - 10:28)
Could affect aging. Yeah, I'm not sure Viagra will stick to the wall, but maybe if you've already thrown spaghetti and ketchup on, then it will.


All right. Yeah, this study is like a holiday buffet in which you just have to taste every single dish, dessert, and drink. It's going to give you a stomachache and, Regina, probably gas and maybe a hangover to boot.


[Regina] (10:29 - 10:53)
Which is a great combination, of course, when you have them together. Woo, great holidays. OK, so that is the quote-unquote research question, the buffet there.


The next thing, Kristin, that we ask in our holiday survival guide is whether the study was a randomized experiment or an observational study.


[Kristin] (10:54 - 11:06)
Yeah, Regina, most studies are observational because they're easier to do. You don't need to randomly assign people to drink an alcoholic cocktail or a glass of water. You can just observe what they're already doing.


It's simpler.


[Regina] (11:07 - 11:18)
It is simpler. So to figure out if this is a randomized experiment or an observational study, because that's important, you and Uncle Joe might just need to read the abstract, and that might be enough.


[Kristin] (11:18 - 11:28)
Right. Authors will usually use the words randomly assigned or randomized in the abstract if it was a randomized trial because that detail is so critical.


[Regina] (11:28 - 11:36)
Right. So quick scan of this abstract makes it clear it was a survey study, and survey means definitely observational. Right.


[Kristin] (11:37 - 11:50)
As we've talked about a lot in this podcast, the issue with observational studies is that the variables are all tangled up. Someone who frequently chooses hot drinks may do a lot of things differently than someone who often chooses cold drinks.


[Regina] (11:50 - 12:05)
Right. Like cold drink people, maybe they're all living in the South, right? It's hot, and they're drinking sweet tea and beer all day, and eating barbecued ribs, pork rinds, and maybe that's why they have gas.


[Kristin] (12:06 - 12:20)
Yeah, exactly. We have to think about that's what we call confounding, and that's going to be a major issue in this study. Observational studies also come in many flavors, and we need to point out the flavor to Uncle Joe, and this one is what we call a cross-sectional study.


[Regina] (12:20 - 12:50)
Right. So they come in lots of flavors. I love the way you say flavor.


It makes it just, you know, tastier instead of dry. So just for the record, other flavors that you might encounter, so you can recognize them, and we'll talk about them in future episodes. They are case control, cohort, nested case control, case crossover, cauliflower casserole.


Just kidding. Case crossover is real. Cauliflower casserole is not.


[Kristin] (12:51 - 13:10)
Right. These are all different kinds of study designs, and this one is cross-sectional, which means that all the variables were measured at a cross-section in time, and this is a particularly weak study design because we don't know the timing of events. We don't know what came first, which is a basic criterion of causality.


[Regina] (13:11 - 13:24)
Maybe people who are depressed, they are choosing to drink a lot of whiskey on the rocks because it makes them feel better, but we don't know if the whiskey preceded the depression or the depression preceded the whiskey. Exactly.


[Kristin] (13:25 - 13:46)
Regina, before we get into more of the study design details, can we take a step back and ask, why are they even looking at this? Like, why would you think that the temperature of your food does anything more than, say, give you brain freeze or a scalded throat? Right.


What is the proposed biological mechanism here to link temperature to mood or sleep or gas?


[Regina] (13:46 - 14:11)
Oh, gas. Right. It's not really clear.


The authors of the paper do say that in some traditional Asian cultures, especially Chinese culture, it's considered unhealthy to consume cold drinks and cold food. Warm or hot drinks are supposed to be better for you, especially in the winter, and they say that this belief is rooted in traditional Chinese medicine.


[Kristin] (14:11 - 14:39)
All right. Well, that's a topic for a whole other episode. We're not going to go there today.


But, Regina, I do have to admit that I have some familiarity with these kinds of traditions, because my soon-to-be, not soon-enough-to-be, ex-husband is from India originally. And when our kids were young, he did bring out some traditional food beliefs that were, let's say, unexpected to me. Like, you're not supposed to drink water and eat fruit at the same time.


[Regina] (14:40 - 14:41)
Oh, wow. Interesting.


[Kristin] (14:41 - 15:03)
Yeah, I have some theories here. I'm guessing that it might have some kind of biological origin, like drinking a lot of water might dilute your stomach acid, and therefore you would be less able to fight off the microbes hanging out on unsanitary fruit, something like that. So maybe at some point in history, that sort of rule had some real biology behind it.


[Regina] (15:03 - 15:05)
Okay, I could see that, maybe.


[Kristin] (15:06 - 15:20)
Yeah, I'm just keeping an open mind that these traditions may have their origin in something biologically real. But in this study, is there any biological basis? Is there anything the authors cite that might explain the proposed biology?


[Regina] (15:20 - 15:44)
Well, they did list a couple of references in that intro section for the traditional Chinese medicine link, and I followed it, led to a site called topchinatravel.com and another to a blog post on the Arkansas Acupuncture Center. And not exactly the most cited scientific references around.


[Kristin] (15:45 - 15:55)
Right. Those are not your typical citations for a scientific paper. And they're really just telling us that there is a cultural basis here, but it's not telling us about a biological one.


[Regina] (15:55 - 16:08)
Okay, but they also did point to a couple of studies, actual studies, that suggested drinking cold water can possibly slow down gut trends at time, I believe, in mice.


[Kristin] (16:09 - 16:20)
Oh, in mice. Okay, well, that might make some sense, but it's kind of a leap from mice exposed to ice water have slower gut trends at times to Asians who drink iced tea in the summer have higher anxiety.


[Regina] (16:20 - 16:34)
Bit of a leap. Bit of a leap there. Oh, and by the way, Kristin, they also cited their own study that showed a link between increased ice cream consumption and menstrual pain.


[Kristin] (16:34 - 16:35)
Ice cream consumption?


[Regina] (16:36 - 16:56)
Wow. That was an actual published study? Yep, from the same data set published in 2023.


And gotta say the results don't seem to me to be really breaking news here because I would always crave ice cream and chocolate with PMS. Yeah, I think cause and effect might be reversed here, Regina.


[Kristin] (16:57 - 17:30)
So maybe the ice cream doesn't cause the PMS. Maybe you crave ice cream when you have PMS. Just maybe. Very plausible. But Regina, another issue, this PMS and ice cream paper, it's a paper written from the very same data set, the same aging survey.


So what they're doing here is salami slicing, which we talked about in the episode on ultrarunning and vitamin D. Salami slicing is when researchers take one study and they cut it up into as many thin papers as possible to pad their publication records.


[Regina] (17:30 - 17:40)
But the problem is slicing and dicing these results, it comes with the risk of false positives. So we really trust people less when they do this.


[Kristin] (17:41 - 17:58)
So now we don't just have a huge multiple testing problem in this one paper. We're now multiple testing from the same data set across multiple papers. So it's kind of like multiple testing dude has a base in D.C. and a base in California. He's a busy guy.


[Regina] (17:59 - 18:03)
He's my role model.


[Kristin] (18:04 - 18:33)
You go, girl. But yeah, this is a recipe for false positives, Regina. So somebody's heart is going to get broken and there may also possibly be oozing sores. I'm just saying before you do this, ask for a clean bill of health.


Mathematically, when you run this many statistical tests, you will get some significant results that don't reflect real associations that are just noise. It's inevitable when you run this many tests. So we're going to be very weary when we get to the results.


[Regina] (18:33 - 18:43)
Now I think we are ready to jump to that list of questions that we brought up in the last episode that help us analyze study design and walk Uncle Joe through that.


[Kristin] (18:43 - 19:24)
That sounds good, Regina. But let's take a short break first.


Welcome back to Normal Curves.


Today, we are teaching you how to talk about science with your family around the holiday dinner table. And we were walking Uncle Joe through an observational study about hot and cold beverage consumption. Regina, we were about to talk about the study design details.


[Regina] (19:25 - 19:47)
Right. In part one of that holiday guide, we gave the standard list of questions that you can use to walk through study design. Great for Uncle Joe.


The problem is it has an awful mnemonic. But let's see if people can remember it from the last episode. And Kristin, can we put in Jeopardy music here or just imagine Jeopardy music?


[Kristin] (19:47 - 19:49)
I'll try to splice it in, but no promises, Regina.


[Regina] (19:51 - 19:59)
Okay, we're imagining it. Okay, Kristin, everyone's had enough chance. Kristin, the answer is...


[Kristin] (20:01 - 20:27)
What is PICOT?


Population, intervention, comparison, outcome, time. You know, Regina, we actually got a fun listener comment. We love listener comments, by the way, so keep them coming.


This was from Larry Wilson on X, formerly Twitter. He suggested we change the name to PIVOT because the V could be for versus rather than the C for comparator. And I kind of like that.


It sounds more cool than PICOT.


[Regina] (20:28 - 20:56)
Absolutely, PIVOT. Let's do it. Thank you, Larry Wilson.


[Kristin]
Yes, PIVOT and SMART.


[Regina]
Oh, you can PIVOT and SMART. There's so much we can do with this.


Okay, so let's start off with P, which stands for...


[Kristin]
Population, not penises. Right, right.


[Regina]
Despite common assumption, P is here, not for penises. We've already done studies where the population was penises, but today it's actual people.


[Kristin] (20:56 - 21:01)
Go check those out. Again, Starbucks reference there. Male equipment size episode.


Worth listening.


[Regina] (21:02 - 21:05)
And okay, the rest of them. I is...


[Kristin] (21:05 - 21:08)
Intervention, or as we're going to see today, could also be exposure.


[Regina] (21:09 - 21:12)
C used to be C, now it's V, is for...


[Kristin] (21:12 - 21:16)
Comparator. Or we could use versus and make PIVOT, which sounds a lot more fun.


[Regina] (21:17 - 21:18)
O is for...


[Kristin] (21:18 - 21:19)
O is for outcome.


[Regina] (21:20 - 21:26)
And T is for...


[Kristin]
T is for time.


[Regina]
Right.


Okay, let's start with P for population.


[Kristin] (21:27 - 21:33)
Right. So remember in P, we are asking who was studied, how did they recruit them, and how big was the sample?


[Regina] (21:33 - 21:45)
All right. And remember this study came out of a survey that was called the Healthy Aging Survey. And there they recruited adults in the U.S. aged 18 to 65.


[Kristin] (21:46 - 21:55)
Regina, I'm wondering why it's called the Healthy Aging Survey if they didn't survey old people. Like I would have guessed we were talking at least middle-aged, but 18-year-olds, really?


[Regina] (21:55 - 22:16)
It is one of the many mysteries of this study, Kristin. Now, for the survey, they did include people from all races, but they wanted the study sample to be at least 50% Asian because they said Asians are often underrepresented in surveys and health studies, so they wanted to oversample them here.


[Kristin] (22:17 - 22:19)
So how did they specifically recruit that group?


[Regina] (22:19 - 22:31)
Yeah, this is interesting. They went and put up flyers in Asian neighborhoods and Chinese churches, Indian grocery stores, and, weirdly enough, Asian college student groups.


[Kristin] (22:31 - 22:41)
Regina, college students? Really? I mean, I thought this was the Healthy Aging Survey.


I mean, college students really, I mean, they're aging, but they're not aging the way that we are aging.


[Regina] (22:42 - 22:53)
I know. Again, I'm chalking this up to one of the many mysteries, but Kristin, in the end, they got overall a little less than half Asian, so they did pretty well.


[Kristin] (22:53 - 23:21)
Okay, what their target was. Regina, this is what we call, though, a convenience sample. It's kind of like they went to the convenience store and shopped for participants.


They got whoever was willing to fill out a survey proactively, and that usually is people who are more health conscious, more motivated about their health. So the results reflect a very select group of people, and they may not be representative of everyone. Specifically, they may not be very representative of Uncle Joe.


[Regina] (23:21 - 23:22)
Uh-huh.


[Kristin] (23:22 - 23:24)
And, Regina, what about sample size here?


[Regina] (23:25 - 23:42)
Yeah, they said that 885 people completed the survey, but in the end, they only kept about 620 people in their data set because they needed to throw out responses that didn't make sense, like people who had said they weighed under 50 pounds.


[Kristin] (23:42 - 23:45)
Ah, yeah, people don't always fill out surveys well.


[Regina] (23:46 - 24:00)
Right, right. But then in this particular paper, though, they used data from only 415 people. They only kept Asians and Whites, 212 Asians and 203 Whites.


They excluded the rest because there were different races.


[Kristin] (24:01 - 24:10)
Oh, so we really have a select sample on top of a select sample, very much a convenience sample that may not be representative of everyone.


[Regina] (24:10 - 24:24)
Yeah. Another strange thing, Kristin, the Whites were on average age 50 versus the Asians age 31 on average. So, these populations are going to be pretty different.


[Kristin] (24:24 - 25:19)
Oh, wow, very different. I guess maybe they got a lot of college students after all. Because, yes, an Irish coffee may go down differently at 50 than at 31. And I'm wondering if maybe they used different recruitment strategies for the white participants that led to these vastly different groups.


And that means, Regina, we really can't make any direct comparisons between the groups. And in fairness, I don't think they tried to. I think they did all the analyses separately.


So maybe it's OK, but it's a little strange. Another quick thing I want to point out, Regina, defining race, especially on surveys, is a total quagmire. Like, how would my kids be classified?


They are half Asian and half white. So are they just going to be excluded from the survey altogether? And if not, which group are they going to be in?


They're in both, really. And there are a lot of mixed families out there. So this kind of dichotomizing race, really problematic.


[Regina] (25:20 - 25:26)
Yeah, which is statistically a whole fascinating discussion in itself, but for another day.


[Kristin] (25:26 - 25:27)
Future episode, yeah.


[Regina] (25:27 - 25:36)
Yeah. OK, that is population. Now we are getting on to the I or the E, intervention or exposure.


Right.


[Kristin] (25:36 - 26:05)
Since this is observational and not experimental, technically there is no intervention. So instead of intervention, we think about what's the exposure. Like, I drank hot drinks, so I was exposed to hot drinks.


And technically then it's not an I, it's an E for exposure. But it doesn't really matter because we still pronounce it the same, PECOT. And the main exposure here was the temperature of the beverages that the participants drank in the summer and also in the winter.


[Regina] (26:05 - 26:20)
Right. And how did they measure that? Because in nutrition studies, they often ask people like every day or every week to report what they consumed during that period to get it really fresh.


So did they send out multiple surveys?


[Kristin] (26:20 - 27:10)
No, they didn't. I mean, that's a better way because it actually is getting it in real time. But here they just collected all their information on one giant survey administered at one time point.


And they were asking people to remember back to what had happened over the past year. And that's actually really hard. And to illustrate that, Regina, I'm going to have you try to answer one of their questions.


Kind of like a memory test. So here we go. Regina, I want you to think back to how often you had cold drinks last winter.


And here are your choices. Never or less than once per month. One to three times a month.


Once a week. Two to four times a week. Five to six times a week.


One to two times a day. Three to five times a day. Or six or more times a day.


Cold drinks last winter. Go.


[Regina] (27:12 - 27:47)
Well, clearly six whiskeys on the rocks six times a day. I mean, what else would it be? I mean, maybe sometimes it was only five times a day.


Clearly, I have no idea because what counts as a drink, right? A sip of water or I'm filling up my Stanley cup. I would have to make a wild guess here.


And I don't drink chilled water generally, so I can't add up my water cups per day. So I guess I'm just counting my cocktails.


[Kristin] (27:49 - 27:53)
Okay, fess up. How many then? How often were you having cocktails last winter?


[Regina] (27:55 - 27:57)
Random number generator here.


[Kristin] (27:57 - 28:00)
So you would just pick randomly, you think?


[Regina] (28:00 - 28:01)
I would. I would.


[Kristin] (28:01 - 28:02)
Okay, you're not going to select one for us?


[Regina] (28:04 - 28:12)
If I needed to, I would say one to two times per day because that seems average.


[Kristin] (28:12 - 28:34)
Seems normal, yeah.


Well, Regina, I do know that you had a few cold beverages last winter because when you were at my house, I was serving green juice and dirt juice and they were definitely served chilled. You call it dirt juice. It's carrot juice, actually.


But they were served chilled because they are disgusting if they are not chilled. So at least once in six months, last winter, you had cold beverages.


[Regina] (28:35 - 29:05)
Kristin, I think you're right. And they were delicious. Thank you for the dirt juice cocktail.


I think, Kristin, though, this really points to the whole ridiculousness of this question and the administration and what we are asking participants to do. Like I said, I would choose at random or choose something normal or I would report a high number because we're supposed to drink a lot of fluid and that might feel like the right answer. So I know that I had a whiskey on the rocks last night and that's about all I can remember.


[Kristin] (29:06 - 29:10)
Oh, you did? Wow. Oh, ah, with the ex-boyfriend.


That's a whole other story.


[Regina] (29:12 - 29:16)
Not for an episode, however. Okay, okay, fine.


[Kristin] (29:17 - 29:36)
I'm not going to get the details now. All right. All right.


But yes, this is just pointing to that it's really hard to do this and it's probably not going to be accurate at all. And, Regina, the researchers also asked about hot and cold foods in addition to hot and cold drinks. And that's also hard to remember.


Yeah.


[Regina] (29:36 - 29:48)
All right. Like how many times did I have sushi for dinner rather than fish and chips?


[Kristin]
Yeah, exactly.


[Regina]
Really, you cannot get anything other than random noise out of these questions. It's full of error.


[Kristin] (29:48 - 29:49)
Slightly better than random noise.


[Regina] (29:50 - 29:50)
Yeah.


[Kristin] (29:50 - 29:59)
Moving on now from I or E to C for comparison or V for versus. What did they actually compare here?


[Regina] (30:00 - 30:21)
Right. Like in the alcohol experiment that we talked about last time, we had two groups. We had the cocktail group and then we had the iced water group.


And here they did not do that. Here they compared people who had a lot of hot drinks to people who had fewer hot drinks. And then the same with cold drinks.


And then cold meals and hot meals.


[Kristin] (30:22 - 30:29)
So the exposure was the high consumption of these things. And the comparator was low consumption of these things.


[Regina] (30:29 - 30:36)
Right. High versus low. And we'll get into more details of how they defined high and low when we look at the results later.


[Kristin] (30:37 - 31:07)
Yeah. And spoiler alert, one of the tricky parts here is that there are a lot of ways to define high and low. And researchers can exploit that to increase the number of statistical tests they run in order to increase their chances of finding false positives.


Right. So you could try 10 different definitions of high and low and only report the one that gave the most positive results. And that would be an example of p-hacking.


And it's going to lead to false positives.


[Regina] (31:08 - 31:15)
This is good for Uncle Joe to know. And we'll walk through later and see how absolutely nuts this is.


[Kristin] (31:15 - 31:19)
Yeah.All right. Next is the O for outcome. What was the outcome here?


[Regina] (31:19 - 31:30)
Not just one outcome, but five different outcomes, which we've mentioned before. Anxiety, depression, insomnia, abdominal fullness, and, of course, gas.


[Kristin] (31:30 - 31:41)
And five is a lot. And those are just the outcomes they reported. They may have analyzed more outcomes, but just not reported them.


And how did they measure these five outcomes, Regina?


[Regina] (31:41 - 31:50)
Right. For depression, anxiety, and insomnia, they used standardized questionnaires that are often used in other research to get numerical scores.


[Kristin] (31:51 - 32:03)
And these standard questionnaires are widely used. I'm very familiar with them. I've analyzed those data myself.


Now, it is self-report, so they are limited. But at least those scales are well validated.


[Regina] (32:04 - 32:18)
Right. But for abdominal fullness and gas, they had a five-point scale where you had to say, how often do you have gas? Never, rarely, sometimes, often, or always.


[Kristin] (32:19 - 32:29)
Okay. That one is kind of funny, because do people really have fullness and gas always? That must be very unpleasant.


[Regina] (32:30 - 32:51)
That must be very unpleasant. That's got to be tied to mood, right? Like, I would certainly be quite depressed and anxious if I had gas always.


But, okay, it's silly maybe to ask it this way, but how else are you going to measure gas objectively, like numerically?


[Kristin] (32:51 - 33:10)
Well, I mean, there's probably ways to measure it objectively, you know, not on a self-report questionnaire, right? You could ask people around you. You could have a little machine of some sort that, you know, you wore for a day, like a blood pressure monitor.


[Regina] (33:12 - 33:17)
Now I really want to do an episode on flatulence now. It's begging us.


[Kristin] (33:18 - 33:23)
Oh, are we going to divert into like 13-year-old boy humor now, Regina?


[Regina] (33:23 - 33:32)
Does it make it any better if I went online and found a fun fact about flatulence? Does that make it more, you know, highbrow?


[Kristin] (33:33 - 33:38)
Scientific, yes. So you did. Okay, I want to hear the fun fact now, Regina.


Good job.


[Regina] (33:39 - 33:45)
Well, it's kind of scientific. It's linguistical. How about this?


Yeah, have you ever had pumpernickel bread?


[Kristin] (33:46 - 33:47)
Oh, yeah, I like pumpernickel bread.


[Regina] (33:48 - 34:18)
Why? Okay, because it literally translates into devil's fart. Pumpern is the German word for to break wind or to fart, and nickel is a short form of the name Nicholas, but also used historically to refer to a goblin or the devil Old Nick.


And yeah, I guess it was just kind of undigestible.


[Kristin] (34:19 - 34:31)
Regina, I think we could do an experiment here. This would be an easy publication. Pumpernickel bread, no pumpernickel bread.


What would be the control? Like white bread and measure of flatulence. I'm sure it would get published.


[Regina] (34:32 - 34:36)
I am so looking this up on PubMed to make sure no one's...


[Kristin] (34:36 - 34:38)
I didn't know that that was the origin of the name.


[Regina] (34:38 - 34:45)
Okay, so that was my attempt to elevate the 13-year-old boy humor up to, you know, at least college student level.


[Kristin] (34:46 - 34:48)
Right. I don't know if we succeeded, Regina, but...


[Regina] (34:48 - 34:53)
But I think this is also relevant to holidays with Uncle Joe, right?


[Kristin] (34:54 - 35:09)
Oh, I imagine that Uncle Joe probably experiences gassiness sometimes. Yes, that's my picture in my head. And as we are thinking about holiday gatherings and getting together with your in-laws and such, sometimes gas is an issue.


[Regina] (35:10 - 35:19)
Yes. I'm picturing the holidays. Beautiful picture, right? Holiday wreaths, a lot of Yankee candles, and some pine air freshener.


[Kristin] (35:21 - 35:36)
All right, bringing it back here, Regina. The point is that this particular question about abdominal fullness and gas, it's pretty arbitrary. It's not a great measure probably of gut health.


I don't think it's necessarily going to be very accurate.


[Regina] (35:37 - 35:44)
Okay, so overall, we're talking about a lot of measurement error, and that's really important to keep in mind when we are talking about results. Yep.


[Kristin] (35:44 - 35:47)
Yeah, if everything is measured with a ton of error, it's all going to be noise.


[Regina] (35:48 - 35:52)
Okay, finally, T. T is for time. What was the time span here?


[Kristin] (35:53 - 36:01)
Well, Regina, this is really just a snapshot of one point in time, although they are asking them to try to remember back as far as a year.


[Regina] (36:01 - 36:10)
Right, but some things about last week and some things about last summer and, you know, an entire year. So, all these different time scales, I know that I could not do that reliably.


[Kristin] (36:10 - 36:11)
Yeah, I don't think I could either.


[Regina] (36:12 - 36:58)
All right, so that was the standard list of questions we use for evaluating study design, PICOT or PIVOT. And, Kristin, got to say, this is not looking like a great study. In fact, I think we could probably just stop Uncle Joe right here and say, no, Uncle Joe, you cannot use this study that you found on Fox News to justify having your third Irish coffees tonight and say it's for your, quote, gut health.


No. But, Kristin, you and I are nothing if not conscientious. And just in case Uncle Joe is as stubborn as many relatives are, we are going to keep plowing on.


We're going to review now our list of questions to ask about results.


[Kristin] (36:58 - 37:01)
Right, because we have to use our SMART scale, Regina.


[Regina] (37:02 - 37:24)
Yeah, I want to point out this list was Kristin's genius brainstorm, where she made a mnemonic of questions to spell out SMART. But I keep thinking we need to come up with three more questions to make it SMART ASS. That is definitely the role I am tempted to slip into personally after too much time around a rowdy family dinner table.


[Kristin] (37:24 - 37:28)
I am going to work on that, Regina. I'm sure I can come up with three more questions.


[Regina] (37:28 - 38:16)
Okay, so before we get into our SMART list, I think this is a good time for a break.


Welcome back to Normal Curve. Today we are at the holiday dinner table, and we are helping Uncle Joe walk through an observational study.


This is one about hot and cold beverages and how they affect our mood, sleep, and digestive system. And Kristin, you're about to lead us through the SMART questions about the results.


[Kristin] (38:17 - 38:42)
Right, and just to remind everybody, SMART stands for S is for Signal, M is for Magnitude, A is for Alternative Explanations, R is for Reality Check, and T is for Trustworthiness or Transparency. And let's start with S, Regina. So, S is for Signal.


How strong was the signal compared to the noise in the data? Were there any statistically significant results here?


[Regina] (38:43 - 39:40)
The answer is yes, there were definitely statistically significant results, but this is where multiple testing dude and all of his issues really make a play. But before we get into all of that, I first want to give just the broad outlines of what they found so we have some context here. And here is how Fox News reported it.


Drinking more cold beverages during warmer months was associated with increased anxiety, more sleep disturbances, and greater feelings of abdominal fullness among Asian participants. The white participants, however, reported less depression, enhanced sleep quality, and fewer gastrointestinal problems when they drank hot beverages in winter. And Kristin, by the way, when they say fewer gastrointestinal problems, of course, they mean less gas.


And I'm guessing you can't say that on Fox News. You can't say, you know, breaking wind or cutting mustard.


[Kristin] (39:41 - 40:19)
But it's all old people who watch Fox News, Regina. So, is that not a topic of interest to them? And they could just use the word gas.


But all joking aside, these are very specific findings, right? In Asians, in the summer months, cold beverages are bad for sleep and anxiety and fullness. But in Whites, warm beverages in the winter are good for depression, sleep, and gas.


And we need to stress to Uncle Joe that when you see results that are this particular, like only in Asians, only in summer, only anxiety, not depression.


[Regina] (40:19 - 40:23)
Only under a full moon, only on Thursdays, only when you cross your fingers.


[Kristin] (40:23 - 40:30)
That extreme specificity should ring some poor research alarm bells with you. You've got to be skeptical.


[Regina] (40:31 - 40:40)
This gets back to multiple testing dude. And, you know, Kristin, this is the busiest we have seen multiple testing dudes in any of our episodes, I think.


[Kristin] (40:40 - 41:20)
He is having a party in this paper, Regina. Yes. There's just so many things here.


And the fundamental problem with multiple testing dude, as we've talked about before, if beverage temperature, for example, has no effect on anything, but you run 20 statistical tests about beverage temperature, and you use that standard threshold of 0.05 for statistical significance, then you are expecting one false positive because 5% of 20 is one. And if you run 100 tests, you expect five false positives. And if you run 1000 tests, you expect 50 false positives.


That gives you a lot of random false positives to cherry pick and make a story out of.


[Regina] (41:20 - 41:25)
So, Kristin, let's walk through this study and just how many tests were done.


[Kristin] (41:25 - 41:39)
Yeah, this is going to be some fun and hilarious math, Regina. And it's pretty doable math, basic arithmetic, except maybe if you are on your third glass of wine by now trying to explain science to Uncle Joe, then maybe this math might feel a little complicated.


[Regina] (41:40 - 41:51)
Yeah. OK, but it starts easy, at least. So, let's start with the outcomes.


They had the five that we talked about before, depression, anxiety, insomnia, feeling full, and gas.


[Kristin] (41:51 - 41:56)
Right. So, all of the analyses are going to get repeated five times, one for each outcome.


[Regina] (41:56 - 42:09)
Right. And then they had 16 different exposure variables. And let me walk you through all of them.


First, they measured how many cold drinks and how many hot drinks you were having in the winter and the summer.


[Kristin] (42:09 - 42:15)
OK. So, we can do that arithmetic, cold and hot, winter and summer. That's 2 times 2 is 4.


[Regina] (42:15 - 42:20)
Right. Then they measured how much cold food and how much hot food you were having in the winter and in the summer.


[Kristin] (42:21 - 42:23)
So, that's four additional exposures.


[Regina] (42:23 - 42:29)
Then they measured ice cream. How much were you having in the winter and the summer?


[Kristin] (42:30 - 42:44)
I was having a lot.


All right. That's two more. So, now we're up to 10 exposures.


Regina, at this point, the study is basically the QVC shopping channel for variables.


Call now, and we'll throw in ice cream plus cold hands absolutely free.


[Regina] (42:45 - 42:47)
And a toaster, a toaster, and an air fryer.


[Kristin] (42:49 - 42:59)
And, you know, Regina, I have to say, I fully support ice cream being its own category of exposure. It should get special status among cold foods. I like what the authors did there.


[Regina] (42:59 - 43:31)
Oh, absolutely. Absolutely. OK.


But this is only 10 so far, notice, and I said 16. So, this is where it gets even more fun. They started combining all of these different things into different composite variables.


They combined cold drinks and cold food for an overall cold category. They made an overall hot category. They made an overall score of total cold minus total hot.


So, now that's three more exposures in both summer and in winter. So, now six more.


[Kristin] (43:32 - 43:45)
So, now we're up to that 16 exposures that you mentioned.


So, Regina, call now, and we'll upgrade you to composite scores at no extra cost. Total cold, total hot, cold minus hot. It's the deluxe package.


[Regina] (43:46 - 43:56)
You know, Kristin, I have to say, you are shockingly good at this voice. And I'm wondering, like, do you have a side job doing this? Do you want a side job doing this?


[Kristin] (43:56 - 43:57)
I might need to get a side job doing this, Regina.


[Regina] (43:58 - 44:19)
Backup career. I'm just saying, QVC, call Kristin. OK.


So, now we are at five outcomes times 16 exposure variables. And that equals 80 statistical tests.


[Kristin]
But wait, Regina, there's more.


[Regina]
There's always more. Tell me, more free gifts. Call now.


[Kristin] (44:20 - 44:46)
They didn't just have one variable for each of these 16 exposures. They treated each exposure variable as a number, but then they also divided it into categories, into tertiles, which is the highest, middle, and lowest third of consumption. So, they analyzed each exposure as a number, but also as a high versus low comparison, plus as a medium versus low comparison.


[Regina] (44:46 - 44:48)
So, now we've got three more.


[Kristin] (44:49 - 44:55)
And now, for just three easy payments of statistical confusion, we will slice each exposure into tertiles.


[Regina] (44:59 - 45:11)
You are good at this. OK. So, that is five outcomes times 16 exposures times three comparisons per exposure.


So, now, Kristin, we are up to 240 tests.


[Kristin] (45:12 - 45:27)
But, Regina, wait, there's more.


Operators are standing by to take your order for one more, doubling the value of your money.


Because they also split the sample into Asians and Whites, and they ran all the tests separately for each group.


[Regina] (45:27 - 45:36)
So, now, we are at 240 times two groups, which is 480 tests, which is just crazy, let me say.


[Kristin] (45:37 - 45:58)
But wait, Regina, there's more.


If you order in the next 10 minutes, we'll throw in a free set of statistical knives to chop up your data even further.


Because it didn't stop there. After running those 480 tests, they further cut the data into two subgroups. People who have very cold hands and people who have normal hands.


[Regina] (46:00 - 46:12)
So, now, they ran each of those 480 tests two more times. So, it's three times 480, which is 1,440 statistical tests.


[Kristin] (46:12 - 46:16)
Regina, at this point, the data set needs a safe word.


[Regina] (46:16 - 46:38)
It is being tortured, honestly. Uncle, it's crying, Uncle. Uncle Joe, save me.


And this last one is particularly interesting, as is moderator cold hands, because apparently cold drinks only matter for your mental health if you have cold hands versus normal hands. I mean, it just gets a little ridiculous.


[Kristin] (46:38 - 46:58)
It's totally ridiculous.


This is not a study anymore, Regina. It's a bundled deal. You get hot drinks, cold drinks, hot food, cold food. But just pay the small shipping fee of 480 extra tests.


Call now and we'll multiply your false positives at no additional cost.


Because, Regina, I'm not done yet.


[Regina] (46:59 - 47:05)
Of course you're not. But sign me up because Uncle Joe is a sucker for this. He loves a good bargain. Let's keep going.


[Kristin] (47:06 - 47:44)
Yeah, believe it or not, I noticed even something else. They tried two different linear regression models. So, for all these different exposure outcome race combinations, they tried two different models with different sets of confounders.


And they literally picked the model that gave better p-values. And they even admit this in their paper. They admit that model 2 gave worse p-values and model 1 gave better p-values, meaning smaller and more significant.


So, they chose to report only the results from model 1.


So, Regina, if you're not satisfied with model 2, don't worry. We also offer model 1. Just pay a small processing fee in statistical integrity.


[Regina] (47:46 - 48:17)
Because that's cheap. Statistical integrity, honesty, ability to sleep at night.


So, we don't know what they did. But if they ran all 1,440 tests twice, we are now up to 2,880 tests. We don't know if we have that many because they didn't report everything in their paper. They haven't made the code available.


But we can say we are at least in the thousands at this point.


[Kristin] (48:17 - 48:42)
Right. I suspect they started trying model 1 versus model 2 and after a while just said, OK, we're getting results better with 1. So, maybe they didn't run all 2,880, but yes, thousands. And, Regina, this is a total recipe for false positives.


If you use the standard p-value threshold, again, of 0.05, that means that 5% of statistical tests will come out significant when there are no real associations in your data.


[Regina] (48:42 - 48:59)
Right. Well, here the authors actually said they used a slightly more stringent threshold of 0.025. So, if food and beverage temperature really do nothing, we'd expect 2.5% of these results to be false positives.


[Kristin] (48:59 - 49:17)
Regina, I have to say, I got the biggest laugh out of that line in the paper because they were like, oh, look, we used 0.025 as our threshold because we were worried about multiple testing. And that's kind of like saying, sure, we handled multiple testing. We put this tiny little fig leaf on a massive elephant and whoops, that takes care of it.


[Regina] (49:20 - 49:25)
Kristin, did you just make an elephant penis joke?


[Kristin] (49:26 - 49:30)
I might have, yeah. I'm trying really hard here, Regina, to help you on the PG-13 angle.


[Regina] (49:31 - 49:51)
OK, Kristin, I need to look up elephant penises now. I just, I cannot resist. I need an elephant penis random factoid.


You can edit out the pause. OK?


OK, Kristin, did you know that an elephant penis is up to three feet long and weighs up to 60 pounds?


[Kristin] (49:51 - 50:03)
It's just the kind of thing that Uncle Joe may be interested in, Regina. So, good holiday dinner table conversation, you know, along with PICOT and Mendelian randomization. Lots of fun facts from Normal Curves to talk about.


[Regina] (50:03 - 50:57)
All right, but back to the numbers away from the elephant penises. I will tear myself away. So, I'm going to go conservative and just talk about the 1,440 tests that we counted out that were run on model one.


So, let's say that cold and hot have nothing to do with anything. Then if you run these 1,440 tests on this random noise with this 2.5% threshold that they said you would expect to get 36 results coming up significant just by chance. That's 36 false positives.


And, Kristin, we use this as a kind of benchmark, right? With Uncle Joe, it's like a reality check, a bring you back down to earth recalibration of expectations. You say, if there's nothing going on, we're going to see 36 things coming up just by chance.


So, that's kind of our baseline there.


[Kristin] (50:57 - 51:16)
Yeah, and Regina, I counted up all the significant p-values in the paper with p is less than 0.025, and guess how many I found.


[Regina]
Oh, don't tell me it's 36.


[Kristin]
It was 35.


Which means this is incredibly consistent with what we would expect just due to chance.


[Regina] (51:17 - 51:38)
This is crazy. Now, like we talked about with multiple testing dude, although there is no law against this in dating or in science, it is not in great form. It does not inspire confidence.


And really, the only proper response is to not take any of it very seriously at all.


[Kristin] (51:38 - 51:52)
Yeah, exactly. This is not a serious study. Okay, Regina, now M.


M is for magnitude. How big was the effect? And, Regina, they had so many results that they highlighted.


Let's just walk Uncle Joe through one of those results.


[Regina] (51:52 - 51:56)
Well, you know I'm going to pick the gas one, Kristin.


[Kristin] (51:56 - 51:58)
Of course, we have to keep people listening.


[Regina] (52:00 - 52:13)
So, this is effect size we're talking about, magnitude effect size. And sometimes you need to go hunting in the paper for it. So, here we're going to go to the results section of the paper and search for gas.


[Kristin] (52:13 - 52:18)
You really never have to go hunting for gas, Regina. It usually finds you.


[Regina] (52:19 - 52:47)
Whether you want it to or not. So, here's what I found when I hunted for the gas. In white, for their gut health, the strongest effect was seen in the winter in the highest drink consumption group versus the lowest hot drink consumption group.


And the effect is given as a beta coefficient or slope in the regression model. And the value is negative 0.52. And that's the important number there.


[Kristin] (52:47 - 53:27)
So, let's unpack that. The negative, first of all, tells you that it's a negative relationship between warm drinks and gas. So, the most frequent warm drink consumers have less gas than the least frequent consumers.


But how much more, we got to figure out what the units are here. And I believe they coded that never, rarely, sometimes, often, always, Likert scale as a one to five number. So, the units here are in changes in that number.


But every one point really represents a jump in category. Just a little aside here, Regina, this is not the best way to analyze categorical data. But that is a topic for a whole other episode.


[Regina] (53:27 - 53:59)
Oh, absolutely. Okay, so getting back to negative 0.52, that was the number. That means that someone who drank the most hot drinks in the winter dropped half a gassiness category compared with someone in the lowest third of hot drink consumption.


So, I guess this is maybe instead of having gas always, you have it mostly always. Which might be an impactful effect size for your in-laws around the dinner table. Who knows?


[Kristin] (53:59 - 54:03)
Yeah, not huge, but maybe enough to make the holiday dinner slightly more pleasant.


[Regina] (54:03 - 54:20)
Yep. Right, right. Okay, magnitude, not huge.


That is our bottom line here. And that is what most of the magnitudes were that we were talking about. Okay, now we are on for SMART to A.


A is for alternative explanations. What else could explain these results?


[Kristin] (54:21 - 54:35)
You know, Regina, I think we've given lots of alternative explanations for Uncle Joe already. The biggest worry here is that all of this is just random noise. That the researchers, you know, tortured the data until it confessed to something.


And they made a story out of those false confessions.


[Regina] (54:36 - 54:59)
Right. Occam's Razor says the simplest explanation is usually the best one. And here that explanation might be that temperature of beverages doesn't really make so strong an effect.


Or at least not so strong an effect that you can detect it by looking at how people feel today and then trying to link it to how many times they remember drinking ice water a year ago.


[Kristin] (55:00 - 55:26)
We also need to worry about other problems like selection bias and confounding. But I'm not even sure we need to get to those in this case. I mean, my suspicion is that these are all just spurious noise results.


But even if a few of these findings were not spurious, if they were real statistical associations, that does not mean they're causal, right? We can think of tons of reasons why two variables might go together, but it's not a causal association.


[Regina] (55:26 - 55:44)
Right. People with anxiety might consume more cocktails on the rocks, but then it's the anxiety causing the cold drinking behavior, not the other way around, speaking purely hypothetically, not from personal experience or my whiskey on the rocks last night.


[Kristin] (55:46 - 55:57)
With your ex-boyfriend.


All right. Let's get to R now. R is for reality check.


And this is things like, do the numbers in the paper make sense? Is there biological plausibility? Does this pass the smell test?


[Regina] (55:57 - 56:07)
This study does not pass the smell test. It has been stinking up the place. And I'm going to say I have not seen a biological rationale yet.


[Kristin] (56:08 - 56:30)
Yeah, it seems very arbitrary. Like, why only in Asians and only in the summer and only with cold hands? If something seems arbitrary, Regina, it often is.


And this is exactly what false positive tortured data results look like, right? As you said, Regina, full moon on Thursdays. It's just it's too arbitrary.


[Regina] (56:30 - 56:44)
Okay, that was R. Now we're on to T of smart. T is for trustworthiness or transparency.


And this is where we ask things like, did the authors provide the raw data? Did they pre-register the study?


[Kristin] (56:44 - 56:47)
And as far as I can tell, surprise, surprise, they did not.


[Regina] (56:47 - 57:13)
They did not. You know, I do appreciate that the researchers took time to report details of so many analyses. But unfortunately, they did not take the time to do important things like decide on all of their research questions and analyses ahead of time and pre-register somewhere or else clearly label what they're doing as exploratory and tentative.


[Kristin] (57:13 - 57:41)
Yeah, so not a whole lot of trustworthiness here, unfortunately. This is just a simple throw the holiday dinner plate at the wall and see what sticks kind of study. And Regina, I went back and read their PMS and ice cream paper.


And actually there, the journal required a data availability statement. And their statement made me laugh out loud. They write, our study is still in the pilot state and data for this paper are not available to share.


[Regina] (57:42 - 57:47)
Pilot study, but it's a completed survey. I'm not sure what that even means.


[Kristin] (57:47 - 59:04)
I know, it doesn't really seem like a pilot study. And regardless, it doesn't follow that even if it was a pilot study, that magically makes the data not available to share. That is a non sequitur.


A more honest answer here would be, we don't want to share the data with you. Thank you very much.


All right, Regina, I think we have now thoroughly talked through this study with Uncle Joe, maybe even convinced him not to always believe what's on Fox News.


And I think we're ready to wrap it up and rate the strength of evidence for our claim today. And we run into a problem right here because our claim was pretty broad and vague. So I'm just going to summarize the claims as given in Fox News, that Asians who drink more cold drinks in the summer have more anxiety, worse sleep and abdominal fullness.


And Whites who drink more warm drinks in the winter have less depression, better sleep and less gas. So, Regina, how do we rate the strength of evidence for claims in this podcast? It's usually with our one to five smooch rating scale.


One meaning little or no evidence for the claim and five meaning very strong evidence for the claim.


[Regina] (59:04 - 59:17)
For the last episode on alcohol, though, we changed the smooch rating scale to martinis. And since this one is also about beverages, maybe we should also change it to something else, another beverage.


[Kristin] (59:17 - 59:18)
Oh, what do you have in mind?


[Regina] (59:18 - 59:23)
Irish coffees, which apparently is good for intestinal gas. Or not.


[Kristin] (59:24 - 59:30)
All right. So one to five Irish coffees. I like it.


So, Regina, how are you going to rate this?


[Regina] (59:30 - 59:46)
Yeah, this is one where I'm going to have to go off the scale that we just created because I cannot give it even one Irish coffee. It's in the negative rating territory. So, I am going to make three Irish coffees and then pour them down the drain.


[Kristin] (59:47 - 59:48)
Like a negative three here.


[Regina] (59:49 - 1:00:16)
A negative three. I wish I could justify drinking hot toddies and Irish coffees and say it's for mental health and gut health, but not based on this study. I'm not saying that a nice hot drink doesn't warm my soul, right, and my tummy, right, in the moment and make me feel better depending on how much whiskey is in there, for example.


But the study could not conclude anything about this at all. And I think we've shown Uncle Joe that today.


What about you?


[Kristin] (1:00:16 - 1:00:33)
Yeah. I'm going to need cold drinks for our scale today, not hot Irish coffees, because even though this study is so bad, I'm still not going to throw Irish coffees in anyone's faces because that would burn them.


But this is like a three martini in the face study for me.


[Regina] (1:00:33 - 1:00:37)
How about we chill the Irish coffees first? Irish cold brew. I think it's a thing.


[Kristin] (1:00:37 - 1:01:15)
Okay, then that's fine. And I'm going to agree with you. This is a negative three for me, three cold brews in the face because this study is complete nonsense.


They measured a bunch of variables. They defined cold and a hot consumption in eight different ways. They modeled each of those variables three different ways.


They split the data up in three different other ways. They tried two different models and picked the one that gave the best p-values. And by the way, this isn't the first paper they've published on this survey.


And the other one was about ice cream causing PMS. So, Regina, my conclusion is this study is good for only one thing, illustrating to Uncle Joe what a bad study looks like. Sorry.


[Regina] (1:01:16 - 1:01:24)
Oh, sorry, Uncle Joe. Better luck next time. Yeah.


All right, methodological morals. You have a good one, Kristin?


[Kristin] (1:01:25 - 1:01:35)
Yeah, I'm going to pick on how they measured their variables. Here's mine. When your measurement starts with ‘think back to last winter’ you might as well use a random number generator.


[Regina] (1:01:35 - 1:01:52)
Oh, I like that one. Very nice.


[Kristin]
How about you, Regina?


[Regina]
I think I'm going to go with the subgroups one. If the effect is only significant in certain subgroups in certain seasons for certain outcomes, it might just be a bad case of gas.


[Kristin] (1:01:54 - 1:02:04)
Yeah, this paper is a bad case of gas. That's a really good description.


Regina, call now and we'll throw in a bonus bad study.


I can't wait to see what they publish next on this Healthy Aging Survey.


[Regina] (1:02:05 - 1:02:23)
Unfortunately, there are just so many bad studies out there. All right, Kristin, this has been a super fun holiday episode. And I think we've shown Uncle Joe how to walk through these things.


And we've given some great mnemonics. We've got PIVOT and SMART. And SMART ASS is coming next, but not yet.


[Kristin] (1:02:25 - 1:02:29)
All right, I'm going to work on that one, Regina. Thanks, Regina. This has been a lot of fun.


[Regina] (1:02:30 - 1:02:31)
Thanks, Kristin. Thanks, everyone, for listening.