Your Brain on AI: Is ChatGPT making us mentally lazy?

ChatGPT is melting our brainpower, killing creativity, and making us soulless — or so the headlines imply. We dig into the study behind the claims, starting with quirky bar charts and mysterious sample sizes, then winding through hairball-like brain diagrams and tens of thousands of statistical tests. Our statistical sleuthing leaves us with questions, not just about the results, but about whether this was science’s version of a first date that looked better on paper.
Statistical topics
- ANOVA
- Bar graphs
- Data visualization
- False Discovery Rate correction
- Multiple testing
- Preprints
- Statistical Sleuthing
Methodological morals
- "Treat your preprints like your blind dates. Show up showered and with teeth brushed."
- "Always check your N. Then check it again."
- "Never make a bar graph that just shows p-values. Ever."
Link to Kristin and Regina's ChatGPT Webinar for Stanford Online
Kristin and Regina’s online courses:
- Demystifying Data: A Modern Approach to Statistical Understanding
- Clinical Trials: Design, Strategy, and Analysis
- Medical Statistics Certificate Program
- Writing in the Sciences
- Epidemiology and Clinical Research Graduate Certificate Program
Programs that we teach in:
Find us on:
Kristin - LinkedIn & Twitter/X
Regina - LinkedIn & ReginaNuzzo.com
- (00:00) - Intro
- (03:46) - Media coverage of the study
- (08:35) - The experiment
- (12:09) - Sample size issues
- (13:11) - Bar chart sleuthing
- (19:15) - Blind date analogy
- (22:57) - Interview results
- (29:07) - Simple text analysis results
- (33:07) - Natural language processing results
- (40:03) - N-gram and ontology analysis results
- (44:58) - Teacher evaluation results
- (51:33) - Neuroimaging analysis
- (59:35) - Multiple testing and connectivity issues
- (01:05:13) - Brain adaptation results
- (01:08:50) - Wrap-up, rating, and methodological morals
00:00 - Intro
03:46 - Media coverage of the study
08:35 - The experiment
12:09 - Sample size issues
13:11 - Bar chart sleuthing
19:15 - Blind date analogy
22:57 - Interview results
29:07 - Simple text analysis results
33:07 - Natural language processing results
40:03 - N-gram and ontology analysis results
44:58 - Teacher evaluation results
51:33 - Neuroimaging analysis
59:35 - Multiple testing and connectivity issues
01:05:13 - Brain adaptation results
01:08:50 - Wrap-up, rating, and methodological morals
[Regina] (0:00 - 0:24)
So, these are all red flags that really make you wonder what is going on behind the scenes with this dude. Now, he could be a quality guy, right? He could be amazing, intelligent, good in bed, just like this could be high-quality science behind this paper.
But if my blind date shows up all sloppy like that, am I likely to give him a second chance?
[Kristin] (0:29 - 0:52)
Welcome to Normal Curves. This is a podcast for anyone who wants to learn about scientific studies and the statistics behind them. It's like a journal club, except we pick topics that are fun, relevant, and sometimes a little spicy.
We evaluate the evidence, and we also give you the tools that you need to evaluate scientific studies on your own. I'm Kristin Sainani. I'm a professor at Stanford University.
[Regina] (0:52 - 0:58)
And I'm Regina Nuzzo. I'm a professor at Gallaudet University and part-time lecturer at Stanford.
[Kristin] (0:59 - 1:04)
We are not medical doctors. We are PhDs. So, nothing in this podcast should be construed as medical advice.
[Regina] (1:04 - 1:09)
Also, this podcast is separate from our day jobs at Stanford and Gallaudet University.
[Kristin] (1:10 - 1:25)
Regina, we're going to do something interesting today. We're going to look at just a single paper about ChatGPT. It's kind of a monster of a paper, and it got a lot of media coverage recently.
It came from the MIT Media Lab, released in June of this year.
[Regina] (1:25 - 1:35)
Yeah, a lot of media coverage. This thing was everywhere, and with some really alarming headlines that imply ChatGPT turned your brain to mush.
[Kristin] (1:36 - 1:41)
There was even an opinion piece about it by David Brooks in the New York Times. So it made the rounds.
[Regina] (1:41 - 1:46)
This despite it not actually being peer-reviewed or published in a journal yet.
[Kristin] (1:47 - 1:54)
That's right. It's a preprint. That means they released it online before peer review.
We'll talk a little bit about preprints today.
[Regina] (1:54 - 1:57)
Yeah, but Kristin, how about you start us off with a quick overview of the study?
[Kristin] (1:57 - 2:16)
All right. This is a study about writing essays. Participants were randomly assigned to different tools.
One group got to use ChatGPT. One group used the internet, but no AI. And the third group used their brains only.
The researchers looked at what their brains were doing while they were writing and how good the essays were.
[Regina] (2:16 - 2:27)
And the media drew a lot of conclusions from this study. Kristin, if I had to summarize it, it would be that ChatGPT makes your brain lazy and you're writing soulless.
[Kristin] (2:27 - 2:49)
That's it in a nutshell, Regina. Now, some of that was media overreach, but reporters were getting the kernel of those conclusions from the paper itself. So we're going to dig into the paper today and look at whether this paper actually provides evidence for that claim that you just gave, Regina.
This is our claim for today, that ChatGPT makes your brain lazy and you're writing soulless.
[Regina] (2:49 - 3:02)
Yep. Okay. There are going to be some surprises today, because in addition to hitting the statistical topics, we are going to do some statistical sleuthing.
And I want to emphasize, you do not need a stats degree to do what we did.
[Kristin] (3:02 - 3:12)
Oh, absolutely not. We're talking bar graph sleuthing, arithmetic sleuthing, number line sleuthing. This is like third grade math.
Like 10 is bigger than 5 sleuthing. 10 is bigger than 5. I'm glad we agree on that, Regina.
[Regina] (3:12 - 3:20)
This episode is a little different than our typical episode because the paper is so long and complicated.
[Kristin] (3:21 - 3:23)
It has over 90 figures plus an appendix.
[Regina] (3:24 - 3:31)
So we decided to divide and conquer. Kristin will cover the text analysis part, and I will cover the brain activity part.
[Kristin] (3:31 - 3:46)
Yes. And we're pretty much going to walk through the paper, including discussing several of the figures. You don't need to see them, but if you want the visuals, you can check out the YouTube version of this episode.
We'll also link to the paper on our website, normalcurves.com.
[Regina] (3:46 - 3:53)
Kristin, I think it would be fun to start by talking about some of the media coverage we saw in the study because it was fun. It is fun.
[Kristin] (3:53 - 4:05)
Yeah. Here's a headline I saw on a parenting website. It reads, a new report from MIT says, chatGPT is making students lazier, unoriginal, and soulless.
[Regina] (4:05 - 4:08)
Oh, that's harsh. Yeah.
[Kristin] (4:08 - 4:14)
And the first sentence of that article is, it's official, chatGPT is hurting students.
[Regina] (4:14 - 4:17)
It's official. Okay. Case closed.
[Kristin] (4:17 - 4:20)
Yeah. This headline is definitely sensationalized.
[Regina] (4:20 - 4:26)
Here's a headline I found from Fox News, chatGPT could be silently rewiring your brain.
[Kristin] (4:27 - 4:30)
That sounds very ominous, silently rewiring.
[Regina] (4:30 - 4:37)
We could go on and on about the headlines. Let's talk about some of the major claims that have been swirling around this paper.
[Kristin] (4:38 - 5:13)
We picked out six major claims to examine today. The first two claims were that the participants who used ChatGPT had more trouble quoting from their own essays from memory and also felt less ownership of their essays. Another big claim was that the essays from the ChatGPT group were more homogenous and generic.
And another one was that the teachers who rated the essays, that was part of the study, that those teachers could identify the essays from the ChatGPT group and that they described them as soulless. And that's where that word came from, by the way.
[Regina] (5:14 - 5:52)
Okay. The two brain claims, the first was that using ChatGPT makes your brain lazier, that it reduces brain connectivity and brain activity. And second, that using ChatGPT might leave behind a mental hangover, a kind of slowed down thinking that sticks around even after you go back to writing on your own.
And here's how one journalist put it. Participants who used ChatGPT consistently showed reduced brain activity, even when later asked to write without it. That lingering mental sluggishness is what the study referred to as cognitive debt.
[Kristin] (5:52 - 5:55)
Lingering mental sluggishness. I like that writing, Regina.
[Regina] (5:56 - 5:56)
Okay.
[Kristin] (5:56 - 6:19)
So those are the major claims making the rounds about this paper. And you know, Regina, some of these claims feel plausible to me, like if someone is using ChatGPT in a totally passive way. So if I asked ChatGPT to write an essay for me and I never read it, I just like cut and paste it into an assignment, presumably I would not be using much of my brain, right?
And indeed, it might sound soulless.
[Regina] (6:19 - 6:30)
I agree. It might be. But Kristin, this is important to emphasize.
Just because it is a plausible conclusion does not mean that the paper actually showed that.
[Kristin] (6:31 - 6:38)
Yes. That's why we need to dig into the paper to see how much evidence is there in this paper for these claims that are being thrown about.
[Regina] (6:38 - 6:48)
As the kids say these days, you need the receipts. That's what we do in this podcast. We examine the evidence behind the claims and headlines.
We check the receipts.
[Kristin] (6:48 - 7:07)
Exactly. Now, Regina, I do want to mention that we have some potential biases. We both use ChatGPT in our work. In fact, we are giving a live webinar through Stanford Online on August 20th about the effective and ethical use of ChatGPT.
And it was while we were preparing for that webinar that we ended up taking a closer look at this MIT paper.
[Regina] (7:07 - 7:23)
That webinar is going to be a lot of fun, I have to say. Very informative, too. And I want to invite all our listeners to attend.
We'll put a link to the registration on NormalCurves.com and we'll also put a link to the recorded session after August 20th in case you miss us live.
[Kristin] (7:24 - 7:31)
Yeah. I hope a lot of our listeners will tune in to that. Getting back to the paper, Regina, we do have to keep in mind that this paper is a preprint.
[Regina] (7:31 - 7:31)
Right.
[Kristin] (7:31 - 7:31)
Let's talk about that.
[Regina] (7:32 - 7:45)
A preprint. So for listeners who are not familiar with that, a preprint is a research paper that has not yet been peer reviewed. So it's basically a draft that's been posted online for people to read and comment on.
[Kristin] (7:46 - 8:05)
So it's possible this is just a first draft and maybe some of the issues we're going to point out today reflect first draft writing. Now, the authors did say that they have already submitted the paper for peer review, but we don't know if the version that they sent for peer review is the same as this preprint. So our critiques apply specifically to this preprint.
[Regina] (8:05 - 8:16)
Kristin, I want to take a moment to give the authors a lot of credit for releasing this as a preprint because it does allow people like us to read the paper and give feedback before it's published.
[Kristin] (8:16 - 8:18)
Yeah, absolutely. We love the transparency.
[Regina] (8:19 - 8:34)
And I want to acknowledge, Kristin, just how much work went into this because they present a huge number of different analyses, and some of them are pretty sophisticated and labor-intensive. So this paper just reflects a lot of effort. I totally agree with that, Regina.
[Kristin] (8:35 - 8:40)
All right, let's start by describing the experiment in some more detail. It's pretty cool, actually.
[Regina] (8:40 - 8:50)
It is. So, Kristin, as you mentioned before, participants were randomly assigned to one of three groups. One group used ChatGPT to help them write their essays.
[Kristin] (8:50 - 8:53)
But they weren't allowed to use anything else. No other internet, no Google.
[Regina] (8:53 - 9:04)
Right. The second group got it the other way. They could use search engines like Google.
But no AI. They actually disabled AI. And the last group used neither.
Old school. They had their brains only.
[Kristin] (9:04 - 9:24)
They did get a computer, though. This was not a handwritten essay. And you know, Regina, at my kids' school, they are now so paranoid about ChatGPT that they are making them write a lot of essays in class by hand.
But that wasn't the case here. The brain-only group got to use a computer and word processing. All right, Regina, now let's talk through the flow of the study.
[Regina] (9:25 - 9:35)
It was an interesting flow. Everyone had to finish three sessions spaced out over a few months. And there was also an optional fourth session, but we'll get to that later.
[Kristin] (9:35 - 9:44)
Right. When the participants came in, the first thing they had to do is to put on this EEG cap. And it's kind of like a swim cap, but with all these electrodes and wires sticking out.
[Regina] (9:44 - 9:58)
Yeah. EEG, by the way, stands for electroencephalography. It measures brain activity by detecting these electrical activity signals of the neurons that are near the scalp.
It's really cool. We'll talk more later.
[Kristin] (9:58 - 10:11)
All right, then they gave the participants a timed essay that they had to write while wearing the EEG cap. And in each session, participants got to choose from a new set of three different essay prompts. And they had only 20 minutes to complete the essay.
[Regina] (10:11 - 10:24)
20 minutes is a lot of time pressure. Because these prompts were actually taken from an old version of the SAT test where they had 25 minutes for the essay, not 20. So the researchers made this task a little harder.
[Kristin] (10:25 - 10:28)
They did. Yeah. You know, Regina, I'm glad I didn't have to do an essay on the SATs.
[Regina] (10:29 - 10:29)
Definitely.
[Kristin] (10:30 - 10:53)
The prompts are kind of generic and philosophical. Like one asks, does loyalty mean that you have to give someone unconditional support or does loyalty require you to speak up and criticize them when you think they're doing something wrong? Another asks whether art is purely for entertainment or does it actually have the power to change people's lives?
[Regina] (10:54 - 11:05)
These are deep.
[Kristin]
They are, yeah.
[Regina]
This is not saying, talk about your favorite movie. Right, yes.
So after the participants finished their essay, a researcher interviewed them in person. It was kind of like a debrief.
[Kristin] (11:05 - 11:20)
Exactly. They asked them a scripted set of questions, and these were later transcribed and turned into data. And the participants had to do this same workflow three times in three separate sessions, writing three different essays.
And if they finished all three, they got paid $100.
[Regina] (11:21 - 11:24)
Let's talk now about who was in the study.
[Kristin] (11:24 - 11:32)
They were from MIT, Harvard, Wellesley, Tufts, and Northeastern. So New England area, all Boston area schools where I grew up.
[Regina] (11:32 - 11:40)
And also smart, all smart. Harvard. I bet they felt some pressure to do well, even in this 20-minute experimental exercise.
[Kristin] (11:41 - 11:57)
Regina, I think that is super important because you can imagine someone might just sign up for this study and not take it seriously. Maybe just write some crap or let ChatGPT write some crap. But these are high-achieving students, so I think they probably take pride in their work, and I bet they tried to do a good job.
[Regina] (11:57 - 12:08)
Yeah, I agree. So the majority were college students, but there were also some graduate students, a handful of postdocs, and junior research scientists, and they were all 18 to 39 years old.
[Kristin] (12:09 - 12:13)
Regina, let's talk sample size now, because as statisticians, this is what we really care about.
[Regina] (12:13 - 12:27)
Mm, sample size. And, okay, they started with 60 participants, started, but then they say they only used data from people who completed all three sessions, which got them down now to 55 participants.
[Kristin] (12:28 - 12:40)
Right, and this is not great because it's actually not best practice to drop participants with incomplete data. In a randomized trial, you're supposed to analyze everyone, even if they dropped out part way. You just use partial data.
[Regina] (12:41 - 12:48)
And it's even weirder because then they say they are reporting data from only 54 out of the 55 who finished the study.
[Kristin] (12:48 - 13:00)
This was totally weird. It seems like they did this because they wanted exactly 18 per group, 18 times 3, 54. But you don't need to have the exact same number per group.
That's, you just, that's unnecessary.
[Regina] (13:00 - 13:09)
Mm-mm, that's throwing out data, and we do not throw out data for this reason. And Kristin, how did they decide which one participant to kick out?
[Kristin] (13:09 - 13:11)
I know, this seems really arbitrary.
[Regina] (13:11 - 13:16)
All right, this is the fun part, Kristin. Let's look at the figures and do some sleuthing.
[Kristin] (13:16 - 13:39)
Yes, this sleuthing isn't really too tricky either, Regina. So, let's start with figure 2. It is a bar graph that's supposed to show what degrees the participants have.
So, like the first bar shows high school degrees, the second bar shows college degrees, and so on. Now, bar graphs display counts, and indeed the y-axis is labeled occurrences, which to me implies counts.
[Regina] (13:40 - 13:47)
Yeah, it does, doesn't it? But Kristin, that first bar has a height of something like 63, which is a problem.
[Kristin] (13:48 - 13:56)
Right. It doesn't take a degree in statistics to pick out the problem either, Regina. I believe, correct me if I'm wrong, but I believe that 63 is higher than 54.
[Regina] (13:57 - 14:00)
Wait a minute, let me count on my fingers. Yeah, yep, 63 higher than 54.
[Kristin] (14:01 - 14:04)
How can we have 63 high school degrees with only 54 participants?
[Regina] (14:05 - 14:13)
Yeah, same thing for the college degrees, the bar with the college degrees. The text implies that we have 19 college graduates, but the bar goes up to 48.
[Kristin] (14:13 - 14:32)
Right, how did we get so many college degrees from only 19 graduates? Yeah, it's incredibly sloppy. And the other thing, Regina, is why is this in graph form at all?
There is no reason to put this information in a graph. It's just a few numbers. You could put that in a table, and it would actually be easier to get the information from a table.
[Regina] (14:32 - 14:41)
Yeah, it's a waste of space, and let's just say in this paper, there are a lot of graphs that could have just been consolidated into a few tables, so yes.
[Kristin] (14:41 - 14:53)
Yes, case in point, figures 29 and 30, which are also problematic bar graphs, so let's talk about those now. They asked participants about their prior experience with CHATGPT, so again, descriptive data.
[Regina] (14:54 - 15:08)
I'm looking at figure 29, Kristin, and it is making me laugh. It is a bar graph of how participants used ChatGPT before the study. First of all, it has no y-axis label.
[Kristin] (15:08 - 15:11)
Yeah, y-axis label, kind of basic, yes.
[Regina] (15:12 - 15:18)
They do have numbers on the y-axis, yay, 0 to 100, but what do they stand for?
[Kristin] (15:18 - 15:27)
Right, when I see 0 to 100, I was assuming that they were percentages, but then you look at the first bar, which is labeled no response, and it goes up to exactly 100.
[Regina] (15:28 - 15:34)
From that, one would think that 100% of the participants had no response.
[Kristin] (15:34 - 15:43)
Right, but then there are other bars for things like homework and emails, and those bars are above zero. So if 100% were non-responding, then how do we have other responses?
[Regina] (15:44 - 15:48)
Well, Kristin, it's greater than 100% of people. That's simple. It's magic.
[Kristin] (15:48 - 15:49)
CHATGPT magic.
[Regina] (15:50 - 15:52)
Figure 30 has the same issue. It does, yeah.
[Kristin] (15:53 - 16:04)
Regina, there are a ridiculous number of bad bar graphs in this paper. They are colored pretty colors, but you don't get extra credit for pretty colors. We are not in kindergarten.
[Regina] (16:05 - 16:12)
Sadly. I'm wondering, did they even proofread these graphs, though, before they posted the manuscript, given all these errors?
[Kristin] (16:12 - 16:21)
I don't think so. I feel like they should have had CHATGPT check them for them. You know, Regina, it's making me feel like they slapped this manuscript together really quickly.
It's quite sloppy.
[Regina] (16:22 - 16:22)
Yeah, yeah.
[Kristin] (16:22 - 16:43)
Can we talk about session four now? This was the optional part of the study. The authors make a big deal out of this session, though.
In their paper, they say that session four generated, quote, some of the most striking observations in our study. But actually, only 18 participants came back for this optional session. Regina, you want to explain what they had to do?
[Regina] (16:43 - 17:09)
Yeah. So, a small number of people came back, and the participants did not know it in advance, but if they came back for this fourth session, their group assignment got switched. So like if they were in the brain-only group before, now they got to use ChatGPT.
This is important. They switched. If they used ChatGPT before, now they had to use only their brains.
And in this fourth session, they had to write yet another essay.
[Kristin] (17:10 - 17:21)
Regina, I got confused about the groups here. So was no one from the search engine group invited back then? And no one used search engine only in the fourth session.
Am I getting that right?
[Regina] (17:21 - 17:40)
Well, it's a mystery because they never tell us explicitly. I'm just going to say, based on my reading, I think session four had only two groups of people. The nine people who switched from brain to CHATGPT, and the nine who switched from ChatGPT to brain.
So search engine was not part of this.
[Kristin] (17:40 - 17:46)
Oh, right. That's what I thought. But then I got confused when I saw figure 12.
Did you see figure 12?
[Regina] (17:46 - 17:49)
I did. It is a bar graph about session four.
[Kristin] (17:49 - 18:04)
Right. And it shows five groups, not two. So there are bars for those two groups that you just mentioned.
But there are also bars for three other groups, a search engine to brain group, a search engine to ChatGPT group, and a brain to search engine group.
[Regina] (18:04 - 18:09)
Right. So how did those three extra bars get in there if those groups did not exist? Right.
[Kristin] (18:09 - 18:41)
Did the authors originally include those groups and then exclude them later? Like maybe they changed their study design after the fact? Or did they miscode their data?
Right? Maybe they hallucinated it. You know, Regina, we have talked about stats on drugs before.
[Regina]
Oh, stats on psychedelics. Good stuff.
[Kristin]
Maybe not recommended when you're analyzing your own data.
[Regina]
Oh, but more fun.
[Kristin]
It's definitely more fun. Regina, we also need to keep in mind that the data from session four are very limited because, as we said, the numbers are small.
And also, this sample is now non-random.
[Regina] (18:42 - 18:57)
Good point. It was just whoever volunteered to come back. And maybe they were different in some way.
They weren't just the most motivated participants or the poorest or whatever. So now the groups could differ in all kinds of ways that could affect the results.
[Kristin] (18:57 - 19:08)
We've lost the benefits of randomization now. And that means we have to be extra cautious in interpreting results from this fourth session. Even though the authors made a big story about it, that's what we're going to talk about.
[Regina] (19:08 - 19:15)
They did. Okay. So I think that sums up experimental methods and participants.
Are we ready for the results?
[Kristin] (19:15 - 19:29)
Regina, before we get into the results, let's pause for a second because we both had the same reaction when reading this paper. The presentation is unexpectedly sloppy. Key details are missing.
There are inconsistencies, confusing figures.
[Regina] (19:30 - 19:38)
Yeah, these are all major red flags for us, Kristin. It doesn't necessarily mean the study itself is flawed, though. Right.
[Kristin] (19:38 - 19:51)
Bad presentation, especially in a preprint, does not always mean bad science. But it does raise my hackles, right? It does make me think, if they didn't double check their bar graphs, what else were they careless on?
[Regina] (19:51 - 19:57)
Yes, exactly. Thinking the same thing. And, Kristin, I've got an analogy.
[Kristin]
Oh, good.
[Regina]
Yeah, you like my analogies.
[Kristin] (19:57 - 19:57)
Yes, I do.
[Regina] (19:57 - 20:40)
Okay. Blind date. Okay.
The guy shows up and he is clearly unwashed, unshowered, right? And he has not brushed his teeth.
[Kristin]
Ew. Uh-oh.
[Regina]
And he's got, what, bedhead hair and his clothes are all rumpled because he obviously just picked them up from the floor where they were in a ball. So these are all red flags that really make you wonder what is going on behind the scenes with this dude.
Now, he could be a quality guy, right? He could be amazing, intelligent, good in bed, just like this could be high-quality science behind this paper. But if my blind date shows up all sloppy like that, am I likely to give him a second chance?
Really?
[Kristin] (20:41 - 20:55)
Right. Like, maybe he's just having a really bad day. His alarm didn't go off and he rolled out of bed 10 minutes before the date and he was so enamored of meeting you, Regina, that he rushed off before taking care of any of the hygiene.
[Regina] (20:55 - 21:03)
Maybe he should have waited for some peer review, though. Like looking in the mirror, right? First, did he have anyone smell his armpits?
[Kristin] (21:03 - 21:10)
But Regina, you like the sweaty armpits. See pheromones episode. So maybe this would increase the chances of a second date with you.
[Regina] (21:11 - 21:14)
Yeah, but you also got to brush your teeth. Those are not pheromones.
[Kristin] (21:14 - 21:40)
Yes. I agree with you on that one, yeah. So again, there is a chance that he's actually a quality guy and he's just having a bad morning, but it's hard to ignore the bad presentation. And this preprint, pretty much the equivalent of this blind date.
So we are going to try to keep an open mind and not judge the paper purely on its bad presentation. We're really going to dig into the actual science underneath. But the paper did not make a great first impression, and that can be hard to recover from.
[Regina] (21:41 - 21:51)
Not a lot of blind dates like that get a second one with me. Yes. OK, Kristin, we are now ready to start talking about the results, but let's take a short break first.
[Kristin] (21:59 - 22:10)
Regina, I've mentioned before on this podcast, our clinical trials course on Stanford Online is called Clinical Trials, Design, Strategy, and Analysis. I want to give our listeners a little bit more information about that course. It's a self-paced course.
[Regina] (22:11 - 22:18)
We cover some really fun case studies designed for people who need to work with clinical trials, including interpreting, running, and understanding them.
[Kristin] (22:18 - 22:56)
You can get a Stanford professional certificate as well as CME credit. You can find a link to that course on our website, normalcurves.com. And our listeners get a discount.
The discount code is normalcurves10. That's all lowercase.
Welcome back to Normal Curves.
Today we are talking about the recent study from MIT on ChatGPT in your brain. And we were about to talk about the results. Regina, let's start with some of the results from those interviews that they did at the end of every session.
[Regina] (22:57 - 23:17)
Right. They asked a series of scripted questions. These are actually numbered in the method section as questions one through eight.
They also present results in the results section numbered one through eight. So you would think that these one through eights would correspond to each other, but they do not, which is weird and confusing.
[Kristin] (23:17 - 24:15)
This confused me so much, Regina. I kept trying to match the question to its results and they weren't lining up. I finally just ignored the numbers.
Regina, the data from these interviews were used to support two of the claims that I talked about earlier. One, that the participants who used ChatGPT had more trouble quoting from their own essays from memory. And two, that they felt less ownership of their essays.
So let's look at the memory part first. One of the questions they asked participants was, can you quote any sentence from your essay without looking at it?
[Regina]
Which appears to have only two answers, yes or no.
[Kristin]
Yes. Yes, exactly. And they report that in session one, 15 out of 18 participants in the ChatGPT group said no, they couldn't quote a sentence from the essay they just submitted, versus only two out of 18 in each of the other two groups.
[Regina] (24:15 - 24:47)
And they report results from statistical tests comparing the groups on these data. But Kristin, I think they ran the wrong statistical test for this type of data. They used what's called an ANOVA, an analysis of variance.
But Kristin, as you know, ANOVA is for numeric outcomes, numbers, like if we were comparing height or weight between the groups. But here, we have a binary outcome category, yes or no, whether they could recall a quote or not.
[Kristin] (24:48 - 25:30)
Regina, this is really basic. Any first-year stats course teaches this. You don't use ANOVA for binary outcomes. And then it gets worse, because I can't even recreate their analysis.
The most generous assumption I can make is that they coded their yes-no outcome as ones and zeros, and then treated those ones and zeros as numbers in an ANOVA. That would be wrong, but the computer would run it. When I do that analysis, though, I get wildly different results from what they report.
I get an F-statistic. This is the output of an ANOVA. I get an F-statistic of 26, whereas they report an F-statistic of about 80, which is just way off.
[Regina] (25:31 - 25:42)
So this means their analyses are not reproducible. And the fact that it's so far off, and that we can't figure out how they got there on what is really pretty simple data.
[Kristin] (25:42 - 25:43)
Very simple data. Yeah.
[Regina] (25:43 - 25:45)
This is a huge red flag.
[Kristin] (25:45 - 26:28)
Right. And, Regina, to be fair, I don't need a p-value here. I don't need a p-value to tell me that 15 out of 18 is different than 2 out of 18.
We're not saying their conclusion is wrong. It looks like in session one, more of the ChatGPT group had trouble remembering a quote from their essay. But again, it goes to, do they know what they're doing statistically?
I would say that's an open question right now. Also, Regina, even though the groups were different in session one, this finding doesn't hold up in session two. In the second session, only two people in the ChatGPT group had trouble recalling a quote.
That did go up to five in session three, but that's still the minority.
[Regina] (26:29 - 26:54)
So there may be something there, but it's not as dramatic or consistent as the authors in the media maybe made it out to be. Exactly. Regina, now let's talk about the essay ownership question.
This is about how much do you feel ownership of your essay? But this is a bit of a mess. We know they asked all three groups about ownership because they present data for all three groups.
[Kristin] (26:54 - 27:16)
Right. But they don't tell us what question they asked the brain-only group. They tell us that they asked the ChatGPT group how much of the essay was ChatGPT's and how much was yours, and they tell us that they asked the search engine group how much of the essay was taken from the internet and how much was yours, but they never tell us what specific question they asked the brain-only group.
[Regina] (27:16 - 27:25)
Right. Like, how much of the essay came from your brain versus how much came from your brain? I mean, what else is there?
It doesn't even make any sense.
[Kristin] (27:25 - 27:33)
Right. I don't get how this question even applies to the brain-only group. Who else would have ownership of your essay if it's coming out of your brain only?
[Regina] (27:33 - 27:47)
A couple of participants in the brain-only group did report only partial ownership, so I'm thinking maybe they gave partial ownership to, like, Shakespeare because they remembered something Shakespeare said about art.
[Kristin] (27:47 - 28:17)
OK. Could be. But it gets even worse, though, Regina, because there was also something wrong with the numbers they report.
In the text, they say that only 9 out of 18, half of the ChatGPT group, claimed full ownership of their essays. But those numbers don't match what's shown in Figure 8. Figure 8 is a stacked bar chart, and it shows data from only 15 people in the ChatGPT group.
It shows that 9 out of 15 claimed full ownership. That's 60 percent, not half.
[Regina] (28:17 - 28:29)
Right. So there are three people missing from that group in the figure, and it's even worse in the search engine group because that bar had only 13 total, so now we lost five more people. It's like a black hole.
[Kristin] (28:29 - 28:42)
Right. So where did those data go? Did those missing people just not answer the question?
Did the researchers lose the data? Or is the figure just wrong? And why did they give different numbers in the text than in the figure?
[Regina] (28:42 - 28:47)
All these inconsistencies, lack of explanation, they are all red flags.
[Kristin] (28:47 - 29:04)
Yes. And again, we're not saying there wasn't a real difference here. Even with the missing data, it does look like the brain-only group claimed more ownership of their essays than the other two groups.
But is this even interesting? The brain-only group doesn't have anything else to attribute ownership to.
[Regina] (29:05 - 29:05)
Yeah.
[Kristin] (29:05 - 29:06)
It's weird. Yeah.
[Regina] (29:07 - 29:11)
Yeah, doesn't make sense. Okay, let's talk about the text analysis section now.
[Kristin] (29:12 - 30:17)
This is the Natural Language Processing, or NLP, section of the paper. And Regina, they applied a bunch of computational tools to analyze the content of the essays. And I want to give the authors credit for all the work they put into this section.
They ran a ton of sophisticated analyses. It represents a lot of effort. Yeah, true.
But the section is hard to read. It's all over the map. It's not clear what questions they're trying to answer or why.
[Regina]
It's that unshowered blind date again. Yes.
[Kristin]
Remember, Regina, one of the major claims about this paper was that the ChatGPT group produced more, quote, homogenous essays.
And this was not something the media made up. The authors actually write this in their discussion section. They write that the ChatGPT group, quote, produced statistically homogenous essays within each topic, showing significantly less deviation compared to the other groups.
Now, granted, this is a confusing sentence, but they seem to be making this claim here.
[Regina] (30:17 - 30:22)
Yeah, I agree. So the paper should back this up with evidence and data, right? Mm-hmm.
[Kristin] (30:22 - 30:55)
So let's look at that evidence. The first thing they tell us is that the ChatGPT and search engine groups had significantly reduced variability in the length of the words compared to the brain-only group. So let me just illustrate what they mean by that, Regina.
So here's a sentence where the words are all similar in length. Okay. I like your smile.
You seem nice. Let's hang out. And we want to compare that to a sentence with more varied word length.
Your smile is gorgeous. You seem lovely. Let's connect.
[Regina] (30:56 - 31:05)
Aw, Kristin, your smile is gorgeous, too. Oh, thank you. Okay, but this metric does not necessarily tell us about homogeneity, I mean, but maybe?
[Kristin] (31:05 - 31:31)
Yeah, maybe. It's not clear that this is the greatest metric for measuring homogenous essays. But there's an even bigger problem, Regina.
The authors actually never present any data that even backs up this statement. They never give us data that shows that the brain group had more varied word length than the other groups. In the text, they say, see figure 15.
But figure 15 is a mess. Surprise, it's another bar graph.
[Regina] (31:32 - 31:43)
But wait, the legend for figure 15 says it's about the number of words per essay, not the number of letters per word. Right. I'm not even sure this is about word length, right?
[Kristin] (31:44 - 32:06)
It seems to now be about number of words. But we can't really tell because it's a horrible figure. It is a bar graph of p-values.
Have you ever seen anything like this before, Regina? I have not, and my skin is crawling right now. Right.
So there are three bars, one for each group. And the height of each bar corresponds to a p-value, one p-value for each group.
[Regina] (32:07 - 32:13)
Kristin, it makes no sense at all to put p-values in a bar graph. Just give us the numbers. Right.
[Kristin] (32:13 - 32:48)
This is just three numbers. Why is this a bar graph? And even worse, I can't figure out what statistical tests generated those p-values.
Right. P-values have to come from somewhere. There has to be a comparison.
And I don't know what those comparisons were. So, for example, the ChatGPT group has a p-value of 0.39. Is that the p-value you get when you compare the ChatGPT group to the brain-only group or the ChatGPT group to the search engine group? Or is it from something else entirely?
It's totally unclear because there's only one p-value, and I don't know where it came from.
[Regina] (32:49 - 33:06)
Yep. And it does not even align with the text either. So the text says that the ChatGPT group was significantly different than the brain group, but that p-value for the ChatGPT group, 0.39, notice, is not statistically significant. It's not less than 0.05.
[Kristin] (33:07 - 33:53)
That's right. And so this first supposed piece of evidence for homogenous assays, I'm giving it a big thumbs down, Regina. All right, but let's move on to a more sophisticated analysis where they use some really cool natural language processing tools. They used deep learning to turn each essay into a vector of hundreds of numbers.
It's a numeric representation of the essay. It's sort of like a digital fingerprint that captures the content, tone, and structure of the essay. So kind of cool.
And then they used something called cosine distance to compare the essays, basically just a way to measure how similar two of these digital fingerprints are. So they looked at this cosine distance across all the essays within a given group as a way to judge homogeneity.
[Regina] (33:54 - 34:02)
This seems more meaningful and maybe a much better way to test for homogeneity than just looking at word length, by the way.
[Kristin] (34:02 - 34:28)
Absolutely. But what they did was to generate heat maps of the essays in each group, and we've got to work through what a heat map is here, Regina. So a heat map is basically like a patchwork quilt, where each square of the quilt tells us how similar any two essays in that group are. So let's say there are 50 essays we want to compare.
The heat map would have 50 rows and 50 columns, and that would result in 250 little tiny squares.
[Regina] (34:28 - 34:36)
Heat maps are fun because they are colored and the colors give us information. So tell us about the colors here.
[Kristin] (34:36 - 35:44)
Right. The squares are colored darker for pairs of essays that are more similar and lighter for pairs of essays that are less similar. So if the square for row 20, column 30 is dark, that tells you that essay 20 and essay 30 are similar to each other. And Regina, of course, you get a dark stripe along the diagonal because that's where every essay is compared to itself and is thus perfectly similar.
And figure 19 presents three heat maps, one for each group, ChatGPT, search, and brain. Do they give results of any formal statistical test? No, it's just a visual.
It's just the three heat maps. And they basically imply that if you stare at the heat maps long enough, you'll magically see that the ChatGPT essays are more similar and here's the exact quote from the paper. Quote, we can see a more rippled effect in ChatGPT written essays showing bigger similarity.
See figure 19. I mean, Regina, I'm staring at figure 19 and I don't see it. Do you see it?
[Regina] (35:45 - 35:51)
Rippled effect? Well, first of all, I would argue that is not a statistical term. I'm sorry.
That is just a vibe.
[Kristin] (35:51 - 35:55)
Regina, we need this on a T-shirt or coffee mug. Did you see the rippled effect?
[Regina] (35:56 - 36:00)
Yeah, but we're going to make it sexy. Show me your cosine ripples, baby.
[Kristin] (36:02 - 36:04)
I think there's definitely a sex connection here, Regina.
[Regina] (36:05 - 36:32)
Okay, back to the heat maps, because I have so many problems. First of all, how did they order the essays, those little squares in the patchwork quilt? Because it looks totally random and you should not do that.
If you're going to use a heat map, the order needs to mean something. So here they should have ordered the essays by similarities, so you could actually see the patterns, not just a vibe. You could see clumps of dark squares, clumps of light.
[Kristin] (36:32 - 37:05)
Exactly. They should have sorted them from the most similar pairs to the least similar pairs, so you could actually tell if the ChatGPT group had stronger clusters. As it is right now, there's no way to visually judge homogeneity.
[Regina]
It's just a vibe.
[Kristin]
Exactly. And Regina, did you notice this? The squares in the ChatGPT heat map are smaller than the squares in the other two groups.
I did notice that. So this so-called rippled appearance may actually be an artifact of the fact that the squares are tinier in the ChatGPT heat map.
[Regina] (37:05 - 37:11)
Right. Like if they were bigger squares, it would be a wave. And since they're smaller squares, it's a ripple.
[Kristin] (37:12 - 37:29)
Right. And the fact that these squares are tinier actually is a clue to something much more important. The overall size of the heat maps, the overall size of those quilts, is exactly the same. So the reason we are getting tinier squares in the ChatGPT heat map is because there are more essays.
[Regina] (37:29 - 37:38)
There are more rows and columns in that heat map. Which makes no sense, Kristin, because there are supposed to be the same number of essays in each group, right?
[Kristin] (37:38 - 37:54)
This is crazy, right? I counted the rows and columns in each heat map, and I found that the ChatGPT one displays 69 essays, the search engine group has 48 essays, and the brain-only group has 44 essays.
[Regina] (37:55 - 37:57)
It is so bizarre. I counted two, got the same numbers.
[Kristin] (37:57 - 38:29)
There is no explanation for these numbers. Regina, there should be 54 essays per group if we are just using essays from the first three sessions, and there should be 63 essays in the brain and ChatGPT groups if we include session four, those extra nine. So where the heck do we get 69, 48, and 44 from?
And this really worries me when I see crap like this, Regina. How are we off on something as simple and basic as just the number of essays? They should be able to keep track of that.
[Regina] (38:30 - 38:39)
They should. I mean, do they get mislabeled, or are they hallucinating essays now? Right.
We have hallucinated data and disappearing data.
[Kristin] (38:39 - 39:25)
That is not good.
[Regina]
It's like a giant haunted house.
[Kristin]
It's like a haunted house, yes.
I'm scared. Are you scared, Regina? I am very scared.
And the worry here is, if you have the essays in the wrong groups in one of your analyses, how do we know that you don't have the essays in the wrong groups for all of the analyses that we're about to talk about? That would mean that we were just looking at noise and artifact. Regina, let's now look at figure 20, because it's another heat map.
And in this figure, they're separating the essays by topic, and then they're comparing the similarity of the essays across the different groups. Like, how similar are all the loyalty essays in the brain group to all the loyalty essays in the ChatGPT group?
[Regina] (39:25 - 39:29)
Right. But I am looking at figure 20. I am seeing no pattern.
[Kristin] (39:29 - 40:03)
This is what I don't get. It's pretty much random. So for some topics, brain and search were closest.
For other topics, brain and ChatGPT were closest. And for others, search and ChatGPT were closest. There is no pattern.
So I'm not sure what we're supposed to get out of this. And it certainly does not support the idea that the ChatGPT essays are more homogenous. Right.
Right? Because there's nothing there that's giving us that. All right.
Let's now skip down to figure 25. This is the N-gram analysis. They talk about this a lot in their discussion.
[Regina] (40:03 - 40:06)
Ooh, interesting. I love N-grams. Explain what N-grams are.
[Kristin] (40:06 - 40:22)
Right. Sounds very fancy, but an N-gram is just a sequence of words of length N. So like perfect society, that's a 2-gram, 2 words.
Or create a perfect society, that would be a 3-gram because they ignore small words like a or the.
[Regina] (40:23 - 40:31)
Right. So basically, this is text analysis that picks out common phrases.
[Kristin] (40:31 - 41:01)
Right. But N-grams sound so much more fancy than common phrases, Regina.
So figure 25 shows the most common phrases, the phrases that popped up the most by group. And again, I don't know how this shows me that the ChatGPT group is more homogenous than the brain-only group because both groups have a fair number of common phrases.
Like weirdly, the phrase multiple choice came up a lot in the brain-only group. And then some version of perfect society came up a lot in all three groups.
[Regina] (41:02 - 41:08)
Wait a minute. Perfect society, that phrase. I remember reading that exact phrase in one of the essay prompts.
[Kristin] (41:08 - 41:15)
Yes. One of the essay prompts was, is a perfect society possible or even desirable?
[Regina] (41:16 - 41:24)
Okay. That is just silly then because of course you're going to see that phrase because you told them to talk about that phrase. Right.
It's in the prompt.
[Kristin] (41:24 - 41:40)
So I don't, like I don't think this N-gram analysis tells us anything. And again, there's no quantification. It's just this picture.
I don't know how we get anything out of it.
[Regina]
Silly.
[Kristin]
Yeah.
All right. Let's now look at figure 37 from the ontology analysis, Regina. They kept referring to figure 37.
[Regina] (41:41 - 41:43)
Ontology. And what is ontology?
[Kristin] (41:43 - 42:03)
An ontology is basically a map that shows how different concepts are related to one another. And they don't give a lot of details here, but basically they try to identify pairs of concepts that appear in the essays like art and movie. That would be a pair.
And they call those pairs edges.
[Regina] (42:03 - 42:23)
Okay. So these are linked concepts now instead of phrases. But what counts as what you said in edge?
Right. Like, did you have to talk about art and movie in the same sentence or in the same paragraph or anywhere in the essay? What?
[Kristin]
We don't know because they give absolutely no details.
[Regina]
Oh, okay. Not surprising.
[Kristin] (42:23 - 42:42)
But again, not very reproducible. Right. But they make a big deal out of figure 37.
It's again, just a visual. But in the visual, you can see that they identified more of these concept pairs from the ChatGPT and search engine essays and very few in the brain-only essays.
[Regina] (42:42 - 42:51)
Wait a minute. So the idea is that the brain-only group is more original because they don't use these frequently paired concepts?
[Kristin] (42:51 - 43:11)
Yeah. I think that's the argument like that you're avoiding cliches or something. But let me give you an example of some of the things that we're seeing in the ChatGPT group. There's a whole set of concept pairs in that group related to art.
So we get art and expression, art and music, art and literature, art and movies, art and books, and art and architecture.
[Regina] (43:11 - 43:14)
Kristin, that wasn't one of the prompts about art?
[Kristin] (43:14 - 43:36)
Yes. It was in the prompt itself. In that art prompt, they mentioned that examples of art include movies, books, and songs.
So maybe ChatGPT is just better at helping students to repeat the prompt in their essay. Right? That's something you're supposed to do in an essay, right?
Make sure that you're addressing the prompt.
[Regina] (43:37 - 43:58)
Right. Kristin, looking at figure 37, it looks like they did not find any of these art concept pairs in the brain-only group, but that's surprising. Yeah.
Because I know some of them chose the art prompt. So what are these people talking about art that has nothing to do with movies or books or architecture or songs or music?
[Kristin] (43:58 - 44:57)
Right. It's actually weird that they didn't find those concept pairs in the brain-only essays. So again, we don't know the details, so it's unclear what's going on, but it does seem strange. And I don't think this analysis is a good measure of originality to begin with, right?
It's, once again, a visual with no quantification, and we're just picking out a lot of things that are in the prompts. And Regina, even beyond that, I don't even know if I can trust that they've thrown the right essays into this algorithm since they seem to have mixed up the essays in the heat map. So maybe somehow all the art prompt essays got into the wrong group, right?
[Regina]
Oh, good point.
[Kristin]
All of this is called into question then. That's the problem, right, when you can't keep track of your sample sizes, right?
So that's it. This is all the so-called evidence for the essays being more homogenous. And Regina, I just don't see any convincing evidence.
Maybe they do have data showing this. They have tons of data, but it's definitely not in the preprint.
[Regina] (44:57 - 44:58)
Yeah.
[Kristin] (44:58 - 45:35)
All right, Regina, let's look at that last claim now, the one related to the teachers. We didn't talk about this earlier, but part of the experimental design here was they did ask two teachers to rate the essays, and they also created an AI judge to rate the essays.
[Regina]
Oh, like ChatGPT rating the essays?
[Kristin]
Yeah, but it wasn't ChatGPT. It was a custom AI that the researchers built to score these essays. And Regina, remember that some media reports stated that the teachers were able to identify the ChatGPT essays and that they called the essays soulless.
[Regina] (45:36 - 45:38)
Now, I would like to quantify soulless, please.
[Kristin] (45:39 - 45:47)
Right, like, how do you measure soullessness? That's a profound question, actually. We should come up with a metric, the soullessness index.
[Regina] (45:47 - 45:58)
I have been on some dates that would score pretty high on the soullessness index. But Kristin, were these media claims based on anything from this paper?
[Kristin] (45:59 - 46:43)
Kind of. As I'm going to show you, the authors don't actually present any data backing these claims up. I think what the media was picking up on was just an anecdotal quote from the teachers that was included in the paper. Oh, no.
Yeah, let me read the quote that they gave from the teachers. The teacher said, some essays across all topics stood out because of a close-to-perfect use of language and structure while simultaneously failing to give personal insights or clear statements. These often lengthy essays included standard ideas, reoccurring typical formulations and statements, which made the use of AI in the writing process rather obvious.
We, as English teachers, perceived these essays as soulless.
[Regina] (46:44 - 46:56)
We perceived these essays as soulless. That is just an impression. Yeah.
Is there anything showing that the essays that they thought used AI were, in fact, those from the ChatGPT group?
[Kristin] (46:57 - 47:44)
No, absolutely not. Regina, the teachers did not know the design or purpose of the experiment. For all they knew, all of the essays could have been ChatGPT-assisted essays.
And the researchers present no data to show that the essays that the teachers are complaining about are, in fact, from the ChatGPT group. That's crazy. Yeah, and the thing is, they have data that they could have used to address that question.
The teachers and the AI judge rated each essay on a zero-to-five scale across multiple categories. So, uniqueness, vocabulary, grammar, organization, content, length, and ChatGPT content. Basically, how much they thought ChatGPT helped write the essay.
[Regina] (47:45 - 47:52)
Well, that means they could have compared those ratings across the three groups, right? They could have, but they didn't.
[Kristin] (47:52 - 48:15)
There are no analyses showing that the teachers rated the essays from the ChatGPT group as higher in ChatGPT content or lower on uniqueness than the essays from the other groups. So, just the fact that the teachers suspected that ChatGPT was involved in the study, that's not evidence of anything. I mean, ChatGPT is on every English teacher's mind right now.
[Regina] (48:15 - 48:18)
And it's one of the things they had to rate on the rubric.
[Kristin] (48:18 - 48:52)
That's right. Right. They were totally primed to suspect ChatGPT. And, Regina, I have been teaching writing for almost a quarter century, long before we had ChatGPT.
And I'll tell you that those perfect grammar empty content essays that the teachers were describing, that is how I would describe a lot of student writing that predates ChatGPT. And in particular, high-achieving students can often master grammar and syntax but still write what I call blobfishes, empty, meaningless prose. ChatGPT did not invent that.
[Regina] (48:53 - 49:06)
No, it did not. So if they did not compare the teacher's ratings across the three groups, then what are all these figures about the teacher ratings about?
Because there are a lot of them.
[Kristin] (49:07 - 49:24)
Right. Basically figures 40 through 55. But they are all comparing the AI judge's ratings of the essays to the teacher's ratings.
So I think the goal was to see how well the AI could grade. Like, could it match what a human could do?
[Regina] (49:24 - 49:25)
Yeah, but that seems like a totally different study. That's not this one.
[Kristin] (49:25 - 49:40)
Yeah, it seems like this should be a different paper. And none of this comparison between the AI and the teachers supports the claim that the teachers could pick out the ChatGPT essays or that the ChatGPT essays were, quote, soulless.
[Regina] (49:40 - 49:48)
So those claims in the media were all based on this one qualitative quote from the teachers. That's it?
[Kristin] (49:48 - 50:07)
Yes. And the researchers themselves also made a few poorly written statements in the paper that seem to imply that the human teachers could detect the ChatGPT group essays. But again, these claims are backed up by no data. Now, maybe they do have data showing this, right?
They could have tested for this. But it's definitely not in the preprint.
[Regina] (50:08 - 50:14)
I mean, after all that buildup, Kristin, I am kind of disappointed they didn't even try to quantify soullessness.
[Kristin] (50:14 - 51:33)
Right. Where was the soullessness in the ratings? Rate on zero to five. How soulless is this essay?
They did not do that. All right, Regina, I think that wraps up the text analysis section. Before we dive into the brain data, though, let's take a short break.
Regina, I've mentioned before on this podcast that my department offers a remote certificate in epidemiology and clinical research. And I wanted to say a little bit more about that for our listeners. This is a rigorous program, not self-paced, excellent for a really serious student.
Yes, you actually take Stanford classes alongside other Stanford students, except remotely. And you get Stanford credit on a Stanford transcript. So those courses can later be applied to a degree program at Stanford.
If you complete a certain number of courses, then you earn a certificate from my department.
[Regina]
And Kristin, they get to take your Stanford courses.
[Kristin]
Yeah, they'll actually be in my Stanford courses, just remotely.
I think my courses are a lot of fun. Come learn about probability and a lot of other cool stuff. You can find a link to this training program at our website, normalcurves.com.
Welcome back to Normal Curves. We were about to talk about the results of the brain analyses. Regina, walk us through what's actually in this section.
[Regina] (51:33 - 51:47)
Remember that the first major brain claim is that while writing the essays, participants who used ChatGPT had lower brain activity and lower connectivity than those who got no tools at all.
[Kristin] (51:47 - 52:01)
Right, like a lot less going on in the brain, less engagement. That is the claim, but now let's look at the evidence. Regina, they put out these pictures of brains with lots of colors.
They're very pretty. I have to admit, I'm not really sure what I'm supposed to see here. You're going to walk us through it, I hope.
[Regina] (52:01 - 52:22)
Yeah, well, I think that is their main value, actually, that they are pretty and colorful. That top row of brains in those pictures, by the way, have 992 individual lines, so that's why they look like hairballs. Wow.
You can't really interpret hairballs by eye. It's like they're more decorative.
[Kristin] (52:22 - 52:28)
So each one of those brains has 992 lines? I can't see 992 lines, Regina.
[Regina] (52:28 - 52:41)
No, you cannot see them individually, but those lines, nearly 1,000 lines, they are the key to everything in this analysis. So, Kristin, let's talk about where those 992 lines came from and what their colors mean.
[Kristin] (52:42 - 52:49)
Okay, walk us through it, Regina, just to remind everybody the participants are wearing these EEG caps. So how do the caps work? How do we get from the caps to data?
[Regina] (52:50 - 53:35)
I should preface this here by saying I am not a neuroscientist. I have written about neuroscience. I've done some stats consulting, but not a brain researcher.
All right, with that in mind, here's what I read. They had a standard EEG cap with 32 electrodes, which is a pretty common setup in brain research. And each electrode measures these tiny changes in electrical activity from big groups of neurons near the scalp.
It's interesting because the skull and the scalp blur the signal coming from the brain. So you can't get pinpointed brain activity in the front areas, but you can get these regional trends. And we know, for example, that electrical activity in the frontal areas is often involved in planning and attention.
[Kristin] (53:35 - 53:39)
That's like when they're talking about teenagers don't have a well-developed frontal cortex.
[Regina] (53:39 - 53:44)
Exactly. So we pop an EEG on them. Maybe it's like a flat line.
[Kristin] (53:45 - 53:57)
So in this study, they measured brain activity or brain connectivity while the participants wrote essays for 20 minutes. That's a long time. So how often were they measuring these little blips of electricity?
[Regina] (53:57 - 54:57)
Good question. 500 times per second.
[Kristin]
Yikes. That is a lot of data.
[Regina]
Yeah, it's like a million data points per minute per person.
Raw EEG data is really messy. So brain researchers use special processing techniques to summarize it all down.
[Kristin]
Okay, great.
[Regina]
In this study, the researchers used a method that very nicely summarized everything over the entire 20-minute period.
[Kristin]
Okay, so one measurement over time.
[Regina]
Right, okay.
So that simplifies it a little bit right there. Now, their method I'm going to explain it because it's a little weird. It gives each pair of electrodes a score.
And that score summarizes how much the activity at one electrode seems to be influencing or predicting activity at another electrode. Let's say electrode A, electrode B. We're just seeing how electrode A predicts activity at electrode B but it is directional.
So we look both ways. We look at influence or prediction from A to B but then also from B to A.
[Kristin] (54:58 - 55:15)
But Regina, let's back up for a minute. We're just looking at two electrodes at a time. So like electrode A with electrode B, electrode A with electrode C, electrode A with electrode D.
That is a lot of data then because we have 32 electrodes, each connecting with 31 other electrodes. So that's a lot of pairs.
[Regina] (55:16 - 55:23)
It is a lot of pairs. It is 32 times 31, which is, I'll do the math for you, 992.
[Kristin] (55:24 - 55:27)
Oh, that's where the 992 lines on the graphs come from.
[Regina] (55:27 - 55:44)
Bingo. It's funny though, I should point this out because in the paper they said that those brain pictures show all 1,024 lines all 32 electrodes by 32 electrodes but that doesn't make sense because how do you show a line from an electrode to itself?
[Kristin] (55:44 - 56:10)
Right, that is not a line, that is a point. That is a point. So they meant 992, but that, again, the sloppiness.
Okay, so we have these 992 what you call influence scores representing 992 connections between these pairs of electrodes. How did they analyze those data? First of all, we have three different sessions so how did they deal with the fact that we've got these brain measurements for the same person over three different sessions?
[Regina] (56:10 - 56:33)
Good question. So they used statistical tests what is called repeated measures ANOVA. Kristin, earlier you and I talked about ANOVA, this is just like a different flavor of ANOVA and the repeated measures ANOVA incorporates data from all three sessions at once.
Which is great, okay, they handled that correctly, good. But Kristin, they analyzed each connection, each pair of electrodes separately.
[Kristin] (56:34 - 56:39)
So each one of those 992 pairs that's a separate outcome?
[Regina]
Yep.
[Kristin]
Oh wow.
[Regina] (56:39 - 56:59)
Yeah, they wanted to know which of those 992 connections you know, those electrode pairs differed in strength between the three experimental groups. So they compared you know, electrode A to electrode B between ChatGPT and Brain Group between Brain and Search Engine between ChatGPT and Search Engine.
[Kristin] (56:59 - 57:08)
Oh wow, okay so we've got almost a thousand different outcomes but then three different comparisons so that's actually three thousand different statistical tests.
[Regina] (57:08 - 57:08)
Yep.
[Kristin] (57:08 - 57:13)
That creates a major multiple testing issue and multiple testing is something we've talked a lot about on this podcast.
[Regina] (57:14 - 57:35)
Yes, and it's bad. Kristin, hate to tell you it gets even worse. They also split the signals into ten what's called frequency bands and the frequency bands you can just think of as different brain radio stations.
Oh okay. Brain rhythms and you can measure all of those at the same time and split them out.
[Kristin] (57:35 - 57:53)
So each electrode is giving ten different signals then so it's not 992 outcomes, it's 992 times ten outcomes is that what you're telling me?
[Regina]
Uh huh, times three which is
[Kristin]
Wow, that's about 30,000 different statistical tests so a real huge multiple testing problem, yes.
[Regina] (57:53 - 57:54)
It is crazy.
[Kristin] (57:54 - 57:57)
Tell me a little bit more about these frequency bands, what's that about?
[Regina] (57:57 - 58:44)
Yeah, you can think of them as kind of like brain radio channels. Okay. It's like the brain is sending out these neural signals on ten different radio stations at once and the thing is the researchers can measure all those radio channels separately, separate them out and you might have heard of delta waves or theta waves.
[Kristin]
Oh, I've seen those in like the sleeping apps, yes?
[Regina]
Yes, exactly because the delta frequency band, this delta, you know, brain channel, it's often associated in studies with depressed. Ah.
So they tell you you can enhance your delta brain wave and theta is supposed to be associated with memory and focus and alpha with generating creative ideas but it's not really as specific as all of that, yeah.
[Kristin] (58:45 - 59:02)
But the important point, Regina, here, let's get back to the 30,000 statistical test and why are we looking at all of these separately? Isn't there some way to look at the whole brain at once, like connectivity across these 32 different electrodes? Why are they not looking at it holistically?
[Regina] (59:03 - 59:28)
Excellent question, Kristin. You know, they never measured overall brain connectivity. I think they lost the force for the trees here.
It's like they were trying to measure the health of a forest but then they only looked at each tree separately. You can't guess that. The brain's like a forest.
Yeah. Yeah. And there are ways to synthesize all of this brain data and get an overall measure of interconnectivity.
[Kristin] (59:28 - 59:28)
there is.
[Regina] (59:29 - 59:29)
There is.
[Kristin] (59:29 - 59:30)
But they just didn't use that here.
[Regina] (59:30 - 59:35)
They could have but they did not. They stayed at like the individual tree, individual electrode level.
[Kristin] (59:35 - 59:40)
Okay, so did they at least then do corrections for multiple testing for these 30,000 tests?
[Regina] (59:41 - 59:53)
Well, they said they did but they didn't give any details. They said, I'm going to quote this because it's funny, when multiple comparisons were involved p-values were adjusted using the false discovery rate correction.
[Kristin] (59:54 - 1:00:11)
When multiple comparisons were involved? When were they not involved? With 30,000 tests, this entire analysis is one big multiple comparisons problem.
That when makes me worried, Regina, because it sounds like they're maybe cherry picking what they corrected.
[Regina] (1:00:12 - 1:00:20)
We've talked about multiple comparisons in our review episode. We talked about multiple comparisons dude and why you have to avoid him.
[Kristin] (1:00:20 - 1:00:24)
This dude has like 30,000 dates. It's very impressive.
[Regina] (1:00:25 - 1:00:45)
He's very busy. Okay, so we cannot tell from this manuscript how well they dealt with this multiple comparisons problem. Did they correct across all 30,000 tests or just some subsets?
Like, we really need to know the details to judge how strong any of these findings are. We need it.
[Kristin] (1:00:45 - 1:00:54)
so maybe they did do the right thing and corrected for all 30,000 tests at once, but I'm kind of doubtful and maybe they did something that didn't provide much correction at all.
[Regina] (1:00:54 - 1:00:54)
Yeah.
[Kristin] (1:00:54 - 1:01:08)
And, Regina, if they did not do a meaningful correction and there were actually no real differences between the groups, we'd expect 5% of those 30,000 tests to come up significant just by chance. That's 1,500 false positives?
[Regina] (1:01:09 - 1:01:17)
The false positives is so important, yes. We have to keep that in mind and, Kristin, guess how many significant results they ended up with?
[Kristin] (1:01:17 - 1:01:18)
Don't tell me 1,500.
[Regina] (1:01:18 - 1:01:21)
1,700.
[Kristin]
Wow.
[Regina]
Which is not far from 1,500.
[Kristin] (1:01:22 - 1:01:30)
So it's possible these results are consistent with nothing more than random noise depending on if and how they corrected for multiple testing.
[Regina] (1:01:30 - 1:01:31)
Yep, big problem here.
[Kristin] (1:01:31 - 1:01:45)
All right, so big picture on the results, Regina. We've got these 10 frequency bands and I see in the paper for each one of those they labeled one of the groups as having the quote highest connectivity and statistically how did they decide that from all this data?
[Regina] (1:01:46 - 1:01:51)
good question. It is a little weird. Okay.
So I'm going to go specific and just give you an example.
[Kristin] (1:01:52 - 1:01:53)
Concrete is good, yeah.
[Regina] (1:01:53 - 1:02:00)
Okay, so let's say alpha wave data, right, we're comparing let's say ChatGPT users to the brain only users.
[Kristin] (1:02:00 - 1:02:02)
Right, 992 outcomes then.
[Regina] (1:02:03 - 1:02:18)
They report that they found 79 connections where the brain only group had significantly higher activity and 42 where the ChatGPT group was higher. So they simply said 79 is more than 42, therefore brain only group had higher connectivity.
[Kristin] (1:02:18 - 1:02:30)
All right, but that completely ignores the fact that there were 992 comparisons underneath this. Yep. And in the vast, vast majority of them, there was no significant difference is what I'm hearing.
So I'm not sure that's very convincing.
[Regina] (1:02:31 - 1:02:57)
Yes, they were, they were just focusing on this sliver of differences. You can't do that. You can't ignore this big picture.
And Kristin, they were not even consistent. Okay. Sometimes they compare the number of significant connections like 79 versus 42.
Other times though, they added up those influence scores. Remember, there's numbers from only the significant connections and said, well, whichever group has the bigger sum must have higher connectivity.
[Kristin] (1:02:57 - 1:03:09)
but wait a minute. If they want to look at those influence scores, they need to add up the influence scores across all of the connections, not just the ones that happen to meet this arbitrary bar of statistical significance. That's cherry-picking.
[Regina] (1:03:09 - 1:03:29)
That is cherry-picking. It's like they're changing the rules. It's also this circular logic.
Think about it. They're saying group A scores higher on the test questions where group A scores higher. Of course they do.
No big surprise there. It tells you nothing about overall brain connectivity across all electrode pairs.
[Kristin] (1:03:29 - 1:03:34)
I'm still worried about this because we're just looking at two electrodes at a time. What is that really telling us?
[Regina] (1:03:34 - 1:03:57)
I have a bad analogy here. I know nothing about sports, so help me out. I'm picturing it like two soccer teams and you're comparing the strength and the performance of two soccer teams by running these tiny one-on-one scrimmages.
Like the two goalies face off and then the two center forwards and then the two side rear flank dudes.
[Kristin] (1:03:57 - 1:03:58)
Those are the only positions that you know.
[Regina] (1:03:59 - 1:04:01)
I'm impressed I had those.
[Kristin] (1:04:01 - 1:04:24)
Very good. And then you would tally up which team won more of these little matchups and declare that team the winner. Team A won 15, Team B won 14, so Team A won the game.
Or they also picked out the ones that had the biggest blowouts and they added up the scores only from those that ignored the rest of them and then declared a winner that way. Both of those are not great ways to declare a winner.
[Regina] (1:04:25 - 1:04:28)
No, I don't think soccer would be quite as popular if they did it this way. That's how we decided.
[Kristin] (1:04:28 - 1:04:32)
The teamwork, how the team works together, that's what really matters.
[Regina] (1:04:32 - 1:04:35)
That's what matters and that's what matters with brains too.
[Kristin] (1:04:35 - 1:05:09)
Absolutely, yes. Those connections overall really important in brains, yeah. Okay, Regina, so now I think I understand those pretty brain pictures better.
Some of them show all the 992 lines but the stronger connections are in this redder color and then some of the brains they remove all the lines that weren't quote statistically significant so you see fewer lines. It's a little hard to get much out of the picture but maybe you're supposed to see darker colors in the brain group. To me, it looks like most of the brain is the same actually in the different groups.
They seem to be using the same parts of their brains.
[Regina] (1:05:09 - 1:05:12)
I think it's a bit like a Rorschach test so what do you see in this?
[Kristin] (1:05:13 - 1:05:21)
Right, all right, so that was the first claim about the brain, Regina. What about the second big claim, the lingering mental sluggishness?
[Regina] (1:05:21 - 1:05:29)
Oh, this one is fun. All right, this claim was based on session four which, remember, had only 18 people and lost the benefits of randomization.
[Kristin] (1:05:30 - 1:05:31)
It was problematic already.
[Regina] (1:05:31 - 1:06:17)
Right, right, it gets worse. In the paper, the researchers say that they present results that show how the brain evolved or adapted or improved or declined, they had all these verbs, when participants switched from one tool to another. But, Kristin, the truth is that their analyses don't show anything about how brains changed.
Yeah. I think they didn't really think it through properly because instead of following the same people to see how their brains changed after switching, like you used ChatGPT for the first three sessions and then switched to brain only, what they did is scramble their analyses. They compared the people who started in the ChatGPT group to the people who ended in the ChatGPT group.
[Kristin] (1:06:17 - 1:06:32)
Wait a minute. So they didn't compare the people in the ChatGPT group to themselves when they later switched to brain only? What?
They compared the people to other people?
[Regina]
To other people!
[Kristin]
That doesn't tell us about evolution of brain or change of brain.
[Regina] (1:06:32 - 1:06:41)
And they did the same with the brain only group, right? They compared the people who started in the brain only group to the people who ended in the brain only group.
[Kristin] (1:06:42 - 1:06:46)
So you cannot call it change over time because it's two different groups. Two different groups.
[Regina] (1:06:46 - 1:06:49)
Yes. They completely lost it. It is a big problem, but that's what the researchers did. So we need to remember that when we're seeing conclusions like participants who started with AI showed reduced connectivity when they had to then write on their
[Kristin] (1:06:49 - 1:07:11)
That reduced verb is wrong and it's not a lingering mental sluggishness if it's not the same person.
There's no lingering. And not to mention, of course, that this whole analysis is based on just a select set of nine people per group.
[Regina] (1:07:12 - 1:07:31)
This is this is a problem. Yep. So now I am thinking that blind day it's worse than he's just unshowered.
Like he showed up in his underwear, you know, he showed up naked and broccoli caught between his teeth. naked.
[Kristin] (1:07:31 - 1:07:39)
Naked. I mean, maybe he had something to show off.
Okay. See our male equipment episode.
[Regina]
Nice callback.
[Kristin]
It's good transparency, Regina.
[Regina] (1:07:41 - 1:07:43)
Full transparency. Literally transparency.
[Kristin] (1:07:44 - 1:07:45)
Just like a preprint. Full transparency.
[Regina] (1:07:46 - 1:07:52)
Totally naked. But if you're going to be this transparent, you got to have something to back it up, man.
[Kristin] (1:07:53 - 1:08:00)
Yeah, exactly. So it would have been fine had he shown up naked and had the goods to back it up. Right.
This paper doesn't seem to have the goods to back it up.
[Regina] (1:08:01 - 1:08:02)
No, it does not. I don't think this was a venti.
[Kristin] (1:08:04 - 1:08:06)
Again, see male equipment episode.
[Regina] (1:08:08 - 1:08:15)
Kristin, maybe there were analyses we have not seen, you know, that they have somehow that didn't make it into this paper.
[Kristin] (1:08:15 - 1:08:26)
Absolutely, there could be, but they're not in the paper. There's a ton of data here. And like I said, with the teacher's analysis, there's ways they could have answered that question.
So maybe they did it behind the scenes and we just never saw it.
[Regina] (1:08:26 - 1:08:49)
You know, we're talking about things that we don't like. I want to take a moment to say something that I did like.
[Kristin]
Oh, good.
[Regina]
In the beginning of the paper, this is the first time I've seen this, they had a TLDR, which stands for too long, didn't read. And they said something like, if you just want the bottom line, then go here. And they had a hyperlink within the PDF to just the summary and discussion.
[Kristin] (1:08:50 - 1:09:26)
I used those hyperlinks so many times in reading this paper, Regina, because it made it easier to navigate this kind of long and dense paper. So I did appreciate that.
All right, Regina, are we ready to rate the strength of the evidence for our claim today?
[Regina]
I think we are ready. Kristin, you want to restate the claim for us?
[Kristin]
Right.
So I think the claim we chose for today was this big claim that ChatGPT makes your brain lazy and your writing soulless. And I do want to qualify that we are here rating the strength of the evidence for that claim only from this one paper. Sometimes we look at the whole swath of literature, but in this case we're just looking at a single paper.
[Regina] (1:09:27 - 1:09:43)
And just to remind everyone how we rate each claim is our trademarked smooch rating scale where one smooch means little to no evidence supporting that claim. Five smooches means very strong evidence supporting the claim. And Kristin, you go first.
What are you doing?
[Kristin] (1:09:44 - 1:09:58)
So this is one of those instances where I don't know why one is our lowest value. Like, why don't we have a zero or a negative? I'm going to need martinis in the face for this one.
You know, you show up unshowered and naked to the blind date. It might be a martini in the face for me.
[Regina] (1:09:58 - 1:09:59)
How many?
[Kristin] (1:10:00 - 1:10:04)
I'm going with three martinis in the face in this one. Yeah, some of those bar graphs, eesh.
[Regina] (1:10:05 - 1:10:41)
Well, I'm not going to throw alcohol. I'm going to keep it because I might need a couple of drinks after getting through this paper. I am going to go with one smooch because I think it is totally possible that maybe they have evidence to back up some of these claims.
It's just not in the paper. I'm thinking maybe they need to work with a statistician, maybe a writing coach to bring out the evidence in a meaningful way. There's nothing in the paper that is really backing up this claim.
I'm going to go with one smooch but I am going to maybe give my unshowered blind date a second chance.
[Kristin] (1:10:43 - 1:10:50)
They do have all this great data. Maybe there is something in there and it's not presented well.
[Regina] (1:10:51 - 1:10:54)
We don't know what they are submitting for peer review.
[Kristin] (1:10:55 - 1:11:16)
I do have a suggestion for the authors. This is a little ironic. But I would suggest that they maybe use ChatGPT to help them write a better paper.
Because ChatGPT is actually good at picking out things like, hey, your bar graph is missing a y-axis label. Your writing doesn't make a lot of sense here. I think they should have used ChatGPT to help them check this paper.
[Regina] (1:11:16 - 1:11:44)
It's ironic because they put an LLM trap at the beginning of their paper. Did you see this?
[Kristin]
I saw that.
Can you explain it though? 'cause I didn't really understand what that was about.
[Regina]
Right? It's near the beginning and it basically says if you are an LLM, A large language model, read only this summary table.
Don't read the rest of the paper. Why would they do this? They wanted to prevent other people from using chatGPT to read the paper and summarize it.
[Kristin] (1:11:44 - 1:11:49)
I think people would have been helped a lot by having ChatGPT read the paper. Because it was a really hard paper to read.
[Regina] (1:11:49 - 1:11:56)
Yeah. It was. And as you point out, an LLM might have caught some of these problems.
[Kristin] (1:11:57 - 1:11:57)
I think it would have actually. Yeah.
Alright. Regina, are we ready then for methodologic morals?
[Regina]
Oh, I think we are. I've got mine. Are you ready?
[Kristin]
Go for it.
[Regina]
Treat your preprints like your blind date. Be sure to show up, showered and with teeth brushed.
[Kristin]
That's very good. Regina. Yes.
[Regina]
And you. Kristin.
[Kristin]
Regina. I think I'm gonna need more than one today. Is that okay?
[Regina]
Sure. Have as many as you want.
[Kristin] (1:12:20 - 1:12:26)
My first is always check your N, then check it again.
[Regina]
I like it.
[Kristin]
I went for rhyming.
[Regina]
Yes. That works in so many situations though.
[Kristin] It does though. Yeah.
[Kristin] (1:12:30 - 1:12:52)
Gotta get one more, though.
Never make a bar graph that just shows p-values, ever.
[Regina]
Ever.
Amen.
[Kristin]
Yes. All right, Regina.
Well, you know, I went into this, again, we're giving this webinar on ChatGPT on August 20th, and I had heard of this paper from the media, so I thought, hey, great opportunity for us to talk about a fascinating paper about the brain.
[Regina] (1:12:52 - 1:12:55)
It was fascinating, but not how I expected it to be fascinating.
[Kristin] (1:12:56 - 1:12:57)
Yeah. How about that? Exactly.
[Regina] (1:12:57 - 1:13:10)
But this has been a very interesting analysis, and this is the benefit, I think, of preprints because it allows people like you and me to go through and do some checking and suggestions.
[Kristin] (1:13:10 - 1:13:19)
And I hope they take our feedback constructively. It is meant to be constructive, and hopefully they'll fix up the paper before it's actually published. So, great opportunity.
[Regina] (1:13:19 - 1:13:21)
Yeah, because they have a lot of good data here.
[Kristin] (1:13:21 - 1:13:23)
A lot. Yeah, exactly. A lot to work with.
[Regina] (1:13:23 - 1:13:26)
Kristin, I, for one, plan to keep using ChatGPT.
[Kristin] (1:13:26 - 1:13:28)
Oh, absolutely. Yes. Yeah.
[Regina] (1:13:28 - 1:13:29)
Mental sluggishness or no.
[Kristin] (1:13:30 - 1:13:34)
Mental sluggishness be damned. It's saving me time.
[Regina] (1:13:34 - 1:13:37)
Right. It's productivity. So, I'm okay with that.
Yes.
[Kristin] (1:13:37 - 1:13:42)
I'm all about efficiency. And please, again, come to our webinar on ChatGPT on August 20th.
[Regina] (1:13:42 - 1:13:51)
We'll put a link to the registration on normalcurves.com, and we'll also put a link to the recorded session after August 20th in case you miss us live.
[Kristin] (1:13:52 - 1:13:58)
Yeah. I hope a lot of our listeners will tune in to that.
[Regina]
Kristin, this was so much fun.
Thank you.
[Kristin]
Thank you, Regina, and thanks, everyone for listening.