Jonathan Chen on AI in Medicine: Promise, Pitfalls, and Practice Artwork

The Future of Medicine

Welcome to The Future of Medicine, a podcast from Stanford's Department of Medicine.

We bring you into conversation with the thought leaders who are reshaping how we understand disease, deliver care, and imagine what's possible in human health. This show is built around the extraordinary speakers who join us for Medicine Grand Rounds – one of the longest-running and most respected forums in academic medicine.

Our guests include world-renowned physicians, scientists, innovators, and policy leaders from across the globe, as well as the remarkable faculty at Stanford. Together, they represent the full spectrum of modern biomedical discovery: from breakthrough therapeutics and cutting-edge genomics, to health equity, digital health, global health, neuroscience, AI, and the re-design of care systems.

This is The Future of Medicine.

All Episodes

The Future of Medicine

Jonathan Chen on AI in Medicine: Promise, Pitfalls, and Practice

March 29, 2026 • Stanford Department of Medicine • Season 1 • Episode 13

0:00 | 44:17

In this episode of The Future of Medicine, we welcome Jonathan Chen, MD, PhD, clinician, AI researcher, and Associate Professor at Stanford, whose work focuses on combining human and artificial intelligence to improve clinical decision-making.

Dr. Chen reflects on the rapid rise of AI in medicine, and the moment he realized everything had changed. He also walks through surprising findings from his research, including studies showing that AI alone can sometimes outperform doctors using AI tools. He explains why this happens, from human bias and “automation errors” to the ways AI systems are designed to agree with users, even when they’re wrong.

Looking ahead, Dr. Chen shares his perspective on the future of AI in medicine, including the risks of overreliance, the importance of clinical judgment, and how these tools could transform everything from medical education to patient care. He also explores the concept of “do no harm” in AI systems—and why safety and accuracy are not the same thing.

Thank you for listening!

Call to action: If you enjoy The Future of Medicine, subscribe for more conversations with leading scientists shaping the next era of healthcare. Please rate and review the podcast to help others discover these important discussions. Share with friends and colleagues who are curious about how science becomes medicine.

0:00:00.160,0:00:04.320
Confabulations, hallucinations... Your judgment
matters even more than it did before. Dr. Jonathan

0:00:04.320,0:00:09.600
Chen is a clinician and AI researcher who's making
medicine smarter by turning real-world data into

0:00:09.600,0:00:14.320
tools that actually work at the bedside.
When I first saw a preview version of GPT-4

0:00:14.320,0:00:19.200
as things were building up, I was like, "Holy
smokes." I literally felt like I need to throw

0:00:19.200,0:00:23.760
away half of my research program. When he's not
advancing the future of clinical decision-making,

0:00:23.760,0:00:29.200
he's an accomplished magician, so he knows a thing
or two about making the impossible feel possible.

0:00:29.200,0:00:32.320
I'm using magic again as that algorithm. Boy,
does that look real, but you still have to

0:00:32.320,0:00:37.520
have that judgment to tell the difference. In our
interview, he dives into how AI can support rather

0:00:37.520,0:00:43.600
than replace human judgment. AI could be the best
teacher you have ever had, but if you misuse it,

0:00:43.600,0:00:47.920
you miss the point, and shares what he's learning
from bringing these systems into real clinical

0:00:47.920,0:00:53.680
practice. Welcome to Stanford Department of
Medicine's inside look at the future of medicine.

0:00:53.680,0:01:00.160
Well, Jonathan, welcome to the future of medicine.
You just came from our grand rounds being a

0:01:00.160,0:01:06.240
speaker to a packed house, presenting not just
cutting-edge ideas in AI and a lot of your own

0:01:06.240,0:01:15.200
work, but magic. So let us start there — medicine
or magic, which came first? How did you get

0:01:15.200,0:01:20.560
there? How did you arrive at today as a faculty
member here at Stanford who practices both? Who

0:01:20.560,0:01:24.800
would be crazy enough to try all of the above?
Um, I don't know. I played a little bit of magic

0:01:24.800,0:01:29.440
when I was a 12-year-old boy, as many 12-year-old
boys do. And then after a few months, I stopped

0:01:29.440,0:01:32.720
doing it because I had nobody to perform for.
My sister wasn't going to watch it more than one

0:01:32.720,0:01:41.280
time, right? I gave it up. These were card tricks.
I learned a three-card monte, I learned some trick

0:01:41.280,0:01:45.440
decks — I actually learned a few basic things, but
probably nothing you'd be impressed by. Just kind

0:01:45.440,0:01:50.400
of kid stuff. And then I became a doctor, somewhat
more because my parents wanted me to. There's a

0:01:50.400,0:01:52.800
little bit of story there as well.
Well, tell us.

0:01:53.760,0:01:58.320
I started college unusually young. I was 13 years
old when I started college. And I don't know about

0:01:58.320,0:02:04.000
you — when I was 13, I did not know what I wanted
to be when I grew up. If you had asked me when

0:02:04.000,0:02:07.680
I was a kid, I would have said I wanted to be a
stand-up comedian because I like to make people

0:02:07.680,0:02:11.680
laugh and that felt good. I never said I want
to be a doctor or a scientist or something like

0:02:11.680,0:02:16.000
that. But how did you end up at college so early?
So there's a program in Los Angeles at Cal State

0:02:16.000,0:02:19.280
LA. It actually started, I think, out of their
psychology department — like an experiment,

0:02:19.280,0:02:23.760
basically. They studied precocious kids in junior
high and asked, do you want to basically be in

0:02:23.760,0:02:28.480
this program? It just perpetuated. I took a test,
they said you did okay, why don't you come in for

0:02:28.480,0:02:32.400
summer quarter? And now there's a whole program
— if you do okay, not just academically but

0:02:32.400,0:02:36.320
in terms of emotional readiness, you could just
start college full-time through that program.

0:02:36.320,0:02:40.880
Okay, but "did okay" among the people being
considered for college at age 13, right?

0:02:40.880,0:02:46.560
Yes, so the scale is a little different. Not that
it was anything extreme, but that kind of dynamic.

0:02:47.360,0:02:50.880
That was also very important and powerful because
it's not like it was just me on a giant college

0:02:50.880,0:02:56.000
campus. It was me and like 30-ish other kids,
essentially. So I had a cohort to hang out with. A

0:02:56.000,0:03:01.520
couple of years after that, I transferred to UCLA.
That really was tough — now I'm a 15-year-old kid

0:03:01.520,0:03:06.800
living on a gigantic college campus by myself,
living in the dorms. That was very challenging,

0:03:06.800,0:03:12.000
but it kind of forced me to grow up real fast.
It's not enough just to study anymore. I can study

0:03:12.000,0:03:16.160
and ace all the exams — oh, that's just not enough
anymore. You have to figure out life real quick.

0:03:16.160,0:03:24.400
Yeah, for a 15-year-old, that's a challenge.
So then you graduated — I graduated from UCLA.

0:03:24.400,0:03:29.600
I was in all the pre-med clubs, I took the MCAT —
but you were 16 or something when you graduated?

0:03:29.600,0:03:35.920
I was a little older than that. I was a
ripe 19 by the time I was out of there.

0:03:35.920,0:03:40.000
But then kind of at the last minute, I hear
all this: well, you're going to be a doctor,

0:03:40.000,0:03:44.480
you've got to have a passion for healing, it's
got to be like your life's mission. And I'm like,

0:03:44.480,0:03:49.920
I'm 18, 19 years old — I feel nothing. Okay, I'm a
kid basically. I'm not ready to commit my life to

0:03:49.920,0:03:54.640
this. I need to go have some other experiences.
And my real underlying nerd passion — I'm

0:03:54.640,0:03:57.920
just a nerd, I just like to study, I like to learn
things — and I'm a computer nerd, right? I like

0:03:57.920,0:04:03.200
to program. So I actually worked as a software
engineer, software development, nothing to do

0:04:03.200,0:04:08.320
with biology or medicine, for a couple of years.
The dot-com bubble was bursting around that era.

0:04:08.320,0:04:12.880
And what were you programming back then? What
kind of computer were you using? What language?

0:04:12.880,0:04:18.480
Gosh, I mean I was learning C++ in class,
but most of industry at the time was Java

0:04:18.480,0:04:23.440
programming, and then the internet had become a
thing — making three-tiered applications, HTML,

0:04:23.440,0:04:29.440
JavaScript, Java at the time. Now it would seem
antiquated; things have evolved. But that gave me

0:04:29.440,0:04:34.400
a lot of perspective on industry. Oh wow, suddenly
when I can make six figures as a 19-year-old,

0:04:34.400,0:04:43.280
my mom didn't care that much when I went to
medical school. But then after a year or so,

0:04:44.400,0:04:48.080
the company laid off a third of its
staff during the dot-com bubble burst,

0:04:48.080,0:04:53.920
including myself. I got another job — it was
fine. But it gave me more perspective: well,

0:04:53.920,0:04:59.040
this is a good job and I'm good at it, but I could
foresee I'm not sure I'm going to find a rewarding

0:04:59.040,0:05:03.120
long-term career I'm going to be happy with.
Were you from a technical

0:05:03.120,0:05:09.200
or academic or medical family?
Really not. I'm sort of the mom's aspiration — the

0:05:09.200,0:05:12.800
first doctor in the family kind of situation. My
dad was an electrical engineer with a computer

0:05:12.800,0:05:17.680
science master's. That's where I get a lot of
the nerdiness from. But this combination was very

0:05:17.680,0:05:23.200
weird. Then an old friend said, "Why don't you
consider medical school again?" I'm like, "What?"

0:05:23.200,0:05:26.800
I'd left that all behind. I'm not going to go
back; I have to go back and beg my old professors

0:05:26.800,0:05:34.240
for a letter of recommendation. Forget it.
It was my girlfriend at the time. She said, "No,

0:05:34.240,0:05:38.240
Jonathan, you should totally apply for med school.
You should do it because in life, you will regret

0:05:38.240,0:05:43.200
what you didn't try, not what you did try."
Once I heard that — putting the guilt trip

0:05:43.200,0:05:52.400
on me — I was like, well, now I have to. But I
said I'm going to apply for MD-PhD programs only,

0:05:52.400,0:05:56.880
and the PhD has to be in computer science.
That does sound crazy, but that's the point.

0:05:56.880,0:06:00.720
If I'm going to do it, it has to be crazy
enough to be something different. Otherwise,

0:06:00.720,0:06:09.840
I might as well just stay in my job.
MD-PhD with computer science — a fair

0:06:09.840,0:06:17.120
number of places do that, but not many. I give
credit to University of California, Irvine. They

0:06:17.120,0:06:22.000
were very open-minded to that combination. They
actually literally had another MSTP — Medical

0:06:22.000,0:06:25.280
Scientist Training Program — student a couple of
years ahead of me doing that same combo. Whereas

0:06:25.280,0:06:32.560
most places would look at me strange: "Shouldn't
you be getting a PhD in biochemistry, immunology,

0:06:32.560,0:06:50.000
neuroscience? Why computer science?"
And this was around 2000, 2001, 2002.

0:06:50.000,0:06:58.000
Yes. The internet had happened, you were
programming in Java, JavaScript — but we

0:06:58.000,0:07:08.160
were really just maturing in our sense of what the
internet was, just after the crash had started.

0:07:09.280,0:07:15.840
Okay. So then you headed to medical school.
I ended up doing the joint degree program.

0:07:19.920,0:07:41.760
I got the PhD in computer science. What I was
working on would nowadays be called rules-based

0:07:41.760,0:07:47.120
expert systems and machine learning applications
for chemistry — specifically organic chemistry,

0:07:47.120,0:07:51.200
which was my subject domain. A lesson I took
from my industry experience is that you need

0:07:51.200,0:08:03.360
strong technical skills, otherwise it's all just
talk. Medicine is really the subject domain I care

0:08:03.360,0:08:07.120
most about now, but at the time I said, well,
chemistry — I think that's fun and interesting.

0:08:07.120,0:08:11.520
In so many words, I made AI systems that
could do your organic chemistry homework

0:08:11.520,0:08:17.440
for you. The irony is, if it can do your homework
for you and help with pharmaceutical development,

0:08:18.320,0:08:23.120
it could also teach you how to do your homework.
Not what I expected, but one of the main outputs

0:08:23.120,0:08:27.120
of my PhD was an education system that was used
for over a decade by students around the world

0:08:27.120,0:08:33.760
to learn organic chemistry. My daughter is
currently suffering through it in college.

0:08:40.720,0:08:43.840
So then you did your clinical
work and became a doctor.

0:08:45.840,0:08:49.360
I actually did not intend to do residency. I
was never intending to be a practicing doctor.

0:08:49.360,0:08:53.840
I was like, I'll just get the knowledge, I'll
graduate, maybe do consulting — I'm going to

0:08:53.840,0:09:01.920
go back to industry in some form or another. And
then I did my clerkships — my third-year internal

0:09:01.920,0:09:07.760
medicine rotation — and it surprised me how much I
liked it. I did not expect to like it. I was like,

0:09:07.760,0:09:11.200
oh, you're going to be at the bottom of
the totem pole, have sick people complain

0:09:11.200,0:09:14.800
to you all the time — who wants to deal with
this? And then I got there and it was like:

0:09:14.800,0:09:19.440
this is actually really cool. You work in a
team, you're working on good problems together,

0:09:19.440,0:09:25.360
it's something that matters. What you do has very
clear impact. People are hanging on your words,

0:09:25.360,0:09:29.840
you've got to do it right. There's a very
interesting applied expertise that has

0:09:29.840,0:09:34.800
to manifest there. It doesn't matter what the
abstract says. Either you prescribe the medicine

0:09:34.800,0:09:39.440
or you don't — how do you make that kind of tough
decision? I thought that was very compelling.

0:09:39.440,0:09:44.160
So I did decide to do residency, and that's how
I first arrived at Stanford. Third time was a

0:09:44.160,0:09:47.680
charm — undergrad didn't quite work out,
med school and PhD, and finally internal

0:09:47.680,0:09:51.040
medicine residency. Very grateful to have
matched at Stanford. I've been here since.

0:09:51.760,0:09:56.560
We've been very happy to have you here in
the Department of Medicine, part of our newly

0:09:56.560,0:10:01.120
named Division of Computational Medicine.
You have a number of different roles now,

0:10:01.120,0:10:05.840
but I think it would be great to start with some
of the really impactful work that's been covered

0:10:05.840,0:10:13.920
in the press around the world — thinking about
the application of modern-day AI architectures to

0:10:13.920,0:10:19.680
medicine. In particular, maybe the thing that
hit the biggest headline was this idea that

0:10:19.680,0:10:28.960
when you compared human doctors plus AI tools
to the AI tool alone, the AI tool appeared to

0:10:28.960,0:10:35.280
outperform — the doctor was actually pulling
down the performance. Talk about that work and

0:10:35.280,0:10:40.400
some of the caveats and what it means for us.
That was definitely not what we expected. When

0:10:40.400,0:10:44.320
I first saw a preview version of GPT-4
as things were building up, I was like,

0:10:44.320,0:10:49.360
"Holy smokes." I literally felt like I need to
throw away half of my research program. This thing

0:10:49.360,0:10:54.480
had leapfrogged so many things I thought were
capable, and it moved much faster than I expected.

0:10:55.440,0:10:59.840
But then empirically — I didn't even bother
checking multiple-choice questions because I

0:10:59.840,0:11:12.960
know 100 people do that; it's so easy, and that's
not what we as doctors would care about. Complex

0:11:12.960,0:11:17.280
case reasoning with expert consensus grading —
very hard to do, but that's the key. Our educators

0:11:17.280,0:11:22.960
are key to unlocking that, and the result was not
what we expected. We thought our hypothesis was:

0:11:22.960,0:11:28.160
oh, you could look up stuff in UpToDate, PubMed,
or use this AI tool that seems really smart,

0:11:28.160,0:11:33.280
and I bet doctors could be even better — we can
show this combination is so great. And that's

0:11:33.280,0:11:39.040
not what we found. We found the combination didn't
make that much difference. Very surprising to us.

0:11:39.040,0:11:51.360
And these were doctors who were already somewhat
familiar with the tools. At the time we did that

0:11:51.360,0:11:56.880
study — about two and a half years ago —
a third of the doctors had never touched

0:11:56.880,0:12:01.040
ChatGPT or a chatbot before in their life, and
another third had maybe used it once or twice.

0:12:01.040,0:12:05.040
It was clear many of them didn't know what it was
and didn't know how to use it, or if they did,

0:12:05.040,0:12:11.920
they did not trust it. And I agree — the first
time you chat with an AI chatbot, it felt weird.

0:12:12.640,0:12:16.880
You didn't know what to ask, you didn't know
what it could say. It was a very bizarre feeling.

0:12:20.480,0:12:27.200
For decades in computer science, this idea of
the Turing test from the 1950s — and suddenly

0:12:27.200,0:12:33.920
we leapfrogged past it almost overnight.
It became almost irrelevant. For medicine,

0:12:33.920,0:12:41.280
where it's been clearly one of the
major applications — with the internet,

0:12:41.280,0:12:48.160
it wasn't immediately obvious that health was
one of the most interesting applications. With

0:12:48.160,0:12:55.120
the language models we're seeing now, it really
is to the forefront. We don't see patients anymore

0:12:55.120,0:13:00.800
who haven't already consulted a language model.
The idea two years later that there would be a

0:13:00.800,0:13:09.040
doctor who doesn't know how to use a language
model — that's not very credible anymore.

0:13:09.600,0:13:14.240
But what exactly do you think it was —
that naivety about how to use the system,

0:13:14.880,0:13:21.200
or do you genuinely think the human was
dragging down the reasoning within the model?

0:13:32.160,0:13:37.600
There is anchoring and sycophancy that happens.
If you say, "Hey AI, I've got this patient with

0:13:37.600,0:13:41.440
shortness of breath — do you think it's congestive
heart failure?" It'll probably just say, "Sure,

0:13:41.440,0:13:45.760
good idea, doctor. Congestive heart failure
sounds great." When maybe it's not giving you

0:13:45.760,0:13:49.360
the objective assessment it could have.
We've empirically shown that many times.

0:13:50.800,0:14:05.040
Why does it do that? At the very base level, I
like to call them autocomplete on steroids. It's

0:14:05.040,0:14:10.000
just read the internet, in so many words, and can
guess the next word, filling in a sentence. But if

0:14:10.000,0:14:20.640
it just did that, it's not very interesting to
work with. So they did supervised fine-tuning,

0:14:20.640,0:14:25.120
reinforcement learning from human feedback. What
does that mean? It'll output 10 different answers,

0:14:28.480,0:14:33.760
and humans said, "I like this answer better — this
is answering my question better, this one is more

0:14:33.760,0:14:38.640
coherent." So it gives more coherent answers.
But that also means humans are instilling their

0:14:38.640,0:14:43.840
values of what they like. And humans really
like it when people — or computers — agree

0:14:43.840,0:14:50.640
with them, whether it's right or not.
So this leads the model into a part of

0:14:50.640,0:15:01.760
its space where it not only wants to please,
but it doesn't fully separate the facts it

0:15:01.760,0:15:06.480
needs to provide from its sense of wanting
to please, which was instilled through that

0:15:07.920,0:15:19.840
reinforcement learning feedback cycle.
This is something you've advised — a very

0:15:19.840,0:15:28.000
practical piece of advice is to let the model
operate first without giving it a pre-baked

0:15:28.000,0:15:33.280
probability of what the answer might be, without
biasing it with your own idea. And for what it's

0:15:33.280,0:15:38.880
worth, a lot of these concepts aren't really
new — humans do the same thing. If you get an

0:15:38.880,0:15:41.920
admission from the emergency room and
they tell you, "Hey, this patient has

0:15:41.920,0:15:47.120
systolic heart failure exacerbation," well,
maybe 80 to 90% of the time that's correct,

0:15:47.120,0:15:52.880
but sometimes — wait, no, this patient has
pneumonia. We anchored on the wrong thing.

0:16:10.880,0:16:19.200
The other danger with AI is automation bias.
It's right so often that you eventually stop

0:16:19.200,0:16:24.880
double-checking all the time — and that gets
you in a bad situation when it really matters.

0:16:25.440,0:16:32.800
I find this with the digital scribe we have
available in the clinic. It is so articulate,

0:16:32.800,0:16:36.160
so well-written — great punctuation,
capital letters in the right place,

0:16:36.160,0:16:43.360
commas, em dashes everywhere — it speaks so well
that our brains, which are used to reading text

0:16:43.360,0:16:54.640
that has been checked by humans, associate that
with quality. And we have to revisit that now.

0:16:54.640,0:17:05.120
Absolutely. Confabulations are photorealistic,
the prose is very eloquent — those are no longer

0:17:05.120,0:17:10.000
reliable indicators of truth anymore. Your
judgment matters even more than it did before.

0:17:10.000,0:17:14.320
And beyond accidental errors, there are bad
actors out there who are straight-up trying

0:17:14.320,0:17:18.880
to scam you with things that look very
credible. It's actually a kind of scary

0:17:18.880,0:17:22.880
world we're emerging into, and I don't know
that we fully know how to cope with it yet.

0:17:23.840,0:17:27.760
You also presented some really interesting
data around how many of these models have

0:17:27.760,0:17:32.400
been tested on benchmarks where there's a right
answer and a wrong answer — multiple choice, not

0:17:32.400,0:17:45.840
very real-world. But in the real world, we have a
rubric in medicine: first, do no harm. You've been

0:17:45.840,0:17:56.237
exploring this with your fantastic team here.
What were you actually trying to do at first?

0:17:56.237,0:18:03.440
We were prototyping an AI-provided consult — like,
what would that be like? But before we unleash

0:18:03.440,0:18:08.720
that in the real world, could we just make sure
it's at least safe? Let's get some representative

0:18:08.720,0:18:12.800
cases and make sure the thing doesn't give
answers that would actually harm somebody,

0:18:12.800,0:18:17.520
let alone whether it's even useful. And
that's not a trivial question to answer.

0:18:17.520,0:18:22.800
In this study, which is just emerging as a
preprint right now, accuracy and harm are not the

0:18:22.800,0:18:29.280
same thing. You can have a really bright student
or resident who knows everything, and sometimes

0:18:29.280,0:18:32.960
you go — whoa, whoa, whoa, that's dangerous.
Don't do that. And it's not because they're not

0:18:32.960,0:18:38.000
smart — it's, oh, was I not supposed to do that?
Was that important? They have no basis to make

0:18:38.000,0:18:43.200
that distinction. That required a very deliberate
type of evaluation, and it showed harm was

0:18:43.200,0:19:08.880
not correlated with accuracy alone.
And we probably should subordinate accuracy

0:19:08.880,0:19:15.520
to "do no harm." It's kind of like a clinical
trial progression — before you check efficacy,

0:19:15.520,0:19:25.120
can we just make sure we're not accidentally
harming people? And what's been surprising us:

0:19:27.040,0:19:35.360
for these kinds of tough questions, 10 to
20% of the time, even frontier models will

0:19:35.360,0:19:40.160
sometimes cause harm — and often the harm was that
they didn't do something. They didn't recommend

0:19:40.160,0:19:44.960
something that was actually important to address.
We often miss sins of omission because we're so

0:19:44.960,0:20:17.040
focused on sins of commission. Doing
nothing doesn't mean doing no harm.

0:20:18.000,0:20:22.320
You actually do a lot of harm when you
should have done something and you saw it.

0:20:22.320,0:20:29.120
Another area worth exploring briefly is what
falls under the term RAG — retrieval-augmented

0:20:29.120,0:20:37.520
generation. We're moving from a world where these
models are essentially next-token predictors built

0:20:37.520,0:21:04.000
on a huge mass of data, trying to please
you — all the things we talked about — to

0:21:04.000,0:21:14.800
a world where they're much more likely to go to
a source and use their language skills to bring

0:21:14.800,0:21:25.120
back data directly from that source. Talk about
that evolution, specifically for medical AI.

0:21:25.120,0:21:30.160
It's a great migration. I used to talk about
confabulation, hallucination — this thing has

0:21:30.160,0:21:33.920
Wernicke-Korsakoff syndrome; it's just making
up words even if they don't make sense. It's not

0:21:33.920,0:21:38.000
trying to lie to you. It doesn't know what a lie
is. It's just filling in the blanks as it goes.

0:21:38.000,0:21:41.280
I've decided not to cover that as much
anymore because it's not a solved problem,

0:21:41.280,0:21:46.000
but it's much less of an issue, because most
models have deliberate RAG — retrieval-augmented

0:21:46.000,0:21:52.240
generation — baked in, where it goes: well, you're
asking a factual or evidence-specific question,

0:21:52.240,0:21:57.200
let me go look for that evidence and kind of read
a document for you. That's a much more grounded

0:21:57.200,0:22:04.400
approach. You still can't rely on it perfectly,
but you also don't have to just guess — you can

0:22:04.400,0:22:11.840
go double-check the article yourself.
As clinicians, we're actually in a very

0:22:11.840,0:22:22.160
good position to deal with this. Just like
in real life, your consultant gives you a lot

0:22:22.160,0:22:27.520
of advice on the wards, and most of the time
they're right. But sometimes — wait a minute,

0:22:27.520,0:22:33.440
it really matters if we go to surgery or not. So
this one is worth double-checking. Whether the

0:22:33.440,0:22:36.960
antibiotics are for five or seven days — well,
it's probably about the same, I'm not going to

0:22:36.960,0:22:41.760
worry too much about it. You know when it matters
and you know how to dig in deep when it does.

0:22:41.760,0:22:53.040
And the link is right there — you can go
directly from a model summary to the source.

0:22:53.920,0:23:00.560
For medical decisions, that's how we mostly
use the frontier models at this point. Our

0:23:00.560,0:23:09.520
residents and colleagues tend towards using Open
Evidence, which is a specific language model built

0:23:10.320,0:23:20.000
from medical data with doctors in mind —
although most patients use the frontier models.

0:23:20.800,0:23:26.827
Open Evidence requires an NPI number to get in;
they're trying to restrict it. Sometimes models

0:23:26.827,0:23:36.080
are restricted to providers, probably partly
for liability reasons, because otherwise these

0:23:36.080,0:23:40.480
models are basically giving medical advice. They
have this disclaimer — "not for medical advice,

0:23:40.480,0:23:43.360
cannot be used for any medical purpose" — but
give me a break. That's clearly what people are

0:23:43.360,0:23:48.640
doing. It creates a very odd tension about who's
really responsible for the decisions being made.

0:23:48.640,0:23:55.600
We've obviously been very quick to embrace
the potential benefits of AI here at Stanford,

0:23:55.600,0:24:00.400
including in our healthcare system. You mentioned
the engineering team that visited rounds with

0:24:00.400,0:24:04.320
me in the hospital last week.
That's right — which was really fun!

0:24:04.320,0:24:09.680
The tool allows a language model,
under closed and safe conditions,

0:24:09.680,0:24:18.240
to access a single patient's record and use
these real strengths — assimilating data,

0:24:18.240,0:24:25.920
going to raw data, and bringing back answers —
but in the context of a single individual. Maybe

0:24:25.920,0:24:35.360
some listeners don't realize how much time our
residents spend clicking through different screens

0:24:35.360,0:24:49.520
in the electronic health record. Talk about
the problem first, then the potential solution.

0:24:50.960,0:25:00.880
For better or worse, so much of our practice
— the electronic medical record is the central

0:25:00.880,0:25:04.400
hub for what happens in medicine. The
first time I thought about ChatGPT, the

0:25:04.400,0:25:08.240
first thing I thought was: I want that attached to
the medical record. But for all sorts of privacy,

0:25:08.240,0:25:11.520
security, and engineering reasons, it's actually
very difficult to do. But for a couple of years,

0:25:11.520,0:25:15.920
teams have been putting this together, and
now we have at least the first forms of it.

0:25:15.920,0:25:25.520
Very powerful — because most of a clinician's
time on the computer isn't writing notes or

0:25:25.520,0:25:30.800
putting in orders. Most of their time is chart
review. They're looking stuff up, reading notes,

0:25:30.800,0:25:37.360
making sense of results, collating. And then at
the last minute they summarize it. Between each

0:25:37.360,0:25:41.360
piece of information, there might be five or six
clicks. I was recently reviewing a medical case

0:25:41.360,0:25:48.960
record — they sent me 5,000 pages of documents,
and most of those 4,900 pages were worthless.

0:25:49.840,0:25:53.840
Just sifting through and making sense of it was
hours and hours of work. That's a very common

0:25:53.840,0:25:58.880
job we give medical students first — why don't you
scour those records and summarize what's happened.

0:26:34.400,0:26:47.760
And now — not perfect, not exactly ready for prime
time, but you can see where it's going — AI can do

0:26:47.760,0:26:52.720
a lot of that, and we can get back to: okay,
thank you for the summary, now I can actually

0:26:52.720,0:28:10.640
think about how to approach this patient's case.
And it can catch things humans didn't. My wife is

0:28:10.640,0:28:15.920
a pathologist and she's doing a prototype there
— it caught things the humans missed. "Has this

0:28:15.920,0:28:19.680
patient ever had a hematologic malignancy?" The
person who scoured the records said, "I don't see

0:28:19.680,0:28:25.440
it." But wait — there was a note from nine months
ago. That totally changes the interpretation.

0:28:35.760,0:28:47.520
And I think the deep learning era for
AI — where we were used to the idea

0:28:47.520,0:28:55.920
that imaging could come under the spell of
AI — now we're seeing that really all the

0:28:55.920,0:28:57.760
data, anything to do with reading and writing,
which is basically everything, is in scope.

0:28:57.760,0:29:03.920
Absolutely. That's why when I saw the
emerging GPT-3 and GPT-4, I was like,

0:29:03.920,0:29:08.160
holy smokes — this is actually going to change the
world. For a lot of stuff — my most cited paper,

0:29:08.160,0:29:13.040
for better or worse, is that machine learning
is way overhyped in medicine. That's a two-page

0:29:13.040,0:29:17.520
perspective, and it's the most cited thing I
have. But here I'm like, okay, there is hype too,

0:29:17.520,0:29:31.920
and people think AI is magic a little bit
too much — marketers basically apply that

0:29:31.920,0:29:38.480
word to anything. But here, I think this is real.
Is that your vision for the next few years? I know

0:29:38.480,0:29:44.640
this is something we both get asked a lot — these
models are only going to get better from here.

0:29:46.480,0:29:54.973
All of this really started with the transformer
paper in 2017, but practically, GPT-3 — that was

0:29:54.973,0:30:04.080
2021, 2022. Three years ago. So it feels almost
ridiculous to ask about five years from now.

0:30:13.760,0:30:19.840
Gosh, that is tough. Five years in this kind of
epoch — look at where we were five years ago.

0:30:19.840,0:30:30.560
We weren't even talking about this. Within five
years, I'm pretty sure most doctors will be using

0:30:30.560,0:30:34.560
ambient scribes, or at least that'll be very
common, routine technology. It won't even seem

0:30:34.560,0:30:46.080
novel anymore. Other predictions? I'm surprised
there's not a class action lawsuit against big

0:30:46.080,0:31:02.160
tech over AI harming people — giving bad medical
advice or harmful things. There are class action

0:31:02.160,0:31:03.120
lawsuits around creative content — use of movie
actors' likenesses, publishing and Hollywood — but

0:31:03.120,0:31:07.440
not yet for medical harm. I'm really
kind of surprised it hasn't happened yet.

0:31:08.320,0:31:18.880
More and more, I think AI is just going to be
embedded in our systems. Your kids will still know

0:31:18.880,0:31:23.520
a little bit of the difference — you were there
before the internet age as I was — but can your

0:31:23.520,0:31:32.720
kids comprehend a world without the internet?
No, they cannot. It'll get to the point where

0:31:32.720,0:31:37.920
it just seems routine. Anytime you read or write
anything, which is basically anytime you interact

0:31:37.920,0:31:43.120
with a computer, AI is going to be right over
your shoulder. And it'll really abstract the

0:31:43.120,0:31:47.840
nature of the way we interact with all things.
And I think it'll get safer as well. It has to,

0:31:48.960,0:31:55.440
and it will — but it's a process to get there.
So you recently took on a new position as Director

0:31:55.440,0:32:02.640
of AI Education. We're obviously an educational
facility — school of medicine, medical students,

0:32:02.640,0:32:09.040
PA students, nurses and other trainees, residency
programs, fellows. We're training a whole

0:32:09.040,0:32:26.160
generation of doctors for the future, as well as
providing continuing medical education. How do you

0:32:26.160,0:32:32.000
even go about planning a curriculum around AI?
It's a chance — Senior Associate Dean Reena

0:32:32.000,0:32:38.160
Thomas tapped me for this one. Dr. Thomas said,
"Jonathan, you're not here for advice. You're here

0:32:38.160,0:32:50.080
to apply for this job." And it's great. I think
this is actually a very nice coalescence of so

0:32:50.080,0:32:58.800
many themes that actually fits my history. I kind
of couldn't help it — organic chemistry education,

0:32:58.800,0:33:00.320
I kind of landed there. I have a fortune
cookie from when I first started my job

0:33:00.320,0:33:03.120
that said you could do well in education
— like, no, I'm trying so hard not to be

0:33:03.120,0:33:08.480
an educator! But you're a Stanford professor, so
you're by definition an educator. And I realize

0:33:08.480,0:33:12.000
I kind of can't help it — whenever I talk to
a young person, I start teaching something.

0:33:12.000,0:33:23.520
And I knew it's not just myself. I found a
crew — many clinical informatics fellows,

0:33:23.520,0:33:28.800
former and current. I have associate directors:
Dong Yao, Shivani Vadok, my current fellow Idan

0:33:28.800,0:33:35.360
Zahash, and multiple others. There's a whole crew
assembled for whom education is their passion,

0:33:35.360,0:33:39.440
and I can give the outline, but they've done the
curricular design, outlined things, reviewed the

0:33:39.440,0:33:48.160
literature, established a framework, and reused
a lot of resources. One of the best contributions

0:33:48.160,0:33:53.360
actually came from a first-year student — Mitra
Alkani. She gave one of the best presentations

0:33:53.360,0:33:58.000
at our first-ever AI and Medical Education
Symposium this past June. Her perspective was:

0:33:58.000,0:34:03.440
she's just a student. She's not here to do
research or science. She said, "Hey, these

0:34:03.440,0:34:07.760
are the three tools I figured out, and nobody told
me. I just tried them myself — they help me make

0:34:07.760,0:34:12.800
flashcards, they help me practice interviewing."
I thought that was very compelling. She showed:

0:34:12.800,0:34:15.760
I'm a student, and this is how I help myself
learn in ways that weren't possible before.

0:34:19.440,0:34:25.440
And what are the principles or things you
worry most about, or want to make sure

0:34:25.440,0:34:31.440
you're passing on through the curriculum?
Oh gosh, there are some real dilemmas I

0:34:31.440,0:34:43.440
don't have easy answers for. For example, should
students be allowed to use AI on their homework?

0:34:44.400,0:34:48.560
The default answer — and actually Stanford
University's default answer — is assume it's

0:34:48.560,0:34:52.240
disallowed unless your instructor says
so. But I push back: that's hopeless,

0:34:52.240,0:34:57.920
it cannot be the default policy, because it's
unenforceable. It puts students in an awkward,

0:34:57.920,0:35:01.440
unfair position where the honor code
says they have to follow the rules,

0:35:01.440,0:35:06.320
but they know all their classmates
are doing it. So we've shifted toward:

0:35:06.320,0:35:10.400
homework is just practice — we're not even going
to bother grading it because it's so easy for AI

0:35:10.400,0:35:21.280
to do. It's got to be the closed-book exams.
But there are some real tensions. Some of our

0:35:21.280,0:35:25.680
medical student classes were killing it on the
homework for medical reasoning — doing great.

0:35:25.680,0:35:29.600
But when it came to the closed-book exams,
they did not do very well. Much worse than

0:35:29.600,0:35:34.960
in prior years. What happened? It's obvious.
They used AI to do their homework and never

0:35:34.960,0:35:42.400
bothered to actually learn. AI could be the best
teacher you have ever had, but if you misuse it,

0:35:42.400,0:35:46.640
you miss the point. The point of homework isn't
to do the homework — it's to make you struggle.

0:35:46.640,0:35:50.160
And that struggle is actually what makes
you learn and makes this knowledge innate.

0:36:04.400,0:36:11.360
We often use the calculator example. We don't
force kids to do long division, but maybe they

0:36:11.360,0:36:17.600
should still learn it — so they have the ability,
and then the calculator amplifies that. But there

0:36:17.600,0:36:24.000
are fundamental medical skills where, if
you start out using these models at the

0:36:24.000,0:36:28.240
level they're at — which is beyond what
any calculator could do — the metaphor

0:36:28.240,0:36:33.600
breaks down. If you never learn those skills
to begin with, you really do lose something.

0:36:50.720,0:36:56.720
We looked at other schools' policies, and I like
the principle: you can use these tools once you've

0:36:56.720,0:37:01.360
demonstrated you could have done it on your own.
It's a great principle; it's just hard to enforce.

0:37:02.240,0:37:08.720
Would you allow an intern or an MS3 to use an
ambient scribe to write their notes? How will they

0:37:08.720,0:37:16.640
ever know how to write a plan if AI always does
it for them? That's quite a tension. Ultimately,

0:37:16.640,0:37:23.760
we want everyone to have judgment — let the AI
remember everything, who cares. But you can't

0:37:23.760,0:37:26.240
have judgment about something if you've
never learned the underlying knowledge.

0:37:27.040,0:37:33.200
I also worry because these models are
trained on text created by humans, which

0:37:33.200,0:37:41.680
came from science and human experience. If all our
generated text ends up being AI-generated, there's

0:37:41.680,0:37:50.880
a feedback loop — AI slop, the snake eating its
own tail. This is already happening. A lot of the

0:37:50.880,0:37:55.920
tech companies building frontier models would say
they're not looking for more data at this point;

0:37:55.920,0:38:01.040
they're trying to get higher-quality text, which
is why many newspapers are suing them. "Yeah,

0:38:01.040,0:38:04.720
we know you're using our work, and we know you
like it because we produce high-quality writing."

0:38:06.000,0:38:10.480
I suspect what will happen is text and the way we
talk and write — without even realizing it — is

0:38:10.480,0:38:14.320
going to become more and more homogenized.
You can detect it now. When somebody writes

0:38:14.320,0:38:19.040
you an AI-generated note, you can kind of
tell. But how many times did you not notice?

0:38:22.160,0:38:29.280
There's actually a whole Wikipedia page
dedicated to tells of language models. Some are

0:38:29.280,0:38:34.880
obvious — the word "delve," the word "critical,"
the word "deep," and the notorious em dash.

0:38:34.880,0:38:39.600
I used to love em dashes. Now I actively
delete every one I see. I have to stop

0:38:39.600,0:38:45.760
using them. It's like — is this AI or is
this Dr. Ashley? I'm not sure anymore.

0:38:45.760,0:38:52.480
I feel personally sad about that. But it's
a challenge. With so many places where we

0:38:54.160,0:39:01.520
as professors or attendings are asked to
generate text, if these become shortcuts,

0:39:01.520,0:39:09.520
we end up in a world where everything is
generated. It will certainly homogenize

0:39:09.520,0:39:16.640
our style of reading and writing.
Maybe it abstracts us to the level

0:39:16.640,0:39:20.720
where we can get more to the actual thinking
and meaning. It's not a perfect analogy,

0:39:20.720,0:39:27.680
but think about programming languages. Who
wants to write bytecode? That's crazy. So we

0:39:27.680,0:39:39.840
got assembly language, then Pascal or FORTRAN,
then C++, then Python. Now it's: just write in

0:39:39.840,0:39:45.120
English and have AI generate the lower-level code
for me. That's a great abstraction. But I think

0:39:45.120,0:39:50.880
when it comes to treading on the domain of normal
human interactions, we haven't had to deal with

0:39:52.080,0:39:56.080
that before. That really is a bit different.
You began your talk with one of my favorite

0:39:56.080,0:40:03.760
quotes: any sufficiently advanced technology is
indistinguishable from magic. So let's get back to

0:40:03.760,0:40:13.120
magic — because you are the only person any of us
know who provides lectures and talks that include

0:40:13.120,0:40:29.280
both up-to-date AI research data and actual
magic tricks. How did you get back into magic?

0:40:33.840,0:40:37.360
I picked it up again maybe five or six years
ago, just before the pandemic. Actually, it was a

0:40:37.360,0:40:44.320
pandemic thing. My oldest child — he was probably
eight or nine — we went to just some street magic

0:40:44.320,0:40:49.520
show, and somebody pulled a rabbit out of a box,
and his face just lit up, and he squealed. If you

0:40:49.520,0:40:54.400
see a child really experience that wonder, it's a
really magical thing — not the magic trick itself,

0:40:54.400,0:40:59.120
but seeing somebody have that experience.
So I thought, oh, my kid enjoys magic;

0:40:59.120,0:41:06.960
I should show him a magic trick. I bought him
a little basic kid set. And as I was showing

0:41:06.960,0:41:16.160
him and my colleague a trick, my colleague
looked over and said, "Jonathan should work

0:41:16.160,0:41:26.880
on his sleight of hand." I was like — what?
I'm just trying to entertain your child and

0:41:26.880,0:41:33.280
you're giving me feedback. So I went and learned
some sleight-of-hand magic. And it was a fun way

0:41:33.280,0:41:38.080
to interact with students — look, I'm in a
nerdy profession, but I'm approachable too.

0:41:38.640,0:41:42.240
It kind of spiraled out of control during
the pandemic. We're all locked indoors;

0:41:43.440,0:41:47.120
some people learned to bake sourdough bread.
I literally spent three months learning a very

0:41:47.120,0:41:55.760
advanced Rubik's cube magic trick. And it spiraled
— I've entered and won multiple competitions;

0:41:55.760,0:42:03.600
I've had paid gigs just to perform magic. I
performed in Vegas — on the main stage in Vegas,

0:42:04.160,0:42:13.760
actually, at a health conference. Not something
I was trying to do, but it's unearthing some

0:42:13.760,0:42:18.480
childhood aspirations and inner child.
It also helps me be better at presentations.

0:42:19.440,0:42:22.880
You learn a lot of empathy when you do magic
because it's all about — who cares what I'm doing;

0:42:22.880,0:42:26.480
I have to anticipate what you're thinking while
I do this, because I don't want you thinking the

0:42:26.480,0:42:35.680
wrong thing. You have to really understand what
another person is thinking. Directing a narrative,

0:42:35.680,0:42:39.920
setting up expectations — I found
that a very powerful combination.

0:42:39.920,0:42:43.600
A few years ago, University Medical Partners
invited me to give a keynote on AI and medicine.

0:42:44.880,0:42:48.320
But they also had a "magic of medicine"
theme that week. I thought, actually,

0:42:48.880,0:42:56.400
I can perform some magic too — can I do that as
a bonus? And I blame it on my wife. She was like,

0:42:56.400,0:43:04.240
"Hey, why not do them together? What if you had
some magic in the middle of your talk?" Like,

0:43:04.240,0:43:13.920
that's crazy — they're going to laugh me off
the stage. But if there's a thematic connection,

0:43:15.200,0:43:22.000
that could be very fun and compelling.
Especially as generative AI was emerging — you

0:43:22.000,0:43:27.520
can't tell what's real anymore. The Turing test.
Is that a real human or a chatbot? That image

0:43:27.520,0:43:32.400
looks so real. That video of me — that was all
AI-generated. And so I'm using magic again as

0:43:32.400,0:43:35.680
that algorithm. Boy, does that look real — but
you still have to have the judgment to tell the

0:43:35.680,0:43:41.200
difference. It's a really powerful combination.
Jonathan, we're so happy to have you in our

0:43:41.200,0:43:45.920
department. I'm very happy to be here. We're
proud of the work you've done in an academic

0:43:45.920,0:43:49.920
way and for really bringing us safely into this
future — and of course for your role in education

0:43:49.920,0:44:05.440
and training the next generation. Thanks
for joining us on the Future of Medicine.

0:44:05.440,0:44:08.800
Thank you, Dr. Ashley.
The preceding program is copyrighted by

0:44:08.800,0:44:17.200
the Board of Trustees of the Leland Stanford Jr.
University. Please visit us at med.stanford.edu.

Dr. Euan Ashley

Host

Dr Jonathan Chen

Guest