WEBVTT

00:00.100 --> 00:04.000
[speaker_0] Pre-training scaling 100% does
not work or here is an argument

00:04.160 --> 00:07.830
RL scaling 100% does not work.
It's just intuitions that

00:07.940 --> 00:11.830
pre-training scale probably doesn't get
there, RL scaling probably doesn't

00:11.880 --> 00:12.160
get there

00:12.180 --> 00:16.020
[speaker_1] ...
essentially just by scaling RL

00:16.100 --> 00:19.920
just by scaling RL and pre-training, uh,
getting to ASI just in the

00:20.040 --> 00:23.830
current paradigm by 2030, uh,
in your definition of intelligence

00:23.920 --> 00:24.920
seems very, very unlikely.

00:24.960 --> 00:28.480
[speaker_0] But yeah, researcher,
you can right now take a pen and paper,

00:28.540 --> 00:32.500
and come up with, okay,
if I had huge amount of GPUs, here

00:32.540 --> 00:34.710
would try, and you can write them down.

00:34.730 --> 00:34.730
[speaker_1] Yeah.

00:34.730 --> 00:36.880
[speaker_0] And then we will ask, okay,
well, why haven't these...

00:37.380 --> 00:41.160
Like, did somebody at, you know, OpenAI,
Anthropic try them and they failed, or did

00:41.220 --> 00:42.400
nobody try them?

00:42.540 --> 00:44.980
[speaker_1] Uh, maybe you're right
and we're on the vertical part of the S

00:45.380 --> 00:49.130
I think if you just plot it from AlexNet,
it's mostly the same

00:49.140 --> 00:52.740
paradigm. Uh, and I would count AlexNet

00:53.460 --> 00:57.400
as one breakthrough,
Transformer as another breakthrough,

00:57.460 --> 00:57.820
breakthrough.

00:58.900 --> 01:01.990
[speaker_0] Hello, everyone. Uh,
my name is Samuel Shadrach.

01:02.700 --> 01:06.320
I graduated from IIT Delhi.
I've been following the whole

01:06.440 --> 01:09.780
AI timelines debate for s- a while now.

01:09.920 --> 01:13.400
By when will humanity get super
intelligence?

01:13.500 --> 01:16.560
Is this a good thing or a bad thing?
What can we...

01:16.620 --> 01:19.940
And now I have a fairly strong opinion
it's a bad thing. We should stop it.

01:20.000 --> 01:22.720
That's a whole separate discussion
that we can have elsewhere.

01:23.100 --> 01:27.090
Today, I have with me, uh, Raghav. Uh,
we are

01:27.140 --> 01:30.140
going to specifically just discuss, uh,
AI timelines.

01:30.380 --> 01:33.480
Uh, when do we think, you know,
super intelligence will come?

01:33.680 --> 01:37.060
Uh, you know, assuming the research,
current development continues.

01:37.220 --> 01:38.980
Raghav,
would you like to introduce yourself?

01:39.140 --> 01:42.980
[speaker_1] I'm been a friend of Sam.
I've been following the AI safety debate

01:43.040 --> 01:46.880
for the last five, six years now, uh,
and very

01:46.920 --> 01:50.680
interested in the subject.
I have some strong opinions as well,

01:50.820 --> 01:54.340
as Samuel, but, uh,
I have slightly differing opinions from

01:54.400 --> 01:56.260
That's why I'm keen to talk about this.

01:56.800 --> 02:00.680
[speaker_0] Yeah. First of all is, yeah,
uh, the whole pre-training

02:00.800 --> 02:03.260
scaling, which right now everyone says
is dead.

02:03.420 --> 02:06.180
Even now I'm not yet convinced
pre-training scaling is dead.

02:06.540 --> 02:10.100
Yeah, I think there
is still a small chance

02:10.259 --> 02:14.160
and you just extrapolate the pre-training,
whatever chinchilla,

02:14.240 --> 02:18.210
whichever scaling curve
that has been running for the past few

02:18.240 --> 02:20.980
we do that a few more years with new GPUs
coming.

02:21.000 --> 02:24.960
I still think there
is at least a little bit chance

02:25.040 --> 02:26.960
ASI, which I'm claiming. Uh-

02:27.420 --> 02:27.520
[speaker_1] Okay

02:27.980 --> 02:31.720
[speaker_0] ...
a lot more realistic to me

02:31.940 --> 02:35.420
uh, some way of scaling up RL. Now,

02:35.820 --> 02:37.300
that may not be, like,

02:38.200 --> 02:41.769
just again, you just do the current,
you know, RL scaling thing

02:41.840 --> 02:44.180
compute. Maybe one breakthrough
is required there.

02:44.800 --> 02:47.600
I have, like, more probability mass than,
okay, we need at least one breakthrough

02:47.660 --> 02:51.100
on how to do the RL thing better
or some other breakthrough.

02:51.440 --> 02:54.780
I have a little bit less probability mass
than, okay, you just blindly scale up RL

02:54.800 --> 02:57.500
and it works. But yeah,
that's the thing is, like, we don't know.

02:57.820 --> 03:01.240
I have not h- seen like, okay, here
is an argument that tells me that

03:01.580 --> 03:05.460
pre-training scaling 100% does not work,
or here is an argument that tells me

03:05.600 --> 03:08.890
RL scaling 100% does not work.
It's just intuitions that

03:09.380 --> 03:13.260
pre-training scale probably doesn't get
there, RL scaling probably doesn't

03:13.320 --> 03:17.040
get there.
We probably need another breakthrough,

03:17.080 --> 03:20.820
really knows. Uh, that's broadly the path,

03:21.180 --> 03:24.980
uh, of, you know,
what technical capabilities will be

03:25.180 --> 03:28.700
Uh, then there is, okay,
the most probably we'll need one more

03:28.780 --> 03:31.920
breakthrough. Why do I think, okay,
even if we need one more breakthrough, we

03:31.980 --> 03:34.640
will...
There's a good chance we get it in the

03:35.460 --> 03:39.320
For that, you have to extrapolate, well,
in the last 10 years, how

03:39.360 --> 03:42.950
many major breakthroughs have happened in
machine learning, deep learning, and,

03:42.960 --> 03:46.080
well, actually three or, like, three
or four major breakthroughs have happened.

03:46.100 --> 03:49.960
So based on that extrapolate, okay,
it doesn't seem surprising to me

03:50.100 --> 03:53.810
if in the next four years another,
you know, smart researcher figures out yet

03:53.820 --> 03:57.620
another breakthrough. So there is
that kind of thing.

03:58.520 --> 04:01.560
Then I have some specific heuristics
and thing.

04:02.120 --> 04:05.760
If I were an AI capability researcher,
if I had a huge number of

04:05.820 --> 04:09.420
GPUs, what are, you know,
my crazy research hypothesis I might

04:09.500 --> 04:13.340
try for how to speed up RL?
And to be clear, I think this is

04:13.360 --> 04:16.880
extremely dangerous thing to do.
I think this is bad for the world to do.

04:17.019 --> 04:20.690
But, you know, if I wanted to do this,
like, I think, okay, there are some

04:20.700 --> 04:23.960
hypothesis you could try. Uh, what else?

04:24.040 --> 04:27.640
Yeah, this is all, like,
very specific to AI research capabilities,

04:27.680 --> 04:31.660
trajectory. And then I have, like,
more very high level big picture kind of

04:31.760 --> 04:35.660
Like, you know, why
is super intelligence important?

04:36.020 --> 04:39.200
Uh, you know, why the very fact that GPT-2

04:39.280 --> 04:43.060
exists should update your entire models of
how the world works.

04:43.440 --> 04:44.080
The fact that

04:44.880 --> 04:48.560
bunch of matrix multiplication can do,
you know, like, uh, like

04:48.680 --> 04:52.560
speaking human language rather than animal
language, why this is a big deal?

04:52.640 --> 04:56.460
Like, you know, mat-
why did matrix multiplication beat the

04:56.500 --> 04:59.690
literal billions of years of evolution
that goes between, you know,

05:00.540 --> 05:01.740
uh-

05:01.760 --> 05:05.460
[speaker_1] I'm sorry, I, this,
I don't think I'm,

05:05.520 --> 05:09.300
only. I mean, like,
maybe we've not discussed this earlier.

05:09.330 --> 05:09.330
[speaker_0] Yeah.

05:09.340 --> 05:10.480
[speaker_1] But maybe we can get to it.

05:10.860 --> 05:11.159
[speaker_0] Yeah, sure.

05:11.380 --> 05:14.460
[speaker_1] The other one I know you're
just saying, but this one seems new to me.

05:14.520 --> 05:17.740
Maybe you're framing it differently
or something. Yeah. It's fine.

05:17.780 --> 05:21.040
We, we'll probably get to it in the order,
but I'm just flagging it that I probably

05:21.080 --> 05:23.320
need you to double-click on this instinct.

05:23.480 --> 05:27.430
[speaker_0] Sure. Uh, I, I just mean like,
uh, human language has features that

05:27.480 --> 05:28.940
are not present in animal language.

05:29.020 --> 05:32.720
There are some linguists
that have studied that,

05:32.800 --> 05:36.700
like new evolutionary adaptation compared
to most animals who don't really even

05:36.780 --> 05:40.180
use language.
They kind of just use sounds

05:40.300 --> 05:44.040
And all of this has taken, like,
literal billions of years in evolutionary

05:44.100 --> 05:47.420
history to build,
and on the other side you have a bunch of

05:47.500 --> 05:51.340
bunch of GPUs in 50 whatever years of AI
research history, and

05:51.380 --> 05:53.260
they have been able to crack human
language.

05:53.320 --> 05:57.260
Like, that itself tells me, "All right,
so intelligence is

05:57.280 --> 06:01.060
likely easier to build than I thought." So
that is like-Yeah.

06:01.120 --> 06:04.780
So even just looking at GPT-2 tells me
like, oh, okay, so maybe now the

06:04.820 --> 06:06.359
singularity could happen in my lifetime.

06:06.400 --> 06:09.900
Like, you know, until now I
was thinking this is some extremely far

06:09.980 --> 06:12.820
Now it looks like, okay,
this could actually happen. Uh, what else?

06:12.980 --> 06:16.810
Yeah, I mean, and I have outside level,
outside view kind of stuff like,

06:16.880 --> 06:19.750
okay, which experts
are actually correctly predicting this?

06:19.780 --> 06:21.600
Which experts are badly predicting this?

06:21.660 --> 06:25.480
I think a lot of people have been
consistently badly predicting,

06:25.560 --> 06:29.540
uh, AI trajectory. Yeah,
like the people who

06:30.020 --> 06:32.900
their predictions are coming more correct,
and the people who keep making the

06:32.919 --> 06:35.640
pessimistic predictions,
their predictions keep coming wrong.

06:36.140 --> 06:39.770
So there's that kind of stuff.
I think that could summarize my whole

06:40.240 --> 06:43.540
[speaker_1] Fair enough.
We'll start taking it one after the other.

06:44.040 --> 06:47.990
Uh, cool. So starting with, uh,
you said pre-training, scaling,

06:48.000 --> 06:50.260
Pre-training scaling is work,
and pre-training scaling is not dead.

06:50.660 --> 06:54.280
I think when people say pre-training
scaling is dead, they don't mean

06:54.340 --> 06:58.040
more parameters and adding more data
and adding more compute doesn't lead to

06:58.080 --> 07:00.420
better loss functions
that leads to more capabilities.

07:00.480 --> 07:02.180
I don't think anybody denies that.

07:02.280 --> 07:05.820
Uh, people are uncertain that, uh,
people are

07:05.859 --> 07:08.980
say-saying that it could stop,
but there is no evidence to believe

07:09.060 --> 07:11.340
stop,
and I don't think any serious researcher

07:11.620 --> 07:15.500
The argument against pre-training mostly
comes from the fact that, um,

07:16.000 --> 07:19.760
it is economically unfeasible to scale
because the, uh, amount

07:19.800 --> 07:20.100
of

07:21.100 --> 07:23.400
resources required to do the scaling gives
you log.

07:23.440 --> 07:26.180
It does not linearly increase,
it gives you log of the intelligence.

07:26.740 --> 07:30.680
Uh, so like increasing, uh,
your compute by ten x et cetera gives you

07:30.700 --> 07:34.280
the amount of intelligence
or capabilities and double the amount of

07:34.340 --> 07:34.600
goes.

07:34.610 --> 07:37.800
[speaker_0] Exactly.
Double the amount of loss,

07:37.940 --> 07:40.140
loss number leads to this, uh, capability.

07:40.500 --> 07:44.200
[speaker_1] Sure. And so
which also means that, uh, so th-then the

07:44.240 --> 07:48.140
debate essentially shifts to not
that pre-training is dead,

07:48.180 --> 07:51.900
many ten x's can we do
and how many doubling of intelligence or,

07:52.000 --> 07:54.020
slightly higher intelligence will
essentially do it.

07:54.440 --> 07:56.820
Uh,
I think you'd mentioned this in the OOTD

07:56.980 --> 08:00.800
GPT 4.5 was a big update for me,
saying that if you just go

08:00.860 --> 08:04.500
on cranking pre-training,
you might get a nicer model with not a

08:04.540 --> 08:08.160
capabilities. There
was nothing 4.5 could do that was so

08:08.340 --> 08:12.240
far ahead of 4, uh,
that essentially I think that

08:12.280 --> 08:13.200
this would essentially

08:14.300 --> 08:17.780
defeat, uh,
like th-th-there would be new

08:18.180 --> 08:21.120
Uh,
o1 was in a different update because o1

08:21.370 --> 08:23.060
And essentially, this
is in the pa-current paradigm.

08:23.120 --> 08:27.000
So my submission for the pre-training
argument is that not that pre-training is

08:27.060 --> 08:30.600
dead in the sense
that you can't technically, uh,

08:30.880 --> 08:34.820
burn all, all the GDP in the world
and like get a

08:34.900 --> 08:38.840
few doublings, etc., out of it. Uh,
whether essentially A, there

08:38.860 --> 08:42.520
is a--
it's economically feasible to do it and B,

08:42.720 --> 08:46.560
even if it was economically feasible to do
it, uh, is, uh, like

08:46.820 --> 08:50.700
essentially maybe the returns of on, on,
on it that again is probably

08:50.740 --> 08:54.500
not worth it.
Maybe spending hundred x the amount of

08:54.580 --> 08:58.500
times better model, uh,
probably starts breaking a lot of other

08:58.620 --> 09:02.340
Uh, having said that, we
are also almost on the edge

09:02.500 --> 09:05.940
of how much compute we can build.
Everything is stretched out to its limit,

09:06.420 --> 09:07.660
uh, especially by twenty-thirty.

09:07.760 --> 09:11.680
The year you said, uh, fabs
are already built

09:11.740 --> 09:15.580
out, etc. We have nowhere close to, uh,
let's say

09:15.640 --> 09:19.280
four x-- four, four increasing our, uh,
pre-training.

09:19.320 --> 09:22.420
We probably can do one
or two more scale-ups in the next two

09:22.700 --> 09:26.520
Sorry, in, in the next four years. Uh,
and that's maximum that we can do.

09:26.980 --> 09:30.180
Uh, even like we do not have enough fabs,
we do not have enough electricity, we do

09:30.200 --> 09:34.180
not have enough...
Some insane amount of physical bottlenecks

09:34.220 --> 09:35.680
the amount of compute in the world.

09:36.060 --> 09:40.040
Um, currently we are doing, uh,
I think forty or

09:40.080 --> 09:41.510
fifty gigawatts of...

09:42.720 --> 09:46.690
Fifty gigawatts? No, thirty gigawatts of,
uh, compute capacity is

09:46.720 --> 09:48.550
what twenty twenty-seven will get us.

09:48.620 --> 09:51.880
And, uh,
if everything goes according to plan

09:51.960 --> 09:55.780
like all the supply chains get stretched
exactly to the right limit,

09:56.240 --> 09:59.420
our best bet is to get to one fifty to two
hundred gigawatts a year, which is again

09:59.440 --> 10:03.400
like six times more compute,
not ten x more compute, um, from what

10:03.440 --> 10:05.960
we have right now.
And like that's two hundred gigawatts per

10:06.440 --> 10:10.220
Uh, and that's assuming all
that compute goes into training

10:10.260 --> 10:14.140
etc. So, so there
are like lots of physical limitations

10:14.160 --> 10:17.920
till twenty-thirty
that do not allow for arbitrarily amount

10:18.300 --> 10:20.780
Uh,
we've already kind of pressed to the

10:20.880 --> 10:24.440
Uh,
buying a laptop costs like three hundred,

10:24.600 --> 10:25.600
Costs three hundred, four hundred more.

10:25.760 --> 10:29.580
I don't think that the world can
essentially take four orders of magnitude

10:29.620 --> 10:32.740
compute scaling that readily now.
I think we,

10:33.140 --> 10:37.100
Coming to RL, I agree
that essentially RL scaling

10:37.160 --> 10:39.070
There are two points
that I want to make about RL.

10:39.070 --> 10:41.390
RL scaling, uh,
is also extremely expensive.

10:41.840 --> 10:44.180
Uh,
it's more expensive than pre-training

10:44.260 --> 10:48.170
Uh, the other point
that I want to make about RL is, uh,

10:48.200 --> 10:51.940
gives us general capability increases,
RL gets us very

10:52.120 --> 10:55.960
jagged increases in capability.
You only get increase in capability in

10:56.000 --> 10:59.820
domains like coding and math, uh,
which essentially kind of

10:59.860 --> 11:03.740
defeats the,
the specific model of intelligence

11:03.800 --> 11:06.720
that it will be better at everything by
twenty-thirty.

11:07.080 --> 11:10.780
So if we,
if we cannot find good RL candidates

11:10.800 --> 11:14.740
loops for different kinds of event,
we might not even solve, uh, forget

11:14.820 --> 11:17.120
solving for like robotics
and like other things.

11:17.180 --> 11:21.140
Even in just like atoms world,
we might not be able to solve all of it

11:21.280 --> 11:24.600
to human level because we will not just
find enough RL data, etc.,

11:25.320 --> 11:28.979
or enough closed loops, etc., uh, uh,
to essentially do it.

11:29.080 --> 11:32.360
Uh, so, so I don't think
that RL also scales

11:32.600 --> 11:36.540
arbitrarily that you can just thousand x
the RL compute and actually get

11:36.580 --> 11:40.320
away with it. Uh, there
is a limit to how much RL compute you can

11:40.400 --> 11:43.040
RL also just gets you increases in
capability.

11:43.380 --> 11:47.220
Yes, economically viable, uh,
the economic, the killer use

11:47.260 --> 11:51.140
case for LMS currently is coding,
and coding is economically viable, etc.

11:51.760 --> 11:53.900
There's a small, uh,
probability gap I have.

11:53.940 --> 11:57.460
This leads to some sort of recursive
self-improvement or some sort of, uh,

11:57.480 --> 11:59.840
breakthrough, etc.,
that happens because of this.

12:00.200 --> 12:04.020
Uh, but aside from that gap, uh,
which sure, we can talk about

12:04.400 --> 12:08.360
Uh, but aside from that gap, uh,
the path essentially just by scaling

12:08.500 --> 12:12.370
RL or just by scaling pre-training
or just by scaling RL and pre-training,

12:12.780 --> 12:16.320
uh,
getting to ASI just in the current

12:16.740 --> 12:19.200
uh,
in your definition of intelligence seems

12:19.240 --> 12:23.020
unlikely.Um, to me, like forget 25%,
I would give

12:23.360 --> 12:26.360
sub 1% chance in the current paradigm, uh,
in these specific

12:26.400 --> 12:30.260
circumstances. Not to say that, uh,
I have a much higher probability was

12:30.340 --> 12:33.660
might get to superintelligence by 2030,
but, uh, that's

12:33.740 --> 12:37.140
essentially assuming technology
that haven't been invented yet come into

12:37.200 --> 12:40.460
being. Uh, so that
was the second point about RL.

12:40.600 --> 12:43.940
Uh, your third point was tied into this,
is like, okay, we might need one more

12:43.960 --> 12:45.960
breakthrough. Uh, current RL might not...

12:46.000 --> 12:47.640
You think there's a chance
that might not be enough.

12:47.680 --> 12:49.360
Maybe it's enough, but we might,
maybe it's not enough.

12:49.400 --> 12:50.920
Maybe we need one more breakthrough.

12:51.040 --> 12:54.469
And, uh, uh, your submission there
was that, hey,

12:54.860 --> 12:58.590
breakthrough will happen, uh,
because look at the last 10 years,

12:58.780 --> 13:02.740
we've gotten transformers
and we've gotten pre-training scaling,

13:02.840 --> 13:06.280
RL. So looks like breakthroughs
are coming very, very quickly.

13:06.700 --> 13:10.380
Um, I think is, uh, this,
I don't think that,

13:11.000 --> 13:14.680
uh,
there is enough data points for you to

13:15.040 --> 13:19.020
Uh, even sta- in the start of GBD,
despite essentially almost

13:19.080 --> 13:22.800
all of world's intelligence
and attention going to this problem,

13:22.940 --> 13:26.520
RL scaling, which we've cracked, uh,
there's not been another scaling

13:26.580 --> 13:30.140
paradigm that we've cracked.
So apart from pre-training and RL scaling,

13:30.260 --> 13:33.380
people are ready to throw computes at
other scaling things, and that's not

13:33.420 --> 13:35.740
something that we've cracked.
I'll take your rebuttal.

13:36.040 --> 13:36.280
[speaker_0] Oh, rebuttal.

13:36.300 --> 13:40.239
[speaker_1] Like what other scaling
paradigm have,

13:40.280 --> 13:40.560
RL-

13:40.870 --> 13:40.890
[speaker_0] No, no.

13:40.900 --> 13:42.330
[speaker_1] Despite throwing insane-

13:42.330 --> 13:45.880
[speaker_0] Okay,
it has a lot of attention,

13:45.920 --> 13:47.180
You're saying this mean
that other breakthroughs-

13:47.220 --> 13:49.320
[speaker_1] No, no other breakthroughs.
I'm saying other breakthroughs...

13:49.840 --> 13:53.020
I'm saying, uh, other breakthroughs
are possible in the sense they're not

13:53.160 --> 13:57.080
physically impossible. But A,
do you disagree that there are other,

13:57.200 --> 13:58.200
things that we can just throw,

13:59.220 --> 14:03.120
s- other, other scalable things
that we can throw compute money,

14:03.140 --> 14:06.980
more intelligence aside from, uh,
like just regular pre-training

14:07.200 --> 14:08.160
and RL?

14:08.560 --> 14:12.460
[speaker_0] I think there
is a huge backlog of research hypothesis

14:12.540 --> 14:13.540
at all these AI companies.

14:13.600 --> 14:14.520
[speaker_1] I doubt it.

14:14.660 --> 14:15.460
[speaker_0] Like lot of very-

14:15.540 --> 14:15.980
[speaker_1] I doubt it

14:15.990 --> 14:19.830
[speaker_0] ... obvious things to try,
but because compute is scarce,

14:19.920 --> 14:20.060
decide-

14:20.080 --> 14:20.480
[speaker_1] I don't know

14:20.500 --> 14:21.450
[speaker_0] ... okay, which ones to, uh-

14:21.480 --> 14:25.300
[speaker_1] I get it, but essentially, I,
I, I, I, I hear that argument saying

14:25.320 --> 14:28.980
that obviously we can improve our models
in X sector, but if there were such such

14:29.140 --> 14:32.740
obvious scaling paradigms
that essentially could have been done

14:32.800 --> 14:36.780
than the current paradigms, uh,
I think you would see some evidence of it.

14:37.080 --> 14:39.809
You would see there's,
there's enough time essentially going into

14:39.809 --> 14:39.999
that-

14:40.260 --> 14:41.540
[speaker_0] No, I, I hear all excuses

14:41.800 --> 14:41.809
[speaker_1] ... that-

14:41.840 --> 14:44.360
[speaker_0] I'm saying like you right now,
you are not an expert AI researcher.

14:44.460 --> 14:48.140
You can right now take a pen and paper,
sit for half a day and come up with, okay,

14:48.380 --> 14:52.320
if I had huge amount of GPUs, here
are 10 hypothesis I would try, and

14:52.380 --> 14:53.420
you can write them down.

14:53.430 --> 14:53.430
[speaker_1] Yeah.

14:53.430 --> 14:57.340
[speaker_0] And then we will ask, okay,
well, why haven't these, like,

14:57.400 --> 15:01.100
know, Open-- Anthropic try them
and they failed, or did nobody try them?

15:01.240 --> 15:01.400
[speaker_1] I-

15:01.469 --> 15:01.689
[speaker_0] If nobody tried-

15:01.710 --> 15:05.380
[speaker_1] ... doubt it's that easy.
Can you name,

15:05.420 --> 15:09.370
that hasn't been tried
that you think has a higher chance of it

15:09.740 --> 15:11.860
an RL level breakthrough if tried?

15:12.200 --> 15:14.880
[speaker_0] I'm not saying any one idea
if I try that will definitely work.

15:14.940 --> 15:15.230
I'm saying-

15:16.040 --> 15:19.480
[speaker_1] No,
any example of an idea

15:19.540 --> 15:23.460
If nobody's tried it,
I didn't find any research papers on it,

15:23.520 --> 15:25.320
we'll probably get ASI.

15:25.420 --> 15:29.020
[speaker_0] Uh,
if you just want random ideas,

15:29.200 --> 15:29.840
Uh-

15:29.920 --> 15:30.060
[speaker_1] Sure.

15:31.040 --> 15:34.670
Just to understand, like,
what kind of ideas do you think

15:34.700 --> 15:38.220
these guys are so compute constrained
that if there is an RL level breakthrough

15:38.300 --> 15:42.060
just sitting on like some researcher's
notepad and they've not had the time.

15:42.160 --> 15:45.460
Because even for RL,
they didn't actually have to use

15:45.500 --> 15:48.340
the idea. Uh, the idea, the idea, the,
the-

15:48.960 --> 15:52.840
[speaker_0] Yeah. Uh, one example,
most of the training still happens in very

15:52.940 --> 15:56.380
code, like, you know, PyTorch, you know,
like four-bit, eight-bit floating point

15:56.460 --> 15:59.760
numbers. If you really wanted,
you could optimize all this way down.

15:59.800 --> 16:03.300
You could literally run training inside an
ASIC. You could optimize the code.

16:03.340 --> 16:06.540
[speaker_1] Yeah. So, so for example,
that's like a very bad idea. No, no, no.

16:06.580 --> 16:10.400
So for example, running ASIC, so, so
that essentially assumes

16:10.440 --> 16:12.220
like a lot of other things need to move in
the world.

16:12.260 --> 16:15.990
One, the amount of GPU capacity
that already exists allocated in the world

16:16.020 --> 16:19.480
needs to go away. Secondly,
there's a reason why even of co...

16:19.520 --> 16:22.510
Like doing this does not get you
that much compute because essentially

16:22.520 --> 16:25.570
compute constrained,
you're memory constrained, uh,

16:25.570 --> 16:28.080
training and you're basically memory
bandwidth constrained, not even memory

16:28.100 --> 16:31.380
constrained. And, uh, yes, sure,
like we have like silicon

16:31.440 --> 16:34.200
photonics,
and we have other breakthroughs

16:34.260 --> 16:38.159
engineering problem. Uh,
I don't think it's as easy as like, hey,

16:38.200 --> 16:41.860
on ASIC and we get like 100X speed up
and nobody's had the time

16:42.279 --> 16:45.200
Uh,
there's so much incentive for any smart

16:45.240 --> 16:48.770
If you think it's
that easy to go replace the entire compute

16:48.840 --> 16:52.430
ASIC and people have just like not tried
it because of some other

16:52.460 --> 16:54.000
crunch, I, I think you're mistaken.

16:54.030 --> 16:54.110
[speaker_0] No, no, I-

16:54.150 --> 16:57.460
[speaker_1] Like there's insane amount of
like capitalist in-incentive to do it.

16:57.660 --> 17:01.440
Like, th-which would be like,
"Hey." I'm saying if there's any other

17:01.980 --> 17:04.950
uh,
the amount of compute you need to quote

17:05.000 --> 17:08.680
small. Uh, and therefore if, if there
were these insane...

17:08.720 --> 17:12.210
o1, for example,
didn't require too much compute on forward

17:12.240 --> 17:15.940
could work. Uh, if you had such an idea,
you could show that, hey, this

17:16.020 --> 17:19.870
works and we want to use this to scale our
models and it's sitting in the labs and

17:19.900 --> 17:23.260
we just think,
but looks like there's n-none,

17:23.580 --> 17:27.360
[speaker_0] Uh, oh, okay. Yeah.
I'm saying there are a lot of research

17:27.460 --> 17:31.420
require a lot of compute to prove even as
a proof of concept, okay, this is worth

17:31.460 --> 17:31.919
exploring.

17:32.040 --> 17:33.990
[speaker_1] I, I doubt it.

17:34.070 --> 17:34.070
[speaker_0] Yeah.

17:34.120 --> 17:34.900
[speaker_1] I think they all-

17:35.300 --> 17:35.570
[speaker_0] The idea is-

17:35.570 --> 17:39.300
[speaker_1] They all show signs of life.
All of these research ideas show signs of

17:39.380 --> 17:42.580
life much before you have to scale them to
get gains out of it.

17:43.220 --> 17:47.100
There are very few research ideas
that quote unquote "only get unlocked at,"

17:47.380 --> 17:50.850
if you only train them for, like only
if you spend billion dollars of compute is

17:50.880 --> 17:54.220
the first sign of life you get from
that research idea, uh,

17:54.280 --> 17:57.230
research idea saying that, "Hey,
if I just keep throwing compute at it"-

17:57.330 --> 18:00.280
[speaker_0] In research,
almost every breakthrough

18:00.380 --> 18:04.240
Like only after you scaled it to literal
billions of dollars you saw that it was

18:04.260 --> 18:04.560
working.

18:05.060 --> 18:07.740
[speaker_1] No.
Give me one research idea that's gonna

18:08.100 --> 18:09.940
GPT-1 was like a very small model.

18:10.480 --> 18:10.550
[speaker_0] RL or-

18:10.650 --> 18:11.669
[speaker_1] RNN was a very small model.

18:11.740 --> 18:13.160
[speaker_0] Uh, GPT-

18:13.280 --> 18:14.820
[speaker_1] RL, for example,
didn't require billions of dollars.

18:14.980 --> 18:16.780
So o1, for example,
was a very cheap model.

18:17.140 --> 18:21.040
Once GPT-4 was trained,
training on chain of,

18:21.080 --> 18:24.964
of thought and doing RL was like a very,
very small experimentSo,

18:25.174 --> 18:25.174
uh-

18:25.184 --> 18:27.854
[speaker_0] Can, uh,
do you have numbers on that?

18:27.864 --> 18:29.584
[speaker_1] Yeah, yeah. So, so there's, A,
there's ...

18:29.774 --> 18:33.424
I, I'll find out where I read about this,
but basically the first version of the O1

18:33.444 --> 18:35.264
model was just get tried in a lab,
et cetera.

18:35.604 --> 18:38.564
For example, uh,
you can take Llama 3 right now,

18:39.164 --> 18:40.434
and, uh, you can take-

18:40.464 --> 18:43.323
[speaker_0] No, no, not,
not about right now.

18:43.484 --> 18:46.334
Back then when O1 was first tried,
like my guess without having-

18:46.334 --> 18:48.344
[speaker_1] Like I'm saying,
you can make it more efficient.

18:48.384 --> 18:48.504
[speaker_0] S- sorry.

18:48.544 --> 18:49.364
[speaker_1] You can make it more

18:50.304 --> 18:50.523
efficient. Yeah.

18:50.744 --> 18:54.104
[speaker_0] Yeah,
like without having read it, my guess

18:54.144 --> 18:57.524
required at least $10 million on top of
the

18:57.564 --> 19:01.324
GPT-4 training cost,
and like I have not actually checked the

19:01.384 --> 19:01.804
but yeah.

19:02.344 --> 19:05.724
[speaker_1] You needed the GPT-4 to get
trained first,

19:05.764 --> 19:08.664
and then you found a new scaling paradigm
on that scaling paradigm.

19:09.104 --> 19:09.193
[speaker_0] Yeah, yeah.

19:09.224 --> 19:09.634
[speaker_1] That I agree.

19:09.653 --> 19:10.944
[speaker_0] I'm saying first you had the
whole-

19:11.004 --> 19:11.564
[speaker_1] But go from four to-

19:11.614 --> 19:13.513
[speaker_0] ... GPT-4 training cost,
then GPT-4 trained.

19:13.524 --> 19:13.614
[speaker_1] Right.

19:13.984 --> 19:14.704
[speaker_0] Then you had-

19:15.164 --> 19:15.434
[speaker_1] But O1, when you look at-

19:15.434 --> 19:19.384
[speaker_0] ... to research hypothesis.
Each of those hypothesis took at least $10

19:19.804 --> 19:21.124
to test, and 10 million
is a random number.

19:21.164 --> 19:21.264
[speaker_1] No.

19:21.344 --> 19:22.824
[speaker_0] I think it's actually
important. Uh-

19:22.904 --> 19:24.384
[speaker_1] I don't think it takes $10
million of test.

19:24.444 --> 19:27.164
I think it takes like a few hundred
thousand dollars,

19:27.224 --> 19:30.924
dollars to test a research idea to see
if that it has any

19:31.024 --> 19:34.564
sign or any chance. Yes,
like you might see like different gains

19:34.664 --> 19:35.374
losses, et cetera.

19:35.394 --> 19:37.124
[speaker_0] That is then possibly 100
million is what I'm claiming.

19:38.004 --> 19:41.324
[speaker_1] I doubt it. I don't know
if any of all these ideas take 10 million.

19:41.384 --> 19:44.463
For example,
O1 didn't take 10 million post GPT-4

19:44.864 --> 19:47.324
[speaker_0] We can actually go
and check that maybe.

19:47.404 --> 19:48.184
Like I know this is-

19:48.224 --> 19:48.274
[speaker_1] No, no

19:48.274 --> 19:51.044
[speaker_0] ... not public information,
but we can go and see like-

19:51.244 --> 19:53.644
[speaker_1] No, no,
I think we can check because the amount of

19:53.684 --> 19:54.924
Yeah, sure, because the actual amount ...

19:54.934 --> 19:57.954
So the idea, the way research works
is get an idea,

19:58.164 --> 20:01.304
this thing. If y-
if you see any sort on it or anything

20:01.314 --> 20:04.584
"Cool. You know what?
This warrants more investigation,

20:04.744 --> 20:07.804
actually go improve something in the
form." And then you can do optimizations

20:07.844 --> 20:09.224
it, and then you can make it better,
et cetera.

20:09.524 --> 20:13.244
But the first amount of like this thing,
the sign of life saying that this has,

20:13.304 --> 20:16.394
this is probably has some legs can come
from not that much money.

20:17.024 --> 20:17.474
[speaker_0] Yeah. This is-

20:17.504 --> 20:19.144
[speaker_1] And then obviously getting
actual real-

20:19.174 --> 20:21.064
[speaker_0] ... GPT works, however, ML
is not like this.

20:21.724 --> 20:23.264
[laughs] I think that's my actual method.

20:23.304 --> 20:26.364
[speaker_1] All of ML has been like this.
Like pre-training, for example,

20:26.404 --> 20:28.664
was a scary small model. GPT-2
was a very small model.

20:29.184 --> 20:32.864
Uh, it took like, uh, probably 150 or 2,
like the, the initial

20:32.924 --> 20:35.284
GPT cost like a million dollars, I think.
Not that much.

20:35.784 --> 20:39.704
Uh, GPT-3 again,
like took like slightly larger amount of

20:40.024 --> 20:41.424
but nothing compared to right now.

20:41.824 --> 20:45.604
And the amount of, uh,
the only reason you go from GPT-2 to 3 to

20:45.724 --> 20:49.484
4 to 5 or whatever subsequent models
is because you see gains

20:49.524 --> 20:53.484
from scaling all the time. Uh, RNN,
for example, you can see that, okay,

20:53.544 --> 20:55.884
scale the RNN paradigm,
you get gains from it.

20:56.424 --> 20:59.673
Uh, and then, then you stop seeing gains,
so that's why RNN didn't scale or

20:59.684 --> 21:01.214
whatever. Or, uh, then they went to LSTM.

21:01.284 --> 21:02.694
LSTM didn't scale,
and then they went to transformer.

21:02.784 --> 21:05.754
Transformer scaled fairly much, and like,
cool, we found a scalable paradigm.

21:06.204 --> 21:09.244
The idea that you can see gains
and then you try to scale it

21:09.744 --> 21:10.064
So-

21:10.084 --> 21:10.524
[speaker_0] Yeah

21:10.534 --> 21:14.144
[speaker_1] ... uh, so it's not that, oh,
I have to like literally roll dice of $10

21:14.224 --> 21:16.584
each time to get one idea. It's not
that random.

21:16.944 --> 21:20.884
[speaker_0] Okay,
I'll make a tighter claim. After GPT-2

21:20.924 --> 21:24.444
had come out,
if you wanted to try out any ML research

21:24.484 --> 21:28.304
hypothesis,
you probably needed at least $10 million

21:28.324 --> 21:31.924
life or not. Like for,
for most of the research hypothesis-

21:31.984 --> 21:32.134
[speaker_1] No, you could use the, uh-

21:32.134 --> 21:34.064
[speaker_0] ...
you needed at least $10 million since GPT-

21:34.104 --> 21:38.084
[speaker_1] No, no. So the id- the,
the idea is that you train GPT-1,

21:38.104 --> 21:41.284
saw that there were gains on it.
The obvious thing to do is that, "Hey,

21:41.324 --> 21:44.224
found a scalable paradigm.
I think I can scale it further to see

21:44.324 --> 21:44.644
gains."

21:45.024 --> 21:45.194
[speaker_0] Sure.

21:45.384 --> 21:48.484
[speaker_1] And then you train GPT-3,
and there are still gains from it,

21:48.524 --> 21:51.704
res- dead research ideas
that might show signs of life, but o-

21:51.724 --> 21:53.084
it, they start breaking.

21:53.124 --> 21:56.944
[speaker_0] Yeah,
I'm saying to try any other idea,

21:56.964 --> 22:00.604
pre-training scaling.
If you have any other idea apart from

22:00.824 --> 22:04.584
GPT-2 came out 2018, right? So after 2018,
if you had any other

22:04.644 --> 22:08.404
idea besides the whole pre-training
scaling thing, you wanted to test it out,

22:08.424 --> 22:12.304
would need at least $10 million for most
ideas to even test out and see if this

22:12.324 --> 22:13.164
has any life or not.

22:13.584 --> 22:15.604
[speaker_1] And that's also trivial
amounts of money if...

22:15.704 --> 22:18.424
I think if people who know about how these
things work, it's not completely

22:18.464 --> 22:20.764
unintuitive. They have,
they have a big idea of test.

22:21.044 --> 22:24.964
I, I'm not saying is that, uh,
there might be a breakthrough that

22:25.004 --> 22:28.004
in some old research paper
or in scribbled in all the diaries

22:28.064 --> 22:32.024
overlooked, uh, but it
is not as low-lying a fruit as you

22:32.564 --> 22:36.384
if,
if only we had just spent more compute,

22:36.824 --> 22:40.054
thousands of thousands of scalable
paradigms." Finding scalable paradigms is

22:40.104 --> 22:43.484
really, really hard.
We've only done like two ti-

22:43.504 --> 22:46.213
last 10 years and two times in the last 50
years and two times in the last thousand

22:46.224 --> 22:50.114
years, and, uh, essentially, uh, uh,
just because we've gotten

22:50.364 --> 22:53.764
lucky,
just because we got lucky doesn't mean

22:53.804 --> 22:56.064
happening in the next two, three, four,
five years.

22:56.084 --> 22:59.524
It's just like there'll just be like
scalable paradigms after scalable

22:59.784 --> 23:02.464
that will just keep showing up because
it's shown up last two times.

23:02.824 --> 23:05.744
[speaker_0] Okay,
so we have identified some disagreement

23:05.864 --> 23:07.344
Uh, how do you think we can resolve this?

23:07.384 --> 23:11.014
Like what data points will work
or what arguments about how do you think

23:11.014 --> 23:14.884
[speaker_1] One data point
that would work, one data point

23:14.944 --> 23:18.904
move you is that if you look other, uh,
if you look at other fields, you look

23:18.944 --> 23:22.884
at biology, et cetera, uh,
just because like there's one breakthrough

23:22.924 --> 23:26.884
that essentially leads to a different
class of

23:26.924 --> 23:28.414
discoveries or drug discovery happening.

23:28.464 --> 23:30.544
For example, you get, uh, let's say

23:31.704 --> 23:35.414
discovery through, like, for example,
RNA delivery of drugs,

23:36.044 --> 23:39.044
uh, like the mRNA vaccine, et cetera,
which you can like modify RNA and you can

23:39.064 --> 23:42.964
inject in people, et cetera.
That means that, yes, a lot of, a lot

23:43.044 --> 23:43.624
of, uh...

23:44.544 --> 23:47.644
That, that was a big deal,
like get a new branch of medicine,

23:47.664 --> 23:51.384
doesn't automatically mean
that the amount of new breakthroughs will

23:51.424 --> 23:55.224
increase. If anything, what we've seen
is that there is actually a slowdown in

23:55.284 --> 23:59.074
amount of new ideas
and new researches in every mature field

23:59.104 --> 24:02.264
more,
more attention goes into it because all

24:02.584 --> 24:05.844
Uh, so finding the next breakthrough
is not a linear process.

24:05.924 --> 24:08.524
It's actually a super linear process.
Not like it's a log process.

24:08.564 --> 24:12.004
You have to like spend 10X,
100X more resources to get more ideas,

24:12.464 --> 24:16.244
and, uh,
you pluck the low-hanging fruits very,

24:16.284 --> 24:18.304
is not that hard. It's not that easy.

24:18.524 --> 24:21.244
So-Um, so other field at least do it.

24:21.284 --> 24:23.424
You can argue that M-ML
is different for some reason.

24:23.544 --> 24:26.604
I don't know why scientific idea is, like,
inherently be different in, uh, ML

24:26.624 --> 24:27.804
because like, oh, ML is different.

24:27.844 --> 24:31.704
Because mostly what happens is
if enough eyeballs look at a problem, uh,

24:31.744 --> 24:34.044
they look at all the low-hanging fruits,
then they go to the second level of

24:34.084 --> 24:36.804
low-hanging fruit,
and they keep doing this, and they,

24:36.814 --> 24:38.184
hypothesis. And like, okay, cool.

24:38.664 --> 24:42.474
Uh, this has been already done in physics,
for example, uh, or

24:42.504 --> 24:45.904
like chemistry, et cetera.
We don't expect like crazy amount of math

24:45.924 --> 24:49.684
come out, uh, by a mathematician, uh,
or like a

24:49.724 --> 24:52.864
great, like, new,
new lines of math schools,

24:53.384 --> 24:56.424
Uh,
and I feel like with more attention on a

24:56.474 --> 24:58.504
It becomes harder to e-curve. S-curve
is not easier.

24:58.824 --> 24:59.364
[speaker_0] Okay, uh-

24:59.414 --> 25:01.724
[speaker_1] It depends on what part of the
S-curve you're on, I think.

25:02.324 --> 25:03.924
[speaker_0] Yeah, I think
that S-curve analogy is good.

25:04.004 --> 25:07.924
So yeah,
if we are comparing to other scientific

25:07.944 --> 25:09.644
field in which experiments are expensive.

25:09.704 --> 25:13.304
So, like, we should not compare this to,
like, theoretical math where you just need

25:13.384 --> 25:14.724
person sitting with pen and paper.

25:14.804 --> 25:17.884
Like, something like drug discovery
is a better analogy for this.

25:18.304 --> 25:21.844
And by the way, I do think ML
is a bit different,

25:21.944 --> 25:25.274
analogizing with other fields, uh,
in drug discovery.

25:25.324 --> 25:27.004
[speaker_1] Or like theoretical physics,
for example.

25:27.344 --> 25:30.244
[speaker_0] Sorry,
y-you mean theoretical physics is cheap

25:30.744 --> 25:34.404
[speaker_1] Is also expensive. It
was cheap at some point,

25:34.424 --> 25:35.964
like, breakthrough with pen and paper.

25:36.384 --> 25:40.004
But now if you want to, like, uh,
experimental physics, sorry,

25:40.244 --> 25:40.314
physics-

25:40.314 --> 25:40.674
[speaker_0] Mm. Yeah

25:40.674 --> 25:44.524
[speaker_1] ... uh,
you could make like a... Yeah,

25:44.544 --> 25:46.174
to do any new physics.

25:46.624 --> 25:50.224
[speaker_0] Sure. Yeah. Okay, fine.
Experimental physics would work.

25:50.504 --> 25:54.254
Uh, yeah.
Now to actually argue about experimental

25:54.284 --> 25:57.624
physics or drug discovery,
I will have to actually read more about

25:58.144 --> 25:58.924
physics or drug discovery. [chuckles]

25:59.064 --> 26:01.564
[speaker_1] Uh, but do you agree that,
that, like-

26:01.744 --> 26:02.534
[speaker_0] Also, we have to-

26:02.804 --> 26:02.814
[speaker_1] With the-

26:02.824 --> 26:03.084
[speaker_0] Yeah, no

26:03.204 --> 26:04.094
[speaker_1] ... enhanced attention and-

26:04.204 --> 26:05.604
[speaker_0] Also,
we have to pick a time period. Sorry.

26:05.864 --> 26:09.684
Uh, also, like, if we take, you know,
drug discovery or experimental physics as

26:09.724 --> 26:13.574
example,
we have to take a time period in the

26:13.604 --> 26:17.584
was known that, okay, this thing is,
you know, in like boom phase and like, you

26:17.624 --> 26:19.504
know, lots of new capabilities
are coming out.

26:19.544 --> 26:22.364
Like, like in ML right now, we know we
are in that sort of phase.

26:22.424 --> 26:24.064
Like,
there may be new things we could try.

26:24.144 --> 26:26.524
Like, it's not like, okay,
it's like a dead field-

26:26.564 --> 26:26.574
[speaker_1] Sure

26:26.574 --> 26:27.204
[speaker_0] ... mature field.

26:28.024 --> 26:28.564
We have not reached-

26:28.584 --> 26:32.424
[speaker_1] Sure, sure. Yeah, I agree.
And like example,

26:32.464 --> 26:35.184
that.
I don't know about experimental physics,

26:35.244 --> 26:39.224
also,
where a bunch of like new physics

26:39.264 --> 26:41.144
1920s. There was a activity that came out.

26:41.184 --> 26:44.364
There's like, like,
a bunch of these new...

26:44.434 --> 26:46.924
All of them essentially got started in
1920s.

26:47.144 --> 26:49.124
There was one period in the 1600s
that happened.

26:49.244 --> 26:51.254
Uh, but, uh, I agree that we might-

26:51.604 --> 26:54.604
[speaker_0] Studying that particular time
period to study, you know,

26:54.664 --> 26:57.334
that happened there?
And after a few breakthroughs-

26:57.334 --> 26:57.334
[speaker_1] Yeah

26:57.384 --> 27:00.224
[speaker_0] ... came out,
now to extrapolate, okay,

27:00.264 --> 27:02.244
breakthroughs come out,
and how expensive will it be to run-

27:02.254 --> 27:02.554
[speaker_1] Sure, sure

27:02.584 --> 27:03.884
[speaker_0] ... experiments?
I think that's the kind of-

27:03.924 --> 27:04.424
[speaker_1] And I agree

27:04.544 --> 27:05.344
[speaker_0] ... study.

27:05.384 --> 27:09.204
[speaker_1] And I agree. And, and,
and then essentially, right,

27:09.304 --> 27:12.904
part of the S-curve, that then there,
then you should

27:12.944 --> 27:14.344
expect more breakthroughs to come out.

27:14.724 --> 27:18.274
If you're on the horizontal part of the
S-curve, then you should think, then you

27:18.284 --> 27:20.044
should expect less discoveries to come
out.

27:20.104 --> 27:23.264
Which part of the S-curve you are on,
I don't think either of us know.

27:23.384 --> 27:25.464
Uh, but I think-

27:25.604 --> 27:25.984
[speaker_0] I'm claiming-

27:26.074 --> 27:26.724
[speaker_1] ... the more time-

27:26.884 --> 27:30.504
[speaker_0] That's my claim. And also,
sure, some decent probability they're not,

27:30.944 --> 27:33.384
[speaker_1] Okay. Uh,
what will be your evidence to saying

27:33.444 --> 27:37.144
Like, how are you so sure?
I have zero base to say that this thing,

27:37.184 --> 27:40.504
can, on hindsight be like, "Oh,
looks like we were on the vertical part

27:40.784 --> 27:44.674
[speaker_0] Okay. And for me,
it's just extrapolate last five to 10

27:44.724 --> 27:46.624
data points. Okay, which year did, uh...

27:46.804 --> 27:50.104
Well, actually you can go back to ,
you know, which year did AlexNet come out?

27:50.154 --> 27:50.154
[speaker_1] Uh-

27:50.184 --> 27:53.144
[speaker_0] Then which year did, you know,
transformer come out?

27:53.184 --> 27:53.644
Which year did-

27:53.654 --> 27:53.654
[speaker_1] That-

27:53.654 --> 27:57.544
[speaker_0] ... GPT-2 come out?
And just put these on like a year

27:57.584 --> 27:58.254
versus, you know-

27:58.304 --> 27:58.684
[speaker_1] That is-

27:58.733 --> 28:02.544
[speaker_0] ...
new breakthrough kind of graph,

28:02.584 --> 28:04.344
this look like your S-curve is saturated?

28:04.404 --> 28:06.184
Yes or no?" And no,
it doesn't look like there's-

28:06.304 --> 28:08.004
[speaker_1] And this is like an outside
perspective.

28:08.044 --> 28:11.584
You don't have to trust it,
but Ilya Sutskever, who

28:11.624 --> 28:15.564
was responsible for GPT,
was responsible for RL, uh, comes

28:15.584 --> 28:18.764
and says essentially all the good ideas
are down and we need to spend some time

28:18.804 --> 28:21.484
doing new research. And this
is the time for...

28:21.524 --> 28:24.594
I don't know if you saw that episode
or The R Kesh, but he's like,

28:24.684 --> 28:26.624
scaling is over. Now
is the time for new research."

28:27.524 --> 28:29.384
[speaker_0] What are Ilya Sutskever's
timelines?

28:29.464 --> 28:32.804
Are they less bullish than me
when I'm saying 25% ASI 2030?

28:32.864 --> 28:34.994
Like,
does Ilya have like less bullish timelines

28:35.914 --> 28:37.593
[speaker_1] I don't know,
but I think that's a relevant.

28:38.024 --> 28:38.263
[speaker_0] Sorry?

28:38.984 --> 28:40.994
[speaker_1] I think that's irrelevant.
I feel-

28:41.124 --> 28:42.624
[speaker_0] No, no, you brought up Ilya-

28:42.674 --> 28:42.674
[speaker_1] Yeah. No

28:42.674 --> 28:46.454
[speaker_0] ... then I was like, okay,
like does Ilya agree with me already,

28:46.464 --> 28:46.784
thing.

28:47.464 --> 28:48.824
[speaker_1] It doesn't matter
if Ilya agrees with you.

28:48.884 --> 28:52.424
What Ilya does agree,
disagree with you on is

28:52.484 --> 28:56.164
that the low-hanging scaling fruits have
been plucked and we need to go find new

28:56.204 --> 28:59.884
scaling breakthroughs, uh,
which he's confident that he will find,

29:00.004 --> 29:03.734
economic incentive to say that. Uh,
but he's saying that, "Okay,

29:03.764 --> 29:05.704
are down,
and now we need to find something new to

29:06.164 --> 29:10.104
[speaker_0] Okay. No, but
if you strongly defer to Ilya on

29:10.184 --> 29:12.234
this question,
then Ilya's actual timelines are-

29:12.264 --> 29:13.284
[speaker_1] No, I don't. I don't,

29:14.164 --> 29:17.064
I don't,
I don't defer strongly to Ilya on this

29:17.124 --> 29:20.434
All I'm saying is there's one extra
evidence point saying that the guy who was

29:20.464 --> 29:23.164
involved with all these three
breakthroughs comes and says

29:23.184 --> 29:26.604
no low-hanging fruits anymore,
and we have to go find more, uh,

29:26.864 --> 29:30.364
should update you that we
are probably not on the vertical part of

29:30.424 --> 29:33.424
[speaker_0] No, no, no. Uh,
I took a different lesson from this.

29:33.584 --> 29:37.204
Uh, like Ilya has, again,
as bullish as me timelines,

29:37.244 --> 29:40.764
low-hanging fruit is picked.
What he means is, okay, the

29:40.804 --> 29:43.403
low-hanging fruit is not like one month
low-hanging fruit.

29:43.464 --> 29:45.304
It's like two years,
three years low-hanging fruit.

29:45.724 --> 29:48.604
[speaker_1] No, I think his time,
his timelines are significantly longer.

29:48.684 --> 29:52.564
I think he's like in the next eight to 10
years will probably be some areas

29:52.624 --> 29:56.144
of research that we need to find to scale,
and yeah.

29:56.524 --> 29:56.534
[speaker_0] Okay.

29:56.564 --> 29:58.244
[speaker_1] I, I think his
is probably longer.

29:58.304 --> 30:02.124
[speaker_0] So, so again, I'm like, yeah,
if you want to debate specifically Ilya's

30:02.204 --> 30:04.064
I think actually we have to go
and find his timeline.

30:04.124 --> 30:05.244
[speaker_1] I don't want to debate Ilya's
worldview.

30:05.324 --> 30:08.964
I'm saying that this, the,
it's not a question of whether Ilya

30:09.004 --> 30:12.704
The question of, uh,
Ilya has an evidence point

30:12.964 --> 30:15.153
may or may not update you on
which part of the S-curve we are on.

30:15.284 --> 30:18.784
[speaker_0] Yeah. So for that,
I need to first even understand does Ilya

30:18.824 --> 30:19.934
or does he have some major disagreement?

30:19.984 --> 30:23.924
[speaker_1] No, he doesn't. He doesn't.
He, his timelines are, uh,

30:24.880 --> 30:27.429
[speaker_0] So then want to know what his
timelines are.

30:27.900 --> 30:31.630
[speaker_1] I think he said the next...
I might have to check on this, but

30:31.740 --> 30:35.640
Darkish episode, he says
that like six to eight years, uh,

30:35.680 --> 30:37.180
and then we'll find something
that we can scale.

30:38.100 --> 30:41.740
And, uh,
there are no good scaling candidates in

30:42.200 --> 30:45.840
[speaker_0] Yeah, no, to continue this,
like, I think, like,

30:45.960 --> 30:47.800
more... Like, are you sure about this?
You know.

30:47.840 --> 30:48.120
[speaker_1] Yeah.

30:48.180 --> 30:49.120
[speaker_0] Give me more context and-

30:49.520 --> 30:52.860
[speaker_1] No, no, I'm saying,
I'm saying, I'm saying that,

30:52.980 --> 30:55.180
S-curve we are on, uh,
there's uncertainty on it.

30:55.740 --> 30:58.180
Uh, maybe you're right and we
are on the vertical part of the S-curve.

30:58.580 --> 31:02.300
I think if you just plot it from AlexNet,
it's mostly the same

31:02.340 --> 31:05.940
paradigm. Uh, and I would count AlexNet

31:06.660 --> 31:10.600
as one breakthrough,
Transformers as another breakthrough,

31:10.640 --> 31:11.000
breakthrough.

31:11.980 --> 31:12.930
But apart from that,

31:13.760 --> 31:14.790
I don't see why-

31:15.160 --> 31:17.910
[speaker_0] Yes, once, uh, no,
but there are also like minor ones

31:18.120 --> 31:21.800
[speaker_1] ... like,
I don't know any other paradigm that

31:21.840 --> 31:25.220
Okay,
I won't even count AlexNet 'cause AlexNet

31:25.280 --> 31:29.180
works. Uh,
I would just count Transformers

31:29.240 --> 31:31.640
that, uh, show that scale is all you need.

31:32.160 --> 31:35.910
Uh, and then there's n-
there's not been a third candidate for

31:35.960 --> 31:39.180
need, or a third S-curve
that you can stack on top of these things.

31:39.560 --> 31:41.550
Pre-training was one S-curve.
We exhausted it.

31:41.600 --> 31:45.520
[speaker_0] There are multiple points.
One is like proof that Transformers

31:45.600 --> 31:49.520
all,
and then there's a second data point

31:49.640 --> 31:50.570
Or even with, like-

31:50.620 --> 31:53.010
[speaker_1] Transformers
are useful at all is...

31:53.050 --> 31:55.360
Transformers are only useful because they
can scale.

31:55.740 --> 31:58.110
[speaker_0] No. Uh, GPT-2 was-

31:58.120 --> 31:58.420
[speaker_1] Because-

31:58.640 --> 32:02.060
[speaker_0] ... useful. Like, it
was a breakthrough by itself, even

32:02.100 --> 32:04.370
anything about whether GPT-2 would scale
or not.

32:04.420 --> 32:08.140
[speaker_1] If it doesn't, then... No,
what I'm talking about is that if you

32:08.200 --> 32:11.800
assume that scaling
is what gets you smarter models, uh,

32:12.120 --> 32:15.540
subscribe to that worldview,
then you need paradigms that can scale.

32:15.790 --> 32:17.330
Pre-training is one paradigm
that can scale.

32:17.560 --> 32:21.520
[speaker_0] Right now.
At the time GPT-2 was invented,

32:21.700 --> 32:23.360
ML research community didn't believe it.

32:23.420 --> 32:26.790
[speaker_1] Sure. And I, like,
I'm saying that it's, that's, that's

32:27.320 --> 32:28.880
irrespective, like, that's immaterial.

32:28.940 --> 32:32.769
What I'm saying right now is
if you believe that intelligence comes

32:32.780 --> 32:33.570
scaling things up-

32:34.180 --> 32:34.460
[speaker_0] Sure

32:34.540 --> 32:38.220
[speaker_1] ...
then scaling a paradigm up,

32:38.280 --> 32:38.800
scaling up.

32:39.460 --> 32:39.640
[speaker_0] Yeah.

32:39.700 --> 32:43.620
[speaker_1] In doing so, uh,
we have found only two paradigms that have

32:43.700 --> 32:44.060
scaled up.

32:44.180 --> 32:47.300
[speaker_0] We have found only two
paradigms that have scaled up.

32:47.360 --> 32:51.120
No, I mean, why does, uh,
the whole scaling fully connected

32:51.180 --> 32:54.440
networks back in twenty twelve,
twenty thirteen, why does that not count?

32:54.480 --> 32:57.420
Why does scaling LSTM not count? I,
I'm not clear.

32:57.620 --> 33:00.000
[speaker_1] Because LSTM didn't scale.
CNNs didn't scale.

33:00.820 --> 33:02.920
[speaker_0] What do you mean?
CNNs do scale.

33:03.220 --> 33:06.540
[speaker_1] As in, like,
you get diminishing returns from scaling

33:07.060 --> 33:08.950
CNNs, for example, cannot-

33:09.320 --> 33:09.330
[speaker_0] Yeah.

33:09.340 --> 33:12.020
[speaker_1] People try to do language
experiments on CNN,

33:12.080 --> 33:15.490
People try to do language experiments on
LSTM, they perform to a certain level, but

33:15.520 --> 33:17.500
essentially then they start degrading.

33:18.020 --> 33:21.420
So th- these, these paradigms have lots of

33:21.580 --> 33:22.900
limitations on how much they can scale.

33:23.540 --> 33:27.520
[speaker_0] Sure. So they,
they did scale up to some amount,

33:27.560 --> 33:29.980
saturated, I guess. Same.
Like you had some estimate, I guess.

33:30.020 --> 33:33.440
[speaker_1] So then,
then it's a failed candidate. Yeah,

33:33.500 --> 33:37.440
Like, like a good candidate is that, hey,
it doesn't matter how much compute

33:37.460 --> 33:39.220
we throw at it, just keep scaling.

33:39.720 --> 33:41.880
[speaker_0] No, no, it means
that back then it succeeded, right?

33:42.140 --> 33:45.640
CNNs, we did scale up to some amount,
then we realized we're getting diminishing

33:45.680 --> 33:46.200
returns.

33:46.280 --> 33:47.300
[speaker_1] No,
but that's what I'm saying, like,

33:48.320 --> 33:50.640
that way, like... No, no, that didn't...

33:50.740 --> 33:54.180
I don't know what your definition of
success is,

33:54.220 --> 33:58.060
to superintelligence. Candidates
that can get us to superintelligence are,

33:58.140 --> 33:59.940
are candidates that you can throw.

33:59.980 --> 34:03.770
There is no limit to how much, uh,
or there's no visible limit to

34:03.880 --> 34:05.880
how much compute you can throw at it.

34:05.920 --> 34:09.580
For example, if tomorrow we find out
that pre-training has stopped scaling,

34:09.600 --> 34:11.089
call pre-training a failed candidate.

34:11.400 --> 34:15.240
Currently, we have two candidates
that can absorb insane amounts of compute

34:15.320 --> 34:17.800
et cetera,
and can keep expecting gains from it.

34:18.400 --> 34:22.010
[speaker_0] No, even the paradigm
that gets us to superintelligence might

34:22.040 --> 34:24.240
saturate somewhere. E-e-every candidate-

34:24.250 --> 34:24.560
[speaker_1] Sure

34:24.580 --> 34:25.080
[speaker_0] ... can saturate.

34:25.220 --> 34:27.620
[speaker_1] They would, can saturate,
but I'm saying...

34:27.820 --> 34:31.400
A-and at that point, essentially, uh,
you can either draw a line and say

34:31.500 --> 34:35.060
intelligence is good enough.
Assuming that, okay, so you're in a,

34:35.140 --> 34:37.680
where you do not have enough, uh...

34:37.700 --> 34:40.500
You've not achieved the level of, or like,
not achieved, that you don't want to

34:40.540 --> 34:44.400
achieve. But I'm saying like, uh,
you think that you can still throw more

34:44.439 --> 34:46.800
at this and get more intelligence out of
this.

34:47.380 --> 34:51.180
Uh, and, uh, in that paradigm, there
are only two things that can

34:51.240 --> 34:53.760
absorb seemingly infinite amounts of
compute.

34:54.540 --> 34:55.670
Uh, and there's not been a third.

34:55.670 --> 34:57.680
[speaker_0] I have a problem with your
infinite thing.

34:57.820 --> 34:58.160
It's-

34:59.500 --> 35:02.560
[speaker_1] Seemingly infinite in the
sense that, sure,

35:02.600 --> 35:04.380
how much you can do it,
or there might be like some...

35:04.560 --> 35:07.240
There are obviously practical limits to
it, but there might be a theoretical limit

35:07.300 --> 35:10.860
as well. And I'm saying
that maybe before we get there, uh,

35:11.140 --> 35:14.060
superintelligence comes and then you,
you've achieved your goal

35:14.100 --> 35:14.700
to scale further.

35:15.720 --> 35:18.240
But until now,
there's no evidence to see that there

35:18.640 --> 35:21.580
[speaker_0] Sorry,
I'm still not super clear what your claim

35:21.620 --> 35:23.500
Can you like summarize this entire
argument?

35:23.580 --> 35:26.540
[speaker_1] Maybe I'll,
maybe I'll try to explain it in a

35:27.280 --> 35:31.200
A successful paradigm in my book
is one that can

35:31.260 --> 35:35.009
absorb all the compute capacity
that you can reasonably throw at it at

35:35.120 --> 35:35.680
point in time.

35:36.180 --> 35:39.790
[speaker_0] That point in time means what?
In that year, how many GPUs had

35:39.860 --> 35:41.420
humanity manufactured?

35:41.440 --> 35:45.140
[speaker_1] So it's also... Yeah, in
that point in time,

35:45.200 --> 35:49.180
So, so in twenty twenty-six, uh, there
are two successful paradigms that can

35:49.260 --> 35:51.380
absorb all the compute
and still keep on giving gain.

35:51.700 --> 35:53.420
We have not exhausted these two paradigms.

35:53.870 --> 35:53.870
[speaker_0] Right.

35:53.880 --> 35:57.500
[speaker_1] And we still have like returns
to get from there,

35:57.540 --> 36:01.340
essentially get us to smarter models.
There are just two of them. And twenty...

36:01.680 --> 36:05.600
And so I'm saying like you keep
extrapolating it, our, our, our lim-- our,

36:05.760 --> 36:09.460
thing. So in twenty twenty--
by the time we get to twenty thirty,

36:09.500 --> 36:12.540
that these two paradigms will not get to
superintelligence because there are

36:12.580 --> 36:15.900
practical limits to scaling them,
if not theoretical limits.

36:16.020 --> 36:19.940
Uh, and, uh,
we need either a third paradigm

36:19.960 --> 36:23.860
seventh paradigm to actually keep stacking
these S-curves to get us

36:23.920 --> 36:26.639
to the world that you
are claiming we'll be in, say,

36:27.000 --> 36:29.820
[speaker_0] Uh, okay. Sure. My probability

36:29.940 --> 36:33.772
ofPre-training scaling plus RL scaling
gets us

36:33.832 --> 36:37.592
to ASI by 2030 is less, let's say,
less than

36:37.632 --> 36:38.092
10%.

36:38.232 --> 36:41.372
[speaker_1] Yeah,
like probably we do need a better, uh,

36:41.852 --> 36:44.292
breakthrough. Yeah.
Probably we do need another breakthrough.

36:44.512 --> 36:45.292
I am saying that

36:46.691 --> 36:50.112
these breakthroughs
are not super easy to come by.

36:50.172 --> 36:52.992
Dep- I think we d-
just did the S curves debate. But yeah,

36:53.012 --> 36:56.672
The one crux we have is
that I am very uncertain of

36:57.092 --> 37:00.051
you have somehow. You
are somehow more certain in

37:00.092 --> 37:03.512
have,
and this is just a prior belief thing.

37:03.632 --> 37:06.212
I don't know if there's, like,
any evidence that you can show me.

37:06.532 --> 37:07.122
[speaker_0] Yeah, I think there are-

37:07.132 --> 37:07.901
[speaker_1] And maybe I don't know

37:07.901 --> 37:09.112
[speaker_0] ... S curves we are tracking.
One is like,

37:10.092 --> 37:13.972
uh, curves for individual paradigms,
and one is like some bigger curve

37:14.012 --> 37:17.012
of, you know,
like humanity's ML research as a whole.

37:17.572 --> 37:21.212
So there,
one is like the curve of pre-training

37:21.312 --> 37:25.292
started saturating.
There's a curve for

37:25.392 --> 37:26.892
scale, when did they start saturating.

37:27.492 --> 37:31.042
There's a curve for, you know,
when did RL scaling start,

37:31.092 --> 37:34.232
might saturate someday. And then there
are multiple of these curves, but then

37:34.252 --> 37:38.032
there's a bigger overall trajectory of how
fast is humanity's ML

37:38.092 --> 37:38.792
capability growing.

37:39.192 --> 37:42.852
[speaker_1] Uh,
that I don't agree with because

37:42.932 --> 37:46.112
it. For example, if you start tracking it,
there have been multiple AI winters and

37:46.152 --> 37:49.932
AI summers. There
were times where people thought

37:50.412 --> 37:54.292
in AI, and if we just do work on GOFAI
or we just do work on,

37:54.332 --> 37:57.082
like, some other paradigm,
we'll be able to get to ASI from here.

37:57.112 --> 37:59.472
[speaker_0] Why aren't we including all of
that as data points?

37:59.992 --> 38:03.752
[speaker_1] So if you keep including them,
then essentially, uh, this could

38:03.812 --> 38:07.752
either be like a, a 1990s rush in

38:07.832 --> 38:10.142
AI of, like saying that, "Hey,
we have super intelligence.

38:10.272 --> 38:14.162
We have chess playing AI, and we are,
like, a few research ideas away from

38:14.252 --> 38:18.242
having, like,
super intelligence AI because we have Deep

38:18.432 --> 38:21.312
we'll be like, "No, that,
that S curve actually flatlined

38:21.352 --> 38:24.742
super intelligence."
And then you have a larger S curve of,

38:25.072 --> 38:28.232
um, and, uh, transformers and R- RL.

38:28.692 --> 38:32.312
And that S curve,
depending on where we are on the S curve,

38:32.372 --> 38:36.172
still saturate before we get to, uh,
super intelligence.

38:36.912 --> 38:40.602
Uh, and, uh, either that could happen,
that is one world that could happen,

38:40.772 --> 38:44.732
or another S curve gets stacked on top of
it, and it keeps going till

38:44.772 --> 38:48.732
we reach super intelligence. Uh, so, uh,
or maybe, like, how close we

38:48.752 --> 38:52.152
are to super intelligence,
essentially whatever that bar is,

38:52.192 --> 38:55.662
there or we need to stack more S curves to
basically get us there

38:55.692 --> 38:59.352
faster. Uh, I do agree
that on infinite human

38:59.372 --> 39:01.432
timescale,
we'll get to super intelligence at some

39:01.932 --> 39:04.772
Uh, but yeah, 2030 is the timelines
that we're dealing with.

39:04.792 --> 39:06.832
[speaker_0] Again,
I didn't understand your argument.

39:06.972 --> 39:10.952
If I track, you know, human-
humanity's AI research progress since,

39:11.032 --> 39:14.812
1970, yes,
there have been multiple spans of few

39:14.852 --> 39:17.912
years where we did get, you know,
some one breakthrough, something happened,

39:17.952 --> 39:19.952
then we had, like, you know,
20 years of nothing happening.

39:20.032 --> 39:21.542
Yes, there have been multiple of these.

39:22.312 --> 39:24.912
Uh,
right now we could be in either one of

39:24.952 --> 39:28.852
We might be about to reach an AI winter
or we might be

39:28.912 --> 39:30.792
about... Yeah,
we might get a few more breakthroughs.

39:30.812 --> 39:33.972
Those again might not, might
or might not get to super intelligence.

39:34.252 --> 39:38.052
Sure, uh, where are we in this?
So far I'm agreeing it's now a question

39:38.132 --> 39:40.252
how do you put the numbers on these things
and...

39:40.961 --> 39:43.712
[speaker_1] The, the quest-
the disagreement comes from the fact that,

39:44.192 --> 39:48.042
It's not a disagreement in the argument,
it's a disagreement on how many

39:48.192 --> 39:51.792
S, if a new S curve is needed
or will these S curves scale to super

39:51.832 --> 39:54.452
intelligence and how easy
are these S curves to come by.

39:54.892 --> 39:58.352
[speaker_0] So what would be a data point
that would actually change

39:58.752 --> 40:02.592
your mind? Like, for me, it's, like,
fairly, yeah, almost obvious

40:02.692 --> 40:06.482
that, okay, yeah, there is a b-
huge backlog of, backlog of research

40:06.572 --> 40:09.372
ideas that need to be,
that will definitely try-

40:09.432 --> 40:09.852
[speaker_1] That I think-

40:10.332 --> 40:11.152
[speaker_0] And all of them require-

40:11.222 --> 40:11.222
[speaker_1] Yeah, but-

40:11.222 --> 40:13.392
[speaker_0] ...
a lot of people to try and, yeah.

40:13.992 --> 40:17.472
[speaker_1] Sure.
I think sure there might be some research

40:17.532 --> 40:21.472
overhang, but, uh, the probability of us

40:21.512 --> 40:24.692
finding a breakthrough in the research
ideas might be below, uh...

40:24.772 --> 40:27.632
I think the ML research community is very,
very smart.

40:28.092 --> 40:32.052
Uh,
they figure out all the best candidates

40:32.102 --> 40:35.392
a daily basis, et cetera.
And in the last five, six years-

40:35.512 --> 40:38.232
[speaker_0] Yeah, I think that's where,
yeah, like,

40:38.292 --> 40:41.702
smart. It comes down to try, hit
and trial random shit until it works.

40:42.042 --> 40:45.842
[laughs] Like, I don't think people had,
like, some, uh, brilliant insight,

40:45.862 --> 40:48.661
"Okay, this is why this thing
is definitely going to work,"

40:48.672 --> 40:50.982
tried it a ton of time.
I think people just, like-

40:51.031 --> 40:51.362
[speaker_1] No, they did

40:51.452 --> 40:53.672
[speaker_0] ...
tried the random 20 random things to try

40:53.982 --> 40:56.672
[speaker_1] If you look to-- No,
but I don't think it

40:56.732 --> 41:00.512
I think the people who try it have some
intuition of, uh, why

41:00.572 --> 41:03.852
this could work or why this wouldn't work,
and they might be wrong or right and,

41:03.892 --> 41:05.731
and the outcomes might look random,
et cetera.

41:06.032 --> 41:09.732
But the selection of
which experiments to try definitely has

41:09.752 --> 41:13.352
A good AI researcher is one
that tries more successful experiments

41:13.932 --> 41:17.732
And a bad researcher
is they keep making bad bets on research

41:17.772 --> 41:18.692
keep failing at it.

41:19.492 --> 41:23.372
And, uh, my, the, my, my,
the data point that

41:23.452 --> 41:25.902
I want to look at is that despite so many,

41:26.812 --> 41:30.692
so much money flowing into, like,
finding research ideas, I'm sure

41:30.752 --> 41:32.612
we'll be able to scale this further.

41:33.012 --> 41:36.992
But is there an RL level
or a pre-training level idea, uh,

41:37.092 --> 41:41.012
already out there
that has not been tried yet because of

41:41.052 --> 41:44.952
compute? Because people,
like researchers are literally drawing

41:45.032 --> 41:48.912
whatever, um, a hat
and implementing the idea as opposed to

41:48.952 --> 41:51.672
reading the idea,
understanding the viability of it working

41:52.212 --> 41:54.632
Uh, I think, like,
like I believe in the second one.

41:54.992 --> 41:58.692
Uh, and if that is true, then,
then where is the idea is my question.

41:59.172 --> 42:00.732
Uh, and just because you got two ideas-

42:00.762 --> 42:03.832
[speaker_0] Yeah, that question
is definitely a crux like... Like yeah,

42:03.892 --> 42:06.932
On one side you have like researchers
understand nothing about the problem and

42:06.952 --> 42:10.572
they're just brute forcing.
On the other end of the spectrum you have,

42:10.632 --> 42:14.432
researcher deeply understands the thing
and they have an hypothesis

42:14.512 --> 42:17.372
actually running the training run,
they already know with confidence this is

42:17.412 --> 42:21.212
definitely going to work. And you
are saying, okay, researchers

42:21.272 --> 42:24.162
end of understanding things.
I'm saying they're far closer to the end

42:24.212 --> 42:25.152
randomly brute forcing.

42:25.612 --> 42:28.752
[speaker_1] I don't know. If you look,
like, if you look at, like Noam,

42:28.832 --> 42:32.702
heard Noam, what's his name? Yeah,
Noam Shazeer talk about,

42:32.792 --> 42:36.402
uh,
when they were getting into transformers,

42:36.402 --> 42:39.692
transformers out, when they
were essentially scaling language models,

42:39.752 --> 42:41.252
wrote the attention paper, et cetera.

42:41.732 --> 42:43.852
Uh, it wasn't like a random idea
that they had come.

42:44.172 --> 42:48.132
Uh,
Noam Shazeer has like this history of

42:48.252 --> 42:51.992
ideas to try out. He's called like a,
a magical researcher

42:52.012 --> 42:55.532
because he can seemingly look at like 100
ideas and figure out,

42:55.572 --> 42:59.072
like these are the one, two.
He has like crazy intuition of these

42:59.132 --> 43:02.092
ideas that could work because he
understands these things much more deeply

43:02.172 --> 43:02.872
average researcher.

43:03.062 --> 43:03.212
[speaker_0] Okay. Oh.

43:03.292 --> 43:05.932
[speaker_1] And there are the superstar
researchers that can look at ideas

43:05.972 --> 43:09.944
like breakthrough ideas much more
quickly.And I don't think it's as

43:10.004 --> 43:12.684
random as that, "Hey,
let me just pick one and do it, and

43:12.744 --> 43:14.204
Otherwise,
I'll go pick another one tomorrow."

43:14.664 --> 43:18.564
[speaker_0] For me, I see it more as,
yeah, like, yes, Noam Shazeer probably did

43:18.604 --> 43:21.984
have some intuitions, but also it
was random that he

43:22.044 --> 43:24.244
How would I put it?
There have been three

43:24.304 --> 43:26.584
Noam Shazeer directly contributed to one
of them.

43:27.084 --> 43:30.644
If you take any of the other research
breakthroughs which Noam Shazeer did not

43:30.704 --> 43:34.584
make,
and you put him in one year before

43:34.624 --> 43:37.824
happened and told him, "Look, here
are all these hypothesis the different

43:37.844 --> 43:41.444
researchers are making.
Which one do you think will work?" I don't

43:41.484 --> 43:44.144
have made that good a guess and told you,
"Oh, this one will work."

43:44.724 --> 43:48.424
[speaker_1] And I saw a Noam Shazeer, uh,
and Jeff Dean

43:48.744 --> 43:49.024
talk

43:49.824 --> 43:53.544
about this exact thing,
and from the story that they

43:54.064 --> 43:56.864
their meeting perspective and understand,
et cetera.

43:57.224 --> 43:58.524
[speaker_0] Yeah, if what you
are saying is correct-

43:58.614 --> 43:58.614
[speaker_1] [laughs]

43:58.614 --> 44:02.564
[speaker_0] ... there should be, like,
the same researcher who's consistently

44:02.604 --> 44:05.414
where the field is heading multiple times
and should be multiple times-

44:05.744 --> 44:05.954
[speaker_1] That's true

44:05.964 --> 44:09.644
[speaker_0] ... not 100% co- correct, but,
like, roughly able to see, okay,

44:09.684 --> 44:12.464
are probably going to work
and then actually roughly ends up correct.

44:13.064 --> 44:13.614
Whereas I'm saying no actually-

44:13.624 --> 44:14.844
[speaker_1] That's been true

44:14.944 --> 44:15.904
[speaker_0] ... I'm saying no actually-

44:15.964 --> 44:18.744
[speaker_1] Because the amount of,
the amount of breakthroughs have come

44:18.944 --> 44:21.854
No, the amount of breakthrough
that have come from these superstar

44:21.884 --> 44:25.334
like, very, very high. Why
is Ilya Sutskever around all big

44:25.384 --> 44:27.884
'Cause he has a crazy sense of
which research ideas can work.

44:28.284 --> 44:30.304
Why is Noam Shazeer around all these big
breakthroughs?

44:30.334 --> 44:32.364
'Cause he has a crazy idea of all these
things to do.

44:32.404 --> 44:36.184
There's a random reason why a random ML
PhD you've never heard of comes up with a

44:36.304 --> 44:38.964
crazy idea.
It's mostly because you literally have to

44:39.784 --> 44:43.663
uh, what's that guy's name,
who's the GPT-2 main, Alec

44:43.724 --> 44:47.524
Radford type level researcher. Apparently,
Alec Radford has such a great

44:47.704 --> 44:51.464
sense of what could work. He's like,
he literally used to

44:51.544 --> 44:55.434
do small experiments on Jupyter Notebooks,
and

44:55.524 --> 44:59.184
he then once he got convinced
that this could work,

44:59.244 --> 45:02.184
engineering to Greg Brockman
or somebody who's like, "Yeah,

45:02.284 --> 45:05.964
Just keep scaling it up. No,
I'm sure it will work." Uh,

45:06.104 --> 45:08.914
built so much intuition about, like,
and what couldn't work

45:09.024 --> 45:12.804
this crazy ML whisperer guy who can just,
like, look at the

45:12.884 --> 45:16.644
shape of the model and figure out, like,
these are ideas worth pursuing, not worth

45:16.664 --> 45:19.813
pursuing. And if you look at, like,
Thinking Machines Lab, which is, like,

45:19.884 --> 45:23.684
lab filled, filled with all these guys,
John Schulman, uh, what's his name, Alec

45:23.724 --> 45:27.584
Radford, all the OG co-founder guys,
they have essentially had, like,

45:27.764 --> 45:31.384
free rein to do whatever.
And the best they came up with

45:31.764 --> 45:35.664
that updated me towards that, oh, okay,
there's, like, a lot of things to do here,

45:35.704 --> 45:39.634
but there is no crazy research
breakthrough paradigm that, that

45:39.764 --> 45:43.444
oh, we got, like,
a pre-training level paradigm

45:43.524 --> 45:46.364
stack on pre-training and get, like,
insane results.

45:46.404 --> 45:50.304
[speaker_0] Yeah,
I think I've identified a data point

45:50.384 --> 45:54.284
on this. If you, uh, again, from these,
you know, three or four superstar

45:54.304 --> 45:58.104
researchers,
if you're able to document a public track

45:58.264 --> 46:02.164
of, well, uh, yeah, since maybe 2018

46:02.244 --> 46:04.844
till 2026, some at least 10 years or no.

46:05.864 --> 46:09.084
Well, yeah, at least, yeah,
like more than five, at least seven,

46:09.664 --> 46:13.624
consistent track record. Like, okay,
here they made these predictions in 2018.

46:13.684 --> 46:15.644
They made these prediction 2020.
They made these-

46:15.664 --> 46:16.604
[speaker_1] Dario is one guy

46:16.684 --> 46:18.813
[speaker_0] ... in 2022,
and they made these in 2024.

46:19.004 --> 46:22.564
And, like,
they didn't single-handedly do all the

46:22.604 --> 46:25.224
roughly able to see where the next
breakthroughs are going to come from.

46:25.544 --> 46:29.124
If you can show me, okay,
just the same guy

46:29.144 --> 46:32.264
trend,
then that would actually shift my

46:32.944 --> 46:36.304
[speaker_1] I don't know.
I think I 100% believe what you're saying

46:36.324 --> 46:40.164
true. Uh, I don't know how... Like, I'm,
I'm trying to think of what are ways to

46:40.204 --> 46:44.084
show you this is happening. Uh,
one way to do this would be that if you

46:44.124 --> 46:48.104
look at the large breakthroughs,
and you look at who's responsible

46:48.124 --> 46:51.124
or who's close to those breakthroughs,
it will seem like it's the same people.

46:51.684 --> 46:55.304
And that should update you towards that,
hey, how come, uh, Ilya

46:55.313 --> 46:57.424
Sutskever is involved with all the big
breakthroughs?

46:57.484 --> 47:01.024
How come it came from the same guy who did
AlexNet, is the same guy who did

47:01.424 --> 47:05.144
GPT-2, is the same guy who did RL? Why,
why is it the same guy who's doing all

47:05.164 --> 47:08.424
these other things? Why
is Noam Shazeer building all of these

47:08.504 --> 47:10.474
Uh,
why is Alec Radford building all these

47:10.844 --> 47:13.834
It's because they have figured out or,
or they have impeccable...

47:14.124 --> 47:18.064
So they,
they talk about impeccable research taste,

47:18.074 --> 47:21.714
taste is what is really hard.
And research taste is this intuition that

47:21.724 --> 47:25.624
researchers have
that can figure out from a pile of, like,

47:25.664 --> 47:28.374
one to worth trying from the compute we
have to get the breakthrough.

47:28.784 --> 47:32.624
[speaker_0] Yeah,
so literally what you said,

47:32.684 --> 47:36.664
here is Alec Radford's track record of
research hypothesis going back

47:36.684 --> 47:37.404
entire eight years-

47:37.464 --> 47:41.064
[speaker_1] But why won't you just buy
Ilya's track record?

47:41.204 --> 47:44.324
[speaker_0] Uh,
I literally don't know enough about this

47:44.414 --> 47:47.204
What did Ilya say in 2018?
What did he say in '19?

47:47.224 --> 47:48.134
[speaker_1] No, he doesn't say anything.

47:48.134 --> 47:48.474
[speaker_0] I'm literally not saying-

47:48.524 --> 47:52.084
[speaker_1] But basically the fact
that he doesn't,

47:52.164 --> 47:56.104
Basically,
if he doesn't have to publicly make any of

47:56.164 --> 47:59.924
there's a reason why all big breakthroughs
are around one person, it

48:00.024 --> 48:03.184
stands to reason
that this person picks better research

48:04.084 --> 48:06.294
person who's picking research ideas at a
random, at random.

48:06.864 --> 48:07.294
[speaker_0] No, no, but-

48:07.484 --> 48:07.614
[speaker_1] Like-

48:07.844 --> 48:11.764
[speaker_0] Was he personally the one who
did the breakthrough,

48:11.804 --> 48:14.024
to be at the lab where somebody else did
the breakthrough?

48:14.424 --> 48:16.384
[speaker_1] No, no. He personally
was overseeing research.

48:16.424 --> 48:19.224
He personally was green-lighting the
experiments that he thinks would work.

48:19.584 --> 48:22.984
[speaker_0] Okay. Uh, okay,
then we can take literally Ilya Sutskever

48:23.044 --> 48:26.944
example. Uh,
which breakthroughs would you say, okay,

48:26.964 --> 48:30.694
is significantly responsible for versus
which ones you think he just happened

48:30.744 --> 48:31.264
to be there?

48:31.274 --> 48:31.274
[speaker_1] Deep learning.

48:31.304 --> 48:31.484
[speaker_0] Sorry?

48:31.544 --> 48:34.664
[speaker_1] Deep learning, he was--
Deep learning, he

48:36.004 --> 48:38.124
[speaker_0] No,
when you say deep learning,

48:39.024 --> 48:42.084
Okay, I'll explain it. Cool. Uh, yeah,
I agree with you. Fine.

48:42.184 --> 48:45.824
Ilya was significantly responsible,
a-along with other people, significantly

48:45.884 --> 48:47.104
responsible for AlexNet, sure.

48:47.164 --> 48:50.534
[speaker_1] GPT, the GPT ideas he
was significantly responsible for.

48:51.944 --> 48:54.804
Just training transformers. GPT-1 also.

48:55.144 --> 48:58.724
The idea that we can essentially scale
transformers or we can, we can find a

48:58.784 --> 49:01.944
scalable transformer paradigm
and build language models from it.

49:02.704 --> 49:06.234
[speaker_0] Uh, okay. So which model
was this? Was this GPT-1?

49:06.504 --> 49:08.504
[speaker_1] This was the generative
pre-transformer paper.

49:08.844 --> 49:11.024
He is directly responsible there.

49:11.084 --> 49:13.894
I think if we look at the GPT-1 paper-

49:13.894 --> 49:17.064
[speaker_0] One second please.
I know this, I know it's annoying to,

49:17.124 --> 49:20.544
things middle of video, but, like,
I actually want to now go read

49:20.554 --> 49:20.664
is.

49:21.044 --> 49:23.504
[speaker_1] Sure. It's this paper.
I will send it to you.

49:23.936 --> 49:27.046
[speaker_0] Oh, okay. It's in the chat.
See, uh, when was this published?

49:27.416 --> 49:28.736
[speaker_1] Five years ago.
Twenty twenty...

49:29.965 --> 49:31.956
No, earlier than that. June 2018.

49:32.436 --> 49:33.636
[speaker_0] Is it June? You sure?

49:34.176 --> 49:37.266
[speaker_1] This is the AI summary. Yeah.
Yeah, June 2018.

49:37.616 --> 49:41.215
[speaker_0] Okay, fine. Take care.
I will buy that. Okay, fine.

49:41.276 --> 49:44.596
Ilya has been there at two major
breakthroughs. Fine.

49:44.946 --> 49:48.896
[speaker_1] Then he's also been there at,
uh, this thing, uh, for

49:48.976 --> 49:52.696
the o1 breakthrough as well. Uh,
Ilya didn't

49:52.736 --> 49:56.576
green-light the experiment. Ilya
was heading research that time and

49:56.596 --> 49:59.536
green-lighted the o1 experiment. RL,
basically RL scaling.

49:59.896 --> 50:02.446
And if you look at Dario-- Sorry. Sorry.

50:02.456 --> 50:03.216
[speaker_0] Dario, for this-

50:03.776 --> 50:07.036
[speaker_1] Uh, he was at OpenAI at
that time. He was head of research.

50:07.416 --> 50:11.066
He was personally seeing all the AI
research that was happening at OpenAI, and

50:11.176 --> 50:11.956
OpenAI came out-

50:12.066 --> 50:15.716
[speaker_0] No,
but if he's head of research,

50:15.756 --> 50:19.616
works in his lab, even
if he does not want to, you know, do that,

50:19.656 --> 50:21.676
hypothesis or, you know, prioritize it.

50:21.686 --> 50:22.946
[speaker_1] You don't have to suggest the
hypothesis.

50:23.016 --> 50:26.356
I'm saying researchers
are mostly the same

50:26.396 --> 50:28.716
experiments, the best researchers.

50:29.476 --> 50:31.686
The best researchers know which,
which can control-

50:31.696 --> 50:35.076
[speaker_0] But like OpenAI tried 100
things. Whichever of,

50:35.116 --> 50:37.836
worked,
he could take credit for it simply because

50:38.276 --> 50:42.146
[speaker_1] Sure, but if, if there
are 10,000 things that OpenAI didn't try,

50:42.236 --> 50:45.916
if there are three big paradigm,
three big breakthroughs

50:46.256 --> 50:50.036
AI,
and the same guy has been around for all

50:50.196 --> 50:53.946
the set of the things he tried. It's the,
the, the set is all the things that are

50:53.946 --> 50:57.796
out there that he didn't try,
and out of which he was able to freak out

50:57.896 --> 50:59.316
or be around all three things.

50:59.676 --> 51:03.546
[speaker_0] No,
there are ways to be around the guy who

51:03.596 --> 51:04.346
gives the correct hypothesis.

51:04.406 --> 51:08.136
[speaker_1] But he's not like Sam Altman,
who was probably around it, but he

51:08.176 --> 51:09.455
committed to the research direction.

51:09.956 --> 51:10.236
Anyway-

51:10.376 --> 51:10.455
[speaker_0] Okay

51:10.476 --> 51:13.576
[speaker_1] ... this is by the way,
like even if you don't strongly believe

51:14.076 --> 51:15.676
[speaker_0] No, no, this
is very important. Like this part

51:15.716 --> 51:19.536
Like, uh,
was he the one who personally selected the

51:19.566 --> 51:21.146
"Okay, this one is worth trying,
we should try it"?

51:21.196 --> 51:21.916
[speaker_1] Yes.

51:21.996 --> 51:22.216
[speaker_0] Or-

51:22.296 --> 51:25.816
[speaker_1] Yes, I, I did the...
I'll tell,

51:25.856 --> 51:29.516
But there is an interview of Dario Amodei

51:29.956 --> 51:33.636
when he was working with Ilya Sutskever,
who I think it's Dario

51:33.676 --> 51:37.076
Amodei, who basically, uh, I don't know
if it's that.

51:37.116 --> 51:40.006
Anyway, it was, I think,
one of these interviews

51:40.036 --> 51:43.616
about Ilya Sutskever,
and he's saying then he came,

51:43.716 --> 51:47.526
research direction, saying
that we need to do X,

51:47.596 --> 51:51.216
we'll do this, and we'll do that.
And Ilya Sutskever drew two circles,

51:51.856 --> 51:55.496
two concentric circles. Inside he--
In one he-- And this was pre o1.

51:55.856 --> 51:59.776
He wrote pre-training,
and outside he wrote RL, and he said,

52:00.056 --> 52:02.566
[speaker_0] Okay. Uh,
if you can send me this, that will help.

52:02.626 --> 52:05.516
[speaker_1] I'll find, to find that clip.
I'll try to find that clip.

52:05.536 --> 52:08.706
And this was like much before o1,
when I think Dario or whoever

52:08.816 --> 52:12.696
and, and the guy was like, "Okay, uh, it

52:12.736 --> 52:13.696
makes sense." Like,

52:14.616 --> 52:16.816
why am I complicating the research agenda
that long?

52:17.156 --> 52:19.956
[speaker_0] Okay, sure.
If you send me this, that will again,

52:19.996 --> 52:23.246
Like now you have given me three different
data points, and Ilya was involved

52:23.296 --> 52:26.696
directly, like not just like, okay,
researcher overseeing, but he

52:26.716 --> 52:29.186
picking the hypothesis
and saying this will work. Yeah.

52:29.216 --> 52:32.196
[speaker_1] Yeah. Yeah.
I think I can to find... One second.

52:32.616 --> 52:36.016
Let me just do a random cloud search to
see if they can find the things.

52:36.436 --> 52:39.986
Last time I remembered, maybe we'll find,
but I, I've seen it and,

52:40.576 --> 52:44.336
uh, provided this is true, uh,
would you agree that research is

52:44.396 --> 52:46.436
not as random as picking ideas from a hat?

52:46.736 --> 52:50.016
[speaker_0] Uh, yeah. If you show me that,
yeah, now three different

52:50.056 --> 52:53.766
breakthroughs, uh,
Ilya Sutskever personally was helping

52:53.876 --> 52:57.696
pick the hypothesis rather than just
happen to be in the same room or

52:57.736 --> 53:01.416
overseeing the same lab.
If you show this across three

53:01.556 --> 53:05.545
that would tell me that there
is something spec- some specific way

53:05.616 --> 53:08.736
Ilya Sutskever personally looks at this
problem, which almost nobody else in the

53:08.776 --> 53:10.156
world has. Yeah.

53:10.196 --> 53:13.646
[speaker_1] It's time to find out
which interview. I think it

53:14.116 --> 53:18.036
Anyway, cool. Uh, I think
are there a couple of other things

53:18.076 --> 53:21.976
that I think I disagreed with. Uh,
so one was that research direction

53:22.116 --> 53:25.596
there might not be as many low-hanging
fruits as you think there are.

53:25.716 --> 53:29.316
Uh, so 2030 might be in this thing.
That was one crux he identified.

53:29.376 --> 53:33.336
The other one was that, uh,
intelligence is easier to build than

53:33.476 --> 53:37.336
I thought. This is something
that I think I've changed my mind on since

53:37.396 --> 53:41.206
Ooty, uh, like since he last spoke, uh,
about this.

53:41.266 --> 53:45.076
It's basically the, if you've,
if you saw the Richard Sutton

53:45.516 --> 53:46.656
Dwarkesh interview,

53:47.516 --> 53:48.146
uh, TCS-

53:48.296 --> 53:49.236
[speaker_0] Yes.

53:49.576 --> 53:51.476
[speaker_1] Do you remember what they
spoke about?

53:51.556 --> 53:55.256
[speaker_0] I think Richard Sutton's
timelines were also something 25% by 2030,

53:55.316 --> 53:56.296
remember correctly.

53:56.476 --> 53:58.896
[speaker_1] No.
He thinks it's more LL by LL-

53:59.876 --> 54:03.436
[speaker_0] I am pretty confident Sutton
had something like next five to 10 years

54:03.456 --> 54:05.266
chance ASI. But yeah-

54:05.516 --> 54:05.956
[speaker_1] But he thinks-

54:05.996 --> 54:08.426
[speaker_0] I can't quite remember.
Actually, you know why-

54:08.536 --> 54:10.886
[speaker_1] But, but I also think
that this whole idea that-

54:11.536 --> 54:14.256
[speaker_0] Ilya's timelines, what
are his...

54:14.676 --> 54:15.366
Give me a minute.

54:15.936 --> 54:19.656
[speaker_1] Yeah,
and I also don't think this whole idea

54:19.716 --> 54:23.136
agrees with you on timelines,
it doesn't matter. Any-

54:23.236 --> 54:27.066
Like it does matter what shape of beliefs
he has and why he agrees to those

54:27.136 --> 54:29.596
timelines and what shape of beliefs you
have and why you agree to the timelines.

54:29.676 --> 54:31.476
Uh,
there might be a fundamental disagreement

54:31.536 --> 54:34.516
You could update from some of his beliefs,
not all of his beliefs, even though his

54:34.536 --> 54:35.436
conclusion are the same.

54:35.996 --> 54:39.796
[speaker_0] No, uh,
like I initially started from a worldview

54:39.876 --> 54:40.096
of

54:41.116 --> 54:44.936
like pick even among the genius
researchers, picking which research

54:44.976 --> 54:48.916
hypothesis works is kind of random,
and it requires just a lot of hit

54:48.936 --> 54:51.166
and trial,
and none of these people really know.

54:51.216 --> 54:55.146
You are trying to update me more towards a
worldview of, no, there are a few

54:55.156 --> 54:59.036
genius researchers here who consistently
seem to get all of the

54:59.076 --> 55:03.056
predictions right. And I'm like,
let's say I did update to your

55:03.076 --> 55:06.616
worldview that all the more means I want
to know, okay, what are these people's

55:06.656 --> 55:09.536
timelines? If now you're saying, okay,
I should defer to these people now.

55:09.576 --> 55:10.346
[speaker_1] No, I'm sure. Go, go find out-

55:10.346 --> 55:13.386
[speaker_0] I literally need to know what
does Ilya Sutskever

55:13.556 --> 55:14.156
Yeah, like-

55:14.316 --> 55:16.396
[speaker_1] Go find their timelines.
That's not the point I was making.

55:16.456 --> 55:19.316
I was trying to make a point that you
were saying that just because Ilya has

55:19.356 --> 55:23.096
timelines and you have short timelines,
it doesn't matter, uh,

55:23.176 --> 55:27.116
what Ilya's arguments on research
direction is or Ilya's time, like

55:27.156 --> 55:29.976
Ilya's stance is on why
and how we get this breakthrough.

55:30.176 --> 55:31.116
[speaker_0] Those also matter. I agree.

55:31.416 --> 55:33.826
[speaker_1] Okay. So it-- So cool.
Find Sutton's timelines.

55:33.916 --> 55:37.656
But the point Sutton made was that, uh,
evolution actually gave us

55:37.716 --> 55:41.436
language very, very lateUh,
and most of evolution was

55:41.476 --> 55:45.306
trying to optimize for things
that we take for granted, which

55:45.916 --> 55:49.396
uh, whatever,
like being physical dexterity, et cetera.

55:49.416 --> 55:53.216
And, uh, those things essentially,
which according to you are part of

55:53.256 --> 55:57.056
intelligence,
those things actually took a, like,

55:57.136 --> 56:00.356
time. Uh, and, uh,
those would be very hard to do.

56:00.796 --> 56:04.676
Uh, the second thing that he says that,
uh, he thinks that if you can get

56:05.416 --> 56:08.596
good at doing those parts,
then everything else falls into place.

56:09.156 --> 56:13.076
Uh, but, uh, the argument is
if you can get to language, uh,

56:13.176 --> 56:15.076
then getting to the physical stuff
is easy.

56:15.236 --> 56:19.096
And he's like, "No, most of it
is getting to the physical stuff." And

56:19.156 --> 56:22.516
then this language part
and all of these part is something

56:22.696 --> 56:26.356
[speaker_0] Sorry,
I want to interrupt you because I feel

56:26.436 --> 56:29.976
points we can both go check, and
if we check,

56:30.026 --> 56:33.696
our debate more productive.
Data point number one of, you know, Ilya

56:33.736 --> 56:36.726
Sutskever being personally present at all
these three breakthroughs, I think you've

56:36.996 --> 56:40.856
proved it for AlexNet, I agree.
For the GPT-1 thing,

56:40.896 --> 56:43.276
also kind of agree. More data would help,
but I kind of agree.

56:43.336 --> 56:46.116
The o1 thing,
I'm not yet convinced Ilya

56:46.156 --> 56:49.416
If you show me data, I can be convinced.
Uh, that's one data point.

56:49.536 --> 56:49.686
[speaker_1] Yeah. I think I found it.

56:49.686 --> 56:53.426
[speaker_0] The other data point I want
is, uh, yeah, literally what

56:53.476 --> 56:57.396
What are Sutskever's timelines? Uh,
and I'm saying, like, let's

56:57.416 --> 57:00.056
first get these data points
and then let's continue the discussion.

57:00.256 --> 57:00.646
I think that-

57:00.716 --> 57:02.375
[speaker_1] Found it. I think I found it.

57:02.396 --> 57:06.156
[speaker_2] The people who
are most responsible for that

57:06.196 --> 57:10.116
and Jakub Pachocki. I think even like, uh,
like,

57:10.336 --> 57:11.746
uh, Dota was kind of-

57:11.746 --> 57:15.056
[speaker_1] We can go like a few seconds
before because he's talking about o1,

57:15.416 --> 57:16.756
verify that he's talking about o1.

57:17.116 --> 57:17.396
[speaker_0] And

57:18.216 --> 57:22.036
okay, uh,
I might agree with you by the way,

57:22.116 --> 57:25.756
note, I'm like,
why has this not been documented

57:25.896 --> 57:29.296
there for all the three breakthroughs?
That seems like a very big deal.

57:29.896 --> 57:30.976
[speaker_1] What do you mean it's not
documented?

57:31.656 --> 57:35.516
[speaker_0] Why isn't there like either a
Hacker News post or a Lesswrong post

57:35.576 --> 57:37.936
saying here is the evidence Ilya
was there at all three breakthroughs?

57:38.276 --> 57:41.976
[speaker_1] But that seems,
I think it's common knowledge.

57:42.256 --> 57:45.596
I'm surprised you didn't,
and I'm surprised you didn't

57:45.996 --> 57:48.876
see this,
but basically it's been talked about in a

57:49.356 --> 57:52.965
Everyone knows that Alex Radford,
Ilya Sutskever, Noam Shazeer are these

57:53.036 --> 57:56.676
like insane superstar researchers who
whatever they touch, whatever

57:56.756 --> 57:59.206
ideas they pick turn out to be the right
candidates always.

57:59.616 --> 58:00.805
There's another one where Jeff Dean
and Noam Shazeer-

58:00.805 --> 58:03.836
[speaker_0] Okay.
It's definitely not common knowledge,

58:03.856 --> 58:05.856
to actually write this up
and update a bunch of people.

58:05.916 --> 58:09.766
Like,
I think I'm not the only one for whom this

58:09.766 --> 58:11.076
from you this. Yeah.

58:11.786 --> 58:15.626
[speaker_1] Yeah. But, uh, okay, cool. Uh,
I think I've-- The clip

58:15.636 --> 58:19.046
something Claude found out. Uh, this
is not the clip that I was talking about

58:19.056 --> 58:22.926
originally, but I think this
is even tighter evidence than what I

58:23.896 --> 58:27.196
because he's directly saying
that the reasoning breakthrough came from

58:27.216 --> 58:28.886
Pachocki and Ilya Sutskever.

58:29.176 --> 58:33.116
[speaker_0] Okay. Yeah. I mean,
I will need a few minutes to properly go

58:33.176 --> 58:34.056
okay, fine.

58:34.196 --> 58:34.676
[speaker_1] No stress, no stress.

58:34.716 --> 58:38.496
[speaker_0] Like for now I can maybe buy
it. Okay. Let's, uh, buy this.

58:38.556 --> 58:42.076
Okay, cool. Yeah.
Also then I want to know, yeah, what

58:42.136 --> 58:45.446
timelines then if you know that. Uh,
any of the people you mentioned-

58:45.446 --> 58:45.886
[speaker_1] We can just ask chat

58:45.896 --> 58:48.666
[speaker_0] ... Alex Radford
or Noam Shazeer or Ilya Sutskever. Yeah.

58:48.685 --> 58:52.546
[speaker_1] Yeah.
This is one post I found on EA forum, uh,

58:52.596 --> 58:56.166
run by Lessig, uh,
where they're talking about Sutskever's--

58:56.216 --> 58:59.366
criticizing Sutskever for not having
transparency on his timelines and

59:00.076 --> 59:00.976
not saying why.

59:02.596 --> 59:06.456
[speaker_0] I think you have successfully
updated me a bit, well,

59:07.056 --> 59:10.986
towards the idea
that Ilya Sutskever personally, like not

59:11.016 --> 59:14.376
even any of the other AI researchers in
this space, but Ilya Sutskever

59:14.516 --> 59:18.376
specifically has like great research taste
and he's consistently able

59:18.456 --> 59:20.076
to pick good research hypothesis.

59:20.116 --> 59:20.656
[speaker_1] Good.

59:20.756 --> 59:24.556
[speaker_0] Not yet shifted my timelines
by a lot. I think that's where I'm at.

59:26.736 --> 59:26.745
[laughs]

59:26.776 --> 59:26.846
[speaker_1] Yeah. I have given you-

59:26.876 --> 59:30.776
[speaker_0] Because yeah, again,
like it's not obvious to me, okay,

59:30.836 --> 59:34.696
if you told me, "Okay,
update towards Ilya

59:34.736 --> 59:38.206
than you," then sure. But
are Ilya's timelines

59:38.236 --> 59:38.965
[speaker_1] Ilya's timelines are--

59:40.676 --> 59:43.856
His minimum is still longer than your
maximum.

59:44.576 --> 59:47.976
Anyway, maybe not. I don't know.
And I think his definition of

59:47.996 --> 59:48.776
might be very different.

59:49.876 --> 59:53.726
[speaker_0] Fair. So yeah,
now we will have to go more into

59:54.356 --> 59:54.796
Uh, okay.

59:54.856 --> 59:55.876
[speaker_1] We can end the recording
there.

59:56.356 --> 59:57.356
[speaker_0] Okay. Uh.

59:57.396 --> 01:00:00.876
[outro music]