1
00:00:00,100 --> 00:00:04,000
[speaker_0] Pre-training scaling 100% does
not work or here is an argument

2
00:00:04,160 --> 00:00:07,830
RL scaling 100% does not work.
It's just intuitions that

3
00:00:07,940 --> 00:00:11,830
pre-training scale probably doesn't get
there, RL scaling probably doesn't

4
00:00:11,880 --> 00:00:12,160
get there

5
00:00:12,180 --> 00:00:16,020
[speaker_1] ...
essentially just by scaling RL

6
00:00:16,100 --> 00:00:19,920
just by scaling RL and pre-training, uh,
getting to ASI just in the

7
00:00:20,040 --> 00:00:23,830
current paradigm by 2030, uh,
in your definition of intelligence

8
00:00:23,920 --> 00:00:24,920
seems very, very unlikely.

9
00:00:24,960 --> 00:00:28,480
[speaker_0] But yeah, researcher,
you can right now take a pen and paper,

10
00:00:28,540 --> 00:00:32,500
and come up with, okay,
if I had huge amount of GPUs, here

11
00:00:32,540 --> 00:00:34,710
would try, and you can write them down.

12
00:00:34,730 --> 00:00:34,730
[speaker_1] Yeah.

13
00:00:34,730 --> 00:00:36,880
[speaker_0] And then we will ask, okay,
well, why haven't these...

14
00:00:37,380 --> 00:00:41,160
Like, did somebody at, you know, OpenAI,
Anthropic try them and they failed, or did

15
00:00:41,220 --> 00:00:42,400
nobody try them?

16
00:00:42,540 --> 00:00:44,980
[speaker_1] Uh, maybe you're right
and we're on the vertical part of the S

17
00:00:45,380 --> 00:00:49,130
I think if you just plot it from AlexNet,
it's mostly the same

18
00:00:49,140 --> 00:00:52,740
paradigm. Uh, and I would count AlexNet

19
00:00:53,460 --> 00:00:57,400
as one breakthrough,
Transformer as another breakthrough,

20
00:00:57,460 --> 00:00:57,820
breakthrough.

21
00:00:58,900 --> 00:01:01,990
[speaker_0] Hello, everyone. Uh,
my name is Samuel Shadrach.

22
00:01:02,700 --> 00:01:06,320
I graduated from IIT Delhi.
I've been following the whole

23
00:01:06,440 --> 00:01:09,780
AI timelines debate for s- a while now.

24
00:01:09,920 --> 00:01:13,400
By when will humanity get super
intelligence?

25
00:01:13,500 --> 00:01:16,560
Is this a good thing or a bad thing?
What can we...

26
00:01:16,620 --> 00:01:19,940
And now I have a fairly strong opinion
it's a bad thing. We should stop it.

27
00:01:20,000 --> 00:01:22,720
That's a whole separate discussion
that we can have elsewhere.

28
00:01:23,100 --> 00:01:27,090
Today, I have with me, uh, Raghav. Uh,
we are

29
00:01:27,140 --> 00:01:30,140
going to specifically just discuss, uh,
AI timelines.

30
00:01:30,380 --> 00:01:33,480
Uh, when do we think, you know,
super intelligence will come?

31
00:01:33,680 --> 00:01:37,060
Uh, you know, assuming the research,
current development continues.

32
00:01:37,220 --> 00:01:38,980
Raghav,
would you like to introduce yourself?

33
00:01:39,140 --> 00:01:42,980
[speaker_1] I'm been a friend of Sam.
I've been following the AI safety debate

34
00:01:43,040 --> 00:01:46,880
for the last five, six years now, uh,
and very

35
00:01:46,920 --> 00:01:50,680
interested in the subject.
I have some strong opinions as well,

36
00:01:50,820 --> 00:01:54,340
as Samuel, but, uh,
I have slightly differing opinions from

37
00:01:54,400 --> 00:01:56,260
That's why I'm keen to talk about this.

38
00:01:56,800 --> 00:02:00,680
[speaker_0] Yeah. First of all is, yeah,
uh, the whole pre-training

39
00:02:00,800 --> 00:02:03,260
scaling, which right now everyone says
is dead.

40
00:02:03,420 --> 00:02:06,180
Even now I'm not yet convinced
pre-training scaling is dead.

41
00:02:06,540 --> 00:02:10,100
Yeah, I think there
is still a small chance

42
00:02:10,259 --> 00:02:14,160
and you just extrapolate the pre-training,
whatever chinchilla,

43
00:02:14,240 --> 00:02:18,210
whichever scaling curve
that has been running for the past few

44
00:02:18,240 --> 00:02:20,980
we do that a few more years with new GPUs
coming.

45
00:02:21,000 --> 00:02:24,960
I still think there
is at least a little bit chance

46
00:02:25,040 --> 00:02:26,960
ASI, which I'm claiming. Uh-

47
00:02:27,420 --> 00:02:27,520
[speaker_1] Okay

48
00:02:27,980 --> 00:02:31,720
[speaker_0] ...
a lot more realistic to me

49
00:02:31,940 --> 00:02:35,420
uh, some way of scaling up RL. Now,

50
00:02:35,820 --> 00:02:37,300
that may not be, like,

51
00:02:38,200 --> 00:02:41,769
just again, you just do the current,
you know, RL scaling thing

52
00:02:41,840 --> 00:02:44,180
compute. Maybe one breakthrough
is required there.

53
00:02:44,800 --> 00:02:47,600
I have, like, more probability mass than,
okay, we need at least one breakthrough

54
00:02:47,660 --> 00:02:51,100
on how to do the RL thing better
or some other breakthrough.

55
00:02:51,440 --> 00:02:54,780
I have a little bit less probability mass
than, okay, you just blindly scale up RL

56
00:02:54,800 --> 00:02:57,500
and it works. But yeah,
that's the thing is, like, we don't know.

57
00:02:57,820 --> 00:03:01,240
I have not h- seen like, okay, here
is an argument that tells me that

58
00:03:01,580 --> 00:03:05,460
pre-training scaling 100% does not work,
or here is an argument that tells me

59
00:03:05,600 --> 00:03:08,890
RL scaling 100% does not work.
It's just intuitions that

60
00:03:09,380 --> 00:03:13,260
pre-training scale probably doesn't get
there, RL scaling probably doesn't

61
00:03:13,320 --> 00:03:17,040
get there.
We probably need another breakthrough,

62
00:03:17,080 --> 00:03:20,820
really knows. Uh, that's broadly the path,

63
00:03:21,180 --> 00:03:24,980
uh, of, you know,
what technical capabilities will be

64
00:03:25,180 --> 00:03:28,700
Uh, then there is, okay,
the most probably we'll need one more

65
00:03:28,780 --> 00:03:31,920
breakthrough. Why do I think, okay,
even if we need one more breakthrough, we

66
00:03:31,980 --> 00:03:34,640
will...
There's a good chance we get it in the

67
00:03:35,460 --> 00:03:39,320
For that, you have to extrapolate, well,
in the last 10 years, how

68
00:03:39,360 --> 00:03:42,950
many major breakthroughs have happened in
machine learning, deep learning, and,

69
00:03:42,960 --> 00:03:46,080
well, actually three or, like, three
or four major breakthroughs have happened.

70
00:03:46,100 --> 00:03:49,960
So based on that extrapolate, okay,
it doesn't seem surprising to me

71
00:03:50,100 --> 00:03:53,810
if in the next four years another,
you know, smart researcher figures out yet

72
00:03:53,820 --> 00:03:57,620
another breakthrough. So there is
that kind of thing.

73
00:03:58,520 --> 00:04:01,560
Then I have some specific heuristics
and thing.

74
00:04:02,120 --> 00:04:05,760
If I were an AI capability researcher,
if I had a huge number of

75
00:04:05,820 --> 00:04:09,420
GPUs, what are, you know,
my crazy research hypothesis I might

76
00:04:09,500 --> 00:04:13,340
try for how to speed up RL?
And to be clear, I think this is

77
00:04:13,360 --> 00:04:16,880
extremely dangerous thing to do.
I think this is bad for the world to do.

78
00:04:17,019 --> 00:04:20,690
But, you know, if I wanted to do this,
like, I think, okay, there are some

79
00:04:20,700 --> 00:04:23,960
hypothesis you could try. Uh, what else?

80
00:04:24,040 --> 00:04:27,640
Yeah, this is all, like,
very specific to AI research capabilities,

81
00:04:27,680 --> 00:04:31,660
trajectory. And then I have, like,
more very high level big picture kind of

82
00:04:31,760 --> 00:04:35,660
Like, you know, why
is super intelligence important?

83
00:04:36,020 --> 00:04:39,200
Uh, you know, why the very fact that GPT-2

84
00:04:39,280 --> 00:04:43,060
exists should update your entire models of
how the world works.

85
00:04:43,440 --> 00:04:44,080
The fact that

86
00:04:44,880 --> 00:04:48,560
bunch of matrix multiplication can do,
you know, like, uh, like

87
00:04:48,680 --> 00:04:52,560
speaking human language rather than animal
language, why this is a big deal?

88
00:04:52,640 --> 00:04:56,460
Like, you know, mat-
why did matrix multiplication beat the

89
00:04:56,500 --> 00:04:59,690
literal billions of years of evolution
that goes between, you know,

90
00:05:00,540 --> 00:05:01,740
uh-

91
00:05:01,760 --> 00:05:05,460
[speaker_1] I'm sorry, I, this,
I don't think I'm,

92
00:05:05,520 --> 00:05:09,300
only. I mean, like,
maybe we've not discussed this earlier.

93
00:05:09,330 --> 00:05:09,330
[speaker_0] Yeah.

94
00:05:09,340 --> 00:05:10,480
[speaker_1] But maybe we can get to it.

95
00:05:10,860 --> 00:05:11,159
[speaker_0] Yeah, sure.

96
00:05:11,380 --> 00:05:14,460
[speaker_1] The other one I know you're
just saying, but this one seems new to me.

97
00:05:14,520 --> 00:05:17,740
Maybe you're framing it differently
or something. Yeah. It's fine.

98
00:05:17,780 --> 00:05:21,040
We, we'll probably get to it in the order,
but I'm just flagging it that I probably

99
00:05:21,080 --> 00:05:23,320
need you to double-click on this instinct.

100
00:05:23,480 --> 00:05:27,430
[speaker_0] Sure. Uh, I, I just mean like,
uh, human language has features that

101
00:05:27,480 --> 00:05:28,940
are not present in animal language.

102
00:05:29,020 --> 00:05:32,720
There are some linguists
that have studied that,

103
00:05:32,800 --> 00:05:36,700
like new evolutionary adaptation compared
to most animals who don't really even

104
00:05:36,780 --> 00:05:40,180
use language.
They kind of just use sounds

105
00:05:40,300 --> 00:05:44,040
And all of this has taken, like,
literal billions of years in evolutionary

106
00:05:44,100 --> 00:05:47,420
history to build,
and on the other side you have a bunch of

107
00:05:47,500 --> 00:05:51,340
bunch of GPUs in 50 whatever years of AI
research history, and

108
00:05:51,380 --> 00:05:53,260
they have been able to crack human
language.

109
00:05:53,320 --> 00:05:57,260
Like, that itself tells me, "All right,
so intelligence is

110
00:05:57,280 --> 00:06:01,060
likely easier to build than I thought." So
that is like-Yeah.

111
00:06:01,120 --> 00:06:04,780
So even just looking at GPT-2 tells me
like, oh, okay, so maybe now the

112
00:06:04,820 --> 00:06:06,359
singularity could happen in my lifetime.

113
00:06:06,400 --> 00:06:09,900
Like, you know, until now I
was thinking this is some extremely far

114
00:06:09,980 --> 00:06:12,820
Now it looks like, okay,
this could actually happen. Uh, what else?

115
00:06:12,980 --> 00:06:16,810
Yeah, I mean, and I have outside level,
outside view kind of stuff like,

116
00:06:16,880 --> 00:06:19,750
okay, which experts
are actually correctly predicting this?

117
00:06:19,780 --> 00:06:21,600
Which experts are badly predicting this?

118
00:06:21,660 --> 00:06:25,480
I think a lot of people have been
consistently badly predicting,

119
00:06:25,560 --> 00:06:29,540
uh, AI trajectory. Yeah,
like the people who

120
00:06:30,020 --> 00:06:32,900
their predictions are coming more correct,
and the people who keep making the

121
00:06:32,919 --> 00:06:35,640
pessimistic predictions,
their predictions keep coming wrong.

122
00:06:36,140 --> 00:06:39,770
So there's that kind of stuff.
I think that could summarize my whole

123
00:06:40,240 --> 00:06:43,540
[speaker_1] Fair enough.
We'll start taking it one after the other.

124
00:06:44,040 --> 00:06:47,990
Uh, cool. So starting with, uh,
you said pre-training, scaling,

125
00:06:48,000 --> 00:06:50,260
Pre-training scaling is work,
and pre-training scaling is not dead.

126
00:06:50,660 --> 00:06:54,280
I think when people say pre-training
scaling is dead, they don't mean

127
00:06:54,340 --> 00:06:58,040
more parameters and adding more data
and adding more compute doesn't lead to

128
00:06:58,080 --> 00:07:00,420
better loss functions
that leads to more capabilities.

129
00:07:00,480 --> 00:07:02,180
I don't think anybody denies that.

130
00:07:02,280 --> 00:07:05,820
Uh, people are uncertain that, uh,
people are

131
00:07:05,859 --> 00:07:08,980
say-saying that it could stop,
but there is no evidence to believe

132
00:07:09,060 --> 00:07:11,340
stop,
and I don't think any serious researcher

133
00:07:11,620 --> 00:07:15,500
The argument against pre-training mostly
comes from the fact that, um,

134
00:07:16,000 --> 00:07:19,760
it is economically unfeasible to scale
because the, uh, amount

135
00:07:19,800 --> 00:07:20,100
of

136
00:07:21,100 --> 00:07:23,400
resources required to do the scaling gives
you log.

137
00:07:23,440 --> 00:07:26,180
It does not linearly increase,
it gives you log of the intelligence.

138
00:07:26,740 --> 00:07:30,680
Uh, so like increasing, uh,
your compute by ten x et cetera gives you

139
00:07:30,700 --> 00:07:34,280
the amount of intelligence
or capabilities and double the amount of

140
00:07:34,340 --> 00:07:34,600
goes.

141
00:07:34,610 --> 00:07:37,800
[speaker_0] Exactly.
Double the amount of loss,

142
00:07:37,940 --> 00:07:40,140
loss number leads to this, uh, capability.

143
00:07:40,500 --> 00:07:44,200
[speaker_1] Sure. And so
which also means that, uh, so th-then the

144
00:07:44,240 --> 00:07:48,140
debate essentially shifts to not
that pre-training is dead,

145
00:07:48,180 --> 00:07:51,900
many ten x's can we do
and how many doubling of intelligence or,

146
00:07:52,000 --> 00:07:54,020
slightly higher intelligence will
essentially do it.

147
00:07:54,440 --> 00:07:56,820
Uh,
I think you'd mentioned this in the OOTD

148
00:07:56,980 --> 00:08:00,800
GPT 4.5 was a big update for me,
saying that if you just go

149
00:08:00,860 --> 00:08:04,500
on cranking pre-training,
you might get a nicer model with not a

150
00:08:04,540 --> 00:08:08,160
capabilities. There
was nothing 4.5 could do that was so

151
00:08:08,340 --> 00:08:12,240
far ahead of 4, uh,
that essentially I think that

152
00:08:12,280 --> 00:08:13,200
this would essentially

153
00:08:14,300 --> 00:08:17,780
defeat, uh,
like th-th-there would be new

154
00:08:18,180 --> 00:08:21,120
Uh,
o1 was in a different update because o1

155
00:08:21,370 --> 00:08:23,060
And essentially, this
is in the pa-current paradigm.

156
00:08:23,120 --> 00:08:27,000
So my submission for the pre-training
argument is that not that pre-training is

157
00:08:27,060 --> 00:08:30,600
dead in the sense
that you can't technically, uh,

158
00:08:30,880 --> 00:08:34,820
burn all, all the GDP in the world
and like get a

159
00:08:34,900 --> 00:08:38,840
few doublings, etc., out of it. Uh,
whether essentially A, there

160
00:08:38,860 --> 00:08:42,520
is a--
it's economically feasible to do it and B,

161
00:08:42,720 --> 00:08:46,560
even if it was economically feasible to do
it, uh, is, uh, like

162
00:08:46,820 --> 00:08:50,700
essentially maybe the returns of on, on,
on it that again is probably

163
00:08:50,740 --> 00:08:54,500
not worth it.
Maybe spending hundred x the amount of

164
00:08:54,580 --> 00:08:58,500
times better model, uh,
probably starts breaking a lot of other

165
00:08:58,620 --> 00:09:02,340
Uh, having said that, we
are also almost on the edge

166
00:09:02,500 --> 00:09:05,940
of how much compute we can build.
Everything is stretched out to its limit,

167
00:09:06,420 --> 00:09:07,660
uh, especially by twenty-thirty.

168
00:09:07,760 --> 00:09:11,680
The year you said, uh, fabs
are already built

169
00:09:11,740 --> 00:09:15,580
out, etc. We have nowhere close to, uh,
let's say

170
00:09:15,640 --> 00:09:19,280
four x-- four, four increasing our, uh,
pre-training.

171
00:09:19,320 --> 00:09:22,420
We probably can do one
or two more scale-ups in the next two

172
00:09:22,700 --> 00:09:26,520
Sorry, in, in the next four years. Uh,
and that's maximum that we can do.

173
00:09:26,980 --> 00:09:30,180
Uh, even like we do not have enough fabs,
we do not have enough electricity, we do

174
00:09:30,200 --> 00:09:34,180
not have enough...
Some insane amount of physical bottlenecks

175
00:09:34,220 --> 00:09:35,680
the amount of compute in the world.

176
00:09:36,060 --> 00:09:40,040
Um, currently we are doing, uh,
I think forty or

177
00:09:40,080 --> 00:09:41,510
fifty gigawatts of...

178
00:09:42,720 --> 00:09:46,690
Fifty gigawatts? No, thirty gigawatts of,
uh, compute capacity is

179
00:09:46,720 --> 00:09:48,550
what twenty twenty-seven will get us.

180
00:09:48,620 --> 00:09:51,880
And, uh,
if everything goes according to plan

181
00:09:51,960 --> 00:09:55,780
like all the supply chains get stretched
exactly to the right limit,

182
00:09:56,240 --> 00:09:59,420
our best bet is to get to one fifty to two
hundred gigawatts a year, which is again

183
00:09:59,440 --> 00:10:03,400
like six times more compute,
not ten x more compute, um, from what

184
00:10:03,440 --> 00:10:05,960
we have right now.
And like that's two hundred gigawatts per

185
00:10:06,440 --> 00:10:10,220
Uh, and that's assuming all
that compute goes into training

186
00:10:10,260 --> 00:10:14,140
etc. So, so there
are like lots of physical limitations

187
00:10:14,160 --> 00:10:17,920
till twenty-thirty
that do not allow for arbitrarily amount

188
00:10:18,300 --> 00:10:20,780
Uh,
we've already kind of pressed to the

189
00:10:20,880 --> 00:10:24,440
Uh,
buying a laptop costs like three hundred,

190
00:10:24,600 --> 00:10:25,600
Costs three hundred, four hundred more.

191
00:10:25,760 --> 00:10:29,580
I don't think that the world can
essentially take four orders of magnitude

192
00:10:29,620 --> 00:10:32,740
compute scaling that readily now.
I think we,

193
00:10:33,140 --> 00:10:37,100
Coming to RL, I agree
that essentially RL scaling

194
00:10:37,160 --> 00:10:39,070
There are two points
that I want to make about RL.

195
00:10:39,070 --> 00:10:41,390
RL scaling, uh,
is also extremely expensive.

196
00:10:41,840 --> 00:10:44,180
Uh,
it's more expensive than pre-training

197
00:10:44,260 --> 00:10:48,170
Uh, the other point
that I want to make about RL is, uh,

198
00:10:48,200 --> 00:10:51,940
gives us general capability increases,
RL gets us very

199
00:10:52,120 --> 00:10:55,960
jagged increases in capability.
You only get increase in capability in

200
00:10:56,000 --> 00:10:59,820
domains like coding and math, uh,
which essentially kind of

201
00:10:59,860 --> 00:11:03,740
defeats the,
the specific model of intelligence

202
00:11:03,800 --> 00:11:06,720
that it will be better at everything by
twenty-thirty.

203
00:11:07,080 --> 00:11:10,780
So if we,
if we cannot find good RL candidates

204
00:11:10,800 --> 00:11:14,740
loops for different kinds of event,
we might not even solve, uh, forget

205
00:11:14,820 --> 00:11:17,120
solving for like robotics
and like other things.

206
00:11:17,180 --> 00:11:21,140
Even in just like atoms world,
we might not be able to solve all of it

207
00:11:21,280 --> 00:11:24,600
to human level because we will not just
find enough RL data, etc.,

208
00:11:25,320 --> 00:11:28,979
or enough closed loops, etc., uh, uh,
to essentially do it.

209
00:11:29,080 --> 00:11:32,360
Uh, so, so I don't think
that RL also scales

210
00:11:32,600 --> 00:11:36,540
arbitrarily that you can just thousand x
the RL compute and actually get

211
00:11:36,580 --> 00:11:40,320
away with it. Uh, there
is a limit to how much RL compute you can

212
00:11:40,400 --> 00:11:43,040
RL also just gets you increases in
capability.

213
00:11:43,380 --> 00:11:47,220
Yes, economically viable, uh,
the economic, the killer use

214
00:11:47,260 --> 00:11:51,140
case for LMS currently is coding,
and coding is economically viable, etc.

215
00:11:51,760 --> 00:11:53,900
There's a small, uh,
probability gap I have.

216
00:11:53,940 --> 00:11:57,460
This leads to some sort of recursive
self-improvement or some sort of, uh,

217
00:11:57,480 --> 00:11:59,840
breakthrough, etc.,
that happens because of this.

218
00:12:00,200 --> 00:12:04,020
Uh, but aside from that gap, uh,
which sure, we can talk about

219
00:12:04,400 --> 00:12:08,360
Uh, but aside from that gap, uh,
the path essentially just by scaling

220
00:12:08,500 --> 00:12:12,370
RL or just by scaling pre-training
or just by scaling RL and pre-training,

221
00:12:12,780 --> 00:12:16,320
uh,
getting to ASI just in the current

222
00:12:16,740 --> 00:12:19,200
uh,
in your definition of intelligence seems

223
00:12:19,240 --> 00:12:23,020
unlikely.Um, to me, like forget 25%,
I would give

224
00:12:23,360 --> 00:12:26,360
sub 1% chance in the current paradigm, uh,
in these specific

225
00:12:26,400 --> 00:12:30,260
circumstances. Not to say that, uh,
I have a much higher probability was

226
00:12:30,340 --> 00:12:33,660
might get to superintelligence by 2030,
but, uh, that's

227
00:12:33,740 --> 00:12:37,140
essentially assuming technology
that haven't been invented yet come into

228
00:12:37,200 --> 00:12:40,460
being. Uh, so that
was the second point about RL.

229
00:12:40,600 --> 00:12:43,940
Uh, your third point was tied into this,
is like, okay, we might need one more

230
00:12:43,960 --> 00:12:45,960
breakthrough. Uh, current RL might not...

231
00:12:46,000 --> 00:12:47,640
You think there's a chance
that might not be enough.

232
00:12:47,680 --> 00:12:49,360
Maybe it's enough, but we might,
maybe it's not enough.

233
00:12:49,400 --> 00:12:50,920
Maybe we need one more breakthrough.

234
00:12:51,040 --> 00:12:54,469
And, uh, uh, your submission there
was that, hey,

235
00:12:54,860 --> 00:12:58,590
breakthrough will happen, uh,
because look at the last 10 years,

236
00:12:58,780 --> 00:13:02,740
we've gotten transformers
and we've gotten pre-training scaling,

237
00:13:02,840 --> 00:13:06,280
RL. So looks like breakthroughs
are coming very, very quickly.

238
00:13:06,700 --> 00:13:10,380
Um, I think is, uh, this,
I don't think that,

239
00:13:11,000 --> 00:13:14,680
uh,
there is enough data points for you to

240
00:13:15,040 --> 00:13:19,020
Uh, even sta- in the start of GBD,
despite essentially almost

241
00:13:19,080 --> 00:13:22,800
all of world's intelligence
and attention going to this problem,

242
00:13:22,940 --> 00:13:26,520
RL scaling, which we've cracked, uh,
there's not been another scaling

243
00:13:26,580 --> 00:13:30,140
paradigm that we've cracked.
So apart from pre-training and RL scaling,

244
00:13:30,260 --> 00:13:33,380
people are ready to throw computes at
other scaling things, and that's not

245
00:13:33,420 --> 00:13:35,740
something that we've cracked.
I'll take your rebuttal.

246
00:13:36,040 --> 00:13:36,280
[speaker_0] Oh, rebuttal.

247
00:13:36,300 --> 00:13:40,239
[speaker_1] Like what other scaling
paradigm have,

248
00:13:40,280 --> 00:13:40,560
RL-

249
00:13:40,870 --> 00:13:40,890
[speaker_0] No, no.

250
00:13:40,900 --> 00:13:42,330
[speaker_1] Despite throwing insane-

251
00:13:42,330 --> 00:13:45,880
[speaker_0] Okay,
it has a lot of attention,

252
00:13:45,920 --> 00:13:47,180
You're saying this mean
that other breakthroughs-

253
00:13:47,220 --> 00:13:49,320
[speaker_1] No, no other breakthroughs.
I'm saying other breakthroughs...

254
00:13:49,840 --> 00:13:53,020
I'm saying, uh, other breakthroughs
are possible in the sense they're not

255
00:13:53,160 --> 00:13:57,080
physically impossible. But A,
do you disagree that there are other,

256
00:13:57,200 --> 00:13:58,200
things that we can just throw,

257
00:13:59,220 --> 00:14:03,120
s- other, other scalable things
that we can throw compute money,

258
00:14:03,140 --> 00:14:06,980
more intelligence aside from, uh,
like just regular pre-training

259
00:14:07,200 --> 00:14:08,160
and RL?

260
00:14:08,560 --> 00:14:12,460
[speaker_0] I think there
is a huge backlog of research hypothesis

261
00:14:12,540 --> 00:14:13,540
at all these AI companies.

262
00:14:13,600 --> 00:14:14,520
[speaker_1] I doubt it.

263
00:14:14,660 --> 00:14:15,460
[speaker_0] Like lot of very-

264
00:14:15,540 --> 00:14:15,980
[speaker_1] I doubt it

265
00:14:15,990 --> 00:14:19,830
[speaker_0] ... obvious things to try,
but because compute is scarce,

266
00:14:19,920 --> 00:14:20,060
decide-

267
00:14:20,080 --> 00:14:20,480
[speaker_1] I don't know

268
00:14:20,500 --> 00:14:21,450
[speaker_0] ... okay, which ones to, uh-

269
00:14:21,480 --> 00:14:25,300
[speaker_1] I get it, but essentially, I,
I, I, I, I hear that argument saying

270
00:14:25,320 --> 00:14:28,980
that obviously we can improve our models
in X sector, but if there were such such

271
00:14:29,140 --> 00:14:32,740
obvious scaling paradigms
that essentially could have been done

272
00:14:32,800 --> 00:14:36,780
than the current paradigms, uh,
I think you would see some evidence of it.

273
00:14:37,080 --> 00:14:39,809
You would see there's,
there's enough time essentially going into

274
00:14:39,809 --> 00:14:39,999
that-

275
00:14:40,260 --> 00:14:41,540
[speaker_0] No, I, I hear all excuses

276
00:14:41,800 --> 00:14:41,809
[speaker_1] ... that-

277
00:14:41,840 --> 00:14:44,360
[speaker_0] I'm saying like you right now,
you are not an expert AI researcher.

278
00:14:44,460 --> 00:14:48,140
You can right now take a pen and paper,
sit for half a day and come up with, okay,

279
00:14:48,380 --> 00:14:52,320
if I had huge amount of GPUs, here
are 10 hypothesis I would try, and

280
00:14:52,380 --> 00:14:53,420
you can write them down.

281
00:14:53,430 --> 00:14:53,430
[speaker_1] Yeah.

282
00:14:53,430 --> 00:14:57,340
[speaker_0] And then we will ask, okay,
well, why haven't these, like,

283
00:14:57,400 --> 00:15:01,100
know, Open-- Anthropic try them
and they failed, or did nobody try them?

284
00:15:01,240 --> 00:15:01,400
[speaker_1] I-

285
00:15:01,469 --> 00:15:01,689
[speaker_0] If nobody tried-

286
00:15:01,710 --> 00:15:05,380
[speaker_1] ... doubt it's that easy.
Can you name,

287
00:15:05,420 --> 00:15:09,370
that hasn't been tried
that you think has a higher chance of it

288
00:15:09,740 --> 00:15:11,860
an RL level breakthrough if tried?

289
00:15:12,200 --> 00:15:14,880
[speaker_0] I'm not saying any one idea
if I try that will definitely work.

290
00:15:14,940 --> 00:15:15,230
I'm saying-

291
00:15:16,040 --> 00:15:19,480
[speaker_1] No,
any example of an idea

292
00:15:19,540 --> 00:15:23,460
If nobody's tried it,
I didn't find any research papers on it,

293
00:15:23,520 --> 00:15:25,320
we'll probably get ASI.

294
00:15:25,420 --> 00:15:29,020
[speaker_0] Uh,
if you just want random ideas,

295
00:15:29,200 --> 00:15:29,840
Uh-

296
00:15:29,920 --> 00:15:30,060
[speaker_1] Sure.

297
00:15:31,040 --> 00:15:34,670
Just to understand, like,
what kind of ideas do you think

298
00:15:34,700 --> 00:15:38,220
these guys are so compute constrained
that if there is an RL level breakthrough

299
00:15:38,300 --> 00:15:42,060
just sitting on like some researcher's
notepad and they've not had the time.

300
00:15:42,160 --> 00:15:45,460
Because even for RL,
they didn't actually have to use

301
00:15:45,500 --> 00:15:48,340
the idea. Uh, the idea, the idea, the,
the-

302
00:15:48,960 --> 00:15:52,840
[speaker_0] Yeah. Uh, one example,
most of the training still happens in very

303
00:15:52,940 --> 00:15:56,380
code, like, you know, PyTorch, you know,
like four-bit, eight-bit floating point

304
00:15:56,460 --> 00:15:59,760
numbers. If you really wanted,
you could optimize all this way down.

305
00:15:59,800 --> 00:16:03,300
You could literally run training inside an
ASIC. You could optimize the code.

306
00:16:03,340 --> 00:16:06,540
[speaker_1] Yeah. So, so for example,
that's like a very bad idea. No, no, no.

307
00:16:06,580 --> 00:16:10,400
So for example, running ASIC, so, so
that essentially assumes

308
00:16:10,440 --> 00:16:12,220
like a lot of other things need to move in
the world.

309
00:16:12,260 --> 00:16:15,990
One, the amount of GPU capacity
that already exists allocated in the world

310
00:16:16,020 --> 00:16:19,480
needs to go away. Secondly,
there's a reason why even of co...

311
00:16:19,520 --> 00:16:22,510
Like doing this does not get you
that much compute because essentially

312
00:16:22,520 --> 00:16:25,570
compute constrained,
you're memory constrained, uh,

313
00:16:25,570 --> 00:16:28,080
training and you're basically memory
bandwidth constrained, not even memory

314
00:16:28,100 --> 00:16:31,380
constrained. And, uh, yes, sure,
like we have like silicon

315
00:16:31,440 --> 00:16:34,200
photonics,
and we have other breakthroughs

316
00:16:34,260 --> 00:16:38,159
engineering problem. Uh,
I don't think it's as easy as like, hey,

317
00:16:38,200 --> 00:16:41,860
on ASIC and we get like 100X speed up
and nobody's had the time

318
00:16:42,279 --> 00:16:45,200
Uh,
there's so much incentive for any smart

319
00:16:45,240 --> 00:16:48,770
If you think it's
that easy to go replace the entire compute

320
00:16:48,840 --> 00:16:52,430
ASIC and people have just like not tried
it because of some other

321
00:16:52,460 --> 00:16:54,000
crunch, I, I think you're mistaken.

322
00:16:54,030 --> 00:16:54,110
[speaker_0] No, no, I-

323
00:16:54,150 --> 00:16:57,460
[speaker_1] Like there's insane amount of
like capitalist in-incentive to do it.

324
00:16:57,660 --> 00:17:01,440
Like, th-which would be like,
"Hey." I'm saying if there's any other

325
00:17:01,980 --> 00:17:04,950
uh,
the amount of compute you need to quote

326
00:17:05,000 --> 00:17:08,680
small. Uh, and therefore if, if there
were these insane...

327
00:17:08,720 --> 00:17:12,210
o1, for example,
didn't require too much compute on forward

328
00:17:12,240 --> 00:17:15,940
could work. Uh, if you had such an idea,
you could show that, hey, this

329
00:17:16,020 --> 00:17:19,870
works and we want to use this to scale our
models and it's sitting in the labs and

330
00:17:19,900 --> 00:17:23,260
we just think,
but looks like there's n-none,

331
00:17:23,580 --> 00:17:27,360
[speaker_0] Uh, oh, okay. Yeah.
I'm saying there are a lot of research

332
00:17:27,460 --> 00:17:31,420
require a lot of compute to prove even as
a proof of concept, okay, this is worth

333
00:17:31,460 --> 00:17:31,919
exploring.

334
00:17:32,040 --> 00:17:33,990
[speaker_1] I, I doubt it.

335
00:17:34,070 --> 00:17:34,070
[speaker_0] Yeah.

336
00:17:34,120 --> 00:17:34,900
[speaker_1] I think they all-

337
00:17:35,300 --> 00:17:35,570
[speaker_0] The idea is-

338
00:17:35,570 --> 00:17:39,300
[speaker_1] They all show signs of life.
All of these research ideas show signs of

339
00:17:39,380 --> 00:17:42,580
life much before you have to scale them to
get gains out of it.

340
00:17:43,220 --> 00:17:47,100
There are very few research ideas
that quote unquote "only get unlocked at,"

341
00:17:47,380 --> 00:17:50,850
if you only train them for, like only
if you spend billion dollars of compute is

342
00:17:50,880 --> 00:17:54,220
the first sign of life you get from
that research idea, uh,

343
00:17:54,280 --> 00:17:57,230
research idea saying that, "Hey,
if I just keep throwing compute at it"-

344
00:17:57,330 --> 00:18:00,280
[speaker_0] In research,
almost every breakthrough

345
00:18:00,380 --> 00:18:04,240
Like only after you scaled it to literal
billions of dollars you saw that it was

346
00:18:04,260 --> 00:18:04,560
working.

347
00:18:05,060 --> 00:18:07,740
[speaker_1] No.
Give me one research idea that's gonna

348
00:18:08,100 --> 00:18:09,940
GPT-1 was like a very small model.

349
00:18:10,480 --> 00:18:10,550
[speaker_0] RL or-

350
00:18:10,650 --> 00:18:11,669
[speaker_1] RNN was a very small model.

351
00:18:11,740 --> 00:18:13,160
[speaker_0] Uh, GPT-

352
00:18:13,280 --> 00:18:14,820
[speaker_1] RL, for example,
didn't require billions of dollars.

353
00:18:14,980 --> 00:18:16,780
So o1, for example,
was a very cheap model.

354
00:18:17,140 --> 00:18:21,040
Once GPT-4 was trained,
training on chain of,

355
00:18:21,080 --> 00:18:24,964
of thought and doing RL was like a very,
very small experimentSo,

356
00:18:25,174 --> 00:18:25,174
uh-

357
00:18:25,184 --> 00:18:27,854
[speaker_0] Can, uh,
do you have numbers on that?

358
00:18:27,864 --> 00:18:29,584
[speaker_1] Yeah, yeah. So, so there's, A,
there's ...

359
00:18:29,774 --> 00:18:33,424
I, I'll find out where I read about this,
but basically the first version of the O1

360
00:18:33,444 --> 00:18:35,264
model was just get tried in a lab,
et cetera.

361
00:18:35,604 --> 00:18:38,564
For example, uh,
you can take Llama 3 right now,

362
00:18:39,164 --> 00:18:40,434
and, uh, you can take-

363
00:18:40,464 --> 00:18:43,323
[speaker_0] No, no, not,
not about right now.

364
00:18:43,484 --> 00:18:46,334
Back then when O1 was first tried,
like my guess without having-

365
00:18:46,334 --> 00:18:48,344
[speaker_1] Like I'm saying,
you can make it more efficient.

366
00:18:48,384 --> 00:18:48,504
[speaker_0] S- sorry.

367
00:18:48,544 --> 00:18:49,364
[speaker_1] You can make it more

368
00:18:50,304 --> 00:18:50,523
efficient. Yeah.

369
00:18:50,744 --> 00:18:54,104
[speaker_0] Yeah,
like without having read it, my guess

370
00:18:54,144 --> 00:18:57,524
required at least $10 million on top of
the

371
00:18:57,564 --> 00:19:01,324
GPT-4 training cost,
and like I have not actually checked the

372
00:19:01,384 --> 00:19:01,804
but yeah.

373
00:19:02,344 --> 00:19:05,724
[speaker_1] You needed the GPT-4 to get
trained first,

374
00:19:05,764 --> 00:19:08,664
and then you found a new scaling paradigm
on that scaling paradigm.

375
00:19:09,104 --> 00:19:09,193
[speaker_0] Yeah, yeah.

376
00:19:09,224 --> 00:19:09,634
[speaker_1] That I agree.

377
00:19:09,653 --> 00:19:10,944
[speaker_0] I'm saying first you had the
whole-

378
00:19:11,004 --> 00:19:11,564
[speaker_1] But go from four to-

379
00:19:11,614 --> 00:19:13,513
[speaker_0] ... GPT-4 training cost,
then GPT-4 trained.

380
00:19:13,524 --> 00:19:13,614
[speaker_1] Right.

381
00:19:13,984 --> 00:19:14,704
[speaker_0] Then you had-

382
00:19:15,164 --> 00:19:15,434
[speaker_1] But O1, when you look at-

383
00:19:15,434 --> 00:19:19,384
[speaker_0] ... to research hypothesis.
Each of those hypothesis took at least $10

384
00:19:19,804 --> 00:19:21,124
to test, and 10 million
is a random number.

385
00:19:21,164 --> 00:19:21,264
[speaker_1] No.

386
00:19:21,344 --> 00:19:22,824
[speaker_0] I think it's actually
important. Uh-

387
00:19:22,904 --> 00:19:24,384
[speaker_1] I don't think it takes $10
million of test.

388
00:19:24,444 --> 00:19:27,164
I think it takes like a few hundred
thousand dollars,

389
00:19:27,224 --> 00:19:30,924
dollars to test a research idea to see
if that it has any

390
00:19:31,024 --> 00:19:34,564
sign or any chance. Yes,
like you might see like different gains

391
00:19:34,664 --> 00:19:35,374
losses, et cetera.

392
00:19:35,394 --> 00:19:37,124
[speaker_0] That is then possibly 100
million is what I'm claiming.

393
00:19:38,004 --> 00:19:41,324
[speaker_1] I doubt it. I don't know
if any of all these ideas take 10 million.

394
00:19:41,384 --> 00:19:44,463
For example,
O1 didn't take 10 million post GPT-4

395
00:19:44,864 --> 00:19:47,324
[speaker_0] We can actually go
and check that maybe.

396
00:19:47,404 --> 00:19:48,184
Like I know this is-

397
00:19:48,224 --> 00:19:48,274
[speaker_1] No, no

398
00:19:48,274 --> 00:19:51,044
[speaker_0] ... not public information,
but we can go and see like-

399
00:19:51,244 --> 00:19:53,644
[speaker_1] No, no,
I think we can check because the amount of

400
00:19:53,684 --> 00:19:54,924
Yeah, sure, because the actual amount ...

401
00:19:54,934 --> 00:19:57,954
So the idea, the way research works
is get an idea,

402
00:19:58,164 --> 00:20:01,304
this thing. If y-
if you see any sort on it or anything

403
00:20:01,314 --> 00:20:04,584
"Cool. You know what?
This warrants more investigation,

404
00:20:04,744 --> 00:20:07,804
actually go improve something in the
form." And then you can do optimizations

405
00:20:07,844 --> 00:20:09,224
it, and then you can make it better,
et cetera.

406
00:20:09,524 --> 00:20:13,244
But the first amount of like this thing,
the sign of life saying that this has,

407
00:20:13,304 --> 00:20:16,394
this is probably has some legs can come
from not that much money.

408
00:20:17,024 --> 00:20:17,474
[speaker_0] Yeah. This is-

409
00:20:17,504 --> 00:20:19,144
[speaker_1] And then obviously getting
actual real-

410
00:20:19,174 --> 00:20:21,064
[speaker_0] ... GPT works, however, ML
is not like this.

411
00:20:21,724 --> 00:20:23,264
[laughs] I think that's my actual method.

412
00:20:23,304 --> 00:20:26,364
[speaker_1] All of ML has been like this.
Like pre-training, for example,

413
00:20:26,404 --> 00:20:28,664
was a scary small model. GPT-2
was a very small model.

414
00:20:29,184 --> 00:20:32,864
Uh, it took like, uh, probably 150 or 2,
like the, the initial

415
00:20:32,924 --> 00:20:35,284
GPT cost like a million dollars, I think.
Not that much.

416
00:20:35,784 --> 00:20:39,704
Uh, GPT-3 again,
like took like slightly larger amount of

417
00:20:40,024 --> 00:20:41,424
but nothing compared to right now.

418
00:20:41,824 --> 00:20:45,604
And the amount of, uh,
the only reason you go from GPT-2 to 3 to

419
00:20:45,724 --> 00:20:49,484
4 to 5 or whatever subsequent models
is because you see gains

420
00:20:49,524 --> 00:20:53,484
from scaling all the time. Uh, RNN,
for example, you can see that, okay,

421
00:20:53,544 --> 00:20:55,884
scale the RNN paradigm,
you get gains from it.

422
00:20:56,424 --> 00:20:59,673
Uh, and then, then you stop seeing gains,
so that's why RNN didn't scale or

423
00:20:59,684 --> 00:21:01,214
whatever. Or, uh, then they went to LSTM.

424
00:21:01,284 --> 00:21:02,694
LSTM didn't scale,
and then they went to transformer.

425
00:21:02,784 --> 00:21:05,754
Transformer scaled fairly much, and like,
cool, we found a scalable paradigm.

426
00:21:06,204 --> 00:21:09,244
The idea that you can see gains
and then you try to scale it

427
00:21:09,744 --> 00:21:10,064
So-

428
00:21:10,084 --> 00:21:10,524
[speaker_0] Yeah

429
00:21:10,534 --> 00:21:14,144
[speaker_1] ... uh, so it's not that, oh,
I have to like literally roll dice of $10

430
00:21:14,224 --> 00:21:16,584
each time to get one idea. It's not
that random.

431
00:21:16,944 --> 00:21:20,884
[speaker_0] Okay,
I'll make a tighter claim. After GPT-2

432
00:21:20,924 --> 00:21:24,444
had come out,
if you wanted to try out any ML research

433
00:21:24,484 --> 00:21:28,304
hypothesis,
you probably needed at least $10 million

434
00:21:28,324 --> 00:21:31,924
life or not. Like for,
for most of the research hypothesis-

435
00:21:31,984 --> 00:21:32,134
[speaker_1] No, you could use the, uh-

436
00:21:32,134 --> 00:21:34,064
[speaker_0] ...
you needed at least $10 million since GPT-

437
00:21:34,104 --> 00:21:38,084
[speaker_1] No, no. So the id- the,
the idea is that you train GPT-1,

438
00:21:38,104 --> 00:21:41,284
saw that there were gains on it.
The obvious thing to do is that, "Hey,

439
00:21:41,324 --> 00:21:44,224
found a scalable paradigm.
I think I can scale it further to see

440
00:21:44,324 --> 00:21:44,644
gains."

441
00:21:45,024 --> 00:21:45,194
[speaker_0] Sure.

442
00:21:45,384 --> 00:21:48,484
[speaker_1] And then you train GPT-3,
and there are still gains from it,

443
00:21:48,524 --> 00:21:51,704
res- dead research ideas
that might show signs of life, but o-

444
00:21:51,724 --> 00:21:53,084
it, they start breaking.

445
00:21:53,124 --> 00:21:56,944
[speaker_0] Yeah,
I'm saying to try any other idea,

446
00:21:56,964 --> 00:22:00,604
pre-training scaling.
If you have any other idea apart from

447
00:22:00,824 --> 00:22:04,584
GPT-2 came out 2018, right? So after 2018,
if you had any other

448
00:22:04,644 --> 00:22:08,404
idea besides the whole pre-training
scaling thing, you wanted to test it out,

449
00:22:08,424 --> 00:22:12,304
would need at least $10 million for most
ideas to even test out and see if this

450
00:22:12,324 --> 00:22:13,164
has any life or not.

451
00:22:13,584 --> 00:22:15,604
[speaker_1] And that's also trivial
amounts of money if...

452
00:22:15,704 --> 00:22:18,424
I think if people who know about how these
things work, it's not completely

453
00:22:18,464 --> 00:22:20,764
unintuitive. They have,
they have a big idea of test.

454
00:22:21,044 --> 00:22:24,964
I, I'm not saying is that, uh,
there might be a breakthrough that

455
00:22:25,004 --> 00:22:28,004
in some old research paper
or in scribbled in all the diaries

456
00:22:28,064 --> 00:22:32,024
overlooked, uh, but it
is not as low-lying a fruit as you

457
00:22:32,564 --> 00:22:36,384
if,
if only we had just spent more compute,

458
00:22:36,824 --> 00:22:40,054
thousands of thousands of scalable
paradigms." Finding scalable paradigms is

459
00:22:40,104 --> 00:22:43,484
really, really hard.
We've only done like two ti-

460
00:22:43,504 --> 00:22:46,213
last 10 years and two times in the last 50
years and two times in the last thousand

461
00:22:46,224 --> 00:22:50,114
years, and, uh, essentially, uh, uh,
just because we've gotten

462
00:22:50,364 --> 00:22:53,764
lucky,
just because we got lucky doesn't mean

463
00:22:53,804 --> 00:22:56,064
happening in the next two, three, four,
five years.

464
00:22:56,084 --> 00:22:59,524
It's just like there'll just be like
scalable paradigms after scalable

465
00:22:59,784 --> 00:23:02,464
that will just keep showing up because
it's shown up last two times.

466
00:23:02,824 --> 00:23:05,744
[speaker_0] Okay,
so we have identified some disagreement

467
00:23:05,864 --> 00:23:07,344
Uh, how do you think we can resolve this?

468
00:23:07,384 --> 00:23:11,014
Like what data points will work
or what arguments about how do you think

469
00:23:11,014 --> 00:23:14,884
[speaker_1] One data point
that would work, one data point

470
00:23:14,944 --> 00:23:18,904
move you is that if you look other, uh,
if you look at other fields, you look

471
00:23:18,944 --> 00:23:22,884
at biology, et cetera, uh,
just because like there's one breakthrough

472
00:23:22,924 --> 00:23:26,884
that essentially leads to a different
class of

473
00:23:26,924 --> 00:23:28,414
discoveries or drug discovery happening.

474
00:23:28,464 --> 00:23:30,544
For example, you get, uh, let's say

475
00:23:31,704 --> 00:23:35,414
discovery through, like, for example,
RNA delivery of drugs,

476
00:23:36,044 --> 00:23:39,044
uh, like the mRNA vaccine, et cetera,
which you can like modify RNA and you can

477
00:23:39,064 --> 00:23:42,964
inject in people, et cetera.
That means that, yes, a lot of, a lot

478
00:23:43,044 --> 00:23:43,624
of, uh...

479
00:23:44,544 --> 00:23:47,644
That, that was a big deal,
like get a new branch of medicine,

480
00:23:47,664 --> 00:23:51,384
doesn't automatically mean
that the amount of new breakthroughs will

481
00:23:51,424 --> 00:23:55,224
increase. If anything, what we've seen
is that there is actually a slowdown in

482
00:23:55,284 --> 00:23:59,074
amount of new ideas
and new researches in every mature field

483
00:23:59,104 --> 00:24:02,264
more,
more attention goes into it because all

484
00:24:02,584 --> 00:24:05,844
Uh, so finding the next breakthrough
is not a linear process.

485
00:24:05,924 --> 00:24:08,524
It's actually a super linear process.
Not like it's a log process.

486
00:24:08,564 --> 00:24:12,004
You have to like spend 10X,
100X more resources to get more ideas,

487
00:24:12,464 --> 00:24:16,244
and, uh,
you pluck the low-hanging fruits very,

488
00:24:16,284 --> 00:24:18,304
is not that hard. It's not that easy.

489
00:24:18,524 --> 00:24:21,244
So-Um, so other field at least do it.

490
00:24:21,284 --> 00:24:23,424
You can argue that M-ML
is different for some reason.

491
00:24:23,544 --> 00:24:26,604
I don't know why scientific idea is, like,
inherently be different in, uh, ML

492
00:24:26,624 --> 00:24:27,804
because like, oh, ML is different.

493
00:24:27,844 --> 00:24:31,704
Because mostly what happens is
if enough eyeballs look at a problem, uh,

494
00:24:31,744 --> 00:24:34,044
they look at all the low-hanging fruits,
then they go to the second level of

495
00:24:34,084 --> 00:24:36,804
low-hanging fruit,
and they keep doing this, and they,

496
00:24:36,814 --> 00:24:38,184
hypothesis. And like, okay, cool.

497
00:24:38,664 --> 00:24:42,474
Uh, this has been already done in physics,
for example, uh, or

498
00:24:42,504 --> 00:24:45,904
like chemistry, et cetera.
We don't expect like crazy amount of math

499
00:24:45,924 --> 00:24:49,684
come out, uh, by a mathematician, uh,
or like a

500
00:24:49,724 --> 00:24:52,864
great, like, new,
new lines of math schools,

501
00:24:53,384 --> 00:24:56,424
Uh,
and I feel like with more attention on a

502
00:24:56,474 --> 00:24:58,504
It becomes harder to e-curve. S-curve
is not easier.

503
00:24:58,824 --> 00:24:59,364
[speaker_0] Okay, uh-

504
00:24:59,414 --> 00:25:01,724
[speaker_1] It depends on what part of the
S-curve you're on, I think.

505
00:25:02,324 --> 00:25:03,924
[speaker_0] Yeah, I think
that S-curve analogy is good.

506
00:25:04,004 --> 00:25:07,924
So yeah,
if we are comparing to other scientific

507
00:25:07,944 --> 00:25:09,644
field in which experiments are expensive.

508
00:25:09,704 --> 00:25:13,304
So, like, we should not compare this to,
like, theoretical math where you just need

509
00:25:13,384 --> 00:25:14,724
person sitting with pen and paper.

510
00:25:14,804 --> 00:25:17,884
Like, something like drug discovery
is a better analogy for this.

511
00:25:18,304 --> 00:25:21,844
And by the way, I do think ML
is a bit different,

512
00:25:21,944 --> 00:25:25,274
analogizing with other fields, uh,
in drug discovery.

513
00:25:25,324 --> 00:25:27,004
[speaker_1] Or like theoretical physics,
for example.

514
00:25:27,344 --> 00:25:30,244
[speaker_0] Sorry,
y-you mean theoretical physics is cheap

515
00:25:30,744 --> 00:25:34,404
[speaker_1] Is also expensive. It
was cheap at some point,

516
00:25:34,424 --> 00:25:35,964
like, breakthrough with pen and paper.

517
00:25:36,384 --> 00:25:40,004
But now if you want to, like, uh,
experimental physics, sorry,

518
00:25:40,244 --> 00:25:40,314
physics-

519
00:25:40,314 --> 00:25:40,674
[speaker_0] Mm. Yeah

520
00:25:40,674 --> 00:25:44,524
[speaker_1] ... uh,
you could make like a... Yeah,

521
00:25:44,544 --> 00:25:46,174
to do any new physics.

522
00:25:46,624 --> 00:25:50,224
[speaker_0] Sure. Yeah. Okay, fine.
Experimental physics would work.

523
00:25:50,504 --> 00:25:54,254
Uh, yeah.
Now to actually argue about experimental

524
00:25:54,284 --> 00:25:57,624
physics or drug discovery,
I will have to actually read more about

525
00:25:58,144 --> 00:25:58,924
physics or drug discovery. [chuckles]

526
00:25:59,064 --> 00:26:01,564
[speaker_1] Uh, but do you agree that,
that, like-

527
00:26:01,744 --> 00:26:02,534
[speaker_0] Also, we have to-

528
00:26:02,804 --> 00:26:02,814
[speaker_1] With the-

529
00:26:02,824 --> 00:26:03,084
[speaker_0] Yeah, no

530
00:26:03,204 --> 00:26:04,094
[speaker_1] ... enhanced attention and-

531
00:26:04,204 --> 00:26:05,604
[speaker_0] Also,
we have to pick a time period. Sorry.

532
00:26:05,864 --> 00:26:09,684
Uh, also, like, if we take, you know,
drug discovery or experimental physics as

533
00:26:09,724 --> 00:26:13,574
example,
we have to take a time period in the

534
00:26:13,604 --> 00:26:17,584
was known that, okay, this thing is,
you know, in like boom phase and like, you

535
00:26:17,624 --> 00:26:19,504
know, lots of new capabilities
are coming out.

536
00:26:19,544 --> 00:26:22,364
Like, like in ML right now, we know we
are in that sort of phase.

537
00:26:22,424 --> 00:26:24,064
Like,
there may be new things we could try.

538
00:26:24,144 --> 00:26:26,524
Like, it's not like, okay,
it's like a dead field-

539
00:26:26,564 --> 00:26:26,574
[speaker_1] Sure

540
00:26:26,574 --> 00:26:27,204
[speaker_0] ... mature field.

541
00:26:28,024 --> 00:26:28,564
We have not reached-

542
00:26:28,584 --> 00:26:32,424
[speaker_1] Sure, sure. Yeah, I agree.
And like example,

543
00:26:32,464 --> 00:26:35,184
that.
I don't know about experimental physics,

544
00:26:35,244 --> 00:26:39,224
also,
where a bunch of like new physics

545
00:26:39,264 --> 00:26:41,144
1920s. There was a activity that came out.

546
00:26:41,184 --> 00:26:44,364
There's like, like,
a bunch of these new...

547
00:26:44,434 --> 00:26:46,924
All of them essentially got started in
1920s.

548
00:26:47,144 --> 00:26:49,124
There was one period in the 1600s
that happened.

549
00:26:49,244 --> 00:26:51,254
Uh, but, uh, I agree that we might-

550
00:26:51,604 --> 00:26:54,604
[speaker_0] Studying that particular time
period to study, you know,

551
00:26:54,664 --> 00:26:57,334
that happened there?
And after a few breakthroughs-

552
00:26:57,334 --> 00:26:57,334
[speaker_1] Yeah

553
00:26:57,384 --> 00:27:00,224
[speaker_0] ... came out,
now to extrapolate, okay,

554
00:27:00,264 --> 00:27:02,244
breakthroughs come out,
and how expensive will it be to run-

555
00:27:02,254 --> 00:27:02,554
[speaker_1] Sure, sure

556
00:27:02,584 --> 00:27:03,884
[speaker_0] ... experiments?
I think that's the kind of-

557
00:27:03,924 --> 00:27:04,424
[speaker_1] And I agree

558
00:27:04,544 --> 00:27:05,344
[speaker_0] ... study.

559
00:27:05,384 --> 00:27:09,204
[speaker_1] And I agree. And, and,
and then essentially, right,

560
00:27:09,304 --> 00:27:12,904
part of the S-curve, that then there,
then you should

561
00:27:12,944 --> 00:27:14,344
expect more breakthroughs to come out.

562
00:27:14,724 --> 00:27:18,274
If you're on the horizontal part of the
S-curve, then you should think, then you

563
00:27:18,284 --> 00:27:20,044
should expect less discoveries to come
out.

564
00:27:20,104 --> 00:27:23,264
Which part of the S-curve you are on,
I don't think either of us know.

565
00:27:23,384 --> 00:27:25,464
Uh, but I think-

566
00:27:25,604 --> 00:27:25,984
[speaker_0] I'm claiming-

567
00:27:26,074 --> 00:27:26,724
[speaker_1] ... the more time-

568
00:27:26,884 --> 00:27:30,504
[speaker_0] That's my claim. And also,
sure, some decent probability they're not,

569
00:27:30,944 --> 00:27:33,384
[speaker_1] Okay. Uh,
what will be your evidence to saying

570
00:27:33,444 --> 00:27:37,144
Like, how are you so sure?
I have zero base to say that this thing,

571
00:27:37,184 --> 00:27:40,504
can, on hindsight be like, "Oh,
looks like we were on the vertical part

572
00:27:40,784 --> 00:27:44,674
[speaker_0] Okay. And for me,
it's just extrapolate last five to 10

573
00:27:44,724 --> 00:27:46,624
data points. Okay, which year did, uh...

574
00:27:46,804 --> 00:27:50,104
Well, actually you can go back to ,
you know, which year did AlexNet come out?

575
00:27:50,154 --> 00:27:50,154
[speaker_1] Uh-

576
00:27:50,184 --> 00:27:53,144
[speaker_0] Then which year did, you know,
transformer come out?

577
00:27:53,184 --> 00:27:53,644
Which year did-

578
00:27:53,654 --> 00:27:53,654
[speaker_1] That-

579
00:27:53,654 --> 00:27:57,544
[speaker_0] ... GPT-2 come out?
And just put these on like a year

580
00:27:57,584 --> 00:27:58,254
versus, you know-

581
00:27:58,304 --> 00:27:58,684
[speaker_1] That is-

582
00:27:58,733 --> 00:28:02,544
[speaker_0] ...
new breakthrough kind of graph,

583
00:28:02,584 --> 00:28:04,344
this look like your S-curve is saturated?

584
00:28:04,404 --> 00:28:06,184
Yes or no?" And no,
it doesn't look like there's-

585
00:28:06,304 --> 00:28:08,004
[speaker_1] And this is like an outside
perspective.

586
00:28:08,044 --> 00:28:11,584
You don't have to trust it,
but Ilya Sutskever, who

587
00:28:11,624 --> 00:28:15,564
was responsible for GPT,
was responsible for RL, uh, comes

588
00:28:15,584 --> 00:28:18,764
and says essentially all the good ideas
are down and we need to spend some time

589
00:28:18,804 --> 00:28:21,484
doing new research. And this
is the time for...

590
00:28:21,524 --> 00:28:24,594
I don't know if you saw that episode
or The R Kesh, but he's like,

591
00:28:24,684 --> 00:28:26,624
scaling is over. Now
is the time for new research."

592
00:28:27,524 --> 00:28:29,384
[speaker_0] What are Ilya Sutskever's
timelines?

593
00:28:29,464 --> 00:28:32,804
Are they less bullish than me
when I'm saying 25% ASI 2030?

594
00:28:32,864 --> 00:28:34,994
Like,
does Ilya have like less bullish timelines

595
00:28:35,914 --> 00:28:37,593
[speaker_1] I don't know,
but I think that's a relevant.

596
00:28:38,024 --> 00:28:38,263
[speaker_0] Sorry?

597
00:28:38,984 --> 00:28:40,994
[speaker_1] I think that's irrelevant.
I feel-

598
00:28:41,124 --> 00:28:42,624
[speaker_0] No, no, you brought up Ilya-

599
00:28:42,674 --> 00:28:42,674
[speaker_1] Yeah. No

600
00:28:42,674 --> 00:28:46,454
[speaker_0] ... then I was like, okay,
like does Ilya agree with me already,

601
00:28:46,464 --> 00:28:46,784
thing.

602
00:28:47,464 --> 00:28:48,824
[speaker_1] It doesn't matter
if Ilya agrees with you.

603
00:28:48,884 --> 00:28:52,424
What Ilya does agree,
disagree with you on is

604
00:28:52,484 --> 00:28:56,164
that the low-hanging scaling fruits have
been plucked and we need to go find new

605
00:28:56,204 --> 00:28:59,884
scaling breakthroughs, uh,
which he's confident that he will find,

606
00:29:00,004 --> 00:29:03,734
economic incentive to say that. Uh,
but he's saying that, "Okay,

607
00:29:03,764 --> 00:29:05,704
are down,
and now we need to find something new to

608
00:29:06,164 --> 00:29:10,104
[speaker_0] Okay. No, but
if you strongly defer to Ilya on

609
00:29:10,184 --> 00:29:12,234
this question,
then Ilya's actual timelines are-

610
00:29:12,264 --> 00:29:13,284
[speaker_1] No, I don't. I don't,

611
00:29:14,164 --> 00:29:17,064
I don't,
I don't defer strongly to Ilya on this

612
00:29:17,124 --> 00:29:20,434
All I'm saying is there's one extra
evidence point saying that the guy who was

613
00:29:20,464 --> 00:29:23,164
involved with all these three
breakthroughs comes and says

614
00:29:23,184 --> 00:29:26,604
no low-hanging fruits anymore,
and we have to go find more, uh,

615
00:29:26,864 --> 00:29:30,364
should update you that we
are probably not on the vertical part of

616
00:29:30,424 --> 00:29:33,424
[speaker_0] No, no, no. Uh,
I took a different lesson from this.

617
00:29:33,584 --> 00:29:37,204
Uh, like Ilya has, again,
as bullish as me timelines,

618
00:29:37,244 --> 00:29:40,764
low-hanging fruit is picked.
What he means is, okay, the

619
00:29:40,804 --> 00:29:43,403
low-hanging fruit is not like one month
low-hanging fruit.

620
00:29:43,464 --> 00:29:45,304
It's like two years,
three years low-hanging fruit.

621
00:29:45,724 --> 00:29:48,604
[speaker_1] No, I think his time,
his timelines are significantly longer.

622
00:29:48,684 --> 00:29:52,564
I think he's like in the next eight to 10
years will probably be some areas

623
00:29:52,624 --> 00:29:56,144
of research that we need to find to scale,
and yeah.

624
00:29:56,524 --> 00:29:56,534
[speaker_0] Okay.

625
00:29:56,564 --> 00:29:58,244
[speaker_1] I, I think his
is probably longer.

626
00:29:58,304 --> 00:30:02,124
[speaker_0] So, so again, I'm like, yeah,
if you want to debate specifically Ilya's

627
00:30:02,204 --> 00:30:04,064
I think actually we have to go
and find his timeline.

628
00:30:04,124 --> 00:30:05,244
[speaker_1] I don't want to debate Ilya's
worldview.

629
00:30:05,324 --> 00:30:08,964
I'm saying that this, the,
it's not a question of whether Ilya

630
00:30:09,004 --> 00:30:12,704
The question of, uh,
Ilya has an evidence point

631
00:30:12,964 --> 00:30:15,153
may or may not update you on
which part of the S-curve we are on.

632
00:30:15,284 --> 00:30:18,784
[speaker_0] Yeah. So for that,
I need to first even understand does Ilya

633
00:30:18,824 --> 00:30:19,934
or does he have some major disagreement?

634
00:30:19,984 --> 00:30:23,924
[speaker_1] No, he doesn't. He doesn't.
He, his timelines are, uh,

635
00:30:24,880 --> 00:30:27,429
[speaker_0] So then want to know what his
timelines are.

636
00:30:27,900 --> 00:30:31,630
[speaker_1] I think he said the next...
I might have to check on this, but

637
00:30:31,740 --> 00:30:35,640
Darkish episode, he says
that like six to eight years, uh,

638
00:30:35,680 --> 00:30:37,180
and then we'll find something
that we can scale.

639
00:30:38,100 --> 00:30:41,740
And, uh,
there are no good scaling candidates in

640
00:30:42,200 --> 00:30:45,840
[speaker_0] Yeah, no, to continue this,
like, I think, like,

641
00:30:45,960 --> 00:30:47,800
more... Like, are you sure about this?
You know.

642
00:30:47,840 --> 00:30:48,120
[speaker_1] Yeah.

643
00:30:48,180 --> 00:30:49,120
[speaker_0] Give me more context and-

644
00:30:49,520 --> 00:30:52,860
[speaker_1] No, no, I'm saying,
I'm saying, I'm saying that,

645
00:30:52,980 --> 00:30:55,180
S-curve we are on, uh,
there's uncertainty on it.

646
00:30:55,740 --> 00:30:58,180
Uh, maybe you're right and we
are on the vertical part of the S-curve.

647
00:30:58,580 --> 00:31:02,300
I think if you just plot it from AlexNet,
it's mostly the same

648
00:31:02,340 --> 00:31:05,940
paradigm. Uh, and I would count AlexNet

649
00:31:06,660 --> 00:31:10,600
as one breakthrough,
Transformers as another breakthrough,

650
00:31:10,640 --> 00:31:11,000
breakthrough.

651
00:31:11,980 --> 00:31:12,930
But apart from that,

652
00:31:13,760 --> 00:31:14,790
I don't see why-

653
00:31:15,160 --> 00:31:17,910
[speaker_0] Yes, once, uh, no,
but there are also like minor ones

654
00:31:18,120 --> 00:31:21,800
[speaker_1] ... like,
I don't know any other paradigm that

655
00:31:21,840 --> 00:31:25,220
Okay,
I won't even count AlexNet 'cause AlexNet

656
00:31:25,280 --> 00:31:29,180
works. Uh,
I would just count Transformers

657
00:31:29,240 --> 00:31:31,640
that, uh, show that scale is all you need.

658
00:31:32,160 --> 00:31:35,910
Uh, and then there's n-
there's not been a third candidate for

659
00:31:35,960 --> 00:31:39,180
need, or a third S-curve
that you can stack on top of these things.

660
00:31:39,560 --> 00:31:41,550
Pre-training was one S-curve.
We exhausted it.

661
00:31:41,600 --> 00:31:45,520
[speaker_0] There are multiple points.
One is like proof that Transformers

662
00:31:45,600 --> 00:31:49,520
all,
and then there's a second data point

663
00:31:49,640 --> 00:31:50,570
Or even with, like-

664
00:31:50,620 --> 00:31:53,010
[speaker_1] Transformers
are useful at all is...

665
00:31:53,050 --> 00:31:55,360
Transformers are only useful because they
can scale.

666
00:31:55,740 --> 00:31:58,110
[speaker_0] No. Uh, GPT-2 was-

667
00:31:58,120 --> 00:31:58,420
[speaker_1] Because-

668
00:31:58,640 --> 00:32:02,060
[speaker_0] ... useful. Like, it
was a breakthrough by itself, even

669
00:32:02,100 --> 00:32:04,370
anything about whether GPT-2 would scale
or not.

670
00:32:04,420 --> 00:32:08,140
[speaker_1] If it doesn't, then... No,
what I'm talking about is that if you

671
00:32:08,200 --> 00:32:11,800
assume that scaling
is what gets you smarter models, uh,

672
00:32:12,120 --> 00:32:15,540
subscribe to that worldview,
then you need paradigms that can scale.

673
00:32:15,790 --> 00:32:17,330
Pre-training is one paradigm
that can scale.

674
00:32:17,560 --> 00:32:21,520
[speaker_0] Right now.
At the time GPT-2 was invented,

675
00:32:21,700 --> 00:32:23,360
ML research community didn't believe it.

676
00:32:23,420 --> 00:32:26,790
[speaker_1] Sure. And I, like,
I'm saying that it's, that's, that's

677
00:32:27,320 --> 00:32:28,880
irrespective, like, that's immaterial.

678
00:32:28,940 --> 00:32:32,769
What I'm saying right now is
if you believe that intelligence comes

679
00:32:32,780 --> 00:32:33,570
scaling things up-

680
00:32:34,180 --> 00:32:34,460
[speaker_0] Sure

681
00:32:34,540 --> 00:32:38,220
[speaker_1] ...
then scaling a paradigm up,

682
00:32:38,280 --> 00:32:38,800
scaling up.

683
00:32:39,460 --> 00:32:39,640
[speaker_0] Yeah.

684
00:32:39,700 --> 00:32:43,620
[speaker_1] In doing so, uh,
we have found only two paradigms that have

685
00:32:43,700 --> 00:32:44,060
scaled up.

686
00:32:44,180 --> 00:32:47,300
[speaker_0] We have found only two
paradigms that have scaled up.

687
00:32:47,360 --> 00:32:51,120
No, I mean, why does, uh,
the whole scaling fully connected

688
00:32:51,180 --> 00:32:54,440
networks back in twenty twelve,
twenty thirteen, why does that not count?

689
00:32:54,480 --> 00:32:57,420
Why does scaling LSTM not count? I,
I'm not clear.

690
00:32:57,620 --> 00:33:00,000
[speaker_1] Because LSTM didn't scale.
CNNs didn't scale.

691
00:33:00,820 --> 00:33:02,920
[speaker_0] What do you mean?
CNNs do scale.

692
00:33:03,220 --> 00:33:06,540
[speaker_1] As in, like,
you get diminishing returns from scaling

693
00:33:07,060 --> 00:33:08,950
CNNs, for example, cannot-

694
00:33:09,320 --> 00:33:09,330
[speaker_0] Yeah.

695
00:33:09,340 --> 00:33:12,020
[speaker_1] People try to do language
experiments on CNN,

696
00:33:12,080 --> 00:33:15,490
People try to do language experiments on
LSTM, they perform to a certain level, but

697
00:33:15,520 --> 00:33:17,500
essentially then they start degrading.

698
00:33:18,020 --> 00:33:21,420
So th- these, these paradigms have lots of

699
00:33:21,580 --> 00:33:22,900
limitations on how much they can scale.

700
00:33:23,540 --> 00:33:27,520
[speaker_0] Sure. So they,
they did scale up to some amount,

701
00:33:27,560 --> 00:33:29,980
saturated, I guess. Same.
Like you had some estimate, I guess.

702
00:33:30,020 --> 00:33:33,440
[speaker_1] So then,
then it's a failed candidate. Yeah,

703
00:33:33,500 --> 00:33:37,440
Like, like a good candidate is that, hey,
it doesn't matter how much compute

704
00:33:37,460 --> 00:33:39,220
we throw at it, just keep scaling.

705
00:33:39,720 --> 00:33:41,880
[speaker_0] No, no, it means
that back then it succeeded, right?

706
00:33:42,140 --> 00:33:45,640
CNNs, we did scale up to some amount,
then we realized we're getting diminishing

707
00:33:45,680 --> 00:33:46,200
returns.

708
00:33:46,280 --> 00:33:47,300
[speaker_1] No,
but that's what I'm saying, like,

709
00:33:48,320 --> 00:33:50,640
that way, like... No, no, that didn't...

710
00:33:50,740 --> 00:33:54,180
I don't know what your definition of
success is,

711
00:33:54,220 --> 00:33:58,060
to superintelligence. Candidates
that can get us to superintelligence are,

712
00:33:58,140 --> 00:33:59,940
are candidates that you can throw.

713
00:33:59,980 --> 00:34:03,770
There is no limit to how much, uh,
or there's no visible limit to

714
00:34:03,880 --> 00:34:05,880
how much compute you can throw at it.

715
00:34:05,920 --> 00:34:09,580
For example, if tomorrow we find out
that pre-training has stopped scaling,

716
00:34:09,600 --> 00:34:11,089
call pre-training a failed candidate.

717
00:34:11,400 --> 00:34:15,240
Currently, we have two candidates
that can absorb insane amounts of compute

718
00:34:15,320 --> 00:34:17,800
et cetera,
and can keep expecting gains from it.

719
00:34:18,400 --> 00:34:22,010
[speaker_0] No, even the paradigm
that gets us to superintelligence might

720
00:34:22,040 --> 00:34:24,240
saturate somewhere. E-e-every candidate-

721
00:34:24,250 --> 00:34:24,560
[speaker_1] Sure

722
00:34:24,580 --> 00:34:25,080
[speaker_0] ... can saturate.

723
00:34:25,220 --> 00:34:27,620
[speaker_1] They would, can saturate,
but I'm saying...

724
00:34:27,820 --> 00:34:31,400
A-and at that point, essentially, uh,
you can either draw a line and say

725
00:34:31,500 --> 00:34:35,060
intelligence is good enough.
Assuming that, okay, so you're in a,

726
00:34:35,140 --> 00:34:37,680
where you do not have enough, uh...

727
00:34:37,700 --> 00:34:40,500
You've not achieved the level of, or like,
not achieved, that you don't want to

728
00:34:40,540 --> 00:34:44,400
achieve. But I'm saying like, uh,
you think that you can still throw more

729
00:34:44,439 --> 00:34:46,800
at this and get more intelligence out of
this.

730
00:34:47,380 --> 00:34:51,180
Uh, and, uh, in that paradigm, there
are only two things that can

731
00:34:51,240 --> 00:34:53,760
absorb seemingly infinite amounts of
compute.

732
00:34:54,540 --> 00:34:55,670
Uh, and there's not been a third.

733
00:34:55,670 --> 00:34:57,680
[speaker_0] I have a problem with your
infinite thing.

734
00:34:57,820 --> 00:34:58,160
It's-

735
00:34:59,500 --> 00:35:02,560
[speaker_1] Seemingly infinite in the
sense that, sure,

736
00:35:02,600 --> 00:35:04,380
how much you can do it,
or there might be like some...

737
00:35:04,560 --> 00:35:07,240
There are obviously practical limits to
it, but there might be a theoretical limit

738
00:35:07,300 --> 00:35:10,860
as well. And I'm saying
that maybe before we get there, uh,

739
00:35:11,140 --> 00:35:14,060
superintelligence comes and then you,
you've achieved your goal

740
00:35:14,100 --> 00:35:14,700
to scale further.

741
00:35:15,720 --> 00:35:18,240
But until now,
there's no evidence to see that there

742
00:35:18,640 --> 00:35:21,580
[speaker_0] Sorry,
I'm still not super clear what your claim

743
00:35:21,620 --> 00:35:23,500
Can you like summarize this entire
argument?

744
00:35:23,580 --> 00:35:26,540
[speaker_1] Maybe I'll,
maybe I'll try to explain it in a

745
00:35:27,280 --> 00:35:31,200
A successful paradigm in my book
is one that can

746
00:35:31,260 --> 00:35:35,009
absorb all the compute capacity
that you can reasonably throw at it at

747
00:35:35,120 --> 00:35:35,680
point in time.

748
00:35:36,180 --> 00:35:39,790
[speaker_0] That point in time means what?
In that year, how many GPUs had

749
00:35:39,860 --> 00:35:41,420
humanity manufactured?

750
00:35:41,440 --> 00:35:45,140
[speaker_1] So it's also... Yeah, in
that point in time,

751
00:35:45,200 --> 00:35:49,180
So, so in twenty twenty-six, uh, there
are two successful paradigms that can

752
00:35:49,260 --> 00:35:51,380
absorb all the compute
and still keep on giving gain.

753
00:35:51,700 --> 00:35:53,420
We have not exhausted these two paradigms.

754
00:35:53,870 --> 00:35:53,870
[speaker_0] Right.

755
00:35:53,880 --> 00:35:57,500
[speaker_1] And we still have like returns
to get from there,

756
00:35:57,540 --> 00:36:01,340
essentially get us to smarter models.
There are just two of them. And twenty...

757
00:36:01,680 --> 00:36:05,600
And so I'm saying like you keep
extrapolating it, our, our, our lim-- our,

758
00:36:05,760 --> 00:36:09,460
thing. So in twenty twenty--
by the time we get to twenty thirty,

759
00:36:09,500 --> 00:36:12,540
that these two paradigms will not get to
superintelligence because there are

760
00:36:12,580 --> 00:36:15,900
practical limits to scaling them,
if not theoretical limits.

761
00:36:16,020 --> 00:36:19,940
Uh, and, uh,
we need either a third paradigm

762
00:36:19,960 --> 00:36:23,860
seventh paradigm to actually keep stacking
these S-curves to get us

763
00:36:23,920 --> 00:36:26,639
to the world that you
are claiming we'll be in, say,

764
00:36:27,000 --> 00:36:29,820
[speaker_0] Uh, okay. Sure. My probability

765
00:36:29,940 --> 00:36:33,772
ofPre-training scaling plus RL scaling
gets us

766
00:36:33,832 --> 00:36:37,592
to ASI by 2030 is less, let's say,
less than

767
00:36:37,632 --> 00:36:38,092
10%.

768
00:36:38,232 --> 00:36:41,372
[speaker_1] Yeah,
like probably we do need a better, uh,

769
00:36:41,852 --> 00:36:44,292
breakthrough. Yeah.
Probably we do need another breakthrough.

770
00:36:44,512 --> 00:36:45,292
I am saying that

771
00:36:46,691 --> 00:36:50,112
these breakthroughs
are not super easy to come by.

772
00:36:50,172 --> 00:36:52,992
Dep- I think we d-
just did the S curves debate. But yeah,

773
00:36:53,012 --> 00:36:56,672
The one crux we have is
that I am very uncertain of

774
00:36:57,092 --> 00:37:00,051
you have somehow. You
are somehow more certain in

775
00:37:00,092 --> 00:37:03,512
have,
and this is just a prior belief thing.

776
00:37:03,632 --> 00:37:06,212
I don't know if there's, like,
any evidence that you can show me.

777
00:37:06,532 --> 00:37:07,122
[speaker_0] Yeah, I think there are-

778
00:37:07,132 --> 00:37:07,901
[speaker_1] And maybe I don't know

779
00:37:07,901 --> 00:37:09,112
[speaker_0] ... S curves we are tracking.
One is like,

780
00:37:10,092 --> 00:37:13,972
uh, curves for individual paradigms,
and one is like some bigger curve

781
00:37:14,012 --> 00:37:17,012
of, you know,
like humanity's ML research as a whole.

782
00:37:17,572 --> 00:37:21,212
So there,
one is like the curve of pre-training

783
00:37:21,312 --> 00:37:25,292
started saturating.
There's a curve for

784
00:37:25,392 --> 00:37:26,892
scale, when did they start saturating.

785
00:37:27,492 --> 00:37:31,042
There's a curve for, you know,
when did RL scaling start,

786
00:37:31,092 --> 00:37:34,232
might saturate someday. And then there
are multiple of these curves, but then

787
00:37:34,252 --> 00:37:38,032
there's a bigger overall trajectory of how
fast is humanity's ML

788
00:37:38,092 --> 00:37:38,792
capability growing.

789
00:37:39,192 --> 00:37:42,852
[speaker_1] Uh,
that I don't agree with because

790
00:37:42,932 --> 00:37:46,112
it. For example, if you start tracking it,
there have been multiple AI winters and

791
00:37:46,152 --> 00:37:49,932
AI summers. There
were times where people thought

792
00:37:50,412 --> 00:37:54,292
in AI, and if we just do work on GOFAI
or we just do work on,

793
00:37:54,332 --> 00:37:57,082
like, some other paradigm,
we'll be able to get to ASI from here.

794
00:37:57,112 --> 00:37:59,472
[speaker_0] Why aren't we including all of
that as data points?

795
00:37:59,992 --> 00:38:03,752
[speaker_1] So if you keep including them,
then essentially, uh, this could

796
00:38:03,812 --> 00:38:07,752
either be like a, a 1990s rush in

797
00:38:07,832 --> 00:38:10,142
AI of, like saying that, "Hey,
we have super intelligence.

798
00:38:10,272 --> 00:38:14,162
We have chess playing AI, and we are,
like, a few research ideas away from

799
00:38:14,252 --> 00:38:18,242
having, like,
super intelligence AI because we have Deep

800
00:38:18,432 --> 00:38:21,312
we'll be like, "No, that,
that S curve actually flatlined

801
00:38:21,352 --> 00:38:24,742
super intelligence."
And then you have a larger S curve of,

802
00:38:25,072 --> 00:38:28,232
um, and, uh, transformers and R- RL.

803
00:38:28,692 --> 00:38:32,312
And that S curve,
depending on where we are on the S curve,

804
00:38:32,372 --> 00:38:36,172
still saturate before we get to, uh,
super intelligence.

805
00:38:36,912 --> 00:38:40,602
Uh, and, uh, either that could happen,
that is one world that could happen,

806
00:38:40,772 --> 00:38:44,732
or another S curve gets stacked on top of
it, and it keeps going till

807
00:38:44,772 --> 00:38:48,732
we reach super intelligence. Uh, so, uh,
or maybe, like, how close we

808
00:38:48,752 --> 00:38:52,152
are to super intelligence,
essentially whatever that bar is,

809
00:38:52,192 --> 00:38:55,662
there or we need to stack more S curves to
basically get us there

810
00:38:55,692 --> 00:38:59,352
faster. Uh, I do agree
that on infinite human

811
00:38:59,372 --> 00:39:01,432
timescale,
we'll get to super intelligence at some

812
00:39:01,932 --> 00:39:04,772
Uh, but yeah, 2030 is the timelines
that we're dealing with.

813
00:39:04,792 --> 00:39:06,832
[speaker_0] Again,
I didn't understand your argument.

814
00:39:06,972 --> 00:39:10,952
If I track, you know, human-
humanity's AI research progress since,

815
00:39:11,032 --> 00:39:14,812
1970, yes,
there have been multiple spans of few

816
00:39:14,852 --> 00:39:17,912
years where we did get, you know,
some one breakthrough, something happened,

817
00:39:17,952 --> 00:39:19,952
then we had, like, you know,
20 years of nothing happening.

818
00:39:20,032 --> 00:39:21,542
Yes, there have been multiple of these.

819
00:39:22,312 --> 00:39:24,912
Uh,
right now we could be in either one of

820
00:39:24,952 --> 00:39:28,852
We might be about to reach an AI winter
or we might be

821
00:39:28,912 --> 00:39:30,792
about... Yeah,
we might get a few more breakthroughs.

822
00:39:30,812 --> 00:39:33,972
Those again might not, might
or might not get to super intelligence.

823
00:39:34,252 --> 00:39:38,052
Sure, uh, where are we in this?
So far I'm agreeing it's now a question

824
00:39:38,132 --> 00:39:40,252
how do you put the numbers on these things
and...

825
00:39:40,961 --> 00:39:43,712
[speaker_1] The, the quest-
the disagreement comes from the fact that,

826
00:39:44,192 --> 00:39:48,042
It's not a disagreement in the argument,
it's a disagreement on how many

827
00:39:48,192 --> 00:39:51,792
S, if a new S curve is needed
or will these S curves scale to super

828
00:39:51,832 --> 00:39:54,452
intelligence and how easy
are these S curves to come by.

829
00:39:54,892 --> 00:39:58,352
[speaker_0] So what would be a data point
that would actually change

830
00:39:58,752 --> 00:40:02,592
your mind? Like, for me, it's, like,
fairly, yeah, almost obvious

831
00:40:02,692 --> 00:40:06,482
that, okay, yeah, there is a b-
huge backlog of, backlog of research

832
00:40:06,572 --> 00:40:09,372
ideas that need to be,
that will definitely try-

833
00:40:09,432 --> 00:40:09,852
[speaker_1] That I think-

834
00:40:10,332 --> 00:40:11,152
[speaker_0] And all of them require-

835
00:40:11,222 --> 00:40:11,222
[speaker_1] Yeah, but-

836
00:40:11,222 --> 00:40:13,392
[speaker_0] ...
a lot of people to try and, yeah.

837
00:40:13,992 --> 00:40:17,472
[speaker_1] Sure.
I think sure there might be some research

838
00:40:17,532 --> 00:40:21,472
overhang, but, uh, the probability of us

839
00:40:21,512 --> 00:40:24,692
finding a breakthrough in the research
ideas might be below, uh...

840
00:40:24,772 --> 00:40:27,632
I think the ML research community is very,
very smart.

841
00:40:28,092 --> 00:40:32,052
Uh,
they figure out all the best candidates

842
00:40:32,102 --> 00:40:35,392
a daily basis, et cetera.
And in the last five, six years-

843
00:40:35,512 --> 00:40:38,232
[speaker_0] Yeah, I think that's where,
yeah, like,

844
00:40:38,292 --> 00:40:41,702
smart. It comes down to try, hit
and trial random shit until it works.

845
00:40:42,042 --> 00:40:45,842
[laughs] Like, I don't think people had,
like, some, uh, brilliant insight,

846
00:40:45,862 --> 00:40:48,661
"Okay, this is why this thing
is definitely going to work,"

847
00:40:48,672 --> 00:40:50,982
tried it a ton of time.
I think people just, like-

848
00:40:51,031 --> 00:40:51,362
[speaker_1] No, they did

849
00:40:51,452 --> 00:40:53,672
[speaker_0] ...
tried the random 20 random things to try

850
00:40:53,982 --> 00:40:56,672
[speaker_1] If you look to-- No,
but I don't think it

851
00:40:56,732 --> 00:41:00,512
I think the people who try it have some
intuition of, uh, why

852
00:41:00,572 --> 00:41:03,852
this could work or why this wouldn't work,
and they might be wrong or right and,

853
00:41:03,892 --> 00:41:05,731
and the outcomes might look random,
et cetera.

854
00:41:06,032 --> 00:41:09,732
But the selection of
which experiments to try definitely has

855
00:41:09,752 --> 00:41:13,352
A good AI researcher is one
that tries more successful experiments

856
00:41:13,932 --> 00:41:17,732
And a bad researcher
is they keep making bad bets on research

857
00:41:17,772 --> 00:41:18,692
keep failing at it.

858
00:41:19,492 --> 00:41:23,372
And, uh, my, the, my, my,
the data point that

859
00:41:23,452 --> 00:41:25,902
I want to look at is that despite so many,

860
00:41:26,812 --> 00:41:30,692
so much money flowing into, like,
finding research ideas, I'm sure

861
00:41:30,752 --> 00:41:32,612
we'll be able to scale this further.

862
00:41:33,012 --> 00:41:36,992
But is there an RL level
or a pre-training level idea, uh,

863
00:41:37,092 --> 00:41:41,012
already out there
that has not been tried yet because of

864
00:41:41,052 --> 00:41:44,952
compute? Because people,
like researchers are literally drawing

865
00:41:45,032 --> 00:41:48,912
whatever, um, a hat
and implementing the idea as opposed to

866
00:41:48,952 --> 00:41:51,672
reading the idea,
understanding the viability of it working

867
00:41:52,212 --> 00:41:54,632
Uh, I think, like,
like I believe in the second one.

868
00:41:54,992 --> 00:41:58,692
Uh, and if that is true, then,
then where is the idea is my question.

869
00:41:59,172 --> 00:42:00,732
Uh, and just because you got two ideas-

870
00:42:00,762 --> 00:42:03,832
[speaker_0] Yeah, that question
is definitely a crux like... Like yeah,

871
00:42:03,892 --> 00:42:06,932
On one side you have like researchers
understand nothing about the problem and

872
00:42:06,952 --> 00:42:10,572
they're just brute forcing.
On the other end of the spectrum you have,

873
00:42:10,632 --> 00:42:14,432
researcher deeply understands the thing
and they have an hypothesis

874
00:42:14,512 --> 00:42:17,372
actually running the training run,
they already know with confidence this is

875
00:42:17,412 --> 00:42:21,212
definitely going to work. And you
are saying, okay, researchers

876
00:42:21,272 --> 00:42:24,162
end of understanding things.
I'm saying they're far closer to the end

877
00:42:24,212 --> 00:42:25,152
randomly brute forcing.

878
00:42:25,612 --> 00:42:28,752
[speaker_1] I don't know. If you look,
like, if you look at, like Noam,

879
00:42:28,832 --> 00:42:32,702
heard Noam, what's his name? Yeah,
Noam Shazeer talk about,

880
00:42:32,792 --> 00:42:36,402
uh,
when they were getting into transformers,

881
00:42:36,402 --> 00:42:39,692
transformers out, when they
were essentially scaling language models,

882
00:42:39,752 --> 00:42:41,252
wrote the attention paper, et cetera.

883
00:42:41,732 --> 00:42:43,852
Uh, it wasn't like a random idea
that they had come.

884
00:42:44,172 --> 00:42:48,132
Uh,
Noam Shazeer has like this history of

885
00:42:48,252 --> 00:42:51,992
ideas to try out. He's called like a,
a magical researcher

886
00:42:52,012 --> 00:42:55,532
because he can seemingly look at like 100
ideas and figure out,

887
00:42:55,572 --> 00:42:59,072
like these are the one, two.
He has like crazy intuition of these

888
00:42:59,132 --> 00:43:02,092
ideas that could work because he
understands these things much more deeply

889
00:43:02,172 --> 00:43:02,872
average researcher.

890
00:43:03,062 --> 00:43:03,212
[speaker_0] Okay. Oh.

891
00:43:03,292 --> 00:43:05,932
[speaker_1] And there are the superstar
researchers that can look at ideas

892
00:43:05,972 --> 00:43:09,944
like breakthrough ideas much more
quickly.And I don't think it's as

893
00:43:10,004 --> 00:43:12,684
random as that, "Hey,
let me just pick one and do it, and

894
00:43:12,744 --> 00:43:14,204
Otherwise,
I'll go pick another one tomorrow."

895
00:43:14,664 --> 00:43:18,564
[speaker_0] For me, I see it more as,
yeah, like, yes, Noam Shazeer probably did

896
00:43:18,604 --> 00:43:21,984
have some intuitions, but also it
was random that he

897
00:43:22,044 --> 00:43:24,244
How would I put it?
There have been three

898
00:43:24,304 --> 00:43:26,584
Noam Shazeer directly contributed to one
of them.

899
00:43:27,084 --> 00:43:30,644
If you take any of the other research
breakthroughs which Noam Shazeer did not

900
00:43:30,704 --> 00:43:34,584
make,
and you put him in one year before

901
00:43:34,624 --> 00:43:37,824
happened and told him, "Look, here
are all these hypothesis the different

902
00:43:37,844 --> 00:43:41,444
researchers are making.
Which one do you think will work?" I don't

903
00:43:41,484 --> 00:43:44,144
have made that good a guess and told you,
"Oh, this one will work."

904
00:43:44,724 --> 00:43:48,424
[speaker_1] And I saw a Noam Shazeer, uh,
and Jeff Dean

905
00:43:48,744 --> 00:43:49,024
talk

906
00:43:49,824 --> 00:43:53,544
about this exact thing,
and from the story that they

907
00:43:54,064 --> 00:43:56,864
their meeting perspective and understand,
et cetera.

908
00:43:57,224 --> 00:43:58,524
[speaker_0] Yeah, if what you
are saying is correct-

909
00:43:58,614 --> 00:43:58,614
[speaker_1] [laughs]

910
00:43:58,614 --> 00:44:02,564
[speaker_0] ... there should be, like,
the same researcher who's consistently

911
00:44:02,604 --> 00:44:05,414
where the field is heading multiple times
and should be multiple times-

912
00:44:05,744 --> 00:44:05,954
[speaker_1] That's true

913
00:44:05,964 --> 00:44:09,644
[speaker_0] ... not 100% co- correct, but,
like, roughly able to see, okay,

914
00:44:09,684 --> 00:44:12,464
are probably going to work
and then actually roughly ends up correct.

915
00:44:13,064 --> 00:44:13,614
Whereas I'm saying no actually-

916
00:44:13,624 --> 00:44:14,844
[speaker_1] That's been true

917
00:44:14,944 --> 00:44:15,904
[speaker_0] ... I'm saying no actually-

918
00:44:15,964 --> 00:44:18,744
[speaker_1] Because the amount of,
the amount of breakthroughs have come

919
00:44:18,944 --> 00:44:21,854
No, the amount of breakthrough
that have come from these superstar

920
00:44:21,884 --> 00:44:25,334
like, very, very high. Why
is Ilya Sutskever around all big

921
00:44:25,384 --> 00:44:27,884
'Cause he has a crazy sense of
which research ideas can work.

922
00:44:28,284 --> 00:44:30,304
Why is Noam Shazeer around all these big
breakthroughs?

923
00:44:30,334 --> 00:44:32,364
'Cause he has a crazy idea of all these
things to do.

924
00:44:32,404 --> 00:44:36,184
There's a random reason why a random ML
PhD you've never heard of comes up with a

925
00:44:36,304 --> 00:44:38,964
crazy idea.
It's mostly because you literally have to

926
00:44:39,784 --> 00:44:43,663
uh, what's that guy's name,
who's the GPT-2 main, Alec

927
00:44:43,724 --> 00:44:47,524
Radford type level researcher. Apparently,
Alec Radford has such a great

928
00:44:47,704 --> 00:44:51,464
sense of what could work. He's like,
he literally used to

929
00:44:51,544 --> 00:44:55,434
do small experiments on Jupyter Notebooks,
and

930
00:44:55,524 --> 00:44:59,184
he then once he got convinced
that this could work,

931
00:44:59,244 --> 00:45:02,184
engineering to Greg Brockman
or somebody who's like, "Yeah,

932
00:45:02,284 --> 00:45:05,964
Just keep scaling it up. No,
I'm sure it will work." Uh,

933
00:45:06,104 --> 00:45:08,914
built so much intuition about, like,
and what couldn't work

934
00:45:09,024 --> 00:45:12,804
this crazy ML whisperer guy who can just,
like, look at the

935
00:45:12,884 --> 00:45:16,644
shape of the model and figure out, like,
these are ideas worth pursuing, not worth

936
00:45:16,664 --> 00:45:19,813
pursuing. And if you look at, like,
Thinking Machines Lab, which is, like,

937
00:45:19,884 --> 00:45:23,684
lab filled, filled with all these guys,
John Schulman, uh, what's his name, Alec

938
00:45:23,724 --> 00:45:27,584
Radford, all the OG co-founder guys,
they have essentially had, like,

939
00:45:27,764 --> 00:45:31,384
free rein to do whatever.
And the best they came up with

940
00:45:31,764 --> 00:45:35,664
that updated me towards that, oh, okay,
there's, like, a lot of things to do here,

941
00:45:35,704 --> 00:45:39,634
but there is no crazy research
breakthrough paradigm that, that

942
00:45:39,764 --> 00:45:43,444
oh, we got, like,
a pre-training level paradigm

943
00:45:43,524 --> 00:45:46,364
stack on pre-training and get, like,
insane results.

944
00:45:46,404 --> 00:45:50,304
[speaker_0] Yeah,
I think I've identified a data point

945
00:45:50,384 --> 00:45:54,284
on this. If you, uh, again, from these,
you know, three or four superstar

946
00:45:54,304 --> 00:45:58,104
researchers,
if you're able to document a public track

947
00:45:58,264 --> 00:46:02,164
of, well, uh, yeah, since maybe 2018

948
00:46:02,244 --> 00:46:04,844
till 2026, some at least 10 years or no.

949
00:46:05,864 --> 00:46:09,084
Well, yeah, at least, yeah,
like more than five, at least seven,

950
00:46:09,664 --> 00:46:13,624
consistent track record. Like, okay,
here they made these predictions in 2018.

951
00:46:13,684 --> 00:46:15,644
They made these prediction 2020.
They made these-

952
00:46:15,664 --> 00:46:16,604
[speaker_1] Dario is one guy

953
00:46:16,684 --> 00:46:18,813
[speaker_0] ... in 2022,
and they made these in 2024.

954
00:46:19,004 --> 00:46:22,564
And, like,
they didn't single-handedly do all the

955
00:46:22,604 --> 00:46:25,224
roughly able to see where the next
breakthroughs are going to come from.

956
00:46:25,544 --> 00:46:29,124
If you can show me, okay,
just the same guy

957
00:46:29,144 --> 00:46:32,264
trend,
then that would actually shift my

958
00:46:32,944 --> 00:46:36,304
[speaker_1] I don't know.
I think I 100% believe what you're saying

959
00:46:36,324 --> 00:46:40,164
true. Uh, I don't know how... Like, I'm,
I'm trying to think of what are ways to

960
00:46:40,204 --> 00:46:44,084
show you this is happening. Uh,
one way to do this would be that if you

961
00:46:44,124 --> 00:46:48,104
look at the large breakthroughs,
and you look at who's responsible

962
00:46:48,124 --> 00:46:51,124
or who's close to those breakthroughs,
it will seem like it's the same people.

963
00:46:51,684 --> 00:46:55,304
And that should update you towards that,
hey, how come, uh, Ilya

964
00:46:55,313 --> 00:46:57,424
Sutskever is involved with all the big
breakthroughs?

965
00:46:57,484 --> 00:47:01,024
How come it came from the same guy who did
AlexNet, is the same guy who did

966
00:47:01,424 --> 00:47:05,144
GPT-2, is the same guy who did RL? Why,
why is it the same guy who's doing all

967
00:47:05,164 --> 00:47:08,424
these other things? Why
is Noam Shazeer building all of these

968
00:47:08,504 --> 00:47:10,474
Uh,
why is Alec Radford building all these

969
00:47:10,844 --> 00:47:13,834
It's because they have figured out or,
or they have impeccable...

970
00:47:14,124 --> 00:47:18,064
So they,
they talk about impeccable research taste,

971
00:47:18,074 --> 00:47:21,714
taste is what is really hard.
And research taste is this intuition that

972
00:47:21,724 --> 00:47:25,624
researchers have
that can figure out from a pile of, like,

973
00:47:25,664 --> 00:47:28,374
one to worth trying from the compute we
have to get the breakthrough.

974
00:47:28,784 --> 00:47:32,624
[speaker_0] Yeah,
so literally what you said,

975
00:47:32,684 --> 00:47:36,664
here is Alec Radford's track record of
research hypothesis going back

976
00:47:36,684 --> 00:47:37,404
entire eight years-

977
00:47:37,464 --> 00:47:41,064
[speaker_1] But why won't you just buy
Ilya's track record?

978
00:47:41,204 --> 00:47:44,324
[speaker_0] Uh,
I literally don't know enough about this

979
00:47:44,414 --> 00:47:47,204
What did Ilya say in 2018?
What did he say in '19?

980
00:47:47,224 --> 00:47:48,134
[speaker_1] No, he doesn't say anything.

981
00:47:48,134 --> 00:47:48,474
[speaker_0] I'm literally not saying-

982
00:47:48,524 --> 00:47:52,084
[speaker_1] But basically the fact
that he doesn't,

983
00:47:52,164 --> 00:47:56,104
Basically,
if he doesn't have to publicly make any of

984
00:47:56,164 --> 00:47:59,924
there's a reason why all big breakthroughs
are around one person, it

985
00:48:00,024 --> 00:48:03,184
stands to reason
that this person picks better research

986
00:48:04,084 --> 00:48:06,294
person who's picking research ideas at a
random, at random.

987
00:48:06,864 --> 00:48:07,294
[speaker_0] No, no, but-

988
00:48:07,484 --> 00:48:07,614
[speaker_1] Like-

989
00:48:07,844 --> 00:48:11,764
[speaker_0] Was he personally the one who
did the breakthrough,

990
00:48:11,804 --> 00:48:14,024
to be at the lab where somebody else did
the breakthrough?

991
00:48:14,424 --> 00:48:16,384
[speaker_1] No, no. He personally
was overseeing research.

992
00:48:16,424 --> 00:48:19,224
He personally was green-lighting the
experiments that he thinks would work.

993
00:48:19,584 --> 00:48:22,984
[speaker_0] Okay. Uh, okay,
then we can take literally Ilya Sutskever

994
00:48:23,044 --> 00:48:26,944
example. Uh,
which breakthroughs would you say, okay,

995
00:48:26,964 --> 00:48:30,694
is significantly responsible for versus
which ones you think he just happened

996
00:48:30,744 --> 00:48:31,264
to be there?

997
00:48:31,274 --> 00:48:31,274
[speaker_1] Deep learning.

998
00:48:31,304 --> 00:48:31,484
[speaker_0] Sorry?

999
00:48:31,544 --> 00:48:34,664
[speaker_1] Deep learning, he was--
Deep learning, he

1000
00:48:36,004 --> 00:48:38,124
[speaker_0] No,
when you say deep learning,

1001
00:48:39,024 --> 00:48:42,084
Okay, I'll explain it. Cool. Uh, yeah,
I agree with you. Fine.

1002
00:48:42,184 --> 00:48:45,824
Ilya was significantly responsible,
a-along with other people, significantly

1003
00:48:45,884 --> 00:48:47,104
responsible for AlexNet, sure.

1004
00:48:47,164 --> 00:48:50,534
[speaker_1] GPT, the GPT ideas he
was significantly responsible for.

1005
00:48:51,944 --> 00:48:54,804
Just training transformers. GPT-1 also.

1006
00:48:55,144 --> 00:48:58,724
The idea that we can essentially scale
transformers or we can, we can find a

1007
00:48:58,784 --> 00:49:01,944
scalable transformer paradigm
and build language models from it.

1008
00:49:02,704 --> 00:49:06,234
[speaker_0] Uh, okay. So which model
was this? Was this GPT-1?

1009
00:49:06,504 --> 00:49:08,504
[speaker_1] This was the generative
pre-transformer paper.

1010
00:49:08,844 --> 00:49:11,024
He is directly responsible there.

1011
00:49:11,084 --> 00:49:13,894
I think if we look at the GPT-1 paper-

1012
00:49:13,894 --> 00:49:17,064
[speaker_0] One second please.
I know this, I know it's annoying to,

1013
00:49:17,124 --> 00:49:20,544
things middle of video, but, like,
I actually want to now go read

1014
00:49:20,554 --> 00:49:20,664
is.

1015
00:49:21,044 --> 00:49:23,504
[speaker_1] Sure. It's this paper.
I will send it to you.

1016
00:49:23,936 --> 00:49:27,046
[speaker_0] Oh, okay. It's in the chat.
See, uh, when was this published?

1017
00:49:27,416 --> 00:49:28,736
[speaker_1] Five years ago.
Twenty twenty...

1018
00:49:29,965 --> 00:49:31,956
No, earlier than that. June 2018.

1019
00:49:32,436 --> 00:49:33,636
[speaker_0] Is it June? You sure?

1020
00:49:34,176 --> 00:49:37,266
[speaker_1] This is the AI summary. Yeah.
Yeah, June 2018.

1021
00:49:37,616 --> 00:49:41,215
[speaker_0] Okay, fine. Take care.
I will buy that. Okay, fine.

1022
00:49:41,276 --> 00:49:44,596
Ilya has been there at two major
breakthroughs. Fine.

1023
00:49:44,946 --> 00:49:48,896
[speaker_1] Then he's also been there at,
uh, this thing, uh, for

1024
00:49:48,976 --> 00:49:52,696
the o1 breakthrough as well. Uh,
Ilya didn't

1025
00:49:52,736 --> 00:49:56,576
green-light the experiment. Ilya
was heading research that time and

1026
00:49:56,596 --> 00:49:59,536
green-lighted the o1 experiment. RL,
basically RL scaling.

1027
00:49:59,896 --> 00:50:02,446
And if you look at Dario-- Sorry. Sorry.

1028
00:50:02,456 --> 00:50:03,216
[speaker_0] Dario, for this-

1029
00:50:03,776 --> 00:50:07,036
[speaker_1] Uh, he was at OpenAI at
that time. He was head of research.

1030
00:50:07,416 --> 00:50:11,066
He was personally seeing all the AI
research that was happening at OpenAI, and

1031
00:50:11,176 --> 00:50:11,956
OpenAI came out-

1032
00:50:12,066 --> 00:50:15,716
[speaker_0] No,
but if he's head of research,

1033
00:50:15,756 --> 00:50:19,616
works in his lab, even
if he does not want to, you know, do that,

1034
00:50:19,656 --> 00:50:21,676
hypothesis or, you know, prioritize it.

1035
00:50:21,686 --> 00:50:22,946
[speaker_1] You don't have to suggest the
hypothesis.

1036
00:50:23,016 --> 00:50:26,356
I'm saying researchers
are mostly the same

1037
00:50:26,396 --> 00:50:28,716
experiments, the best researchers.

1038
00:50:29,476 --> 00:50:31,686
The best researchers know which,
which can control-

1039
00:50:31,696 --> 00:50:35,076
[speaker_0] But like OpenAI tried 100
things. Whichever of,

1040
00:50:35,116 --> 00:50:37,836
worked,
he could take credit for it simply because

1041
00:50:38,276 --> 00:50:42,146
[speaker_1] Sure, but if, if there
are 10,000 things that OpenAI didn't try,

1042
00:50:42,236 --> 00:50:45,916
if there are three big paradigm,
three big breakthroughs

1043
00:50:46,256 --> 00:50:50,036
AI,
and the same guy has been around for all

1044
00:50:50,196 --> 00:50:53,946
the set of the things he tried. It's the,
the, the set is all the things that are

1045
00:50:53,946 --> 00:50:57,796
out there that he didn't try,
and out of which he was able to freak out

1046
00:50:57,896 --> 00:50:59,316
or be around all three things.

1047
00:50:59,676 --> 00:51:03,546
[speaker_0] No,
there are ways to be around the guy who

1048
00:51:03,596 --> 00:51:04,346
gives the correct hypothesis.

1049
00:51:04,406 --> 00:51:08,136
[speaker_1] But he's not like Sam Altman,
who was probably around it, but he

1050
00:51:08,176 --> 00:51:09,455
committed to the research direction.

1051
00:51:09,956 --> 00:51:10,236
Anyway-

1052
00:51:10,376 --> 00:51:10,455
[speaker_0] Okay

1053
00:51:10,476 --> 00:51:13,576
[speaker_1] ... this is by the way,
like even if you don't strongly believe

1054
00:51:14,076 --> 00:51:15,676
[speaker_0] No, no, this
is very important. Like this part

1055
00:51:15,716 --> 00:51:19,536
Like, uh,
was he the one who personally selected the

1056
00:51:19,566 --> 00:51:21,146
"Okay, this one is worth trying,
we should try it"?

1057
00:51:21,196 --> 00:51:21,916
[speaker_1] Yes.

1058
00:51:21,996 --> 00:51:22,216
[speaker_0] Or-

1059
00:51:22,296 --> 00:51:25,816
[speaker_1] Yes, I, I did the...
I'll tell,

1060
00:51:25,856 --> 00:51:29,516
But there is an interview of Dario Amodei

1061
00:51:29,956 --> 00:51:33,636
when he was working with Ilya Sutskever,
who I think it's Dario

1062
00:51:33,676 --> 00:51:37,076
Amodei, who basically, uh, I don't know
if it's that.

1063
00:51:37,116 --> 00:51:40,006
Anyway, it was, I think,
one of these interviews

1064
00:51:40,036 --> 00:51:43,616
about Ilya Sutskever,
and he's saying then he came,

1065
00:51:43,716 --> 00:51:47,526
research direction, saying
that we need to do X,

1066
00:51:47,596 --> 00:51:51,216
we'll do this, and we'll do that.
And Ilya Sutskever drew two circles,

1067
00:51:51,856 --> 00:51:55,496
two concentric circles. Inside he--
In one he-- And this was pre o1.

1068
00:51:55,856 --> 00:51:59,776
He wrote pre-training,
and outside he wrote RL, and he said,

1069
00:52:00,056 --> 00:52:02,566
[speaker_0] Okay. Uh,
if you can send me this, that will help.

1070
00:52:02,626 --> 00:52:05,516
[speaker_1] I'll find, to find that clip.
I'll try to find that clip.

1071
00:52:05,536 --> 00:52:08,706
And this was like much before o1,
when I think Dario or whoever

1072
00:52:08,816 --> 00:52:12,696
and, and the guy was like, "Okay, uh, it

1073
00:52:12,736 --> 00:52:13,696
makes sense." Like,

1074
00:52:14,616 --> 00:52:16,816
why am I complicating the research agenda
that long?

1075
00:52:17,156 --> 00:52:19,956
[speaker_0] Okay, sure.
If you send me this, that will again,

1076
00:52:19,996 --> 00:52:23,246
Like now you have given me three different
data points, and Ilya was involved

1077
00:52:23,296 --> 00:52:26,696
directly, like not just like, okay,
researcher overseeing, but he

1078
00:52:26,716 --> 00:52:29,186
picking the hypothesis
and saying this will work. Yeah.

1079
00:52:29,216 --> 00:52:32,196
[speaker_1] Yeah. Yeah.
I think I can to find... One second.

1080
00:52:32,616 --> 00:52:36,016
Let me just do a random cloud search to
see if they can find the things.

1081
00:52:36,436 --> 00:52:39,986
Last time I remembered, maybe we'll find,
but I, I've seen it and,

1082
00:52:40,576 --> 00:52:44,336
uh, provided this is true, uh,
would you agree that research is

1083
00:52:44,396 --> 00:52:46,436
not as random as picking ideas from a hat?

1084
00:52:46,736 --> 00:52:50,016
[speaker_0] Uh, yeah. If you show me that,
yeah, now three different

1085
00:52:50,056 --> 00:52:53,766
breakthroughs, uh,
Ilya Sutskever personally was helping

1086
00:52:53,876 --> 00:52:57,696
pick the hypothesis rather than just
happen to be in the same room or

1087
00:52:57,736 --> 00:53:01,416
overseeing the same lab.
If you show this across three

1088
00:53:01,556 --> 00:53:05,545
that would tell me that there
is something spec- some specific way

1089
00:53:05,616 --> 00:53:08,736
Ilya Sutskever personally looks at this
problem, which almost nobody else in the

1090
00:53:08,776 --> 00:53:10,156
world has. Yeah.

1091
00:53:10,196 --> 00:53:13,646
[speaker_1] It's time to find out
which interview. I think it

1092
00:53:14,116 --> 00:53:18,036
Anyway, cool. Uh, I think
are there a couple of other things

1093
00:53:18,076 --> 00:53:21,976
that I think I disagreed with. Uh,
so one was that research direction

1094
00:53:22,116 --> 00:53:25,596
there might not be as many low-hanging
fruits as you think there are.

1095
00:53:25,716 --> 00:53:29,316
Uh, so 2030 might be in this thing.
That was one crux he identified.

1096
00:53:29,376 --> 00:53:33,336
The other one was that, uh,
intelligence is easier to build than

1097
00:53:33,476 --> 00:53:37,336
I thought. This is something
that I think I've changed my mind on since

1098
00:53:37,396 --> 00:53:41,206
Ooty, uh, like since he last spoke, uh,
about this.

1099
00:53:41,266 --> 00:53:45,076
It's basically the, if you've,
if you saw the Richard Sutton

1100
00:53:45,516 --> 00:53:46,656
Dwarkesh interview,

1101
00:53:47,516 --> 00:53:48,146
uh, TCS-

1102
00:53:48,296 --> 00:53:49,236
[speaker_0] Yes.

1103
00:53:49,576 --> 00:53:51,476
[speaker_1] Do you remember what they
spoke about?

1104
00:53:51,556 --> 00:53:55,256
[speaker_0] I think Richard Sutton's
timelines were also something 25% by 2030,

1105
00:53:55,316 --> 00:53:56,296
remember correctly.

1106
00:53:56,476 --> 00:53:58,896
[speaker_1] No.
He thinks it's more LL by LL-

1107
00:53:59,876 --> 00:54:03,436
[speaker_0] I am pretty confident Sutton
had something like next five to 10 years

1108
00:54:03,456 --> 00:54:05,266
chance ASI. But yeah-

1109
00:54:05,516 --> 00:54:05,956
[speaker_1] But he thinks-

1110
00:54:05,996 --> 00:54:08,426
[speaker_0] I can't quite remember.
Actually, you know why-

1111
00:54:08,536 --> 00:54:10,886
[speaker_1] But, but I also think
that this whole idea that-

1112
00:54:11,536 --> 00:54:14,256
[speaker_0] Ilya's timelines, what
are his...

1113
00:54:14,676 --> 00:54:15,366
Give me a minute.

1114
00:54:15,936 --> 00:54:19,656
[speaker_1] Yeah,
and I also don't think this whole idea

1115
00:54:19,716 --> 00:54:23,136
agrees with you on timelines,
it doesn't matter. Any-

1116
00:54:23,236 --> 00:54:27,066
Like it does matter what shape of beliefs
he has and why he agrees to those

1117
00:54:27,136 --> 00:54:29,596
timelines and what shape of beliefs you
have and why you agree to the timelines.

1118
00:54:29,676 --> 00:54:31,476
Uh,
there might be a fundamental disagreement

1119
00:54:31,536 --> 00:54:34,516
You could update from some of his beliefs,
not all of his beliefs, even though his

1120
00:54:34,536 --> 00:54:35,436
conclusion are the same.

1121
00:54:35,996 --> 00:54:39,796
[speaker_0] No, uh,
like I initially started from a worldview

1122
00:54:39,876 --> 00:54:40,096
of

1123
00:54:41,116 --> 00:54:44,936
like pick even among the genius
researchers, picking which research

1124
00:54:44,976 --> 00:54:48,916
hypothesis works is kind of random,
and it requires just a lot of hit

1125
00:54:48,936 --> 00:54:51,166
and trial,
and none of these people really know.

1126
00:54:51,216 --> 00:54:55,146
You are trying to update me more towards a
worldview of, no, there are a few

1127
00:54:55,156 --> 00:54:59,036
genius researchers here who consistently
seem to get all of the

1128
00:54:59,076 --> 00:55:03,056
predictions right. And I'm like,
let's say I did update to your

1129
00:55:03,076 --> 00:55:06,616
worldview that all the more means I want
to know, okay, what are these people's

1130
00:55:06,656 --> 00:55:09,536
timelines? If now you're saying, okay,
I should defer to these people now.

1131
00:55:09,576 --> 00:55:10,346
[speaker_1] No, I'm sure. Go, go find out-

1132
00:55:10,346 --> 00:55:13,386
[speaker_0] I literally need to know what
does Ilya Sutskever

1133
00:55:13,556 --> 00:55:14,156
Yeah, like-

1134
00:55:14,316 --> 00:55:16,396
[speaker_1] Go find their timelines.
That's not the point I was making.

1135
00:55:16,456 --> 00:55:19,316
I was trying to make a point that you
were saying that just because Ilya has

1136
00:55:19,356 --> 00:55:23,096
timelines and you have short timelines,
it doesn't matter, uh,

1137
00:55:23,176 --> 00:55:27,116
what Ilya's arguments on research
direction is or Ilya's time, like

1138
00:55:27,156 --> 00:55:29,976
Ilya's stance is on why
and how we get this breakthrough.

1139
00:55:30,176 --> 00:55:31,116
[speaker_0] Those also matter. I agree.

1140
00:55:31,416 --> 00:55:33,826
[speaker_1] Okay. So it-- So cool.
Find Sutton's timelines.

1141
00:55:33,916 --> 00:55:37,656
But the point Sutton made was that, uh,
evolution actually gave us

1142
00:55:37,716 --> 00:55:41,436
language very, very lateUh,
and most of evolution was

1143
00:55:41,476 --> 00:55:45,306
trying to optimize for things
that we take for granted, which

1144
00:55:45,916 --> 00:55:49,396
uh, whatever,
like being physical dexterity, et cetera.

1145
00:55:49,416 --> 00:55:53,216
And, uh, those things essentially,
which according to you are part of

1146
00:55:53,256 --> 00:55:57,056
intelligence,
those things actually took a, like,

1147
00:55:57,136 --> 00:56:00,356
time. Uh, and, uh,
those would be very hard to do.

1148
00:56:00,796 --> 00:56:04,676
Uh, the second thing that he says that,
uh, he thinks that if you can get

1149
00:56:05,416 --> 00:56:08,596
good at doing those parts,
then everything else falls into place.

1150
00:56:09,156 --> 00:56:13,076
Uh, but, uh, the argument is
if you can get to language, uh,

1151
00:56:13,176 --> 00:56:15,076
then getting to the physical stuff
is easy.

1152
00:56:15,236 --> 00:56:19,096
And he's like, "No, most of it
is getting to the physical stuff." And

1153
00:56:19,156 --> 00:56:22,516
then this language part
and all of these part is something

1154
00:56:22,696 --> 00:56:26,356
[speaker_0] Sorry,
I want to interrupt you because I feel

1155
00:56:26,436 --> 00:56:29,976
points we can both go check, and
if we check,

1156
00:56:30,026 --> 00:56:33,696
our debate more productive.
Data point number one of, you know, Ilya

1157
00:56:33,736 --> 00:56:36,726
Sutskever being personally present at all
these three breakthroughs, I think you've

1158
00:56:36,996 --> 00:56:40,856
proved it for AlexNet, I agree.
For the GPT-1 thing,

1159
00:56:40,896 --> 00:56:43,276
also kind of agree. More data would help,
but I kind of agree.

1160
00:56:43,336 --> 00:56:46,116
The o1 thing,
I'm not yet convinced Ilya

1161
00:56:46,156 --> 00:56:49,416
If you show me data, I can be convinced.
Uh, that's one data point.

1162
00:56:49,536 --> 00:56:49,686
[speaker_1] Yeah. I think I found it.

1163
00:56:49,686 --> 00:56:53,426
[speaker_0] The other data point I want
is, uh, yeah, literally what

1164
00:56:53,476 --> 00:56:57,396
What are Sutskever's timelines? Uh,
and I'm saying, like, let's

1165
00:56:57,416 --> 00:57:00,056
first get these data points
and then let's continue the discussion.

1166
00:57:00,256 --> 00:57:00,646
I think that-

1167
00:57:00,716 --> 00:57:02,375
[speaker_1] Found it. I think I found it.

1168
00:57:02,396 --> 00:57:06,156
[speaker_2] The people who
are most responsible for that

1169
00:57:06,196 --> 00:57:10,116
and Jakub Pachocki. I think even like, uh,
like,

1170
00:57:10,336 --> 00:57:11,746
uh, Dota was kind of-

1171
00:57:11,746 --> 00:57:15,056
[speaker_1] We can go like a few seconds
before because he's talking about o1,

1172
00:57:15,416 --> 00:57:16,756
verify that he's talking about o1.

1173
00:57:17,116 --> 00:57:17,396
[speaker_0] And

1174
00:57:18,216 --> 00:57:22,036
okay, uh,
I might agree with you by the way,

1175
00:57:22,116 --> 00:57:25,756
note, I'm like,
why has this not been documented

1176
00:57:25,896 --> 00:57:29,296
there for all the three breakthroughs?
That seems like a very big deal.

1177
00:57:29,896 --> 00:57:30,976
[speaker_1] What do you mean it's not
documented?

1178
00:57:31,656 --> 00:57:35,516
[speaker_0] Why isn't there like either a
Hacker News post or a Lesswrong post

1179
00:57:35,576 --> 00:57:37,936
saying here is the evidence Ilya
was there at all three breakthroughs?

1180
00:57:38,276 --> 00:57:41,976
[speaker_1] But that seems,
I think it's common knowledge.

1181
00:57:42,256 --> 00:57:45,596
I'm surprised you didn't,
and I'm surprised you didn't

1182
00:57:45,996 --> 00:57:48,876
see this,
but basically it's been talked about in a

1183
00:57:49,356 --> 00:57:52,965
Everyone knows that Alex Radford,
Ilya Sutskever, Noam Shazeer are these

1184
00:57:53,036 --> 00:57:56,676
like insane superstar researchers who
whatever they touch, whatever

1185
00:57:56,756 --> 00:57:59,206
ideas they pick turn out to be the right
candidates always.

1186
00:57:59,616 --> 00:58:00,805
There's another one where Jeff Dean
and Noam Shazeer-

1187
00:58:00,805 --> 00:58:03,836
[speaker_0] Okay.
It's definitely not common knowledge,

1188
00:58:03,856 --> 00:58:05,856
to actually write this up
and update a bunch of people.

1189
00:58:05,916 --> 00:58:09,766
Like,
I think I'm not the only one for whom this

1190
00:58:09,766 --> 00:58:11,076
from you this. Yeah.

1191
00:58:11,786 --> 00:58:15,626
[speaker_1] Yeah. But, uh, okay, cool. Uh,
I think I've-- The clip

1192
00:58:15,636 --> 00:58:19,046
something Claude found out. Uh, this
is not the clip that I was talking about

1193
00:58:19,056 --> 00:58:22,926
originally, but I think this
is even tighter evidence than what I

1194
00:58:23,896 --> 00:58:27,196
because he's directly saying
that the reasoning breakthrough came from

1195
00:58:27,216 --> 00:58:28,886
Pachocki and Ilya Sutskever.

1196
00:58:29,176 --> 00:58:33,116
[speaker_0] Okay. Yeah. I mean,
I will need a few minutes to properly go

1197
00:58:33,176 --> 00:58:34,056
okay, fine.

1198
00:58:34,196 --> 00:58:34,676
[speaker_1] No stress, no stress.

1199
00:58:34,716 --> 00:58:38,496
[speaker_0] Like for now I can maybe buy
it. Okay. Let's, uh, buy this.

1200
00:58:38,556 --> 00:58:42,076
Okay, cool. Yeah.
Also then I want to know, yeah, what

1201
00:58:42,136 --> 00:58:45,446
timelines then if you know that. Uh,
any of the people you mentioned-

1202
00:58:45,446 --> 00:58:45,886
[speaker_1] We can just ask chat

1203
00:58:45,896 --> 00:58:48,666
[speaker_0] ... Alex Radford
or Noam Shazeer or Ilya Sutskever. Yeah.

1204
00:58:48,685 --> 00:58:52,546
[speaker_1] Yeah.
This is one post I found on EA forum, uh,

1205
00:58:52,596 --> 00:58:56,166
run by Lessig, uh,
where they're talking about Sutskever's--

1206
00:58:56,216 --> 00:58:59,366
criticizing Sutskever for not having
transparency on his timelines and

1207
00:59:00,076 --> 00:59:00,976
not saying why.

1208
00:59:02,596 --> 00:59:06,456
[speaker_0] I think you have successfully
updated me a bit, well,

1209
00:59:07,056 --> 00:59:10,986
towards the idea
that Ilya Sutskever personally, like not

1210
00:59:11,016 --> 00:59:14,376
even any of the other AI researchers in
this space, but Ilya Sutskever

1211
00:59:14,516 --> 00:59:18,376
specifically has like great research taste
and he's consistently able

1212
00:59:18,456 --> 00:59:20,076
to pick good research hypothesis.

1213
00:59:20,116 --> 00:59:20,656
[speaker_1] Good.

1214
00:59:20,756 --> 00:59:24,556
[speaker_0] Not yet shifted my timelines
by a lot. I think that's where I'm at.

1215
00:59:26,736 --> 00:59:26,745
[laughs]

1216
00:59:26,776 --> 00:59:26,846
[speaker_1] Yeah. I have given you-

1217
00:59:26,876 --> 00:59:30,776
[speaker_0] Because yeah, again,
like it's not obvious to me, okay,

1218
00:59:30,836 --> 00:59:34,696
if you told me, "Okay,
update towards Ilya

1219
00:59:34,736 --> 00:59:38,206
than you," then sure. But
are Ilya's timelines

1220
00:59:38,236 --> 00:59:38,965
[speaker_1] Ilya's timelines are--

1221
00:59:40,676 --> 00:59:43,856
His minimum is still longer than your
maximum.

1222
00:59:44,576 --> 00:59:47,976
Anyway, maybe not. I don't know.
And I think his definition of

1223
00:59:47,996 --> 00:59:48,776
might be very different.

1224
00:59:49,876 --> 00:59:53,726
[speaker_0] Fair. So yeah,
now we will have to go more into

1225
00:59:54,356 --> 00:59:54,796
Uh, okay.

1226
00:59:54,856 --> 00:59:55,876
[speaker_1] We can end the recording
there.

1227
00:59:56,356 --> 00:59:57,356
[speaker_0] Okay. Uh.

1228
00:59:57,396 --> 01:00:00,876
[outro music]