1 00:00:00,100 --> 00:00:04,000 [speaker_0] Pre-training scaling 100% does not work or here is an argument 2 00:00:04,160 --> 00:00:07,830 RL scaling 100% does not work. It's just intuitions that 3 00:00:07,940 --> 00:00:11,830 pre-training scale probably doesn't get there, RL scaling probably doesn't 4 00:00:11,880 --> 00:00:12,160 get there 5 00:00:12,180 --> 00:00:16,020 [speaker_1] ... essentially just by scaling RL 6 00:00:16,100 --> 00:00:19,920 just by scaling RL and pre-training, uh, getting to ASI just in the 7 00:00:20,040 --> 00:00:23,830 current paradigm by 2030, uh, in your definition of intelligence 8 00:00:23,920 --> 00:00:24,920 seems very, very unlikely. 9 00:00:24,960 --> 00:00:28,480 [speaker_0] But yeah, researcher, you can right now take a pen and paper, 10 00:00:28,540 --> 00:00:32,500 and come up with, okay, if I had huge amount of GPUs, here 11 00:00:32,540 --> 00:00:34,710 would try, and you can write them down. 12 00:00:34,730 --> 00:00:34,730 [speaker_1] Yeah. 13 00:00:34,730 --> 00:00:36,880 [speaker_0] And then we will ask, okay, well, why haven't these... 14 00:00:37,380 --> 00:00:41,160 Like, did somebody at, you know, OpenAI, Anthropic try them and they failed, or did 15 00:00:41,220 --> 00:00:42,400 nobody try them? 16 00:00:42,540 --> 00:00:44,980 [speaker_1] Uh, maybe you're right and we're on the vertical part of the S 17 00:00:45,380 --> 00:00:49,130 I think if you just plot it from AlexNet, it's mostly the same 18 00:00:49,140 --> 00:00:52,740 paradigm. Uh, and I would count AlexNet 19 00:00:53,460 --> 00:00:57,400 as one breakthrough, Transformer as another breakthrough, 20 00:00:57,460 --> 00:00:57,820 breakthrough. 21 00:00:58,900 --> 00:01:01,990 [speaker_0] Hello, everyone. Uh, my name is Samuel Shadrach. 22 00:01:02,700 --> 00:01:06,320 I graduated from IIT Delhi. I've been following the whole 23 00:01:06,440 --> 00:01:09,780 AI timelines debate for s- a while now. 24 00:01:09,920 --> 00:01:13,400 By when will humanity get super intelligence? 25 00:01:13,500 --> 00:01:16,560 Is this a good thing or a bad thing? What can we... 26 00:01:16,620 --> 00:01:19,940 And now I have a fairly strong opinion it's a bad thing. We should stop it. 27 00:01:20,000 --> 00:01:22,720 That's a whole separate discussion that we can have elsewhere. 28 00:01:23,100 --> 00:01:27,090 Today, I have with me, uh, Raghav. Uh, we are 29 00:01:27,140 --> 00:01:30,140 going to specifically just discuss, uh, AI timelines. 30 00:01:30,380 --> 00:01:33,480 Uh, when do we think, you know, super intelligence will come? 31 00:01:33,680 --> 00:01:37,060 Uh, you know, assuming the research, current development continues. 32 00:01:37,220 --> 00:01:38,980 Raghav, would you like to introduce yourself? 33 00:01:39,140 --> 00:01:42,980 [speaker_1] I'm been a friend of Sam. I've been following the AI safety debate 34 00:01:43,040 --> 00:01:46,880 for the last five, six years now, uh, and very 35 00:01:46,920 --> 00:01:50,680 interested in the subject. I have some strong opinions as well, 36 00:01:50,820 --> 00:01:54,340 as Samuel, but, uh, I have slightly differing opinions from 37 00:01:54,400 --> 00:01:56,260 That's why I'm keen to talk about this. 38 00:01:56,800 --> 00:02:00,680 [speaker_0] Yeah. First of all is, yeah, uh, the whole pre-training 39 00:02:00,800 --> 00:02:03,260 scaling, which right now everyone says is dead. 40 00:02:03,420 --> 00:02:06,180 Even now I'm not yet convinced pre-training scaling is dead. 41 00:02:06,540 --> 00:02:10,100 Yeah, I think there is still a small chance 42 00:02:10,259 --> 00:02:14,160 and you just extrapolate the pre-training, whatever chinchilla, 43 00:02:14,240 --> 00:02:18,210 whichever scaling curve that has been running for the past few 44 00:02:18,240 --> 00:02:20,980 we do that a few more years with new GPUs coming. 45 00:02:21,000 --> 00:02:24,960 I still think there is at least a little bit chance 46 00:02:25,040 --> 00:02:26,960 ASI, which I'm claiming. Uh- 47 00:02:27,420 --> 00:02:27,520 [speaker_1] Okay 48 00:02:27,980 --> 00:02:31,720 [speaker_0] ... a lot more realistic to me 49 00:02:31,940 --> 00:02:35,420 uh, some way of scaling up RL. Now, 50 00:02:35,820 --> 00:02:37,300 that may not be, like, 51 00:02:38,200 --> 00:02:41,769 just again, you just do the current, you know, RL scaling thing 52 00:02:41,840 --> 00:02:44,180 compute. Maybe one breakthrough is required there. 53 00:02:44,800 --> 00:02:47,600 I have, like, more probability mass than, okay, we need at least one breakthrough 54 00:02:47,660 --> 00:02:51,100 on how to do the RL thing better or some other breakthrough. 55 00:02:51,440 --> 00:02:54,780 I have a little bit less probability mass than, okay, you just blindly scale up RL 56 00:02:54,800 --> 00:02:57,500 and it works. But yeah, that's the thing is, like, we don't know. 57 00:02:57,820 --> 00:03:01,240 I have not h- seen like, okay, here is an argument that tells me that 58 00:03:01,580 --> 00:03:05,460 pre-training scaling 100% does not work, or here is an argument that tells me 59 00:03:05,600 --> 00:03:08,890 RL scaling 100% does not work. It's just intuitions that 60 00:03:09,380 --> 00:03:13,260 pre-training scale probably doesn't get there, RL scaling probably doesn't 61 00:03:13,320 --> 00:03:17,040 get there. We probably need another breakthrough, 62 00:03:17,080 --> 00:03:20,820 really knows. Uh, that's broadly the path, 63 00:03:21,180 --> 00:03:24,980 uh, of, you know, what technical capabilities will be 64 00:03:25,180 --> 00:03:28,700 Uh, then there is, okay, the most probably we'll need one more 65 00:03:28,780 --> 00:03:31,920 breakthrough. Why do I think, okay, even if we need one more breakthrough, we 66 00:03:31,980 --> 00:03:34,640 will... There's a good chance we get it in the 67 00:03:35,460 --> 00:03:39,320 For that, you have to extrapolate, well, in the last 10 years, how 68 00:03:39,360 --> 00:03:42,950 many major breakthroughs have happened in machine learning, deep learning, and, 69 00:03:42,960 --> 00:03:46,080 well, actually three or, like, three or four major breakthroughs have happened. 70 00:03:46,100 --> 00:03:49,960 So based on that extrapolate, okay, it doesn't seem surprising to me 71 00:03:50,100 --> 00:03:53,810 if in the next four years another, you know, smart researcher figures out yet 72 00:03:53,820 --> 00:03:57,620 another breakthrough. So there is that kind of thing. 73 00:03:58,520 --> 00:04:01,560 Then I have some specific heuristics and thing. 74 00:04:02,120 --> 00:04:05,760 If I were an AI capability researcher, if I had a huge number of 75 00:04:05,820 --> 00:04:09,420 GPUs, what are, you know, my crazy research hypothesis I might 76 00:04:09,500 --> 00:04:13,340 try for how to speed up RL? And to be clear, I think this is 77 00:04:13,360 --> 00:04:16,880 extremely dangerous thing to do. I think this is bad for the world to do. 78 00:04:17,019 --> 00:04:20,690 But, you know, if I wanted to do this, like, I think, okay, there are some 79 00:04:20,700 --> 00:04:23,960 hypothesis you could try. Uh, what else? 80 00:04:24,040 --> 00:04:27,640 Yeah, this is all, like, very specific to AI research capabilities, 81 00:04:27,680 --> 00:04:31,660 trajectory. And then I have, like, more very high level big picture kind of 82 00:04:31,760 --> 00:04:35,660 Like, you know, why is super intelligence important? 83 00:04:36,020 --> 00:04:39,200 Uh, you know, why the very fact that GPT-2 84 00:04:39,280 --> 00:04:43,060 exists should update your entire models of how the world works. 85 00:04:43,440 --> 00:04:44,080 The fact that 86 00:04:44,880 --> 00:04:48,560 bunch of matrix multiplication can do, you know, like, uh, like 87 00:04:48,680 --> 00:04:52,560 speaking human language rather than animal language, why this is a big deal? 88 00:04:52,640 --> 00:04:56,460 Like, you know, mat- why did matrix multiplication beat the 89 00:04:56,500 --> 00:04:59,690 literal billions of years of evolution that goes between, you know, 90 00:05:00,540 --> 00:05:01,740 uh- 91 00:05:01,760 --> 00:05:05,460 [speaker_1] I'm sorry, I, this, I don't think I'm, 92 00:05:05,520 --> 00:05:09,300 only. I mean, like, maybe we've not discussed this earlier. 93 00:05:09,330 --> 00:05:09,330 [speaker_0] Yeah. 94 00:05:09,340 --> 00:05:10,480 [speaker_1] But maybe we can get to it. 95 00:05:10,860 --> 00:05:11,159 [speaker_0] Yeah, sure. 96 00:05:11,380 --> 00:05:14,460 [speaker_1] The other one I know you're just saying, but this one seems new to me. 97 00:05:14,520 --> 00:05:17,740 Maybe you're framing it differently or something. Yeah. It's fine. 98 00:05:17,780 --> 00:05:21,040 We, we'll probably get to it in the order, but I'm just flagging it that I probably 99 00:05:21,080 --> 00:05:23,320 need you to double-click on this instinct. 100 00:05:23,480 --> 00:05:27,430 [speaker_0] Sure. Uh, I, I just mean like, uh, human language has features that 101 00:05:27,480 --> 00:05:28,940 are not present in animal language. 102 00:05:29,020 --> 00:05:32,720 There are some linguists that have studied that, 103 00:05:32,800 --> 00:05:36,700 like new evolutionary adaptation compared to most animals who don't really even 104 00:05:36,780 --> 00:05:40,180 use language. They kind of just use sounds 105 00:05:40,300 --> 00:05:44,040 And all of this has taken, like, literal billions of years in evolutionary 106 00:05:44,100 --> 00:05:47,420 history to build, and on the other side you have a bunch of 107 00:05:47,500 --> 00:05:51,340 bunch of GPUs in 50 whatever years of AI research history, and 108 00:05:51,380 --> 00:05:53,260 they have been able to crack human language. 109 00:05:53,320 --> 00:05:57,260 Like, that itself tells me, "All right, so intelligence is 110 00:05:57,280 --> 00:06:01,060 likely easier to build than I thought." So that is like-Yeah. 111 00:06:01,120 --> 00:06:04,780 So even just looking at GPT-2 tells me like, oh, okay, so maybe now the 112 00:06:04,820 --> 00:06:06,359 singularity could happen in my lifetime. 113 00:06:06,400 --> 00:06:09,900 Like, you know, until now I was thinking this is some extremely far 114 00:06:09,980 --> 00:06:12,820 Now it looks like, okay, this could actually happen. Uh, what else? 115 00:06:12,980 --> 00:06:16,810 Yeah, I mean, and I have outside level, outside view kind of stuff like, 116 00:06:16,880 --> 00:06:19,750 okay, which experts are actually correctly predicting this? 117 00:06:19,780 --> 00:06:21,600 Which experts are badly predicting this? 118 00:06:21,660 --> 00:06:25,480 I think a lot of people have been consistently badly predicting, 119 00:06:25,560 --> 00:06:29,540 uh, AI trajectory. Yeah, like the people who 120 00:06:30,020 --> 00:06:32,900 their predictions are coming more correct, and the people who keep making the 121 00:06:32,919 --> 00:06:35,640 pessimistic predictions, their predictions keep coming wrong. 122 00:06:36,140 --> 00:06:39,770 So there's that kind of stuff. I think that could summarize my whole 123 00:06:40,240 --> 00:06:43,540 [speaker_1] Fair enough. We'll start taking it one after the other. 124 00:06:44,040 --> 00:06:47,990 Uh, cool. So starting with, uh, you said pre-training, scaling, 125 00:06:48,000 --> 00:06:50,260 Pre-training scaling is work, and pre-training scaling is not dead. 126 00:06:50,660 --> 00:06:54,280 I think when people say pre-training scaling is dead, they don't mean 127 00:06:54,340 --> 00:06:58,040 more parameters and adding more data and adding more compute doesn't lead to 128 00:06:58,080 --> 00:07:00,420 better loss functions that leads to more capabilities. 129 00:07:00,480 --> 00:07:02,180 I don't think anybody denies that. 130 00:07:02,280 --> 00:07:05,820 Uh, people are uncertain that, uh, people are 131 00:07:05,859 --> 00:07:08,980 say-saying that it could stop, but there is no evidence to believe 132 00:07:09,060 --> 00:07:11,340 stop, and I don't think any serious researcher 133 00:07:11,620 --> 00:07:15,500 The argument against pre-training mostly comes from the fact that, um, 134 00:07:16,000 --> 00:07:19,760 it is economically unfeasible to scale because the, uh, amount 135 00:07:19,800 --> 00:07:20,100 of 136 00:07:21,100 --> 00:07:23,400 resources required to do the scaling gives you log. 137 00:07:23,440 --> 00:07:26,180 It does not linearly increase, it gives you log of the intelligence. 138 00:07:26,740 --> 00:07:30,680 Uh, so like increasing, uh, your compute by ten x et cetera gives you 139 00:07:30,700 --> 00:07:34,280 the amount of intelligence or capabilities and double the amount of 140 00:07:34,340 --> 00:07:34,600 goes. 141 00:07:34,610 --> 00:07:37,800 [speaker_0] Exactly. Double the amount of loss, 142 00:07:37,940 --> 00:07:40,140 loss number leads to this, uh, capability. 143 00:07:40,500 --> 00:07:44,200 [speaker_1] Sure. And so which also means that, uh, so th-then the 144 00:07:44,240 --> 00:07:48,140 debate essentially shifts to not that pre-training is dead, 145 00:07:48,180 --> 00:07:51,900 many ten x's can we do and how many doubling of intelligence or, 146 00:07:52,000 --> 00:07:54,020 slightly higher intelligence will essentially do it. 147 00:07:54,440 --> 00:07:56,820 Uh, I think you'd mentioned this in the OOTD 148 00:07:56,980 --> 00:08:00,800 GPT 4.5 was a big update for me, saying that if you just go 149 00:08:00,860 --> 00:08:04,500 on cranking pre-training, you might get a nicer model with not a 150 00:08:04,540 --> 00:08:08,160 capabilities. There was nothing 4.5 could do that was so 151 00:08:08,340 --> 00:08:12,240 far ahead of 4, uh, that essentially I think that 152 00:08:12,280 --> 00:08:13,200 this would essentially 153 00:08:14,300 --> 00:08:17,780 defeat, uh, like th-th-there would be new 154 00:08:18,180 --> 00:08:21,120 Uh, o1 was in a different update because o1 155 00:08:21,370 --> 00:08:23,060 And essentially, this is in the pa-current paradigm. 156 00:08:23,120 --> 00:08:27,000 So my submission for the pre-training argument is that not that pre-training is 157 00:08:27,060 --> 00:08:30,600 dead in the sense that you can't technically, uh, 158 00:08:30,880 --> 00:08:34,820 burn all, all the GDP in the world and like get a 159 00:08:34,900 --> 00:08:38,840 few doublings, etc., out of it. Uh, whether essentially A, there 160 00:08:38,860 --> 00:08:42,520 is a-- it's economically feasible to do it and B, 161 00:08:42,720 --> 00:08:46,560 even if it was economically feasible to do it, uh, is, uh, like 162 00:08:46,820 --> 00:08:50,700 essentially maybe the returns of on, on, on it that again is probably 163 00:08:50,740 --> 00:08:54,500 not worth it. Maybe spending hundred x the amount of 164 00:08:54,580 --> 00:08:58,500 times better model, uh, probably starts breaking a lot of other 165 00:08:58,620 --> 00:09:02,340 Uh, having said that, we are also almost on the edge 166 00:09:02,500 --> 00:09:05,940 of how much compute we can build. Everything is stretched out to its limit, 167 00:09:06,420 --> 00:09:07,660 uh, especially by twenty-thirty. 168 00:09:07,760 --> 00:09:11,680 The year you said, uh, fabs are already built 169 00:09:11,740 --> 00:09:15,580 out, etc. We have nowhere close to, uh, let's say 170 00:09:15,640 --> 00:09:19,280 four x-- four, four increasing our, uh, pre-training. 171 00:09:19,320 --> 00:09:22,420 We probably can do one or two more scale-ups in the next two 172 00:09:22,700 --> 00:09:26,520 Sorry, in, in the next four years. Uh, and that's maximum that we can do. 173 00:09:26,980 --> 00:09:30,180 Uh, even like we do not have enough fabs, we do not have enough electricity, we do 174 00:09:30,200 --> 00:09:34,180 not have enough... Some insane amount of physical bottlenecks 175 00:09:34,220 --> 00:09:35,680 the amount of compute in the world. 176 00:09:36,060 --> 00:09:40,040 Um, currently we are doing, uh, I think forty or 177 00:09:40,080 --> 00:09:41,510 fifty gigawatts of... 178 00:09:42,720 --> 00:09:46,690 Fifty gigawatts? No, thirty gigawatts of, uh, compute capacity is 179 00:09:46,720 --> 00:09:48,550 what twenty twenty-seven will get us. 180 00:09:48,620 --> 00:09:51,880 And, uh, if everything goes according to plan 181 00:09:51,960 --> 00:09:55,780 like all the supply chains get stretched exactly to the right limit, 182 00:09:56,240 --> 00:09:59,420 our best bet is to get to one fifty to two hundred gigawatts a year, which is again 183 00:09:59,440 --> 00:10:03,400 like six times more compute, not ten x more compute, um, from what 184 00:10:03,440 --> 00:10:05,960 we have right now. And like that's two hundred gigawatts per 185 00:10:06,440 --> 00:10:10,220 Uh, and that's assuming all that compute goes into training 186 00:10:10,260 --> 00:10:14,140 etc. So, so there are like lots of physical limitations 187 00:10:14,160 --> 00:10:17,920 till twenty-thirty that do not allow for arbitrarily amount 188 00:10:18,300 --> 00:10:20,780 Uh, we've already kind of pressed to the 189 00:10:20,880 --> 00:10:24,440 Uh, buying a laptop costs like three hundred, 190 00:10:24,600 --> 00:10:25,600 Costs three hundred, four hundred more. 191 00:10:25,760 --> 00:10:29,580 I don't think that the world can essentially take four orders of magnitude 192 00:10:29,620 --> 00:10:32,740 compute scaling that readily now. I think we, 193 00:10:33,140 --> 00:10:37,100 Coming to RL, I agree that essentially RL scaling 194 00:10:37,160 --> 00:10:39,070 There are two points that I want to make about RL. 195 00:10:39,070 --> 00:10:41,390 RL scaling, uh, is also extremely expensive. 196 00:10:41,840 --> 00:10:44,180 Uh, it's more expensive than pre-training 197 00:10:44,260 --> 00:10:48,170 Uh, the other point that I want to make about RL is, uh, 198 00:10:48,200 --> 00:10:51,940 gives us general capability increases, RL gets us very 199 00:10:52,120 --> 00:10:55,960 jagged increases in capability. You only get increase in capability in 200 00:10:56,000 --> 00:10:59,820 domains like coding and math, uh, which essentially kind of 201 00:10:59,860 --> 00:11:03,740 defeats the, the specific model of intelligence 202 00:11:03,800 --> 00:11:06,720 that it will be better at everything by twenty-thirty. 203 00:11:07,080 --> 00:11:10,780 So if we, if we cannot find good RL candidates 204 00:11:10,800 --> 00:11:14,740 loops for different kinds of event, we might not even solve, uh, forget 205 00:11:14,820 --> 00:11:17,120 solving for like robotics and like other things. 206 00:11:17,180 --> 00:11:21,140 Even in just like atoms world, we might not be able to solve all of it 207 00:11:21,280 --> 00:11:24,600 to human level because we will not just find enough RL data, etc., 208 00:11:25,320 --> 00:11:28,979 or enough closed loops, etc., uh, uh, to essentially do it. 209 00:11:29,080 --> 00:11:32,360 Uh, so, so I don't think that RL also scales 210 00:11:32,600 --> 00:11:36,540 arbitrarily that you can just thousand x the RL compute and actually get 211 00:11:36,580 --> 00:11:40,320 away with it. Uh, there is a limit to how much RL compute you can 212 00:11:40,400 --> 00:11:43,040 RL also just gets you increases in capability. 213 00:11:43,380 --> 00:11:47,220 Yes, economically viable, uh, the economic, the killer use 214 00:11:47,260 --> 00:11:51,140 case for LMS currently is coding, and coding is economically viable, etc. 215 00:11:51,760 --> 00:11:53,900 There's a small, uh, probability gap I have. 216 00:11:53,940 --> 00:11:57,460 This leads to some sort of recursive self-improvement or some sort of, uh, 217 00:11:57,480 --> 00:11:59,840 breakthrough, etc., that happens because of this. 218 00:12:00,200 --> 00:12:04,020 Uh, but aside from that gap, uh, which sure, we can talk about 219 00:12:04,400 --> 00:12:08,360 Uh, but aside from that gap, uh, the path essentially just by scaling 220 00:12:08,500 --> 00:12:12,370 RL or just by scaling pre-training or just by scaling RL and pre-training, 221 00:12:12,780 --> 00:12:16,320 uh, getting to ASI just in the current 222 00:12:16,740 --> 00:12:19,200 uh, in your definition of intelligence seems 223 00:12:19,240 --> 00:12:23,020 unlikely.Um, to me, like forget 25%, I would give 224 00:12:23,360 --> 00:12:26,360 sub 1% chance in the current paradigm, uh, in these specific 225 00:12:26,400 --> 00:12:30,260 circumstances. Not to say that, uh, I have a much higher probability was 226 00:12:30,340 --> 00:12:33,660 might get to superintelligence by 2030, but, uh, that's 227 00:12:33,740 --> 00:12:37,140 essentially assuming technology that haven't been invented yet come into 228 00:12:37,200 --> 00:12:40,460 being. Uh, so that was the second point about RL. 229 00:12:40,600 --> 00:12:43,940 Uh, your third point was tied into this, is like, okay, we might need one more 230 00:12:43,960 --> 00:12:45,960 breakthrough. Uh, current RL might not... 231 00:12:46,000 --> 00:12:47,640 You think there's a chance that might not be enough. 232 00:12:47,680 --> 00:12:49,360 Maybe it's enough, but we might, maybe it's not enough. 233 00:12:49,400 --> 00:12:50,920 Maybe we need one more breakthrough. 234 00:12:51,040 --> 00:12:54,469 And, uh, uh, your submission there was that, hey, 235 00:12:54,860 --> 00:12:58,590 breakthrough will happen, uh, because look at the last 10 years, 236 00:12:58,780 --> 00:13:02,740 we've gotten transformers and we've gotten pre-training scaling, 237 00:13:02,840 --> 00:13:06,280 RL. So looks like breakthroughs are coming very, very quickly. 238 00:13:06,700 --> 00:13:10,380 Um, I think is, uh, this, I don't think that, 239 00:13:11,000 --> 00:13:14,680 uh, there is enough data points for you to 240 00:13:15,040 --> 00:13:19,020 Uh, even sta- in the start of GBD, despite essentially almost 241 00:13:19,080 --> 00:13:22,800 all of world's intelligence and attention going to this problem, 242 00:13:22,940 --> 00:13:26,520 RL scaling, which we've cracked, uh, there's not been another scaling 243 00:13:26,580 --> 00:13:30,140 paradigm that we've cracked. So apart from pre-training and RL scaling, 244 00:13:30,260 --> 00:13:33,380 people are ready to throw computes at other scaling things, and that's not 245 00:13:33,420 --> 00:13:35,740 something that we've cracked. I'll take your rebuttal. 246 00:13:36,040 --> 00:13:36,280 [speaker_0] Oh, rebuttal. 247 00:13:36,300 --> 00:13:40,239 [speaker_1] Like what other scaling paradigm have, 248 00:13:40,280 --> 00:13:40,560 RL- 249 00:13:40,870 --> 00:13:40,890 [speaker_0] No, no. 250 00:13:40,900 --> 00:13:42,330 [speaker_1] Despite throwing insane- 251 00:13:42,330 --> 00:13:45,880 [speaker_0] Okay, it has a lot of attention, 252 00:13:45,920 --> 00:13:47,180 You're saying this mean that other breakthroughs- 253 00:13:47,220 --> 00:13:49,320 [speaker_1] No, no other breakthroughs. I'm saying other breakthroughs... 254 00:13:49,840 --> 00:13:53,020 I'm saying, uh, other breakthroughs are possible in the sense they're not 255 00:13:53,160 --> 00:13:57,080 physically impossible. But A, do you disagree that there are other, 256 00:13:57,200 --> 00:13:58,200 things that we can just throw, 257 00:13:59,220 --> 00:14:03,120 s- other, other scalable things that we can throw compute money, 258 00:14:03,140 --> 00:14:06,980 more intelligence aside from, uh, like just regular pre-training 259 00:14:07,200 --> 00:14:08,160 and RL? 260 00:14:08,560 --> 00:14:12,460 [speaker_0] I think there is a huge backlog of research hypothesis 261 00:14:12,540 --> 00:14:13,540 at all these AI companies. 262 00:14:13,600 --> 00:14:14,520 [speaker_1] I doubt it. 263 00:14:14,660 --> 00:14:15,460 [speaker_0] Like lot of very- 264 00:14:15,540 --> 00:14:15,980 [speaker_1] I doubt it 265 00:14:15,990 --> 00:14:19,830 [speaker_0] ... obvious things to try, but because compute is scarce, 266 00:14:19,920 --> 00:14:20,060 decide- 267 00:14:20,080 --> 00:14:20,480 [speaker_1] I don't know 268 00:14:20,500 --> 00:14:21,450 [speaker_0] ... okay, which ones to, uh- 269 00:14:21,480 --> 00:14:25,300 [speaker_1] I get it, but essentially, I, I, I, I, I hear that argument saying 270 00:14:25,320 --> 00:14:28,980 that obviously we can improve our models in X sector, but if there were such such 271 00:14:29,140 --> 00:14:32,740 obvious scaling paradigms that essentially could have been done 272 00:14:32,800 --> 00:14:36,780 than the current paradigms, uh, I think you would see some evidence of it. 273 00:14:37,080 --> 00:14:39,809 You would see there's, there's enough time essentially going into 274 00:14:39,809 --> 00:14:39,999 that- 275 00:14:40,260 --> 00:14:41,540 [speaker_0] No, I, I hear all excuses 276 00:14:41,800 --> 00:14:41,809 [speaker_1] ... that- 277 00:14:41,840 --> 00:14:44,360 [speaker_0] I'm saying like you right now, you are not an expert AI researcher. 278 00:14:44,460 --> 00:14:48,140 You can right now take a pen and paper, sit for half a day and come up with, okay, 279 00:14:48,380 --> 00:14:52,320 if I had huge amount of GPUs, here are 10 hypothesis I would try, and 280 00:14:52,380 --> 00:14:53,420 you can write them down. 281 00:14:53,430 --> 00:14:53,430 [speaker_1] Yeah. 282 00:14:53,430 --> 00:14:57,340 [speaker_0] And then we will ask, okay, well, why haven't these, like, 283 00:14:57,400 --> 00:15:01,100 know, Open-- Anthropic try them and they failed, or did nobody try them? 284 00:15:01,240 --> 00:15:01,400 [speaker_1] I- 285 00:15:01,469 --> 00:15:01,689 [speaker_0] If nobody tried- 286 00:15:01,710 --> 00:15:05,380 [speaker_1] ... doubt it's that easy. Can you name, 287 00:15:05,420 --> 00:15:09,370 that hasn't been tried that you think has a higher chance of it 288 00:15:09,740 --> 00:15:11,860 an RL level breakthrough if tried? 289 00:15:12,200 --> 00:15:14,880 [speaker_0] I'm not saying any one idea if I try that will definitely work. 290 00:15:14,940 --> 00:15:15,230 I'm saying- 291 00:15:16,040 --> 00:15:19,480 [speaker_1] No, any example of an idea 292 00:15:19,540 --> 00:15:23,460 If nobody's tried it, I didn't find any research papers on it, 293 00:15:23,520 --> 00:15:25,320 we'll probably get ASI. 294 00:15:25,420 --> 00:15:29,020 [speaker_0] Uh, if you just want random ideas, 295 00:15:29,200 --> 00:15:29,840 Uh- 296 00:15:29,920 --> 00:15:30,060 [speaker_1] Sure. 297 00:15:31,040 --> 00:15:34,670 Just to understand, like, what kind of ideas do you think 298 00:15:34,700 --> 00:15:38,220 these guys are so compute constrained that if there is an RL level breakthrough 299 00:15:38,300 --> 00:15:42,060 just sitting on like some researcher's notepad and they've not had the time. 300 00:15:42,160 --> 00:15:45,460 Because even for RL, they didn't actually have to use 301 00:15:45,500 --> 00:15:48,340 the idea. Uh, the idea, the idea, the, the- 302 00:15:48,960 --> 00:15:52,840 [speaker_0] Yeah. Uh, one example, most of the training still happens in very 303 00:15:52,940 --> 00:15:56,380 code, like, you know, PyTorch, you know, like four-bit, eight-bit floating point 304 00:15:56,460 --> 00:15:59,760 numbers. If you really wanted, you could optimize all this way down. 305 00:15:59,800 --> 00:16:03,300 You could literally run training inside an ASIC. You could optimize the code. 306 00:16:03,340 --> 00:16:06,540 [speaker_1] Yeah. So, so for example, that's like a very bad idea. No, no, no. 307 00:16:06,580 --> 00:16:10,400 So for example, running ASIC, so, so that essentially assumes 308 00:16:10,440 --> 00:16:12,220 like a lot of other things need to move in the world. 309 00:16:12,260 --> 00:16:15,990 One, the amount of GPU capacity that already exists allocated in the world 310 00:16:16,020 --> 00:16:19,480 needs to go away. Secondly, there's a reason why even of co... 311 00:16:19,520 --> 00:16:22,510 Like doing this does not get you that much compute because essentially 312 00:16:22,520 --> 00:16:25,570 compute constrained, you're memory constrained, uh, 313 00:16:25,570 --> 00:16:28,080 training and you're basically memory bandwidth constrained, not even memory 314 00:16:28,100 --> 00:16:31,380 constrained. And, uh, yes, sure, like we have like silicon 315 00:16:31,440 --> 00:16:34,200 photonics, and we have other breakthroughs 316 00:16:34,260 --> 00:16:38,159 engineering problem. Uh, I don't think it's as easy as like, hey, 317 00:16:38,200 --> 00:16:41,860 on ASIC and we get like 100X speed up and nobody's had the time 318 00:16:42,279 --> 00:16:45,200 Uh, there's so much incentive for any smart 319 00:16:45,240 --> 00:16:48,770 If you think it's that easy to go replace the entire compute 320 00:16:48,840 --> 00:16:52,430 ASIC and people have just like not tried it because of some other 321 00:16:52,460 --> 00:16:54,000 crunch, I, I think you're mistaken. 322 00:16:54,030 --> 00:16:54,110 [speaker_0] No, no, I- 323 00:16:54,150 --> 00:16:57,460 [speaker_1] Like there's insane amount of like capitalist in-incentive to do it. 324 00:16:57,660 --> 00:17:01,440 Like, th-which would be like, "Hey." I'm saying if there's any other 325 00:17:01,980 --> 00:17:04,950 uh, the amount of compute you need to quote 326 00:17:05,000 --> 00:17:08,680 small. Uh, and therefore if, if there were these insane... 327 00:17:08,720 --> 00:17:12,210 o1, for example, didn't require too much compute on forward 328 00:17:12,240 --> 00:17:15,940 could work. Uh, if you had such an idea, you could show that, hey, this 329 00:17:16,020 --> 00:17:19,870 works and we want to use this to scale our models and it's sitting in the labs and 330 00:17:19,900 --> 00:17:23,260 we just think, but looks like there's n-none, 331 00:17:23,580 --> 00:17:27,360 [speaker_0] Uh, oh, okay. Yeah. I'm saying there are a lot of research 332 00:17:27,460 --> 00:17:31,420 require a lot of compute to prove even as a proof of concept, okay, this is worth 333 00:17:31,460 --> 00:17:31,919 exploring. 334 00:17:32,040 --> 00:17:33,990 [speaker_1] I, I doubt it. 335 00:17:34,070 --> 00:17:34,070 [speaker_0] Yeah. 336 00:17:34,120 --> 00:17:34,900 [speaker_1] I think they all- 337 00:17:35,300 --> 00:17:35,570 [speaker_0] The idea is- 338 00:17:35,570 --> 00:17:39,300 [speaker_1] They all show signs of life. All of these research ideas show signs of 339 00:17:39,380 --> 00:17:42,580 life much before you have to scale them to get gains out of it. 340 00:17:43,220 --> 00:17:47,100 There are very few research ideas that quote unquote "only get unlocked at," 341 00:17:47,380 --> 00:17:50,850 if you only train them for, like only if you spend billion dollars of compute is 342 00:17:50,880 --> 00:17:54,220 the first sign of life you get from that research idea, uh, 343 00:17:54,280 --> 00:17:57,230 research idea saying that, "Hey, if I just keep throwing compute at it"- 344 00:17:57,330 --> 00:18:00,280 [speaker_0] In research, almost every breakthrough 345 00:18:00,380 --> 00:18:04,240 Like only after you scaled it to literal billions of dollars you saw that it was 346 00:18:04,260 --> 00:18:04,560 working. 347 00:18:05,060 --> 00:18:07,740 [speaker_1] No. Give me one research idea that's gonna 348 00:18:08,100 --> 00:18:09,940 GPT-1 was like a very small model. 349 00:18:10,480 --> 00:18:10,550 [speaker_0] RL or- 350 00:18:10,650 --> 00:18:11,669 [speaker_1] RNN was a very small model. 351 00:18:11,740 --> 00:18:13,160 [speaker_0] Uh, GPT- 352 00:18:13,280 --> 00:18:14,820 [speaker_1] RL, for example, didn't require billions of dollars. 353 00:18:14,980 --> 00:18:16,780 So o1, for example, was a very cheap model. 354 00:18:17,140 --> 00:18:21,040 Once GPT-4 was trained, training on chain of, 355 00:18:21,080 --> 00:18:24,964 of thought and doing RL was like a very, very small experimentSo, 356 00:18:25,174 --> 00:18:25,174 uh- 357 00:18:25,184 --> 00:18:27,854 [speaker_0] Can, uh, do you have numbers on that? 358 00:18:27,864 --> 00:18:29,584 [speaker_1] Yeah, yeah. So, so there's, A, there's ... 359 00:18:29,774 --> 00:18:33,424 I, I'll find out where I read about this, but basically the first version of the O1 360 00:18:33,444 --> 00:18:35,264 model was just get tried in a lab, et cetera. 361 00:18:35,604 --> 00:18:38,564 For example, uh, you can take Llama 3 right now, 362 00:18:39,164 --> 00:18:40,434 and, uh, you can take- 363 00:18:40,464 --> 00:18:43,323 [speaker_0] No, no, not, not about right now. 364 00:18:43,484 --> 00:18:46,334 Back then when O1 was first tried, like my guess without having- 365 00:18:46,334 --> 00:18:48,344 [speaker_1] Like I'm saying, you can make it more efficient. 366 00:18:48,384 --> 00:18:48,504 [speaker_0] S- sorry. 367 00:18:48,544 --> 00:18:49,364 [speaker_1] You can make it more 368 00:18:50,304 --> 00:18:50,523 efficient. Yeah. 369 00:18:50,744 --> 00:18:54,104 [speaker_0] Yeah, like without having read it, my guess 370 00:18:54,144 --> 00:18:57,524 required at least $10 million on top of the 371 00:18:57,564 --> 00:19:01,324 GPT-4 training cost, and like I have not actually checked the 372 00:19:01,384 --> 00:19:01,804 but yeah. 373 00:19:02,344 --> 00:19:05,724 [speaker_1] You needed the GPT-4 to get trained first, 374 00:19:05,764 --> 00:19:08,664 and then you found a new scaling paradigm on that scaling paradigm. 375 00:19:09,104 --> 00:19:09,193 [speaker_0] Yeah, yeah. 376 00:19:09,224 --> 00:19:09,634 [speaker_1] That I agree. 377 00:19:09,653 --> 00:19:10,944 [speaker_0] I'm saying first you had the whole- 378 00:19:11,004 --> 00:19:11,564 [speaker_1] But go from four to- 379 00:19:11,614 --> 00:19:13,513 [speaker_0] ... GPT-4 training cost, then GPT-4 trained. 380 00:19:13,524 --> 00:19:13,614 [speaker_1] Right. 381 00:19:13,984 --> 00:19:14,704 [speaker_0] Then you had- 382 00:19:15,164 --> 00:19:15,434 [speaker_1] But O1, when you look at- 383 00:19:15,434 --> 00:19:19,384 [speaker_0] ... to research hypothesis. Each of those hypothesis took at least $10 384 00:19:19,804 --> 00:19:21,124 to test, and 10 million is a random number. 385 00:19:21,164 --> 00:19:21,264 [speaker_1] No. 386 00:19:21,344 --> 00:19:22,824 [speaker_0] I think it's actually important. Uh- 387 00:19:22,904 --> 00:19:24,384 [speaker_1] I don't think it takes $10 million of test. 388 00:19:24,444 --> 00:19:27,164 I think it takes like a few hundred thousand dollars, 389 00:19:27,224 --> 00:19:30,924 dollars to test a research idea to see if that it has any 390 00:19:31,024 --> 00:19:34,564 sign or any chance. Yes, like you might see like different gains 391 00:19:34,664 --> 00:19:35,374 losses, et cetera. 392 00:19:35,394 --> 00:19:37,124 [speaker_0] That is then possibly 100 million is what I'm claiming. 393 00:19:38,004 --> 00:19:41,324 [speaker_1] I doubt it. I don't know if any of all these ideas take 10 million. 394 00:19:41,384 --> 00:19:44,463 For example, O1 didn't take 10 million post GPT-4 395 00:19:44,864 --> 00:19:47,324 [speaker_0] We can actually go and check that maybe. 396 00:19:47,404 --> 00:19:48,184 Like I know this is- 397 00:19:48,224 --> 00:19:48,274 [speaker_1] No, no 398 00:19:48,274 --> 00:19:51,044 [speaker_0] ... not public information, but we can go and see like- 399 00:19:51,244 --> 00:19:53,644 [speaker_1] No, no, I think we can check because the amount of 400 00:19:53,684 --> 00:19:54,924 Yeah, sure, because the actual amount ... 401 00:19:54,934 --> 00:19:57,954 So the idea, the way research works is get an idea, 402 00:19:58,164 --> 00:20:01,304 this thing. If y- if you see any sort on it or anything 403 00:20:01,314 --> 00:20:04,584 "Cool. You know what? This warrants more investigation, 404 00:20:04,744 --> 00:20:07,804 actually go improve something in the form." And then you can do optimizations 405 00:20:07,844 --> 00:20:09,224 it, and then you can make it better, et cetera. 406 00:20:09,524 --> 00:20:13,244 But the first amount of like this thing, the sign of life saying that this has, 407 00:20:13,304 --> 00:20:16,394 this is probably has some legs can come from not that much money. 408 00:20:17,024 --> 00:20:17,474 [speaker_0] Yeah. This is- 409 00:20:17,504 --> 00:20:19,144 [speaker_1] And then obviously getting actual real- 410 00:20:19,174 --> 00:20:21,064 [speaker_0] ... GPT works, however, ML is not like this. 411 00:20:21,724 --> 00:20:23,264 [laughs] I think that's my actual method. 412 00:20:23,304 --> 00:20:26,364 [speaker_1] All of ML has been like this. Like pre-training, for example, 413 00:20:26,404 --> 00:20:28,664 was a scary small model. GPT-2 was a very small model. 414 00:20:29,184 --> 00:20:32,864 Uh, it took like, uh, probably 150 or 2, like the, the initial 415 00:20:32,924 --> 00:20:35,284 GPT cost like a million dollars, I think. Not that much. 416 00:20:35,784 --> 00:20:39,704 Uh, GPT-3 again, like took like slightly larger amount of 417 00:20:40,024 --> 00:20:41,424 but nothing compared to right now. 418 00:20:41,824 --> 00:20:45,604 And the amount of, uh, the only reason you go from GPT-2 to 3 to 419 00:20:45,724 --> 00:20:49,484 4 to 5 or whatever subsequent models is because you see gains 420 00:20:49,524 --> 00:20:53,484 from scaling all the time. Uh, RNN, for example, you can see that, okay, 421 00:20:53,544 --> 00:20:55,884 scale the RNN paradigm, you get gains from it. 422 00:20:56,424 --> 00:20:59,673 Uh, and then, then you stop seeing gains, so that's why RNN didn't scale or 423 00:20:59,684 --> 00:21:01,214 whatever. Or, uh, then they went to LSTM. 424 00:21:01,284 --> 00:21:02,694 LSTM didn't scale, and then they went to transformer. 425 00:21:02,784 --> 00:21:05,754 Transformer scaled fairly much, and like, cool, we found a scalable paradigm. 426 00:21:06,204 --> 00:21:09,244 The idea that you can see gains and then you try to scale it 427 00:21:09,744 --> 00:21:10,064 So- 428 00:21:10,084 --> 00:21:10,524 [speaker_0] Yeah 429 00:21:10,534 --> 00:21:14,144 [speaker_1] ... uh, so it's not that, oh, I have to like literally roll dice of $10 430 00:21:14,224 --> 00:21:16,584 each time to get one idea. It's not that random. 431 00:21:16,944 --> 00:21:20,884 [speaker_0] Okay, I'll make a tighter claim. After GPT-2 432 00:21:20,924 --> 00:21:24,444 had come out, if you wanted to try out any ML research 433 00:21:24,484 --> 00:21:28,304 hypothesis, you probably needed at least $10 million 434 00:21:28,324 --> 00:21:31,924 life or not. Like for, for most of the research hypothesis- 435 00:21:31,984 --> 00:21:32,134 [speaker_1] No, you could use the, uh- 436 00:21:32,134 --> 00:21:34,064 [speaker_0] ... you needed at least $10 million since GPT- 437 00:21:34,104 --> 00:21:38,084 [speaker_1] No, no. So the id- the, the idea is that you train GPT-1, 438 00:21:38,104 --> 00:21:41,284 saw that there were gains on it. The obvious thing to do is that, "Hey, 439 00:21:41,324 --> 00:21:44,224 found a scalable paradigm. I think I can scale it further to see 440 00:21:44,324 --> 00:21:44,644 gains." 441 00:21:45,024 --> 00:21:45,194 [speaker_0] Sure. 442 00:21:45,384 --> 00:21:48,484 [speaker_1] And then you train GPT-3, and there are still gains from it, 443 00:21:48,524 --> 00:21:51,704 res- dead research ideas that might show signs of life, but o- 444 00:21:51,724 --> 00:21:53,084 it, they start breaking. 445 00:21:53,124 --> 00:21:56,944 [speaker_0] Yeah, I'm saying to try any other idea, 446 00:21:56,964 --> 00:22:00,604 pre-training scaling. If you have any other idea apart from 447 00:22:00,824 --> 00:22:04,584 GPT-2 came out 2018, right? So after 2018, if you had any other 448 00:22:04,644 --> 00:22:08,404 idea besides the whole pre-training scaling thing, you wanted to test it out, 449 00:22:08,424 --> 00:22:12,304 would need at least $10 million for most ideas to even test out and see if this 450 00:22:12,324 --> 00:22:13,164 has any life or not. 451 00:22:13,584 --> 00:22:15,604 [speaker_1] And that's also trivial amounts of money if... 452 00:22:15,704 --> 00:22:18,424 I think if people who know about how these things work, it's not completely 453 00:22:18,464 --> 00:22:20,764 unintuitive. They have, they have a big idea of test. 454 00:22:21,044 --> 00:22:24,964 I, I'm not saying is that, uh, there might be a breakthrough that 455 00:22:25,004 --> 00:22:28,004 in some old research paper or in scribbled in all the diaries 456 00:22:28,064 --> 00:22:32,024 overlooked, uh, but it is not as low-lying a fruit as you 457 00:22:32,564 --> 00:22:36,384 if, if only we had just spent more compute, 458 00:22:36,824 --> 00:22:40,054 thousands of thousands of scalable paradigms." Finding scalable paradigms is 459 00:22:40,104 --> 00:22:43,484 really, really hard. We've only done like two ti- 460 00:22:43,504 --> 00:22:46,213 last 10 years and two times in the last 50 years and two times in the last thousand 461 00:22:46,224 --> 00:22:50,114 years, and, uh, essentially, uh, uh, just because we've gotten 462 00:22:50,364 --> 00:22:53,764 lucky, just because we got lucky doesn't mean 463 00:22:53,804 --> 00:22:56,064 happening in the next two, three, four, five years. 464 00:22:56,084 --> 00:22:59,524 It's just like there'll just be like scalable paradigms after scalable 465 00:22:59,784 --> 00:23:02,464 that will just keep showing up because it's shown up last two times. 466 00:23:02,824 --> 00:23:05,744 [speaker_0] Okay, so we have identified some disagreement 467 00:23:05,864 --> 00:23:07,344 Uh, how do you think we can resolve this? 468 00:23:07,384 --> 00:23:11,014 Like what data points will work or what arguments about how do you think 469 00:23:11,014 --> 00:23:14,884 [speaker_1] One data point that would work, one data point 470 00:23:14,944 --> 00:23:18,904 move you is that if you look other, uh, if you look at other fields, you look 471 00:23:18,944 --> 00:23:22,884 at biology, et cetera, uh, just because like there's one breakthrough 472 00:23:22,924 --> 00:23:26,884 that essentially leads to a different class of 473 00:23:26,924 --> 00:23:28,414 discoveries or drug discovery happening. 474 00:23:28,464 --> 00:23:30,544 For example, you get, uh, let's say 475 00:23:31,704 --> 00:23:35,414 discovery through, like, for example, RNA delivery of drugs, 476 00:23:36,044 --> 00:23:39,044 uh, like the mRNA vaccine, et cetera, which you can like modify RNA and you can 477 00:23:39,064 --> 00:23:42,964 inject in people, et cetera. That means that, yes, a lot of, a lot 478 00:23:43,044 --> 00:23:43,624 of, uh... 479 00:23:44,544 --> 00:23:47,644 That, that was a big deal, like get a new branch of medicine, 480 00:23:47,664 --> 00:23:51,384 doesn't automatically mean that the amount of new breakthroughs will 481 00:23:51,424 --> 00:23:55,224 increase. If anything, what we've seen is that there is actually a slowdown in 482 00:23:55,284 --> 00:23:59,074 amount of new ideas and new researches in every mature field 483 00:23:59,104 --> 00:24:02,264 more, more attention goes into it because all 484 00:24:02,584 --> 00:24:05,844 Uh, so finding the next breakthrough is not a linear process. 485 00:24:05,924 --> 00:24:08,524 It's actually a super linear process. Not like it's a log process. 486 00:24:08,564 --> 00:24:12,004 You have to like spend 10X, 100X more resources to get more ideas, 487 00:24:12,464 --> 00:24:16,244 and, uh, you pluck the low-hanging fruits very, 488 00:24:16,284 --> 00:24:18,304 is not that hard. It's not that easy. 489 00:24:18,524 --> 00:24:21,244 So-Um, so other field at least do it. 490 00:24:21,284 --> 00:24:23,424 You can argue that M-ML is different for some reason. 491 00:24:23,544 --> 00:24:26,604 I don't know why scientific idea is, like, inherently be different in, uh, ML 492 00:24:26,624 --> 00:24:27,804 because like, oh, ML is different. 493 00:24:27,844 --> 00:24:31,704 Because mostly what happens is if enough eyeballs look at a problem, uh, 494 00:24:31,744 --> 00:24:34,044 they look at all the low-hanging fruits, then they go to the second level of 495 00:24:34,084 --> 00:24:36,804 low-hanging fruit, and they keep doing this, and they, 496 00:24:36,814 --> 00:24:38,184 hypothesis. And like, okay, cool. 497 00:24:38,664 --> 00:24:42,474 Uh, this has been already done in physics, for example, uh, or 498 00:24:42,504 --> 00:24:45,904 like chemistry, et cetera. We don't expect like crazy amount of math 499 00:24:45,924 --> 00:24:49,684 come out, uh, by a mathematician, uh, or like a 500 00:24:49,724 --> 00:24:52,864 great, like, new, new lines of math schools, 501 00:24:53,384 --> 00:24:56,424 Uh, and I feel like with more attention on a 502 00:24:56,474 --> 00:24:58,504 It becomes harder to e-curve. S-curve is not easier. 503 00:24:58,824 --> 00:24:59,364 [speaker_0] Okay, uh- 504 00:24:59,414 --> 00:25:01,724 [speaker_1] It depends on what part of the S-curve you're on, I think. 505 00:25:02,324 --> 00:25:03,924 [speaker_0] Yeah, I think that S-curve analogy is good. 506 00:25:04,004 --> 00:25:07,924 So yeah, if we are comparing to other scientific 507 00:25:07,944 --> 00:25:09,644 field in which experiments are expensive. 508 00:25:09,704 --> 00:25:13,304 So, like, we should not compare this to, like, theoretical math where you just need 509 00:25:13,384 --> 00:25:14,724 person sitting with pen and paper. 510 00:25:14,804 --> 00:25:17,884 Like, something like drug discovery is a better analogy for this. 511 00:25:18,304 --> 00:25:21,844 And by the way, I do think ML is a bit different, 512 00:25:21,944 --> 00:25:25,274 analogizing with other fields, uh, in drug discovery. 513 00:25:25,324 --> 00:25:27,004 [speaker_1] Or like theoretical physics, for example. 514 00:25:27,344 --> 00:25:30,244 [speaker_0] Sorry, y-you mean theoretical physics is cheap 515 00:25:30,744 --> 00:25:34,404 [speaker_1] Is also expensive. It was cheap at some point, 516 00:25:34,424 --> 00:25:35,964 like, breakthrough with pen and paper. 517 00:25:36,384 --> 00:25:40,004 But now if you want to, like, uh, experimental physics, sorry, 518 00:25:40,244 --> 00:25:40,314 physics- 519 00:25:40,314 --> 00:25:40,674 [speaker_0] Mm. Yeah 520 00:25:40,674 --> 00:25:44,524 [speaker_1] ... uh, you could make like a... Yeah, 521 00:25:44,544 --> 00:25:46,174 to do any new physics. 522 00:25:46,624 --> 00:25:50,224 [speaker_0] Sure. Yeah. Okay, fine. Experimental physics would work. 523 00:25:50,504 --> 00:25:54,254 Uh, yeah. Now to actually argue about experimental 524 00:25:54,284 --> 00:25:57,624 physics or drug discovery, I will have to actually read more about 525 00:25:58,144 --> 00:25:58,924 physics or drug discovery. [chuckles] 526 00:25:59,064 --> 00:26:01,564 [speaker_1] Uh, but do you agree that, that, like- 527 00:26:01,744 --> 00:26:02,534 [speaker_0] Also, we have to- 528 00:26:02,804 --> 00:26:02,814 [speaker_1] With the- 529 00:26:02,824 --> 00:26:03,084 [speaker_0] Yeah, no 530 00:26:03,204 --> 00:26:04,094 [speaker_1] ... enhanced attention and- 531 00:26:04,204 --> 00:26:05,604 [speaker_0] Also, we have to pick a time period. Sorry. 532 00:26:05,864 --> 00:26:09,684 Uh, also, like, if we take, you know, drug discovery or experimental physics as 533 00:26:09,724 --> 00:26:13,574 example, we have to take a time period in the 534 00:26:13,604 --> 00:26:17,584 was known that, okay, this thing is, you know, in like boom phase and like, you 535 00:26:17,624 --> 00:26:19,504 know, lots of new capabilities are coming out. 536 00:26:19,544 --> 00:26:22,364 Like, like in ML right now, we know we are in that sort of phase. 537 00:26:22,424 --> 00:26:24,064 Like, there may be new things we could try. 538 00:26:24,144 --> 00:26:26,524 Like, it's not like, okay, it's like a dead field- 539 00:26:26,564 --> 00:26:26,574 [speaker_1] Sure 540 00:26:26,574 --> 00:26:27,204 [speaker_0] ... mature field. 541 00:26:28,024 --> 00:26:28,564 We have not reached- 542 00:26:28,584 --> 00:26:32,424 [speaker_1] Sure, sure. Yeah, I agree. And like example, 543 00:26:32,464 --> 00:26:35,184 that. I don't know about experimental physics, 544 00:26:35,244 --> 00:26:39,224 also, where a bunch of like new physics 545 00:26:39,264 --> 00:26:41,144 1920s. There was a activity that came out. 546 00:26:41,184 --> 00:26:44,364 There's like, like, a bunch of these new... 547 00:26:44,434 --> 00:26:46,924 All of them essentially got started in 1920s. 548 00:26:47,144 --> 00:26:49,124 There was one period in the 1600s that happened. 549 00:26:49,244 --> 00:26:51,254 Uh, but, uh, I agree that we might- 550 00:26:51,604 --> 00:26:54,604 [speaker_0] Studying that particular time period to study, you know, 551 00:26:54,664 --> 00:26:57,334 that happened there? And after a few breakthroughs- 552 00:26:57,334 --> 00:26:57,334 [speaker_1] Yeah 553 00:26:57,384 --> 00:27:00,224 [speaker_0] ... came out, now to extrapolate, okay, 554 00:27:00,264 --> 00:27:02,244 breakthroughs come out, and how expensive will it be to run- 555 00:27:02,254 --> 00:27:02,554 [speaker_1] Sure, sure 556 00:27:02,584 --> 00:27:03,884 [speaker_0] ... experiments? I think that's the kind of- 557 00:27:03,924 --> 00:27:04,424 [speaker_1] And I agree 558 00:27:04,544 --> 00:27:05,344 [speaker_0] ... study. 559 00:27:05,384 --> 00:27:09,204 [speaker_1] And I agree. And, and, and then essentially, right, 560 00:27:09,304 --> 00:27:12,904 part of the S-curve, that then there, then you should 561 00:27:12,944 --> 00:27:14,344 expect more breakthroughs to come out. 562 00:27:14,724 --> 00:27:18,274 If you're on the horizontal part of the S-curve, then you should think, then you 563 00:27:18,284 --> 00:27:20,044 should expect less discoveries to come out. 564 00:27:20,104 --> 00:27:23,264 Which part of the S-curve you are on, I don't think either of us know. 565 00:27:23,384 --> 00:27:25,464 Uh, but I think- 566 00:27:25,604 --> 00:27:25,984 [speaker_0] I'm claiming- 567 00:27:26,074 --> 00:27:26,724 [speaker_1] ... the more time- 568 00:27:26,884 --> 00:27:30,504 [speaker_0] That's my claim. And also, sure, some decent probability they're not, 569 00:27:30,944 --> 00:27:33,384 [speaker_1] Okay. Uh, what will be your evidence to saying 570 00:27:33,444 --> 00:27:37,144 Like, how are you so sure? I have zero base to say that this thing, 571 00:27:37,184 --> 00:27:40,504 can, on hindsight be like, "Oh, looks like we were on the vertical part 572 00:27:40,784 --> 00:27:44,674 [speaker_0] Okay. And for me, it's just extrapolate last five to 10 573 00:27:44,724 --> 00:27:46,624 data points. Okay, which year did, uh... 574 00:27:46,804 --> 00:27:50,104 Well, actually you can go back to , you know, which year did AlexNet come out? 575 00:27:50,154 --> 00:27:50,154 [speaker_1] Uh- 576 00:27:50,184 --> 00:27:53,144 [speaker_0] Then which year did, you know, transformer come out? 577 00:27:53,184 --> 00:27:53,644 Which year did- 578 00:27:53,654 --> 00:27:53,654 [speaker_1] That- 579 00:27:53,654 --> 00:27:57,544 [speaker_0] ... GPT-2 come out? And just put these on like a year 580 00:27:57,584 --> 00:27:58,254 versus, you know- 581 00:27:58,304 --> 00:27:58,684 [speaker_1] That is- 582 00:27:58,733 --> 00:28:02,544 [speaker_0] ... new breakthrough kind of graph, 583 00:28:02,584 --> 00:28:04,344 this look like your S-curve is saturated? 584 00:28:04,404 --> 00:28:06,184 Yes or no?" And no, it doesn't look like there's- 585 00:28:06,304 --> 00:28:08,004 [speaker_1] And this is like an outside perspective. 586 00:28:08,044 --> 00:28:11,584 You don't have to trust it, but Ilya Sutskever, who 587 00:28:11,624 --> 00:28:15,564 was responsible for GPT, was responsible for RL, uh, comes 588 00:28:15,584 --> 00:28:18,764 and says essentially all the good ideas are down and we need to spend some time 589 00:28:18,804 --> 00:28:21,484 doing new research. And this is the time for... 590 00:28:21,524 --> 00:28:24,594 I don't know if you saw that episode or The R Kesh, but he's like, 591 00:28:24,684 --> 00:28:26,624 scaling is over. Now is the time for new research." 592 00:28:27,524 --> 00:28:29,384 [speaker_0] What are Ilya Sutskever's timelines? 593 00:28:29,464 --> 00:28:32,804 Are they less bullish than me when I'm saying 25% ASI 2030? 594 00:28:32,864 --> 00:28:34,994 Like, does Ilya have like less bullish timelines 595 00:28:35,914 --> 00:28:37,593 [speaker_1] I don't know, but I think that's a relevant. 596 00:28:38,024 --> 00:28:38,263 [speaker_0] Sorry? 597 00:28:38,984 --> 00:28:40,994 [speaker_1] I think that's irrelevant. I feel- 598 00:28:41,124 --> 00:28:42,624 [speaker_0] No, no, you brought up Ilya- 599 00:28:42,674 --> 00:28:42,674 [speaker_1] Yeah. No 600 00:28:42,674 --> 00:28:46,454 [speaker_0] ... then I was like, okay, like does Ilya agree with me already, 601 00:28:46,464 --> 00:28:46,784 thing. 602 00:28:47,464 --> 00:28:48,824 [speaker_1] It doesn't matter if Ilya agrees with you. 603 00:28:48,884 --> 00:28:52,424 What Ilya does agree, disagree with you on is 604 00:28:52,484 --> 00:28:56,164 that the low-hanging scaling fruits have been plucked and we need to go find new 605 00:28:56,204 --> 00:28:59,884 scaling breakthroughs, uh, which he's confident that he will find, 606 00:29:00,004 --> 00:29:03,734 economic incentive to say that. Uh, but he's saying that, "Okay, 607 00:29:03,764 --> 00:29:05,704 are down, and now we need to find something new to 608 00:29:06,164 --> 00:29:10,104 [speaker_0] Okay. No, but if you strongly defer to Ilya on 609 00:29:10,184 --> 00:29:12,234 this question, then Ilya's actual timelines are- 610 00:29:12,264 --> 00:29:13,284 [speaker_1] No, I don't. I don't, 611 00:29:14,164 --> 00:29:17,064 I don't, I don't defer strongly to Ilya on this 612 00:29:17,124 --> 00:29:20,434 All I'm saying is there's one extra evidence point saying that the guy who was 613 00:29:20,464 --> 00:29:23,164 involved with all these three breakthroughs comes and says 614 00:29:23,184 --> 00:29:26,604 no low-hanging fruits anymore, and we have to go find more, uh, 615 00:29:26,864 --> 00:29:30,364 should update you that we are probably not on the vertical part of 616 00:29:30,424 --> 00:29:33,424 [speaker_0] No, no, no. Uh, I took a different lesson from this. 617 00:29:33,584 --> 00:29:37,204 Uh, like Ilya has, again, as bullish as me timelines, 618 00:29:37,244 --> 00:29:40,764 low-hanging fruit is picked. What he means is, okay, the 619 00:29:40,804 --> 00:29:43,403 low-hanging fruit is not like one month low-hanging fruit. 620 00:29:43,464 --> 00:29:45,304 It's like two years, three years low-hanging fruit. 621 00:29:45,724 --> 00:29:48,604 [speaker_1] No, I think his time, his timelines are significantly longer. 622 00:29:48,684 --> 00:29:52,564 I think he's like in the next eight to 10 years will probably be some areas 623 00:29:52,624 --> 00:29:56,144 of research that we need to find to scale, and yeah. 624 00:29:56,524 --> 00:29:56,534 [speaker_0] Okay. 625 00:29:56,564 --> 00:29:58,244 [speaker_1] I, I think his is probably longer. 626 00:29:58,304 --> 00:30:02,124 [speaker_0] So, so again, I'm like, yeah, if you want to debate specifically Ilya's 627 00:30:02,204 --> 00:30:04,064 I think actually we have to go and find his timeline. 628 00:30:04,124 --> 00:30:05,244 [speaker_1] I don't want to debate Ilya's worldview. 629 00:30:05,324 --> 00:30:08,964 I'm saying that this, the, it's not a question of whether Ilya 630 00:30:09,004 --> 00:30:12,704 The question of, uh, Ilya has an evidence point 631 00:30:12,964 --> 00:30:15,153 may or may not update you on which part of the S-curve we are on. 632 00:30:15,284 --> 00:30:18,784 [speaker_0] Yeah. So for that, I need to first even understand does Ilya 633 00:30:18,824 --> 00:30:19,934 or does he have some major disagreement? 634 00:30:19,984 --> 00:30:23,924 [speaker_1] No, he doesn't. He doesn't. He, his timelines are, uh, 635 00:30:24,880 --> 00:30:27,429 [speaker_0] So then want to know what his timelines are. 636 00:30:27,900 --> 00:30:31,630 [speaker_1] I think he said the next... I might have to check on this, but 637 00:30:31,740 --> 00:30:35,640 Darkish episode, he says that like six to eight years, uh, 638 00:30:35,680 --> 00:30:37,180 and then we'll find something that we can scale. 639 00:30:38,100 --> 00:30:41,740 And, uh, there are no good scaling candidates in 640 00:30:42,200 --> 00:30:45,840 [speaker_0] Yeah, no, to continue this, like, I think, like, 641 00:30:45,960 --> 00:30:47,800 more... Like, are you sure about this? You know. 642 00:30:47,840 --> 00:30:48,120 [speaker_1] Yeah. 643 00:30:48,180 --> 00:30:49,120 [speaker_0] Give me more context and- 644 00:30:49,520 --> 00:30:52,860 [speaker_1] No, no, I'm saying, I'm saying, I'm saying that, 645 00:30:52,980 --> 00:30:55,180 S-curve we are on, uh, there's uncertainty on it. 646 00:30:55,740 --> 00:30:58,180 Uh, maybe you're right and we are on the vertical part of the S-curve. 647 00:30:58,580 --> 00:31:02,300 I think if you just plot it from AlexNet, it's mostly the same 648 00:31:02,340 --> 00:31:05,940 paradigm. Uh, and I would count AlexNet 649 00:31:06,660 --> 00:31:10,600 as one breakthrough, Transformers as another breakthrough, 650 00:31:10,640 --> 00:31:11,000 breakthrough. 651 00:31:11,980 --> 00:31:12,930 But apart from that, 652 00:31:13,760 --> 00:31:14,790 I don't see why- 653 00:31:15,160 --> 00:31:17,910 [speaker_0] Yes, once, uh, no, but there are also like minor ones 654 00:31:18,120 --> 00:31:21,800 [speaker_1] ... like, I don't know any other paradigm that 655 00:31:21,840 --> 00:31:25,220 Okay, I won't even count AlexNet 'cause AlexNet 656 00:31:25,280 --> 00:31:29,180 works. Uh, I would just count Transformers 657 00:31:29,240 --> 00:31:31,640 that, uh, show that scale is all you need. 658 00:31:32,160 --> 00:31:35,910 Uh, and then there's n- there's not been a third candidate for 659 00:31:35,960 --> 00:31:39,180 need, or a third S-curve that you can stack on top of these things. 660 00:31:39,560 --> 00:31:41,550 Pre-training was one S-curve. We exhausted it. 661 00:31:41,600 --> 00:31:45,520 [speaker_0] There are multiple points. One is like proof that Transformers 662 00:31:45,600 --> 00:31:49,520 all, and then there's a second data point 663 00:31:49,640 --> 00:31:50,570 Or even with, like- 664 00:31:50,620 --> 00:31:53,010 [speaker_1] Transformers are useful at all is... 665 00:31:53,050 --> 00:31:55,360 Transformers are only useful because they can scale. 666 00:31:55,740 --> 00:31:58,110 [speaker_0] No. Uh, GPT-2 was- 667 00:31:58,120 --> 00:31:58,420 [speaker_1] Because- 668 00:31:58,640 --> 00:32:02,060 [speaker_0] ... useful. Like, it was a breakthrough by itself, even 669 00:32:02,100 --> 00:32:04,370 anything about whether GPT-2 would scale or not. 670 00:32:04,420 --> 00:32:08,140 [speaker_1] If it doesn't, then... No, what I'm talking about is that if you 671 00:32:08,200 --> 00:32:11,800 assume that scaling is what gets you smarter models, uh, 672 00:32:12,120 --> 00:32:15,540 subscribe to that worldview, then you need paradigms that can scale. 673 00:32:15,790 --> 00:32:17,330 Pre-training is one paradigm that can scale. 674 00:32:17,560 --> 00:32:21,520 [speaker_0] Right now. At the time GPT-2 was invented, 675 00:32:21,700 --> 00:32:23,360 ML research community didn't believe it. 676 00:32:23,420 --> 00:32:26,790 [speaker_1] Sure. And I, like, I'm saying that it's, that's, that's 677 00:32:27,320 --> 00:32:28,880 irrespective, like, that's immaterial. 678 00:32:28,940 --> 00:32:32,769 What I'm saying right now is if you believe that intelligence comes 679 00:32:32,780 --> 00:32:33,570 scaling things up- 680 00:32:34,180 --> 00:32:34,460 [speaker_0] Sure 681 00:32:34,540 --> 00:32:38,220 [speaker_1] ... then scaling a paradigm up, 682 00:32:38,280 --> 00:32:38,800 scaling up. 683 00:32:39,460 --> 00:32:39,640 [speaker_0] Yeah. 684 00:32:39,700 --> 00:32:43,620 [speaker_1] In doing so, uh, we have found only two paradigms that have 685 00:32:43,700 --> 00:32:44,060 scaled up. 686 00:32:44,180 --> 00:32:47,300 [speaker_0] We have found only two paradigms that have scaled up. 687 00:32:47,360 --> 00:32:51,120 No, I mean, why does, uh, the whole scaling fully connected 688 00:32:51,180 --> 00:32:54,440 networks back in twenty twelve, twenty thirteen, why does that not count? 689 00:32:54,480 --> 00:32:57,420 Why does scaling LSTM not count? I, I'm not clear. 690 00:32:57,620 --> 00:33:00,000 [speaker_1] Because LSTM didn't scale. CNNs didn't scale. 691 00:33:00,820 --> 00:33:02,920 [speaker_0] What do you mean? CNNs do scale. 692 00:33:03,220 --> 00:33:06,540 [speaker_1] As in, like, you get diminishing returns from scaling 693 00:33:07,060 --> 00:33:08,950 CNNs, for example, cannot- 694 00:33:09,320 --> 00:33:09,330 [speaker_0] Yeah. 695 00:33:09,340 --> 00:33:12,020 [speaker_1] People try to do language experiments on CNN, 696 00:33:12,080 --> 00:33:15,490 People try to do language experiments on LSTM, they perform to a certain level, but 697 00:33:15,520 --> 00:33:17,500 essentially then they start degrading. 698 00:33:18,020 --> 00:33:21,420 So th- these, these paradigms have lots of 699 00:33:21,580 --> 00:33:22,900 limitations on how much they can scale. 700 00:33:23,540 --> 00:33:27,520 [speaker_0] Sure. So they, they did scale up to some amount, 701 00:33:27,560 --> 00:33:29,980 saturated, I guess. Same. Like you had some estimate, I guess. 702 00:33:30,020 --> 00:33:33,440 [speaker_1] So then, then it's a failed candidate. Yeah, 703 00:33:33,500 --> 00:33:37,440 Like, like a good candidate is that, hey, it doesn't matter how much compute 704 00:33:37,460 --> 00:33:39,220 we throw at it, just keep scaling. 705 00:33:39,720 --> 00:33:41,880 [speaker_0] No, no, it means that back then it succeeded, right? 706 00:33:42,140 --> 00:33:45,640 CNNs, we did scale up to some amount, then we realized we're getting diminishing 707 00:33:45,680 --> 00:33:46,200 returns. 708 00:33:46,280 --> 00:33:47,300 [speaker_1] No, but that's what I'm saying, like, 709 00:33:48,320 --> 00:33:50,640 that way, like... No, no, that didn't... 710 00:33:50,740 --> 00:33:54,180 I don't know what your definition of success is, 711 00:33:54,220 --> 00:33:58,060 to superintelligence. Candidates that can get us to superintelligence are, 712 00:33:58,140 --> 00:33:59,940 are candidates that you can throw. 713 00:33:59,980 --> 00:34:03,770 There is no limit to how much, uh, or there's no visible limit to 714 00:34:03,880 --> 00:34:05,880 how much compute you can throw at it. 715 00:34:05,920 --> 00:34:09,580 For example, if tomorrow we find out that pre-training has stopped scaling, 716 00:34:09,600 --> 00:34:11,089 call pre-training a failed candidate. 717 00:34:11,400 --> 00:34:15,240 Currently, we have two candidates that can absorb insane amounts of compute 718 00:34:15,320 --> 00:34:17,800 et cetera, and can keep expecting gains from it. 719 00:34:18,400 --> 00:34:22,010 [speaker_0] No, even the paradigm that gets us to superintelligence might 720 00:34:22,040 --> 00:34:24,240 saturate somewhere. E-e-every candidate- 721 00:34:24,250 --> 00:34:24,560 [speaker_1] Sure 722 00:34:24,580 --> 00:34:25,080 [speaker_0] ... can saturate. 723 00:34:25,220 --> 00:34:27,620 [speaker_1] They would, can saturate, but I'm saying... 724 00:34:27,820 --> 00:34:31,400 A-and at that point, essentially, uh, you can either draw a line and say 725 00:34:31,500 --> 00:34:35,060 intelligence is good enough. Assuming that, okay, so you're in a, 726 00:34:35,140 --> 00:34:37,680 where you do not have enough, uh... 727 00:34:37,700 --> 00:34:40,500 You've not achieved the level of, or like, not achieved, that you don't want to 728 00:34:40,540 --> 00:34:44,400 achieve. But I'm saying like, uh, you think that you can still throw more 729 00:34:44,439 --> 00:34:46,800 at this and get more intelligence out of this. 730 00:34:47,380 --> 00:34:51,180 Uh, and, uh, in that paradigm, there are only two things that can 731 00:34:51,240 --> 00:34:53,760 absorb seemingly infinite amounts of compute. 732 00:34:54,540 --> 00:34:55,670 Uh, and there's not been a third. 733 00:34:55,670 --> 00:34:57,680 [speaker_0] I have a problem with your infinite thing. 734 00:34:57,820 --> 00:34:58,160 It's- 735 00:34:59,500 --> 00:35:02,560 [speaker_1] Seemingly infinite in the sense that, sure, 736 00:35:02,600 --> 00:35:04,380 how much you can do it, or there might be like some... 737 00:35:04,560 --> 00:35:07,240 There are obviously practical limits to it, but there might be a theoretical limit 738 00:35:07,300 --> 00:35:10,860 as well. And I'm saying that maybe before we get there, uh, 739 00:35:11,140 --> 00:35:14,060 superintelligence comes and then you, you've achieved your goal 740 00:35:14,100 --> 00:35:14,700 to scale further. 741 00:35:15,720 --> 00:35:18,240 But until now, there's no evidence to see that there 742 00:35:18,640 --> 00:35:21,580 [speaker_0] Sorry, I'm still not super clear what your claim 743 00:35:21,620 --> 00:35:23,500 Can you like summarize this entire argument? 744 00:35:23,580 --> 00:35:26,540 [speaker_1] Maybe I'll, maybe I'll try to explain it in a 745 00:35:27,280 --> 00:35:31,200 A successful paradigm in my book is one that can 746 00:35:31,260 --> 00:35:35,009 absorb all the compute capacity that you can reasonably throw at it at 747 00:35:35,120 --> 00:35:35,680 point in time. 748 00:35:36,180 --> 00:35:39,790 [speaker_0] That point in time means what? In that year, how many GPUs had 749 00:35:39,860 --> 00:35:41,420 humanity manufactured? 750 00:35:41,440 --> 00:35:45,140 [speaker_1] So it's also... Yeah, in that point in time, 751 00:35:45,200 --> 00:35:49,180 So, so in twenty twenty-six, uh, there are two successful paradigms that can 752 00:35:49,260 --> 00:35:51,380 absorb all the compute and still keep on giving gain. 753 00:35:51,700 --> 00:35:53,420 We have not exhausted these two paradigms. 754 00:35:53,870 --> 00:35:53,870 [speaker_0] Right. 755 00:35:53,880 --> 00:35:57,500 [speaker_1] And we still have like returns to get from there, 756 00:35:57,540 --> 00:36:01,340 essentially get us to smarter models. There are just two of them. And twenty... 757 00:36:01,680 --> 00:36:05,600 And so I'm saying like you keep extrapolating it, our, our, our lim-- our, 758 00:36:05,760 --> 00:36:09,460 thing. So in twenty twenty-- by the time we get to twenty thirty, 759 00:36:09,500 --> 00:36:12,540 that these two paradigms will not get to superintelligence because there are 760 00:36:12,580 --> 00:36:15,900 practical limits to scaling them, if not theoretical limits. 761 00:36:16,020 --> 00:36:19,940 Uh, and, uh, we need either a third paradigm 762 00:36:19,960 --> 00:36:23,860 seventh paradigm to actually keep stacking these S-curves to get us 763 00:36:23,920 --> 00:36:26,639 to the world that you are claiming we'll be in, say, 764 00:36:27,000 --> 00:36:29,820 [speaker_0] Uh, okay. Sure. My probability 765 00:36:29,940 --> 00:36:33,772 ofPre-training scaling plus RL scaling gets us 766 00:36:33,832 --> 00:36:37,592 to ASI by 2030 is less, let's say, less than 767 00:36:37,632 --> 00:36:38,092 10%. 768 00:36:38,232 --> 00:36:41,372 [speaker_1] Yeah, like probably we do need a better, uh, 769 00:36:41,852 --> 00:36:44,292 breakthrough. Yeah. Probably we do need another breakthrough. 770 00:36:44,512 --> 00:36:45,292 I am saying that 771 00:36:46,691 --> 00:36:50,112 these breakthroughs are not super easy to come by. 772 00:36:50,172 --> 00:36:52,992 Dep- I think we d- just did the S curves debate. But yeah, 773 00:36:53,012 --> 00:36:56,672 The one crux we have is that I am very uncertain of 774 00:36:57,092 --> 00:37:00,051 you have somehow. You are somehow more certain in 775 00:37:00,092 --> 00:37:03,512 have, and this is just a prior belief thing. 776 00:37:03,632 --> 00:37:06,212 I don't know if there's, like, any evidence that you can show me. 777 00:37:06,532 --> 00:37:07,122 [speaker_0] Yeah, I think there are- 778 00:37:07,132 --> 00:37:07,901 [speaker_1] And maybe I don't know 779 00:37:07,901 --> 00:37:09,112 [speaker_0] ... S curves we are tracking. One is like, 780 00:37:10,092 --> 00:37:13,972 uh, curves for individual paradigms, and one is like some bigger curve 781 00:37:14,012 --> 00:37:17,012 of, you know, like humanity's ML research as a whole. 782 00:37:17,572 --> 00:37:21,212 So there, one is like the curve of pre-training 783 00:37:21,312 --> 00:37:25,292 started saturating. There's a curve for 784 00:37:25,392 --> 00:37:26,892 scale, when did they start saturating. 785 00:37:27,492 --> 00:37:31,042 There's a curve for, you know, when did RL scaling start, 786 00:37:31,092 --> 00:37:34,232 might saturate someday. And then there are multiple of these curves, but then 787 00:37:34,252 --> 00:37:38,032 there's a bigger overall trajectory of how fast is humanity's ML 788 00:37:38,092 --> 00:37:38,792 capability growing. 789 00:37:39,192 --> 00:37:42,852 [speaker_1] Uh, that I don't agree with because 790 00:37:42,932 --> 00:37:46,112 it. For example, if you start tracking it, there have been multiple AI winters and 791 00:37:46,152 --> 00:37:49,932 AI summers. There were times where people thought 792 00:37:50,412 --> 00:37:54,292 in AI, and if we just do work on GOFAI or we just do work on, 793 00:37:54,332 --> 00:37:57,082 like, some other paradigm, we'll be able to get to ASI from here. 794 00:37:57,112 --> 00:37:59,472 [speaker_0] Why aren't we including all of that as data points? 795 00:37:59,992 --> 00:38:03,752 [speaker_1] So if you keep including them, then essentially, uh, this could 796 00:38:03,812 --> 00:38:07,752 either be like a, a 1990s rush in 797 00:38:07,832 --> 00:38:10,142 AI of, like saying that, "Hey, we have super intelligence. 798 00:38:10,272 --> 00:38:14,162 We have chess playing AI, and we are, like, a few research ideas away from 799 00:38:14,252 --> 00:38:18,242 having, like, super intelligence AI because we have Deep 800 00:38:18,432 --> 00:38:21,312 we'll be like, "No, that, that S curve actually flatlined 801 00:38:21,352 --> 00:38:24,742 super intelligence." And then you have a larger S curve of, 802 00:38:25,072 --> 00:38:28,232 um, and, uh, transformers and R- RL. 803 00:38:28,692 --> 00:38:32,312 And that S curve, depending on where we are on the S curve, 804 00:38:32,372 --> 00:38:36,172 still saturate before we get to, uh, super intelligence. 805 00:38:36,912 --> 00:38:40,602 Uh, and, uh, either that could happen, that is one world that could happen, 806 00:38:40,772 --> 00:38:44,732 or another S curve gets stacked on top of it, and it keeps going till 807 00:38:44,772 --> 00:38:48,732 we reach super intelligence. Uh, so, uh, or maybe, like, how close we 808 00:38:48,752 --> 00:38:52,152 are to super intelligence, essentially whatever that bar is, 809 00:38:52,192 --> 00:38:55,662 there or we need to stack more S curves to basically get us there 810 00:38:55,692 --> 00:38:59,352 faster. Uh, I do agree that on infinite human 811 00:38:59,372 --> 00:39:01,432 timescale, we'll get to super intelligence at some 812 00:39:01,932 --> 00:39:04,772 Uh, but yeah, 2030 is the timelines that we're dealing with. 813 00:39:04,792 --> 00:39:06,832 [speaker_0] Again, I didn't understand your argument. 814 00:39:06,972 --> 00:39:10,952 If I track, you know, human- humanity's AI research progress since, 815 00:39:11,032 --> 00:39:14,812 1970, yes, there have been multiple spans of few 816 00:39:14,852 --> 00:39:17,912 years where we did get, you know, some one breakthrough, something happened, 817 00:39:17,952 --> 00:39:19,952 then we had, like, you know, 20 years of nothing happening. 818 00:39:20,032 --> 00:39:21,542 Yes, there have been multiple of these. 819 00:39:22,312 --> 00:39:24,912 Uh, right now we could be in either one of 820 00:39:24,952 --> 00:39:28,852 We might be about to reach an AI winter or we might be 821 00:39:28,912 --> 00:39:30,792 about... Yeah, we might get a few more breakthroughs. 822 00:39:30,812 --> 00:39:33,972 Those again might not, might or might not get to super intelligence. 823 00:39:34,252 --> 00:39:38,052 Sure, uh, where are we in this? So far I'm agreeing it's now a question 824 00:39:38,132 --> 00:39:40,252 how do you put the numbers on these things and... 825 00:39:40,961 --> 00:39:43,712 [speaker_1] The, the quest- the disagreement comes from the fact that, 826 00:39:44,192 --> 00:39:48,042 It's not a disagreement in the argument, it's a disagreement on how many 827 00:39:48,192 --> 00:39:51,792 S, if a new S curve is needed or will these S curves scale to super 828 00:39:51,832 --> 00:39:54,452 intelligence and how easy are these S curves to come by. 829 00:39:54,892 --> 00:39:58,352 [speaker_0] So what would be a data point that would actually change 830 00:39:58,752 --> 00:40:02,592 your mind? Like, for me, it's, like, fairly, yeah, almost obvious 831 00:40:02,692 --> 00:40:06,482 that, okay, yeah, there is a b- huge backlog of, backlog of research 832 00:40:06,572 --> 00:40:09,372 ideas that need to be, that will definitely try- 833 00:40:09,432 --> 00:40:09,852 [speaker_1] That I think- 834 00:40:10,332 --> 00:40:11,152 [speaker_0] And all of them require- 835 00:40:11,222 --> 00:40:11,222 [speaker_1] Yeah, but- 836 00:40:11,222 --> 00:40:13,392 [speaker_0] ... a lot of people to try and, yeah. 837 00:40:13,992 --> 00:40:17,472 [speaker_1] Sure. I think sure there might be some research 838 00:40:17,532 --> 00:40:21,472 overhang, but, uh, the probability of us 839 00:40:21,512 --> 00:40:24,692 finding a breakthrough in the research ideas might be below, uh... 840 00:40:24,772 --> 00:40:27,632 I think the ML research community is very, very smart. 841 00:40:28,092 --> 00:40:32,052 Uh, they figure out all the best candidates 842 00:40:32,102 --> 00:40:35,392 a daily basis, et cetera. And in the last five, six years- 843 00:40:35,512 --> 00:40:38,232 [speaker_0] Yeah, I think that's where, yeah, like, 844 00:40:38,292 --> 00:40:41,702 smart. It comes down to try, hit and trial random shit until it works. 845 00:40:42,042 --> 00:40:45,842 [laughs] Like, I don't think people had, like, some, uh, brilliant insight, 846 00:40:45,862 --> 00:40:48,661 "Okay, this is why this thing is definitely going to work," 847 00:40:48,672 --> 00:40:50,982 tried it a ton of time. I think people just, like- 848 00:40:51,031 --> 00:40:51,362 [speaker_1] No, they did 849 00:40:51,452 --> 00:40:53,672 [speaker_0] ... tried the random 20 random things to try 850 00:40:53,982 --> 00:40:56,672 [speaker_1] If you look to-- No, but I don't think it 851 00:40:56,732 --> 00:41:00,512 I think the people who try it have some intuition of, uh, why 852 00:41:00,572 --> 00:41:03,852 this could work or why this wouldn't work, and they might be wrong or right and, 853 00:41:03,892 --> 00:41:05,731 and the outcomes might look random, et cetera. 854 00:41:06,032 --> 00:41:09,732 But the selection of which experiments to try definitely has 855 00:41:09,752 --> 00:41:13,352 A good AI researcher is one that tries more successful experiments 856 00:41:13,932 --> 00:41:17,732 And a bad researcher is they keep making bad bets on research 857 00:41:17,772 --> 00:41:18,692 keep failing at it. 858 00:41:19,492 --> 00:41:23,372 And, uh, my, the, my, my, the data point that 859 00:41:23,452 --> 00:41:25,902 I want to look at is that despite so many, 860 00:41:26,812 --> 00:41:30,692 so much money flowing into, like, finding research ideas, I'm sure 861 00:41:30,752 --> 00:41:32,612 we'll be able to scale this further. 862 00:41:33,012 --> 00:41:36,992 But is there an RL level or a pre-training level idea, uh, 863 00:41:37,092 --> 00:41:41,012 already out there that has not been tried yet because of 864 00:41:41,052 --> 00:41:44,952 compute? Because people, like researchers are literally drawing 865 00:41:45,032 --> 00:41:48,912 whatever, um, a hat and implementing the idea as opposed to 866 00:41:48,952 --> 00:41:51,672 reading the idea, understanding the viability of it working 867 00:41:52,212 --> 00:41:54,632 Uh, I think, like, like I believe in the second one. 868 00:41:54,992 --> 00:41:58,692 Uh, and if that is true, then, then where is the idea is my question. 869 00:41:59,172 --> 00:42:00,732 Uh, and just because you got two ideas- 870 00:42:00,762 --> 00:42:03,832 [speaker_0] Yeah, that question is definitely a crux like... Like yeah, 871 00:42:03,892 --> 00:42:06,932 On one side you have like researchers understand nothing about the problem and 872 00:42:06,952 --> 00:42:10,572 they're just brute forcing. On the other end of the spectrum you have, 873 00:42:10,632 --> 00:42:14,432 researcher deeply understands the thing and they have an hypothesis 874 00:42:14,512 --> 00:42:17,372 actually running the training run, they already know with confidence this is 875 00:42:17,412 --> 00:42:21,212 definitely going to work. And you are saying, okay, researchers 876 00:42:21,272 --> 00:42:24,162 end of understanding things. I'm saying they're far closer to the end 877 00:42:24,212 --> 00:42:25,152 randomly brute forcing. 878 00:42:25,612 --> 00:42:28,752 [speaker_1] I don't know. If you look, like, if you look at, like Noam, 879 00:42:28,832 --> 00:42:32,702 heard Noam, what's his name? Yeah, Noam Shazeer talk about, 880 00:42:32,792 --> 00:42:36,402 uh, when they were getting into transformers, 881 00:42:36,402 --> 00:42:39,692 transformers out, when they were essentially scaling language models, 882 00:42:39,752 --> 00:42:41,252 wrote the attention paper, et cetera. 883 00:42:41,732 --> 00:42:43,852 Uh, it wasn't like a random idea that they had come. 884 00:42:44,172 --> 00:42:48,132 Uh, Noam Shazeer has like this history of 885 00:42:48,252 --> 00:42:51,992 ideas to try out. He's called like a, a magical researcher 886 00:42:52,012 --> 00:42:55,532 because he can seemingly look at like 100 ideas and figure out, 887 00:42:55,572 --> 00:42:59,072 like these are the one, two. He has like crazy intuition of these 888 00:42:59,132 --> 00:43:02,092 ideas that could work because he understands these things much more deeply 889 00:43:02,172 --> 00:43:02,872 average researcher. 890 00:43:03,062 --> 00:43:03,212 [speaker_0] Okay. Oh. 891 00:43:03,292 --> 00:43:05,932 [speaker_1] And there are the superstar researchers that can look at ideas 892 00:43:05,972 --> 00:43:09,944 like breakthrough ideas much more quickly.And I don't think it's as 893 00:43:10,004 --> 00:43:12,684 random as that, "Hey, let me just pick one and do it, and 894 00:43:12,744 --> 00:43:14,204 Otherwise, I'll go pick another one tomorrow." 895 00:43:14,664 --> 00:43:18,564 [speaker_0] For me, I see it more as, yeah, like, yes, Noam Shazeer probably did 896 00:43:18,604 --> 00:43:21,984 have some intuitions, but also it was random that he 897 00:43:22,044 --> 00:43:24,244 How would I put it? There have been three 898 00:43:24,304 --> 00:43:26,584 Noam Shazeer directly contributed to one of them. 899 00:43:27,084 --> 00:43:30,644 If you take any of the other research breakthroughs which Noam Shazeer did not 900 00:43:30,704 --> 00:43:34,584 make, and you put him in one year before 901 00:43:34,624 --> 00:43:37,824 happened and told him, "Look, here are all these hypothesis the different 902 00:43:37,844 --> 00:43:41,444 researchers are making. Which one do you think will work?" I don't 903 00:43:41,484 --> 00:43:44,144 have made that good a guess and told you, "Oh, this one will work." 904 00:43:44,724 --> 00:43:48,424 [speaker_1] And I saw a Noam Shazeer, uh, and Jeff Dean 905 00:43:48,744 --> 00:43:49,024 talk 906 00:43:49,824 --> 00:43:53,544 about this exact thing, and from the story that they 907 00:43:54,064 --> 00:43:56,864 their meeting perspective and understand, et cetera. 908 00:43:57,224 --> 00:43:58,524 [speaker_0] Yeah, if what you are saying is correct- 909 00:43:58,614 --> 00:43:58,614 [speaker_1] [laughs] 910 00:43:58,614 --> 00:44:02,564 [speaker_0] ... there should be, like, the same researcher who's consistently 911 00:44:02,604 --> 00:44:05,414 where the field is heading multiple times and should be multiple times- 912 00:44:05,744 --> 00:44:05,954 [speaker_1] That's true 913 00:44:05,964 --> 00:44:09,644 [speaker_0] ... not 100% co- correct, but, like, roughly able to see, okay, 914 00:44:09,684 --> 00:44:12,464 are probably going to work and then actually roughly ends up correct. 915 00:44:13,064 --> 00:44:13,614 Whereas I'm saying no actually- 916 00:44:13,624 --> 00:44:14,844 [speaker_1] That's been true 917 00:44:14,944 --> 00:44:15,904 [speaker_0] ... I'm saying no actually- 918 00:44:15,964 --> 00:44:18,744 [speaker_1] Because the amount of, the amount of breakthroughs have come 919 00:44:18,944 --> 00:44:21,854 No, the amount of breakthrough that have come from these superstar 920 00:44:21,884 --> 00:44:25,334 like, very, very high. Why is Ilya Sutskever around all big 921 00:44:25,384 --> 00:44:27,884 'Cause he has a crazy sense of which research ideas can work. 922 00:44:28,284 --> 00:44:30,304 Why is Noam Shazeer around all these big breakthroughs? 923 00:44:30,334 --> 00:44:32,364 'Cause he has a crazy idea of all these things to do. 924 00:44:32,404 --> 00:44:36,184 There's a random reason why a random ML PhD you've never heard of comes up with a 925 00:44:36,304 --> 00:44:38,964 crazy idea. It's mostly because you literally have to 926 00:44:39,784 --> 00:44:43,663 uh, what's that guy's name, who's the GPT-2 main, Alec 927 00:44:43,724 --> 00:44:47,524 Radford type level researcher. Apparently, Alec Radford has such a great 928 00:44:47,704 --> 00:44:51,464 sense of what could work. He's like, he literally used to 929 00:44:51,544 --> 00:44:55,434 do small experiments on Jupyter Notebooks, and 930 00:44:55,524 --> 00:44:59,184 he then once he got convinced that this could work, 931 00:44:59,244 --> 00:45:02,184 engineering to Greg Brockman or somebody who's like, "Yeah, 932 00:45:02,284 --> 00:45:05,964 Just keep scaling it up. No, I'm sure it will work." Uh, 933 00:45:06,104 --> 00:45:08,914 built so much intuition about, like, and what couldn't work 934 00:45:09,024 --> 00:45:12,804 this crazy ML whisperer guy who can just, like, look at the 935 00:45:12,884 --> 00:45:16,644 shape of the model and figure out, like, these are ideas worth pursuing, not worth 936 00:45:16,664 --> 00:45:19,813 pursuing. And if you look at, like, Thinking Machines Lab, which is, like, 937 00:45:19,884 --> 00:45:23,684 lab filled, filled with all these guys, John Schulman, uh, what's his name, Alec 938 00:45:23,724 --> 00:45:27,584 Radford, all the OG co-founder guys, they have essentially had, like, 939 00:45:27,764 --> 00:45:31,384 free rein to do whatever. And the best they came up with 940 00:45:31,764 --> 00:45:35,664 that updated me towards that, oh, okay, there's, like, a lot of things to do here, 941 00:45:35,704 --> 00:45:39,634 but there is no crazy research breakthrough paradigm that, that 942 00:45:39,764 --> 00:45:43,444 oh, we got, like, a pre-training level paradigm 943 00:45:43,524 --> 00:45:46,364 stack on pre-training and get, like, insane results. 944 00:45:46,404 --> 00:45:50,304 [speaker_0] Yeah, I think I've identified a data point 945 00:45:50,384 --> 00:45:54,284 on this. If you, uh, again, from these, you know, three or four superstar 946 00:45:54,304 --> 00:45:58,104 researchers, if you're able to document a public track 947 00:45:58,264 --> 00:46:02,164 of, well, uh, yeah, since maybe 2018 948 00:46:02,244 --> 00:46:04,844 till 2026, some at least 10 years or no. 949 00:46:05,864 --> 00:46:09,084 Well, yeah, at least, yeah, like more than five, at least seven, 950 00:46:09,664 --> 00:46:13,624 consistent track record. Like, okay, here they made these predictions in 2018. 951 00:46:13,684 --> 00:46:15,644 They made these prediction 2020. They made these- 952 00:46:15,664 --> 00:46:16,604 [speaker_1] Dario is one guy 953 00:46:16,684 --> 00:46:18,813 [speaker_0] ... in 2022, and they made these in 2024. 954 00:46:19,004 --> 00:46:22,564 And, like, they didn't single-handedly do all the 955 00:46:22,604 --> 00:46:25,224 roughly able to see where the next breakthroughs are going to come from. 956 00:46:25,544 --> 00:46:29,124 If you can show me, okay, just the same guy 957 00:46:29,144 --> 00:46:32,264 trend, then that would actually shift my 958 00:46:32,944 --> 00:46:36,304 [speaker_1] I don't know. I think I 100% believe what you're saying 959 00:46:36,324 --> 00:46:40,164 true. Uh, I don't know how... Like, I'm, I'm trying to think of what are ways to 960 00:46:40,204 --> 00:46:44,084 show you this is happening. Uh, one way to do this would be that if you 961 00:46:44,124 --> 00:46:48,104 look at the large breakthroughs, and you look at who's responsible 962 00:46:48,124 --> 00:46:51,124 or who's close to those breakthroughs, it will seem like it's the same people. 963 00:46:51,684 --> 00:46:55,304 And that should update you towards that, hey, how come, uh, Ilya 964 00:46:55,313 --> 00:46:57,424 Sutskever is involved with all the big breakthroughs? 965 00:46:57,484 --> 00:47:01,024 How come it came from the same guy who did AlexNet, is the same guy who did 966 00:47:01,424 --> 00:47:05,144 GPT-2, is the same guy who did RL? Why, why is it the same guy who's doing all 967 00:47:05,164 --> 00:47:08,424 these other things? Why is Noam Shazeer building all of these 968 00:47:08,504 --> 00:47:10,474 Uh, why is Alec Radford building all these 969 00:47:10,844 --> 00:47:13,834 It's because they have figured out or, or they have impeccable... 970 00:47:14,124 --> 00:47:18,064 So they, they talk about impeccable research taste, 971 00:47:18,074 --> 00:47:21,714 taste is what is really hard. And research taste is this intuition that 972 00:47:21,724 --> 00:47:25,624 researchers have that can figure out from a pile of, like, 973 00:47:25,664 --> 00:47:28,374 one to worth trying from the compute we have to get the breakthrough. 974 00:47:28,784 --> 00:47:32,624 [speaker_0] Yeah, so literally what you said, 975 00:47:32,684 --> 00:47:36,664 here is Alec Radford's track record of research hypothesis going back 976 00:47:36,684 --> 00:47:37,404 entire eight years- 977 00:47:37,464 --> 00:47:41,064 [speaker_1] But why won't you just buy Ilya's track record? 978 00:47:41,204 --> 00:47:44,324 [speaker_0] Uh, I literally don't know enough about this 979 00:47:44,414 --> 00:47:47,204 What did Ilya say in 2018? What did he say in '19? 980 00:47:47,224 --> 00:47:48,134 [speaker_1] No, he doesn't say anything. 981 00:47:48,134 --> 00:47:48,474 [speaker_0] I'm literally not saying- 982 00:47:48,524 --> 00:47:52,084 [speaker_1] But basically the fact that he doesn't, 983 00:47:52,164 --> 00:47:56,104 Basically, if he doesn't have to publicly make any of 984 00:47:56,164 --> 00:47:59,924 there's a reason why all big breakthroughs are around one person, it 985 00:48:00,024 --> 00:48:03,184 stands to reason that this person picks better research 986 00:48:04,084 --> 00:48:06,294 person who's picking research ideas at a random, at random. 987 00:48:06,864 --> 00:48:07,294 [speaker_0] No, no, but- 988 00:48:07,484 --> 00:48:07,614 [speaker_1] Like- 989 00:48:07,844 --> 00:48:11,764 [speaker_0] Was he personally the one who did the breakthrough, 990 00:48:11,804 --> 00:48:14,024 to be at the lab where somebody else did the breakthrough? 991 00:48:14,424 --> 00:48:16,384 [speaker_1] No, no. He personally was overseeing research. 992 00:48:16,424 --> 00:48:19,224 He personally was green-lighting the experiments that he thinks would work. 993 00:48:19,584 --> 00:48:22,984 [speaker_0] Okay. Uh, okay, then we can take literally Ilya Sutskever 994 00:48:23,044 --> 00:48:26,944 example. Uh, which breakthroughs would you say, okay, 995 00:48:26,964 --> 00:48:30,694 is significantly responsible for versus which ones you think he just happened 996 00:48:30,744 --> 00:48:31,264 to be there? 997 00:48:31,274 --> 00:48:31,274 [speaker_1] Deep learning. 998 00:48:31,304 --> 00:48:31,484 [speaker_0] Sorry? 999 00:48:31,544 --> 00:48:34,664 [speaker_1] Deep learning, he was-- Deep learning, he 1000 00:48:36,004 --> 00:48:38,124 [speaker_0] No, when you say deep learning, 1001 00:48:39,024 --> 00:48:42,084 Okay, I'll explain it. Cool. Uh, yeah, I agree with you. Fine. 1002 00:48:42,184 --> 00:48:45,824 Ilya was significantly responsible, a-along with other people, significantly 1003 00:48:45,884 --> 00:48:47,104 responsible for AlexNet, sure. 1004 00:48:47,164 --> 00:48:50,534 [speaker_1] GPT, the GPT ideas he was significantly responsible for. 1005 00:48:51,944 --> 00:48:54,804 Just training transformers. GPT-1 also. 1006 00:48:55,144 --> 00:48:58,724 The idea that we can essentially scale transformers or we can, we can find a 1007 00:48:58,784 --> 00:49:01,944 scalable transformer paradigm and build language models from it. 1008 00:49:02,704 --> 00:49:06,234 [speaker_0] Uh, okay. So which model was this? Was this GPT-1? 1009 00:49:06,504 --> 00:49:08,504 [speaker_1] This was the generative pre-transformer paper. 1010 00:49:08,844 --> 00:49:11,024 He is directly responsible there. 1011 00:49:11,084 --> 00:49:13,894 I think if we look at the GPT-1 paper- 1012 00:49:13,894 --> 00:49:17,064 [speaker_0] One second please. I know this, I know it's annoying to, 1013 00:49:17,124 --> 00:49:20,544 things middle of video, but, like, I actually want to now go read 1014 00:49:20,554 --> 00:49:20,664 is. 1015 00:49:21,044 --> 00:49:23,504 [speaker_1] Sure. It's this paper. I will send it to you. 1016 00:49:23,936 --> 00:49:27,046 [speaker_0] Oh, okay. It's in the chat. See, uh, when was this published? 1017 00:49:27,416 --> 00:49:28,736 [speaker_1] Five years ago. Twenty twenty... 1018 00:49:29,965 --> 00:49:31,956 No, earlier than that. June 2018. 1019 00:49:32,436 --> 00:49:33,636 [speaker_0] Is it June? You sure? 1020 00:49:34,176 --> 00:49:37,266 [speaker_1] This is the AI summary. Yeah. Yeah, June 2018. 1021 00:49:37,616 --> 00:49:41,215 [speaker_0] Okay, fine. Take care. I will buy that. Okay, fine. 1022 00:49:41,276 --> 00:49:44,596 Ilya has been there at two major breakthroughs. Fine. 1023 00:49:44,946 --> 00:49:48,896 [speaker_1] Then he's also been there at, uh, this thing, uh, for 1024 00:49:48,976 --> 00:49:52,696 the o1 breakthrough as well. Uh, Ilya didn't 1025 00:49:52,736 --> 00:49:56,576 green-light the experiment. Ilya was heading research that time and 1026 00:49:56,596 --> 00:49:59,536 green-lighted the o1 experiment. RL, basically RL scaling. 1027 00:49:59,896 --> 00:50:02,446 And if you look at Dario-- Sorry. Sorry. 1028 00:50:02,456 --> 00:50:03,216 [speaker_0] Dario, for this- 1029 00:50:03,776 --> 00:50:07,036 [speaker_1] Uh, he was at OpenAI at that time. He was head of research. 1030 00:50:07,416 --> 00:50:11,066 He was personally seeing all the AI research that was happening at OpenAI, and 1031 00:50:11,176 --> 00:50:11,956 OpenAI came out- 1032 00:50:12,066 --> 00:50:15,716 [speaker_0] No, but if he's head of research, 1033 00:50:15,756 --> 00:50:19,616 works in his lab, even if he does not want to, you know, do that, 1034 00:50:19,656 --> 00:50:21,676 hypothesis or, you know, prioritize it. 1035 00:50:21,686 --> 00:50:22,946 [speaker_1] You don't have to suggest the hypothesis. 1036 00:50:23,016 --> 00:50:26,356 I'm saying researchers are mostly the same 1037 00:50:26,396 --> 00:50:28,716 experiments, the best researchers. 1038 00:50:29,476 --> 00:50:31,686 The best researchers know which, which can control- 1039 00:50:31,696 --> 00:50:35,076 [speaker_0] But like OpenAI tried 100 things. Whichever of, 1040 00:50:35,116 --> 00:50:37,836 worked, he could take credit for it simply because 1041 00:50:38,276 --> 00:50:42,146 [speaker_1] Sure, but if, if there are 10,000 things that OpenAI didn't try, 1042 00:50:42,236 --> 00:50:45,916 if there are three big paradigm, three big breakthroughs 1043 00:50:46,256 --> 00:50:50,036 AI, and the same guy has been around for all 1044 00:50:50,196 --> 00:50:53,946 the set of the things he tried. It's the, the, the set is all the things that are 1045 00:50:53,946 --> 00:50:57,796 out there that he didn't try, and out of which he was able to freak out 1046 00:50:57,896 --> 00:50:59,316 or be around all three things. 1047 00:50:59,676 --> 00:51:03,546 [speaker_0] No, there are ways to be around the guy who 1048 00:51:03,596 --> 00:51:04,346 gives the correct hypothesis. 1049 00:51:04,406 --> 00:51:08,136 [speaker_1] But he's not like Sam Altman, who was probably around it, but he 1050 00:51:08,176 --> 00:51:09,455 committed to the research direction. 1051 00:51:09,956 --> 00:51:10,236 Anyway- 1052 00:51:10,376 --> 00:51:10,455 [speaker_0] Okay 1053 00:51:10,476 --> 00:51:13,576 [speaker_1] ... this is by the way, like even if you don't strongly believe 1054 00:51:14,076 --> 00:51:15,676 [speaker_0] No, no, this is very important. Like this part 1055 00:51:15,716 --> 00:51:19,536 Like, uh, was he the one who personally selected the 1056 00:51:19,566 --> 00:51:21,146 "Okay, this one is worth trying, we should try it"? 1057 00:51:21,196 --> 00:51:21,916 [speaker_1] Yes. 1058 00:51:21,996 --> 00:51:22,216 [speaker_0] Or- 1059 00:51:22,296 --> 00:51:25,816 [speaker_1] Yes, I, I did the... I'll tell, 1060 00:51:25,856 --> 00:51:29,516 But there is an interview of Dario Amodei 1061 00:51:29,956 --> 00:51:33,636 when he was working with Ilya Sutskever, who I think it's Dario 1062 00:51:33,676 --> 00:51:37,076 Amodei, who basically, uh, I don't know if it's that. 1063 00:51:37,116 --> 00:51:40,006 Anyway, it was, I think, one of these interviews 1064 00:51:40,036 --> 00:51:43,616 about Ilya Sutskever, and he's saying then he came, 1065 00:51:43,716 --> 00:51:47,526 research direction, saying that we need to do X, 1066 00:51:47,596 --> 00:51:51,216 we'll do this, and we'll do that. And Ilya Sutskever drew two circles, 1067 00:51:51,856 --> 00:51:55,496 two concentric circles. Inside he-- In one he-- And this was pre o1. 1068 00:51:55,856 --> 00:51:59,776 He wrote pre-training, and outside he wrote RL, and he said, 1069 00:52:00,056 --> 00:52:02,566 [speaker_0] Okay. Uh, if you can send me this, that will help. 1070 00:52:02,626 --> 00:52:05,516 [speaker_1] I'll find, to find that clip. I'll try to find that clip. 1071 00:52:05,536 --> 00:52:08,706 And this was like much before o1, when I think Dario or whoever 1072 00:52:08,816 --> 00:52:12,696 and, and the guy was like, "Okay, uh, it 1073 00:52:12,736 --> 00:52:13,696 makes sense." Like, 1074 00:52:14,616 --> 00:52:16,816 why am I complicating the research agenda that long? 1075 00:52:17,156 --> 00:52:19,956 [speaker_0] Okay, sure. If you send me this, that will again, 1076 00:52:19,996 --> 00:52:23,246 Like now you have given me three different data points, and Ilya was involved 1077 00:52:23,296 --> 00:52:26,696 directly, like not just like, okay, researcher overseeing, but he 1078 00:52:26,716 --> 00:52:29,186 picking the hypothesis and saying this will work. Yeah. 1079 00:52:29,216 --> 00:52:32,196 [speaker_1] Yeah. Yeah. I think I can to find... One second. 1080 00:52:32,616 --> 00:52:36,016 Let me just do a random cloud search to see if they can find the things. 1081 00:52:36,436 --> 00:52:39,986 Last time I remembered, maybe we'll find, but I, I've seen it and, 1082 00:52:40,576 --> 00:52:44,336 uh, provided this is true, uh, would you agree that research is 1083 00:52:44,396 --> 00:52:46,436 not as random as picking ideas from a hat? 1084 00:52:46,736 --> 00:52:50,016 [speaker_0] Uh, yeah. If you show me that, yeah, now three different 1085 00:52:50,056 --> 00:52:53,766 breakthroughs, uh, Ilya Sutskever personally was helping 1086 00:52:53,876 --> 00:52:57,696 pick the hypothesis rather than just happen to be in the same room or 1087 00:52:57,736 --> 00:53:01,416 overseeing the same lab. If you show this across three 1088 00:53:01,556 --> 00:53:05,545 that would tell me that there is something spec- some specific way 1089 00:53:05,616 --> 00:53:08,736 Ilya Sutskever personally looks at this problem, which almost nobody else in the 1090 00:53:08,776 --> 00:53:10,156 world has. Yeah. 1091 00:53:10,196 --> 00:53:13,646 [speaker_1] It's time to find out which interview. I think it 1092 00:53:14,116 --> 00:53:18,036 Anyway, cool. Uh, I think are there a couple of other things 1093 00:53:18,076 --> 00:53:21,976 that I think I disagreed with. Uh, so one was that research direction 1094 00:53:22,116 --> 00:53:25,596 there might not be as many low-hanging fruits as you think there are. 1095 00:53:25,716 --> 00:53:29,316 Uh, so 2030 might be in this thing. That was one crux he identified. 1096 00:53:29,376 --> 00:53:33,336 The other one was that, uh, intelligence is easier to build than 1097 00:53:33,476 --> 00:53:37,336 I thought. This is something that I think I've changed my mind on since 1098 00:53:37,396 --> 00:53:41,206 Ooty, uh, like since he last spoke, uh, about this. 1099 00:53:41,266 --> 00:53:45,076 It's basically the, if you've, if you saw the Richard Sutton 1100 00:53:45,516 --> 00:53:46,656 Dwarkesh interview, 1101 00:53:47,516 --> 00:53:48,146 uh, TCS- 1102 00:53:48,296 --> 00:53:49,236 [speaker_0] Yes. 1103 00:53:49,576 --> 00:53:51,476 [speaker_1] Do you remember what they spoke about? 1104 00:53:51,556 --> 00:53:55,256 [speaker_0] I think Richard Sutton's timelines were also something 25% by 2030, 1105 00:53:55,316 --> 00:53:56,296 remember correctly. 1106 00:53:56,476 --> 00:53:58,896 [speaker_1] No. He thinks it's more LL by LL- 1107 00:53:59,876 --> 00:54:03,436 [speaker_0] I am pretty confident Sutton had something like next five to 10 years 1108 00:54:03,456 --> 00:54:05,266 chance ASI. But yeah- 1109 00:54:05,516 --> 00:54:05,956 [speaker_1] But he thinks- 1110 00:54:05,996 --> 00:54:08,426 [speaker_0] I can't quite remember. Actually, you know why- 1111 00:54:08,536 --> 00:54:10,886 [speaker_1] But, but I also think that this whole idea that- 1112 00:54:11,536 --> 00:54:14,256 [speaker_0] Ilya's timelines, what are his... 1113 00:54:14,676 --> 00:54:15,366 Give me a minute. 1114 00:54:15,936 --> 00:54:19,656 [speaker_1] Yeah, and I also don't think this whole idea 1115 00:54:19,716 --> 00:54:23,136 agrees with you on timelines, it doesn't matter. Any- 1116 00:54:23,236 --> 00:54:27,066 Like it does matter what shape of beliefs he has and why he agrees to those 1117 00:54:27,136 --> 00:54:29,596 timelines and what shape of beliefs you have and why you agree to the timelines. 1118 00:54:29,676 --> 00:54:31,476 Uh, there might be a fundamental disagreement 1119 00:54:31,536 --> 00:54:34,516 You could update from some of his beliefs, not all of his beliefs, even though his 1120 00:54:34,536 --> 00:54:35,436 conclusion are the same. 1121 00:54:35,996 --> 00:54:39,796 [speaker_0] No, uh, like I initially started from a worldview 1122 00:54:39,876 --> 00:54:40,096 of 1123 00:54:41,116 --> 00:54:44,936 like pick even among the genius researchers, picking which research 1124 00:54:44,976 --> 00:54:48,916 hypothesis works is kind of random, and it requires just a lot of hit 1125 00:54:48,936 --> 00:54:51,166 and trial, and none of these people really know. 1126 00:54:51,216 --> 00:54:55,146 You are trying to update me more towards a worldview of, no, there are a few 1127 00:54:55,156 --> 00:54:59,036 genius researchers here who consistently seem to get all of the 1128 00:54:59,076 --> 00:55:03,056 predictions right. And I'm like, let's say I did update to your 1129 00:55:03,076 --> 00:55:06,616 worldview that all the more means I want to know, okay, what are these people's 1130 00:55:06,656 --> 00:55:09,536 timelines? If now you're saying, okay, I should defer to these people now. 1131 00:55:09,576 --> 00:55:10,346 [speaker_1] No, I'm sure. Go, go find out- 1132 00:55:10,346 --> 00:55:13,386 [speaker_0] I literally need to know what does Ilya Sutskever 1133 00:55:13,556 --> 00:55:14,156 Yeah, like- 1134 00:55:14,316 --> 00:55:16,396 [speaker_1] Go find their timelines. That's not the point I was making. 1135 00:55:16,456 --> 00:55:19,316 I was trying to make a point that you were saying that just because Ilya has 1136 00:55:19,356 --> 00:55:23,096 timelines and you have short timelines, it doesn't matter, uh, 1137 00:55:23,176 --> 00:55:27,116 what Ilya's arguments on research direction is or Ilya's time, like 1138 00:55:27,156 --> 00:55:29,976 Ilya's stance is on why and how we get this breakthrough. 1139 00:55:30,176 --> 00:55:31,116 [speaker_0] Those also matter. I agree. 1140 00:55:31,416 --> 00:55:33,826 [speaker_1] Okay. So it-- So cool. Find Sutton's timelines. 1141 00:55:33,916 --> 00:55:37,656 But the point Sutton made was that, uh, evolution actually gave us 1142 00:55:37,716 --> 00:55:41,436 language very, very lateUh, and most of evolution was 1143 00:55:41,476 --> 00:55:45,306 trying to optimize for things that we take for granted, which 1144 00:55:45,916 --> 00:55:49,396 uh, whatever, like being physical dexterity, et cetera. 1145 00:55:49,416 --> 00:55:53,216 And, uh, those things essentially, which according to you are part of 1146 00:55:53,256 --> 00:55:57,056 intelligence, those things actually took a, like, 1147 00:55:57,136 --> 00:56:00,356 time. Uh, and, uh, those would be very hard to do. 1148 00:56:00,796 --> 00:56:04,676 Uh, the second thing that he says that, uh, he thinks that if you can get 1149 00:56:05,416 --> 00:56:08,596 good at doing those parts, then everything else falls into place. 1150 00:56:09,156 --> 00:56:13,076 Uh, but, uh, the argument is if you can get to language, uh, 1151 00:56:13,176 --> 00:56:15,076 then getting to the physical stuff is easy. 1152 00:56:15,236 --> 00:56:19,096 And he's like, "No, most of it is getting to the physical stuff." And 1153 00:56:19,156 --> 00:56:22,516 then this language part and all of these part is something 1154 00:56:22,696 --> 00:56:26,356 [speaker_0] Sorry, I want to interrupt you because I feel 1155 00:56:26,436 --> 00:56:29,976 points we can both go check, and if we check, 1156 00:56:30,026 --> 00:56:33,696 our debate more productive. Data point number one of, you know, Ilya 1157 00:56:33,736 --> 00:56:36,726 Sutskever being personally present at all these three breakthroughs, I think you've 1158 00:56:36,996 --> 00:56:40,856 proved it for AlexNet, I agree. For the GPT-1 thing, 1159 00:56:40,896 --> 00:56:43,276 also kind of agree. More data would help, but I kind of agree. 1160 00:56:43,336 --> 00:56:46,116 The o1 thing, I'm not yet convinced Ilya 1161 00:56:46,156 --> 00:56:49,416 If you show me data, I can be convinced. Uh, that's one data point. 1162 00:56:49,536 --> 00:56:49,686 [speaker_1] Yeah. I think I found it. 1163 00:56:49,686 --> 00:56:53,426 [speaker_0] The other data point I want is, uh, yeah, literally what 1164 00:56:53,476 --> 00:56:57,396 What are Sutskever's timelines? Uh, and I'm saying, like, let's 1165 00:56:57,416 --> 00:57:00,056 first get these data points and then let's continue the discussion. 1166 00:57:00,256 --> 00:57:00,646 I think that- 1167 00:57:00,716 --> 00:57:02,375 [speaker_1] Found it. I think I found it. 1168 00:57:02,396 --> 00:57:06,156 [speaker_2] The people who are most responsible for that 1169 00:57:06,196 --> 00:57:10,116 and Jakub Pachocki. I think even like, uh, like, 1170 00:57:10,336 --> 00:57:11,746 uh, Dota was kind of- 1171 00:57:11,746 --> 00:57:15,056 [speaker_1] We can go like a few seconds before because he's talking about o1, 1172 00:57:15,416 --> 00:57:16,756 verify that he's talking about o1. 1173 00:57:17,116 --> 00:57:17,396 [speaker_0] And 1174 00:57:18,216 --> 00:57:22,036 okay, uh, I might agree with you by the way, 1175 00:57:22,116 --> 00:57:25,756 note, I'm like, why has this not been documented 1176 00:57:25,896 --> 00:57:29,296 there for all the three breakthroughs? That seems like a very big deal. 1177 00:57:29,896 --> 00:57:30,976 [speaker_1] What do you mean it's not documented? 1178 00:57:31,656 --> 00:57:35,516 [speaker_0] Why isn't there like either a Hacker News post or a Lesswrong post 1179 00:57:35,576 --> 00:57:37,936 saying here is the evidence Ilya was there at all three breakthroughs? 1180 00:57:38,276 --> 00:57:41,976 [speaker_1] But that seems, I think it's common knowledge. 1181 00:57:42,256 --> 00:57:45,596 I'm surprised you didn't, and I'm surprised you didn't 1182 00:57:45,996 --> 00:57:48,876 see this, but basically it's been talked about in a 1183 00:57:49,356 --> 00:57:52,965 Everyone knows that Alex Radford, Ilya Sutskever, Noam Shazeer are these 1184 00:57:53,036 --> 00:57:56,676 like insane superstar researchers who whatever they touch, whatever 1185 00:57:56,756 --> 00:57:59,206 ideas they pick turn out to be the right candidates always. 1186 00:57:59,616 --> 00:58:00,805 There's another one where Jeff Dean and Noam Shazeer- 1187 00:58:00,805 --> 00:58:03,836 [speaker_0] Okay. It's definitely not common knowledge, 1188 00:58:03,856 --> 00:58:05,856 to actually write this up and update a bunch of people. 1189 00:58:05,916 --> 00:58:09,766 Like, I think I'm not the only one for whom this 1190 00:58:09,766 --> 00:58:11,076 from you this. Yeah. 1191 00:58:11,786 --> 00:58:15,626 [speaker_1] Yeah. But, uh, okay, cool. Uh, I think I've-- The clip 1192 00:58:15,636 --> 00:58:19,046 something Claude found out. Uh, this is not the clip that I was talking about 1193 00:58:19,056 --> 00:58:22,926 originally, but I think this is even tighter evidence than what I 1194 00:58:23,896 --> 00:58:27,196 because he's directly saying that the reasoning breakthrough came from 1195 00:58:27,216 --> 00:58:28,886 Pachocki and Ilya Sutskever. 1196 00:58:29,176 --> 00:58:33,116 [speaker_0] Okay. Yeah. I mean, I will need a few minutes to properly go 1197 00:58:33,176 --> 00:58:34,056 okay, fine. 1198 00:58:34,196 --> 00:58:34,676 [speaker_1] No stress, no stress. 1199 00:58:34,716 --> 00:58:38,496 [speaker_0] Like for now I can maybe buy it. Okay. Let's, uh, buy this. 1200 00:58:38,556 --> 00:58:42,076 Okay, cool. Yeah. Also then I want to know, yeah, what 1201 00:58:42,136 --> 00:58:45,446 timelines then if you know that. Uh, any of the people you mentioned- 1202 00:58:45,446 --> 00:58:45,886 [speaker_1] We can just ask chat 1203 00:58:45,896 --> 00:58:48,666 [speaker_0] ... Alex Radford or Noam Shazeer or Ilya Sutskever. Yeah. 1204 00:58:48,685 --> 00:58:52,546 [speaker_1] Yeah. This is one post I found on EA forum, uh, 1205 00:58:52,596 --> 00:58:56,166 run by Lessig, uh, where they're talking about Sutskever's-- 1206 00:58:56,216 --> 00:58:59,366 criticizing Sutskever for not having transparency on his timelines and 1207 00:59:00,076 --> 00:59:00,976 not saying why. 1208 00:59:02,596 --> 00:59:06,456 [speaker_0] I think you have successfully updated me a bit, well, 1209 00:59:07,056 --> 00:59:10,986 towards the idea that Ilya Sutskever personally, like not 1210 00:59:11,016 --> 00:59:14,376 even any of the other AI researchers in this space, but Ilya Sutskever 1211 00:59:14,516 --> 00:59:18,376 specifically has like great research taste and he's consistently able 1212 00:59:18,456 --> 00:59:20,076 to pick good research hypothesis. 1213 00:59:20,116 --> 00:59:20,656 [speaker_1] Good. 1214 00:59:20,756 --> 00:59:24,556 [speaker_0] Not yet shifted my timelines by a lot. I think that's where I'm at. 1215 00:59:26,736 --> 00:59:26,745 [laughs] 1216 00:59:26,776 --> 00:59:26,846 [speaker_1] Yeah. I have given you- 1217 00:59:26,876 --> 00:59:30,776 [speaker_0] Because yeah, again, like it's not obvious to me, okay, 1218 00:59:30,836 --> 00:59:34,696 if you told me, "Okay, update towards Ilya 1219 00:59:34,736 --> 00:59:38,206 than you," then sure. But are Ilya's timelines 1220 00:59:38,236 --> 00:59:38,965 [speaker_1] Ilya's timelines are-- 1221 00:59:40,676 --> 00:59:43,856 His minimum is still longer than your maximum. 1222 00:59:44,576 --> 00:59:47,976 Anyway, maybe not. I don't know. And I think his definition of 1223 00:59:47,996 --> 00:59:48,776 might be very different. 1224 00:59:49,876 --> 00:59:53,726 [speaker_0] Fair. So yeah, now we will have to go more into 1225 00:59:54,356 --> 00:59:54,796 Uh, okay. 1226 00:59:54,856 --> 00:59:55,876 [speaker_1] We can end the recording there. 1227 00:59:56,356 --> 00:59:57,356 [speaker_0] Okay. Uh. 1228 00:59:57,396 --> 01:00:00,876 [outro music]