Someone said I don't understand the case for ASI risk very well. It is possible this is true. I haven't been following the latest alignment/safety research for latest 2-3 years atleast. And even back then I didn't explore it that deeply. That being said, I thought I should give it a shot, to try and describe the case myself, without doing a ton of more reading first.
Intuition
I have a strong intution that superintelligence is scary, even before I talk about specific arguments. Why do you think you can control someone smarter than you, just because you control the box it is in?
Technical
If you blindly search for minds that consistently get high values of X in some finite set of evaluation environments, it is very unlikely that mind will always get high values of X in all possible environments. It is far more likely that it has learned to optimise for some Y that is different from X but correlates with it in the evaluation environments.
Yes, this also includes humans. Humans underdoing sufficient environmental shift, and vastly increased intelligence, are dangerous too.
I prefer the phrase "environmental shift" over the more common phrase "distributional shift" because your tests don't need to be static data points in a benchmark dataset, they can be more complex algorithmic tests. Your train and test sets are environments not datapoints.
Example of an algorithmic test: Train an RL agent to maximise red balls in a variety of "training" minecraft environments generated in realtime. Use another AI to adversarially generate these environments. Then generate some more "testing" minecraft environments in realtime, using another AI, adversarially. Even if the main AI faithfully maximises red balls and nothing else, in all the training and testing environments, I do not have a very high degree of confidence it will continue to maximise red balls when you put in some real environment different from these environments.
AI can realise it needs to sandbag you, in many complex ways.
AI might literally know it is being tested, based on information you've leaked to it in the train or test set, and attempt to sandbag you. We have real world examples of this now.
AI might predict that there's a significant probability it is being tested, even if it doesn't have proof, and it might attempt to sandbag you just in case. Sandbagging is instrumentally convergent even if you don't know anything about the environment.
(Analogy: Humans can pascal mug themselves into fearing god or aliens or whatever, even if they have zero proof regarding god or aliens or whatever. See all the humans who are afraid of sending deep space radio signals for instance.)
If you catch the AI red-handed, you still can't do shit.
What you need is an algo that can adversarially catch the AI red-handed every single time, and then search for minds that don't have these defects, ad infinitum. If you only catch the AI a finite number of times, it's possible the next one escapes your tests. See above: It is hard to design any finite set of eval environments to prevent an AI from optimising something different in an infinite number of real environments.
Mech interp
Doing alignment without solving mech interp is really hard, and solving mech interp is really hard.
A weight can encode for multiple functions at the same time. A group of weights can encode for a function such that any subset of these weights looks similar to noise. If you have N weights, the naive way of doing mech interp is searching through not N possibilities but 2^N possibilities. Solving mech interp means solving an exponential search problem, and typically only human genius insight solves such problems.
At every incremental achievement in mech interp, you are also providing yet another signal that capabilities researchers can use to make their AI even more capable.
Scaling
Bigger models have capabilities that smaller models don't, that could make alignment techniques on small models fail on big models.
Bigger models could just be "different" in any number of ways from smaller models, that could make alignment techniques on small models fail on big models. For instance, big models might know more elegant techniques (such as deeper Y's they can optimise for to get public X, instead of shallow Y1 Y2 Y3), and hence your old alignment technique (which only worked on a model that secretly optimises for Y1 when it publicly optimise X) may just break.
Everything is at machine speed, and has discontinuous jumps.
We have seen discontinuous jumps many times, from GPT3 to GPT3.5, from GPT4 to o1, and so on. Discontinuous jumps at this speed when we are dancing on the edge of ASI, will be that much more dangerous.
Most of the above arguments are OR not AND. Even one of these problems is sufficient to wipe out humanity, even if you solve some of the others, and get lucky on others. How confident are you that will solve all the problems via blind luck?
Political
You want to build minds that consistently get high values of X in some evaluation environments. There are literally trillions of dollars and entire countries you can colonise if you do it. If you choose to build "tool AI" or "narrow AI" or "not goal-seeking AI" or anything else, you are leaving money on the table.
At every incremental step of your research, you get more and more incentive.
At every incremental step of your research, because you suck at keeping secrets, everyone else also eventually gets access to your research. If you just leave money on the table, but your research is public, someone else will come and grab that money in the immediate next step. Not in some hypothetical future with another couple breakthroughs, but almost immediately.
Even a monopoly on violence may not be enough to actually change this fact of reality. Let's say US or Chinese govts locks their researchers in a military fortress, and starts executing 7 generations of any person who makes it easier to leak secrets. Even in such a world, I will be surprised if either side achieves even a 3 year lead against their competitor, and I will be extremely surprised if either side achieves a 10 year lead against their competitor. For reference, alexnet was released in 2012 so a 10 year lead means the entire history of deep learning, repeated all over again. Sounds like you're already at a vastly (not mildly) superhuman AI at all fields by that point.
Your AI org's leaders are under constant paranoia that if they make the wrong move, their whole org will be left behind. Being behind for even 1 or 2 years is enough for your lab to miss the next funding round and be left out of the race. See Mistral for example, see the Israeli lab (I don't even remember their name), these people are not serious competitors.
Even if you know you are dealing with world-ending technology, you will still race. Since there are many humans willing to take risks to become world dictator and many of them can be competent fundraisers and operators, it seems plausible that the AI company operators who actually build ASI are either the most risk-takers or the most ignorant of ASI risk, or both. (Yes I'm talking about Altman, but it's not just him. AI being expensive has tilted the balance in favour of fundraising skill above else, but the previous mentioned dynamic still applies.)
At every incremental step, AI centralises not just capital or attention in the hands of few actors, but it centralises the ability to steer reality itself. Capital and attention are fake instrumental resources that let you actually do things. AI can actually do things directly. Example: There's a big difference between being a billionaire who can hire a private army, and actually controlling a drone army using your PC yourself. Private armies are chaotic, they can revolt, you have to hire from a limited talent pool with certain values, they have to pay tax etc. A drone army can do literally anything you mind control it to do.
The first time you lose control of something smarter than you, it is Game Over. You probably don't get retries.
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month