Why I am pessimistic on ASI alignment being solved by 2030
Disclaimer
Quick Note
Why am I writing this?
Lots of people still seem to think that working on ASI alignment or safety is the highest impact career path. (I think it is high priority but not highest priority.) This post is for them.
How seriously should you take this post?
If you are an AI alignment/safety researcher, I do think you should ultimately trust your own judgment not mine. That is required to become a good researcher.
If you can see a specific way I am wrong, or a specific way your idea is promising inspite of my pessimistic priors, maybe just trust your instincts.
If you think your ideas are right, and they directly contradict mine, and you also can't see where my ideas are wrong, then this should trigger a big alarm bell in your head.
You should definitely start from Yudkowsky's List of Lethalities, and the counterarguments to his writings. It's argued way better than anything I could write. I agree with more than half of his document.
I am not a full-time alignment/safety researcher.
I did MLSS under Dan Hendrycks and spent a few months back in 2022 skimming the research in the field.
There's probably a lot of details about the research field as of 2026 that I am unaware of. (This is intentional on my part, my priorities have shifted).
Main
The best case scenario of successful alignment as I understand it, is permanent global dictatorship
My speculation
If you manage to help Altman or Amodei or Trump or whoever, actually get an ASI to do what they want, then obviously the first thing they will try to do is figure out if they can become world dictator using this capability.
If you are an alignment/safety researcher, you are directly bringing about permanent dictatorship.
I still support people working on this, because I think permanent dictatorship is better than human extinction.
Until politics and geopolitics is fixed somewhat, you can't really hope for better than this as a technical researcher IMO. Hence finding high leverage ways to fix politics and geopolitics seems more important to me than alignment/safety research.
Alignment asks the question - how do I get an ASI to care about same things as me?
Control asks the question - how do I get a misaligned powerseeking ASI to stay trapped with its box (though it may want to break out), and maybe even do some useful work for me while it is stuck there.
I am slightly more optimistic on control, atleast upto some level of capability. It won't work for arbitrarily superhuman AI, but it might work for slightly superhuman AI.
I can probably keep Otto van Bismarck (expert politician) or Albert Einstein (expert scientist) inside a box, despite their attempts to break out.
Maybe I could even keep an army of 1 million Bismarcks running a 100x speed inside a box, although I am less clear about this, and it seems like a dangerous experiment.
I don't think chimpanzees could keep humans inside a box, assuming chimpanzees were capable of being deliberate about their choice to create humans versus not (in practice, they had no such choice).
We are building ASI by choice
My speculation
Building an ASI that can run entire governments, economies and militaries with no human oversight is an intentional choice. Building an ASI that makes new technological breakthroughs way faster than humans is an intentional choice. Building an ASI that can acquire overwhelming military or political advantage from zero initial resources is an intentional choice.
The people running the AI companies and the govts supporting them, want an ASI that runs with no human oversight. At minimum, your reward for building this is a trillion dollars. At maximum, your reward is becoming permanent world dictator for all of humanity for hundreds (if not thousands, millions or billions) of years.
Any plan that starts with "let's just build a dumb, less useful but more safe AI" is doomed for this reason. A lot of plans to build tools not agents are doomed for the same reason.
More
Most alignment/safety research is blue-sky preparadigmatic work
Current SOTA and my speculation
If you are an at all technical person, you can just go read the abstracts and skim papers of a lot of alignment/safety research. It is very intuitively obvious that it will take more than 5 years for most of this writing to get converted into an actual pytorch repo. Just pattern match to other scientific fields where people start by doing preparadigmatic research in a hundred different directions.
Most empirical alignment research is bottlenecked by interpretability
My speculation
If you don't understand what is going on inside the black box, you can only apply pressures from outside and hope that what is inside is more likely to conform to what you want.
An ASI that did exactly what you specified in some formal spec, or some training reward regime, would be dangerous. You yourself don't know you want, well enough, to formally specify it.
Suppose you became a superintelligence yourself (let's say I could attach more biological neurons in some structure to your head or something). That might be dangerous to you even as per your own current values. You maybe won't trivially reward hack like an ASI might, but you too have your own failure modes as a human.
Mechanistic interpretability is hard
Current SOTA
The primary pieces of an ML model (plus any scaffolding on top) that are highly interpretable right now are the logits in the penultimate layer, and the embedding layer.
Some people like Neel Nanda and Chris Olah have found a few other specific structures that are interpretable.
We don't have an entire classes of problems being cracked. We have one narrow subproblem here and one narrow subproblem there being cracked.
My speculation
Suppose there is a 1B param model. This model is not encoding 10^9 things, it is encoding some subset of 2^(10^9) things. It is possible for a set of weights to represent some useful human-comprehensible function, and for each individual weight in that set to be indistinguishable from noise if studied in isolation.
Now, the space of possibilities is sufficiently large that no naive brute force approach can crack it. Either you need a human researcher to crack it (more likely), or you need some intelligent way to automate cracking it (less likely).
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month