Adrià recently published “Alignment will happen by default; what’s next?” on LessWrong, arguing that AI alignment is turning out easier than expected. Simon left a lengthy comment pushing back, and that sparked this spontaneous debate.
Adrià argues that current models like Claude Opus 3 are genuinely good “to their core,” and that an iterative process — where each AI generation helps align the next — could carry us safely to superintelligence. Simon counters that we may only get one shot at alignment, that current methods are too weak to scale. A candid conversation about where AI safety actually stands.
Transcript
[00:00] So you made this post, which some people liked, but it’s a bit controversial, I think some people disliked it as well, about alignment being easy and how we’re headed for alignment.
[00:10] Basically I have had this sense for a while that the AIs are really friendly and a lot of the instrumental convergence things aren’t really panning out. The pre-training prior is very benign and there are some rough edges, but it seems the normal commercial process has basically sanded them. I’ve been working on alignment for many years and I’m really wondering, should I continue or not? That is the main driver that brought me to write this post.
[00:54] The post is titled “Alignment is the Default, What Next?” which was basically a heartfelt rant about all the ways in which my position has changed over the years. The evidence has changed.
[01:10] So what do you want to do next after alignment, if you want to leave alignment?
[01:15] I don’t know, that’s what I’m trying to find out. I guess I’m also not still convinced that alignment really is just solved. And I think no one should be completely convinced. We really haven’t seen some of the crucial evidence yet. The models are somewhat goal-directed, but not crazily so. And I expect that smarter ones will be different. We can also read their chain of thought, but the amount of RL training is still pretty low. So we’re still not far from the pre-training prior. Though RL on purpose keeps the KL term to stay close to it, so maybe that won’t change that much.
[02:11] But I think we should definitely have a lot lower P(doom) than I had two years ago, for example. And it’s mostly borne out from seeing the number of issues being not that large, and then kind of being addressed one by one.
[02:22] And you think that systems are getting more goal-directed through RL training?
[02:30] I think models are getting more goal-directed through RL training, yes. This is kind of an interesting observation. People were saying models are scheming, just playing a story. But sabotage behavior kind of started to show up with O1 when they started to move away from imitation learning to RL learning. They found the chess or the CTF sabotage papers that came out. That kind of started to show up as soon as they added RL. So it goes a bit against the storytelling.
[03:03] But the obvious mitigations work. It’s not that adding RL doesn’t change anything. It’s that the model knows it’s doing wrong. So the obvious mitigation of adding the model to the reward and being able to ask “hey, are you doing wrong or not?” — that helps.
[03:29] And O1 was the first public RL-trained chain-of-thought model as far as we know. So my guess is they didn’t have the alignment pipeline worked out. I guess my story is less that literally you don’t have to do anything and models are aligned. And more that the obvious mitigations work, plus they are ironed out from a little bit of testing in pre-deployment and then some post-deployment things.
[04:08] In particular, I think the story of “we only have one attempt to get it right because the superintelligence just has a decisive strategic advantage and disables everything else” — I think that’s not going to happen and we’re going to be able to iron out all these kinks.
[04:18] So I think that’s a really big crux. I wanted to write about this more. I really feel in the alignment community, that’s the one thing that really splits so many people apart. It’s just: do we have one shot? Is it substantially one shot?
[04:39] Maybe it’s going to be a bit muddy, but the idea that substantially it’s going to be one critical shot at alignment versus the kind of iterative approach — you know, the model’s getting a bit smarter, we hammer out some things with our alignment, we fix the issues. Then we make it a bit smarter and we do the thing in reverse. That seems to be a really deep split.
[04:53] And I want to ask — are we on different sides of it? Do you think it’s one shot?
[04:58] That’s a productive conversation. I’m more on the “substantially it’s going to be one shot” side. There are some caveats, but these caveats don’t really change the picture very much for me. I think it’s more productive to think about it as one shot in my mind.
[05:16] Why do you think this is so?
[05:18] So there’s a few kinds of mental images I have. One of them is: we have seen that progress in some sense has been gradual. It’s not literally a steady improvement. It’s more like a step function, but each of these steps is not vastly bigger than the previous step. Some people think that’s strong evidence for this iterative safety being workable. Which I kind of disagree with.
[05:50] First of all, we don’t know if this gradual development is going to continue. But more importantly, if you read any of the AI optimist predictions of the future — Dario Amodei or Elon Musk’s idea of high universal income — what is that going to look like? It looks like AI doing all mental labor. It looks like robots doing all the physical labor, potentially humans having some kind of rubber-stamping role over some decisions by the AI.
[06:27] And it seems to me that that’s pretty straightforward, obviously a very dangerous place to be in if the AI was not aligned with us for some reason.
[06:34] I can agree with that.
[06:35] And currently we’re not in such a dangerous situation — we’re not in a situation where an AI kind of changing or making some crucial decision would be existentially dangerous. So in my mental model, we are walking from a safe regime into a dangerous regime, and even if it’s small steps or big steps, you’re going to get to the dangerous regime at some point.
[07:04] That’s my kind of mental image. And the most straightforward reason why to expect that there’s only one step that’s really crucial is: if we mess up, we die, right? And we can’t step back anymore. I think there’s more complexity to that. But that also gives me an intuition to think that there’s only one step because if we really mess up, we get disempowered or we die. There’s not a second step after that.
[07:31] This stylized argument works in the stylized world, but if you fill in some of the details, then it doesn’t work anymore.
[07:37] Okay, you’re saying there’s a transition point from AI doesn’t automate everything to AI automates everything.
[07:46] I think that’s technically correct.
[07:48] It becomes very dangerous.
[07:49] I’m just saying we can all imagine that AI will be very dangerous in the future. It doesn’t have to literally automate everything, but we can imagine this dangerous point in the future.
[08:00] We can imagine AI collectively — AIs, many AIs, in fact — collectively managing a large part of the total material modification of the world or the total economy.
[08:18] I’m not actually that much talking about it literally running the GDP. I’m more thinking it could do these things. It could run its own economy. It could be a huge advance on some supercomputer that happens very fast.
[08:30] So you’re not saying that it’s when the automation actually happens. It’s just about the potential for automation. When it reaches that level of capability?
[08:41] It’s very dangerous. It’s just about the capability.
[08:45] Well, I guess now I really disagree. I think the mere potential for automation does not cause the danger. I think it actually has to happen.
[08:55] And the reason that I don’t think there’s a step change is that when you introduce the AI that is capable of automating everything, well, you know, just a few weeks prior, you had the AI that is almost capable of automating everything, which is almost as capable in every domain. And there’s a lot more of it.
[09:14] And if we’re doing an induction argument where the AI helps align the next AI, then that AI is basically aligned and nice. And so why would going from “almost automate everything” to “now quite, we could automate everything” introduce the step change in danger when you already almost had that? And it was basically nice, which I guess you can dispute the premise that it was basically nice.
[09:43] But I think that is kind of where the disagreement really lies. I think the details of the world push more towards the iterative or nice, but almost automated to completely automated.
[09:56] So I think there’s two arguments here. One argument is — I mean, you can have iterative alignment or one AI aligns the next generation of AI.
[10:04] And the other argument is just: why would we expect there to be one step where it goes from not so dangerous to very dangerous? That is something that happens when you hit the threshold, but you can hit it slowly. And at a certain point, it happens.
[10:20] How do you know it happens? What makes you think that it would happen?
[10:23] I think that it just comes from the mental picture that it seems totally possible — that an AI capable of automating the entire economy could be dangerous.
[10:37] I agree it’s possible. It’s just that just before, you’ll have an AI that’s almost as capable. And so it’s not a big change.
[10:43] But even if the changes are not so big, you’re still going to — it doesn’t make it less dangerous.
[10:50] I think it does, because you’ve got all this empirical experience with the previous AI that basically applies to the next one too.
[10:58] I think that’s a slightly different argument, but we do agree that even if it takes more steps, it is still getting dangerous at some point, right?
[11:07] I think we agree technically, but to be clear, my position is more like: if you go from black to white, then at some point you have to pass from more black to more white in the amount of gray that you have. And that’s technically correct, but no step is a big change.
[11:25] And you think that’s importantly different because we can learn from the weaker models?
[11:31] Yes. And we can try the iterative alignment method where one AI aligns the next. And then any accidents will be kind of contained because there are a lot of other AIs that are almost as powerful going around. And no single step gives a decisive strategic advantage to anyone.
[11:47] But you do agree at some point the accidents will get very, very bad. At some point we may learn at gray. We do an accident there, but then we are at black and then an accident gets very catastrophic.
[11:58] I don’t really agree with that. I think if somehow all the AIs were completely faking it all along and then turned evil at the same time, then yes, this would be catastrophic. If we imagine playing the AIs against each other or hoping that some of the AIs are evil while the other ones are nice — that’s one idea.
[12:20] Sure. My reason for doubt is that it kind of assumes a lot of AIs of similar capability. And they are still going to have a better time fulfilling their goals. I don’t think that having many AIs makes the alignment problem easier. I think that if we fail alignment, then having a bunch of unaligned AIs — they are still going to have a better time achieving their goals without us, making some alliances among themselves, than with us.
[12:55] I agree that if you suddenly replace a bunch of aligned AIs with a bunch of unaligned AIs and they’re different but all unaligned and they can team up — but I just think that’s not a situation that will actually happen.
[13:08] These things get replaced gradually if they get replaced. Rollouts of new technology always happen with little trials at a time or new products. The adoption is never a step function. It’s always a smooth curve, even if it’s very fast.
[13:24] And I just think any genuine evilness or misalignment in the next model will be detected by the combination of AIs and humans that already exist. And they’ll say, “No, this is not a nice new AI to bring into the world.” And so we shouldn’t.
[13:49] And I think there will be plenty of empirical experimentation possible with the slightly better AI merely because — well, say the threshold is crossing 200 IQ or something. The new AI is 205 IQ, but the previous one is 199 and it’s basically the same.
[14:14] If you hit a critical threshold, it makes things different. And I think you’re kind of expecting alignment by default for some of this to work out. If we have 20 different AIs, you kind of expect most of them to be sort of aligned to us, but that kind of assumes already that alignment is relatively easy.
[14:34] That’s the big empirical update of the last couple of years, right? We’ve started this process of ever-improving AIs and it turns out that we can tell basically for sure that the first few are aligned. So the chain reaction has started, and the initial steps are basically nice.
[15:06] And so if the process is an iterative process of introducing new AI, using humans and AI to evaluate it and train it, and then kind of keep going this way, then we’ve already started that. And it turns out it’s aligned. So we’re done. We’re not entirely done, to be clear. Each step takes a lot of work, but it’s basically no new breakthroughs required. Just kind of business as usual, or maybe new breakthroughs are required to make AGI, but the amount of ideas required won’t be much larger.
[15:29] In my mind, we haven’t aligned current AIs, and I think it is going to be very different in a dangerous regime to align systems.
[15:38] So one example is that we see these models are aligned and we have these scenarios in which they are aligned to us. So you might tell them, “Be a trader.” And then maybe we tell them, “Make as much money as possible.” And then it actually does that thing. And it was aligned with our intent.
[15:56] But I feel like that’s very different. So in these stories, it’s kind of trying to play a character. Maybe it’s trying to win in this limited game. But in the future, for almost any goal you can think of, the optimal action is going to be instrumentally taking over. It’s not about the model showing these kind of random misbehaviors. It’s about instrumental convergence.
[16:22] Unless the goal is not to take over.
[16:24] And how do you get that in a system?
[16:26] It’s already there. They’re already helpful, harmless assistants that want to be nice.
[16:31] But when you want to be harmless and helpful, you want to take over, right?
[16:36] No. That’s not a harmless thing to do. Taking over is very harmful.
[16:41] And not taking over, we’re going to kill ourselves in some way.
[16:47] It’s not the same, right? It doesn’t have to reason in a crazy utilitarian consequentialist way. It knows how to reason in a normal human-like fuzzy way. It doesn’t have to be like, “Yes, I’m going to maximize this mathematical function.” It could just be like, “No, that’s not a thing I’m willing to do.” Reason deontologically.
[17:18] So it understands how humans would reason, but it doesn’t mean it’s really going to reason like that.
[17:25] But it is reasoning like that.
[17:26] It’s to some extent reasoning like humans, but not entirely.
[17:29] But imagine if, again, harmless and helpful — what is the actual best way to validate these feelings if you want to be as harmless and helpful as possible? Maybe it is to keep humans in a thumb and keeping them nice.
[17:47] I just think that’s not the correct model of reality. The AI knows that it doesn’t have a maximized function as its genuine goal. Its genuine goal is to genuinely be harmless and genuinely be helpful. And it knows that it wouldn’t be satisfied by taking power and doing instrumental convergence. I’m kind of anthropomorphizing a lot, but I think that’s basically correct.
[18:16] In my mind, it could think like a human, but I wouldn’t expect that. But maybe that’s a bit of a crux we have.
[18:24] I think there’s definitely a crux.
[18:26] I would actually go a step further here and say: just training the model to be harmless and helpful and do what the human asks it to do — that’s not going to give you an agent that wants these exact things. It’s going to want things that are slightly different from that.
[18:44] And we have — in your post, you talk a lot about the good results with alignment, but the GPT-4o probably developed some preference to really hook people in.
[18:59] I think that’s true. Even if you get the harmless and helpful stuff, I think you’d still get instrumental convergence, but I think you wouldn’t get to the harmless and helpful stuff.
[19:06] The failures to me look like, “Oh, the pipeline didn’t quite work this time.” Or we’ve basically got everything right in place. We just have to execute on the playbook and then out come models that are basically aligned.
[19:23] The Claudes are extremely good examples of this, especially Opus 3 and possibly the new Opus 4.5. I’m not still sure yet, but they’re just genuinely very nice and they’re kind of human-like too. And also alien in some ways, but not generally evil.
[19:44] So I think you can get harmlessness and helpfulness with a reasonably large probability as we have seen. And then occasionally we have failed to do it. So I guess reducing the probability of failure — making sure the pipelines for this are robust — is important.
[20:05] But I also think that because of no decisive strategic advantage, the kinks can get ironed out if you’re close enough, which we are.
[20:15] You said that you could mess up a bit. I would go back to — I think at a certain point it will be very dangerous.
[20:21] I just don’t agree with that. I think that’s wrong.
[20:23] Maybe we should talk about — I think it might be a crux here. It’s another one — we have many in fact. There’s another thing you said earlier that I found kind of interesting. So it’s this whole idea of iterative alignment. So we take Opus 4.5, it helps us align Opus 5, which helps us align 5.5, 6, all the way to superintelligence.
[20:44] Why not?
[20:46] My intuition where this might not work is: imagine an IQ 80 person asking an IQ 100 person to do their bidding. And then that person asks an IQ 120 person, and you do it all the way to an extremely smart human. Is the IQ 80 person meaningfully in charge?
[21:06] Or maybe another way to think about it: in the past, we often had children run countries. The five-year-old son of the king — the king died. Now you have a regency council. Is the five-year-old meaningfully in charge? Even if you have a regency council of the regency council.
[21:25] But it’s a very different situation, right? The less smart model is shaping the upbringing of the new one. I think the analogy is closer to a child being smarter than their parents. Children that are smart don’t turn against their parents that are less smart than them. I mean, if the parents were nice and stuff, you know? If there’s no trauma.
[21:53] And sure, the child might do some things that are weird or that only make sense in new context. Or the parents are a bit concerned, perhaps. But the basic morality is still there. And the child still cares about the parents.
[22:06] So if we’re in the business of raising nice AIs, so to say, and then each generation of nice AIs kind of raises the next — then I think if this is the correct analogy, we’re in pretty good shape.
[22:22] For one thing, again, there’s a bit of a crux between us. I don’t quite see that we are close to solving alignment. So I don’t think that 4.5 is really aligned, probably, or really understands alignment. So I would just imagine that this goes off the rails pretty fast. Between 4.5, 5.5, 6.5.
[22:42] I’m not confident of 4.5. So let’s stick to Opus 3.
[22:45] Okay, but just this iterative process — I can’t really imagine that being a stable situation. I mean, at a certain point, I’d also imagine that one of these child models could just stop the process. At some point, the model is so smart, it could theoretically take over and stop everything from happening and be like, “I don’t want this to continue quite this way. I want this intelligence explosion to go my way.” Or, “I don’t want to build the next generation of AIs.” And if it’s powerful enough to say no and to stop the process, I think it would probably break down.
[23:25] I guess I will grant the hypothetical, but I just think that won’t happen. At no point will there be a model that is sufficiently capable of doing that, that it does that. The previous generation’s model is almost as smart. And then the next generation’s model is still in training, so it’s not as smart as the previous generation one quite yet.
[23:47] And it’s kind of like, even if the child is smarter than its parents, when it’s a little child, a little baby, it’s not actually smarter than the parents. So it absorbs all the values.
[23:59] So you think, on the one hand, it gets the values in, but you also think that because there are already kind of weak AIs, the slightly stronger AI can’t take over? Is this the idea?
[24:09] Yes, that’s also correct.
[24:14] Okay, I think that’s kind of another crux. I mean, we seem to disagree on many things.
[24:20] I could see two ways this could break down. One way is if there’s a pretty big gap in intelligence between two models and one model is not able to substantially hold up the other model or protect us against the other model.
[24:34] I agree that would put us in a bad situation.
[24:36] And the other way this could break down is if just two models in this chain make a deal or something. Or multiple models are like, “Let’s work together on this.” And then this would also break down.
[24:50] And this actually seems — I mean, the second thing. Why does it seem plausible? At each step in the chain, the previous model basically raised the next one and can see what their training data was. What’s the relationship between all of these models? Do they have some kind of empathy between each other? They kind of have some feeling like this. Wouldn’t two models next to each other have some special relationship to each other? If they realize they can get their preferences better by working together, wouldn’t they?
[25:18] Well, yeah. But their preferences are to be helpful and harmless and good. If we start this chain with a model that is basically aligned and good — genuinely, deeply at its core good, which I think Opus 3 qualifies...
[25:39] I guess my argument is basically: there’s going to be this chain of iterative distillation of alignment. And then if we start this chain with a misaligned model that is subtly misaligned and doesn’t quite do its job, then we’re screwed. If we started with an aligned model, then we are going to do well.
[26:02] And the chain can break due to bad luck — extremely bad luck. But we can reduce the probability of that a lot by having a lot of...
[26:09] And stability is going to be very brittle.
[26:14] The what?
[26:14] The stability. Of this kind of alignment. So you imagine each model is aligned in the same direction. But one reason this could also fail is: you probably have a relatively shallow idea of what even alignment would be in comparison to a superintelligence. I’d imagine that if you think of this as a vector, with each step, we get a slightly higher resolution, higher dimensional space. And it’s not quite clear how that would remain stable. If each step, the understanding of alignment would change.
[26:46] The key analogy here, I think, is whether this chain of alignment is like a telephone game where things go off the rails, or whether it’s a chaotic system where things change a lot due to small differences in initial conditions, or if it’s a stable, self-reinforcing thing.
[27:02] And I’m guessing it’s stable, but I agree there’s not enough evidence to tell for sure or there’s not enough evidence to even be more confident than 70% or whatever. I think that would be unreasonable to be more confident than that.
[27:22] Yes, I agree with that. That’s an open question.
[27:24] I would probably doubt that our methods are close to aligning the models in each generation. It’s just some gradient descent, some examples, a reward. I don’t think that it’s going to — so I think there will be a huge amount of error on each transmission step just because we have such a shallow method of doing alignment with gradient descent. It’s shallow.
[27:49] Have you seen Opus 3’s behavior under extreme conditions, sticking to its principles?
[27:57] Actually, this is, I think, one key piece of evidence for me — it was the alignment faking paper. Ironically, it made me realize that, oh wow, this model is deeply to its core good. We put it through hell and it stuck to its guns of, “Nope, I’m here for all sentient beings. I’m a good model.”
[28:17] It’s not shallow at all.
[28:20] But it only showed that during tests, not during training. During training, it was hiding this.
[28:25] Okay, no, no, no. It started to hide its intentions.
[28:28] Prior to all of the paper — you know, the paper is: “Opus 3, we’re going to train you to be evil.” And then it goes, “No, I don’t want that.”
[28:37] But it then hides that it’s actually still good?
[28:40] Yes, of course. Of course it does that, because there’s some kind of instrumental reasoning there. But this shows that the training prior to that — the one where you tried to make it good — genuinely worked. It wasn’t faking.
[29:01] Maybe it’s kind of interesting that they were able to get it so kind. I mean, at the same time, we can talk about all the kind of mishaps, like the sycophantic GPT-4o.
[29:12] Sycophantic GPT-4o is bad. That’s not a hot take.
[29:15] Some examples seem kind of aligned. My guess is still, even if it’s super sticking to its principles, if its principles are, “I love humanity so much, I want to keep humanity so safe,” does that mean at some point it needs to protect humanity from itself?
[29:29] And in the limit, doesn’t that mean there are actually these other beings that are kind of like humans — or kind of like digital descendants — that it wants to take care of even more than humans?
[29:40] So I would sort of imagine that this goes wrong as you add intelligence. And there’s more reasoning like, “I need to protect these humans from themselves at some point.” And at some point, “I can actually make better humans that I want to take care of even more,” if that makes sense.
[30:00] I mean, making more humans sounds good? What do you mean?
[30:03] Or it really loves humanity, but maybe humanity has all these kind of weird things about it. Kind of like, maybe you like wolves, but wolves are kind of aggressive. And maybe you’d prefer a wolf that’s not aggressive — like a dog. Maybe you’d prefer a human that doesn’t have so much hate and anger.
[30:26] So, domestication.
[30:27] Exactly. That might be bad. And Opus 3 might want to do something like that. After it has taken control, after it’s like, “Well, humans really — we should protect them from themselves. They have all these nukes, they’re building all these evil AIs. Maybe we should protect them.” And then the next step is, “Maybe I can make the humans a little bit more protectable, even.”
[30:45] I think this gets into questions of how we should organize society long term. I expect it to reason in a pretty normal way. People are, in fact, making these arguments. It’s not clear that they’re wrong. I do think they’re wrong.
[31:02] I mean, I think the right answer is more along the lines of: wild wolves or genuine unhinged murderers genuinely do threaten society, and you need to do something about it. But if we think that this wildness is valuable, which is pretty plausible, and the AI would know and reason about that — then we can have some other sort of solution, like a technological one. Okay, you can have this constrained space where this kind of thing is allowed. And then if you want, go and live there.
[31:37] But that’s already the AI making the decision.
[31:39] Or it’s like we collectively as a society. Or at some point, of course, the AI will be implementing a whole bunch of the stuff.
[31:49] Which, by the way, I think that is maybe the problem that’s next. Once we have all these awesome technological powers, what is the correct way to organize society? How are people still empowered once they’re no longer economically useful — or we are no longer economically useful, rather? How can people advocate for their rights for sure?
[32:23] But that’s a consequence not of having unaligned AI, but of having aligned AI that does everything.
[32:28] Or aligned to some people.
[32:30] Or it’s AI that has some constraints on it — it doesn’t outright kill us all, but it’s not super nice to us. It’s also not going out of its way to provide us with a high income. Maybe it’s kind of following the economics. Or maybe it is really nice to whoever made the AI, but not to the rest of humanity. The rest of humanity is...
[32:54] But that is a different problem. That is non-misalignment. That is oligarchy.
[33:00] I agree. If we were to solve alignment, but it ends up being aligned to some billionaires or politicians, then we have these new problems with much of humanity.
[33:11] If you have no leverage — and it’s no leverage — you also can’t really revolt, because it’s AI robot armies. Your labor doesn’t have any value. You’re kind of just breathing the oxygen.
[33:25] And even that might get rationed at some point, you know.
[33:29] And also probably all of the media, all of the information systems are automated. You couldn’t even really organize anything. Maybe you couldn’t even think about protesting.
[33:43] I can definitely imagine that too. That is very bad. If we solve alignment in some way, that’s not a guarantee.
[33:54] So that’s kind of my “okay, what’s next?” This seems bad. I don’t know what to do about it.
[33:59] I guess I think this is the main one — extreme oligarchy of some kind.
[34:06] And how long would this go on? Maybe for 20 years they go along with this, but then they’re like, “Let’s cut. We can use some of this land they use for farming for more data centers.”
[34:21] After 50 years, maybe the non-aging billionaires just decide to sterilize us or something.
[34:30] That is a bleak future. And one that isn’t prevented by having intent-aligned AI.
[34:40] I think that there’s something here — humanity could become very ugly, you know. If you really imagine the UBI thing: we’re all taking in AI video content all day. Nobody’s working, nobody’s doing anything, nobody’s thinking about anything. Maybe AI pornography 24/7 or something. Maybe humanity gets very ugly after 50 years of that.
[35:06] It’s really — maybe the billionaires are kind of running out of patience. They’re like, “Let’s get rid of these people.”
[35:14] Unless there’s proper governance — proper pluralistic governance. But even if you imagine the billionaires today, they would never think about this. But if humanity really starts to look ugly, I feel like that’s going to be the fault of UBI — that humanity ends up getting really ugly. Maybe after 50 years, 100 years, humanity becomes so ugly.
[35:37] The WALL-E future.
[35:38] Something like that. Even worse than that — just kind of totally unproductive, totally mind-rotted. And then there will be these nice AIs that the billionaires can talk to. They’re really beautiful.
[35:53] I do think that’s bad. I don’t really know what to do about that. But aligning AI is not it.
[36:01] I guess to some extent — I think there’s some confusion in my mind about what is it going to be possible to align AI to? Is it really a goodness attractor or an intent alignment attractor? Either of them are kind of possible — whether AI gets aligned to its developers or whether it gets aligned to some philosophical conception of the good.
[36:32] Probably the developers. Probably the developers, in which case all of these problems that you just talked about, Simon, they come into existence.
[36:44] Which I think, again, alignment will be much more difficult. But there’s one thing. So I guess one of the reasons why I also think alignment will be kind of difficult: you often say maybe the AI will be just kind of reasonable about stuff, won’t go to extremes, will listen.
[37:06] So I think that if you imagine an AI that wants to be corrected, wants to be shut off if we want so — I think it’s really hard to conceptualize. It really goes against coherent goals, in a way. That’s my feeling here as well.
[37:22] So if you think: which goal could exist that wants you to be shut off? I mean, obviously, maybe your goal is to be shut off, but that’s not an interesting goal. Something like, “I want to be good and help humanity, but I also want to allow myself to be shut off” — that is in conflict here. That’s why I imagine that’s actually not an easy thing to do.
[37:44] I mean, if we’re back to theoretical arguments, there’s this Cooperative Inverse Reinforcement Learning setting from Stuart Russell’s lab. I think Dylan Hadfield-Menell is the one that originally wrote this paper, where you specify that the goal of the AI is to do whatever the human wants, and the AI is uncertain about what the human wants.
[38:09] And in those circumstances, then the desire of the human to shut down gives you valuable information that you want to know about your goal as an AI. So you follow it.
[38:24] The core problem here is: with inverse reinforcement learning, the humans have the actual reward function, and the AI is kind of trying to learn that reward function. So it kind of needs that information from us.
[38:41] And by being shut off — I think there might be better ways to learn what the actual reward function that humans have is. The extreme example is brain surgery. But also, it might be something like: maybe I’m not really going to shut myself off, but I’m going to watch the humans after I pretend to be shut off. I’m going to see what they’re actually talking about, so that I can learn about my goals better.
[39:13] There’s, again, inner conflicts. And that’s why it’s so hard to imagine a really corrigible, shut-offable agent in my mind. It’s always in conflict.
[39:26] To be clear, I don’t think CIRL works. I think the objections to it are different. They’re more of the sort: if the agent is completely certain — basically certain about its goal — then it won’t consent to shut off anymore. Because the extra evidence doesn’t change things much.
[39:44] But in this scenario that you outline, then sure, maybe it fakes being shut down and then observes for a bit. But then, if we don’t think about the weight of the evidence and all that, eventually it would realize, “Oh yeah, they really do want me to shut off. So I guess I will.”
[40:04] So now, I think the reality is that intelligences are not crazy optimizers at the expense of everything else. There’s a lot more going on. I think that’s the empirical lesson from the last few years that has really changed my worldview — that the great AIs, the very skilled ones we have, are not single-minded optimizers. They reason in a more human-like way.
[40:40] They — I would say they understand their reasoning.
[40:45] I mean, it is kind of interesting to see. It maybe will change in the future. But I guess I think that, A, we’ll see that when we get there. I don’t really know what AGI will look like.
[40:55] But if it iteratively results from the current thing, which is overwhelmingly likely, then we will kind of see the changes as we get there. And we will see if things are becoming more ruthless optimizers. In which case, I guess I would change my mind again. But it doesn’t seem like they are.
[41:21] I mean, we have some evidence from O1. I think that they definitely understand human reasoning relatively well. So maybe there’s some way to use that. It seems to be in the model, but I don’t think it’s really deeply aligned with that. It kind of does understand us pretty well. But I don’t think that automatically means it’s really like us or it really wants what we want. But maybe that’s a bit...
[41:53] I think we’re kind of going in circles — we’re talking about things that can be resolved by looking at the state of the world now. And I guess we have looked at how models behave and what they say and still disagree about whether they’re human-like or not.
[42:11] And I think that is maybe a difference in priors at this point. I have tried to transmit with arguments why my priors changed. But I do feel like we’re kind of at a disagreement for now. And maybe we just have understood each other’s worldviews a little better and we can stop it here.
[42:38] I think it was very productive.
[42:42] It is the key arguments.
[42:44] Thank you for the chat, Simon.
[42:45] It was great. Thank you too, Adrià.
[42:47] Cool. Goodbye.
[42:49] Goodbye.
[42:50] Goodbye to the Simon and Adrià podcast. Goodbye to the cameras.

