0:00
/
Transcript

Where does the race to automate AI research end?

This is a research talk I gave on how automating AI research could lead to an unrecoverable catastrophic alignment failure.

TL;DW: A recording of a recent MATS research talk where I argue that the automation of AI research — which OpenAI and Anthropic say is imminent — could lead to an unrecoverable alignment failure. Three properties make it especially dangerous: oversight breaks down at scale, capabilities self-amplify, and capabilities will be sped up asymmetrically faster than alignment. The outcome could be a lethal, unrecoverable alignment failure. Link to the paper preprint.

Transcript

[0:00] So my talk is about automated AI research and the risks that come with it. This is a very relevant and imminent topic.

[0:08] So we have OpenAI and Anthropic both talking about this. Roughly, the timeline for both of them is that in a few months, they want to have

[0:16] maybe thousands of research interns. And then by 2028, they want to have totally automated AI research,

[0:24] maybe hundreds of thousands of fully human- or superhuman-level AI researchers. For those out of the loop, this is what Jack Clark says:

[0:33] No humans in the loop by 2028; it’s more than 60% likely, in his view. OpenAI has a very similar view on this topic.

[0:43] We had somebody from MATS 8.0, Sev Field. He interviewed 25 researchers from labs and academia. 20 out of 25 said automating AI research

[0:55] is one of the most urgent risks posed by AI systems. It is a very urgent, very imminent thing.

[1:02] I’m going to go into one argument why this is very dangerous and very imminent. The basic point I’m making is actually closely related

[1:11] to a lot of the talks that came before me. Oversight is going to be very difficult. You’re going to look at thousands of agents

[1:19] that are going to be increasingly more intelligent. And there’s going to be a huge upscaling due to effective compute improvements,

[1:29] algorithmic advances, and also by physically having more compute available to systems.

[1:35] So oversight mechanisms are going to be going to a phase where the effective compression is going to go up very fast.

[1:44] Less and less of what these agents are going to produce, humans are going to be able to read about. The second property of this

[1:52] is going to be self-amplification. With this process, we’re using AI to improve AI.

[1:58] This is a self-amplifying process. The better the agents get at AI research, the faster the process is going to move.

[2:05] So we could have very explosive progress with very little monitorability. And then the third point is going to be

[2:11] on asymmetry of this. What I’m talking about here in particular is that I expect there to be much slower progress

[2:20] on automating alignment research than capabilities research. I think there’s very strong arguments

[2:26] why these two things are not going to be sped up at the same rate. I think that one very possible,

[2:33] quite likely outcome of that is a rapid, unrecoverable failure of alignment, where we very rapidly, without much warning,

[2:42] end up with a robustly superhuman AI system that is misaligned. We have very little monitorability.

[2:49] It’s very fast. And we get into a state where we cannot recover from this failure of alignment.

[2:54] So just to get some kind of rough scale, we heard numbers from OpenAI that they’re looking at something like 10,000 research interns

[3:02] being run quite soon, possibly this year. But these are not fully autonomous, in their expectation. If you just do the scaling based on compute,

[3:13] you would expect, in just a few years, there to be many more of these agents. And even in the beginning,

[3:22] You would have an enormous amount of data produced for these agents. But the experiments they run, the chain of thought,

[3:28] you would have a high implicit compression ratio. Like, how much can the 1,000 human researchers at OpenAI, Anthropic really read?

[3:37] Now, there are methods, for example, by Google DeepMind proposed for this. One of them is hierarchical summarization.

[3:43] And there’s variants of this. That’s some form of, there are monitor AIs that summarize what’s going on

[3:50] and that flag dangerous behavior. And we have a chain of thought, and we can summarize the chain of thought of models

[3:57] and potentially figure out what they’re thinking about. But there are big questions about this. So one thing is the compression ratio

[4:04] would be expected to go up over time. So we have many more agents also being more intelligent over time.

[4:14] So we would expect the compression ratio to change in such a way that more and more data gets produced and less and less gets read by humans.

[4:22] So the amount that humans can read is constant, But this is so much more data over time. And then we also know complacency

[4:31] from humans who oversee automated processes: they become complacent. And we have heard much about different control schemes here.

[4:39] But there’s a big risk that your control scheme might give false positive results. And what happens then?

[4:48] The natural human response is to become complacent. If the last five times the control scheme gave a false alert,

[4:55] on the sixth time, you’re likely to just press OK. You just press the OK button and the system continues running. If there is an actual failure,

[5:07] what’s the clear path? To move on? It’s probably a false alarm. So oftentimes, people are just going to move on.

[5:13] We know that chain of thought is not always faithful. And also, you have automated AI researchers that you try to make really good AI researchers.

[5:21] Maybe they’re just going to decide at some point: maybe we just drop the chain of thought. Maybe we just do latent chain of thought.

[5:26] Maybe we can be 20% more efficient if we drop that feature. So your agents might just kind of drop

[5:31] the interpretability at some point. So another thing here is, this is a self-amplifying process.

[5:38] So what this means is that the speed at which your AI agents perform AI research depends on their capability at AI research.

[5:50] Now this property is a property of the exponential function: that there’s this direct dependence of the derivative on the capability.

[5:58] That allows a very simple conceptual argument to expect kind of fast exponential progress. Now at the same time,

[6:05] we already have exponential increase in compute and algorithmic advances. We already have a four-month doubling time

[6:13] in terms of effective compute. Now there’s some question here: does effective compute, if that goes up exponentially,

[6:20] do we see exponential improvement in intelligence? I think there’s some wisdom here that you need exponential increase

[6:27] in effective compute for linear progress. So we might see true exponential progress in AI capabilities and intelligence.

[6:36] only when we really go into the self-amplification loop. There are lots of possibilities that we’d actually see exponentially

[6:42] decreasing doubling times. You might get into hyperbolic growth. So there are some assumptions here,

[6:49] but I think on some level, the intuition is just this kind of self-amplification. The speed of improvement depends on the capability itself.

[6:58] And just maybe the fact that computers are very fast. They’re much faster than human beings. The last kind of conceptual argument

[7:04] is that we’re likely going to see an asymmetric speedup. But the simple reason is, it’s easy to test that an AI can perform a coding test.

[7:13] It can pass all the tests, but it’s much harder to check that your AI is aligned. If this is a very simple behavioral test, for example,

[7:22] your AI might be aware it’s being tested, and it might understand what kind of behavioral responses you like to hear.

[7:30] At the same time, for capabilities, the only way to solve a challenging math puzzle or to solve a challenging programming task

[7:38] is to actually be able to perform it. So there’s a clear feedback loop, a clear verifiable reward

[7:45] that we don’t see for alignment research. For alignment, there are also open conceptual questions. What does corrigibility mean?

[7:55] What should an AI do if there are conflicting principles? If its constitution clashes with itself or against what it’s being asked to do?

[8:04] These questions don’t really appear with capabilities, where we all kind of agree a model is good at maths or coding.

[8:12] This goes back to this kind of oversight argument. A lot of control schemes, for example, require an aligned overseer

[8:18] that is roughly as capable as the model it’s overseeing. And this also depends on alignment research actually coming along reasonably fast.

[8:28] So what this all means is we might have an incredibly short window for action. We might be looking at maybe just days or weeks

[8:36] where we have a big blow-up in intelligence. And if alignment then fails on a robustly superhuman system,

[8:45] the system might misrepresent its intentions. It might resist correction of its alignment. It might prevent its own replacement.

[8:55] Usually we think we can just revert, retrain, go to the last safe state. This might not be available at this point.

[9:04] And just, like, to make it literal here. So one way it could prevent its replacement or retraining is it could just kill everybody.

[9:13] So that’s one of the options it has. Yeah. Yeah.

[9:19] Yeah. Yeah. Timing.

[9:28] Okay, any questions? One plausible story for this being wrong is that there is a large class of fundamental

[9:46] agent foundations-y kind of problems that so far have been very difficult for humans to make progress on

[9:52] because it’s very difficult to get humans to be really, really good at maths and things like this. And so one plausible scenario I’ve thought of before

[10:01] where this could be wrong, though generally I share the same concern, is that, essentially, we are able,

[10:09] alignment research gets disproportionately sped up just because it turns out there was this fundamental capability

[10:16] that suddenly makes a lot of these problems much easier than they were before. Do you have thoughts on this

[10:21] as a potential alternative world? I think that there’s still a lot of confusion in the field of alignment about what alignment is.

[10:30] People have totally different ideas. You know, I think that some labs think it’s mostly about just kind of

[10:39] creating this nice character for the AI and then having it follow that character. And other people think

[10:45] that’s a totally different approach from what would work, actually. I think that there’s value

[10:50] in working on automated alignment, and I think there’s some possibility we would see some big benefit there.

[10:57] I don’t think it’s super likely, not very much, but yeah, I think it’s some possibility. Yeah, thank you.

[11:04] I have one question about the asymmetric speedup. I want to say, I agree that, with capabilities,

[11:14] it’s easier to verify or to automatically verify, right? At least, you know, math and coding;

[11:20] maybe writing a novel, not so much. But also, we heard in the very beginning that there are way fewer alignment researchers

[11:30] than there are capabilities researchers. So maybe giving them AI will speed them up more than it will speed up the larger group, right?

[11:40] How do you run the calculation of which effect is greater? Yeah, I think the main thing is just,

[11:49] again, it’s really easy to verify that it has the capability. And I think that there’s value

[11:56] from automated alignment research, but there are just these huge questions where people just disagree what alignment even means

[12:02] or how to measure it. I think that some people, I just don’t really expect it to be faster.

[12:11] And I think that’s also natural. kind of makes it more profitable to put more compute into capabilities as well.

[12:22] And I suspect, I suspect in practice, they will put most of the compute into capability automation.

[12:29] And on alignment, I just suspect that there’s going to be a disagreement on what it means. And it’s very overdetermined in my mind.

[12:37] I think it’s not totally impossible. It’s overdetermined that it’s going to be much more useful for capabilities.

[12:44] There are these inside arguments about verifiability and speed, there being confusion among alignment researchers.

[12:52] And there are these kind of outside arguments. A company that will go harder on capabilities will likely progress faster than one that

[12:58] sets aside, like, 50% for safety, right? So you have a field of, like, five, six different companies doing this at the same time.

[13:07] That’s, like, an outside argument here. You have another, like, outside argument that it might be more of an incentive

[13:13] to put more compute into capabilities research. So you have these, like, outside arguments And these incentive arguments do,

[13:21] in my mind, seem to tilt it a lot towards speeding up more capabilities than alignment. But do you see that, for example,

[13:28] the companies that push most on capabilities are winning the race at the moment? Because I’m not sure I would agree with that.

[13:34] I think it’s a very messy situation because, I mean, people will say Anthropic is the number one, but they are also the most aggressive

[13:42] in going into automated AI research. So I think this is an interesting thing that people who really believe in AI safety,

[13:53] they may believe that they have to win in a way. If you really get why AI is so dangerous, then you can also really see the paths

[14:03] to build more powerful AI. Because people who don’t get that, maybe they don’t see that.

[14:07] Anthropic certainly cares about safety to some extent, but they also don’t have a huge lead. It’s not like they could just burn six months

[14:15] and still be in the lead. It’s like they have the next guy on their back.

Discussion about this video

User's avatar

Ready for more?