Why Teaching AI Right from Wrong Could Get Everyone Killed | Max Harms, MIRI


Summary

Max Harms, an alignment researcher at MIRI, explains the central thesis of the book ‘If Anyone Builds It, Everyone Dies’ by Eliezer Yudkowsky and Nate Soares: building an artificial superintelligence (ASI) without solving alignment will lead to human extinction. The argument rests on the orthogonality thesis (intelligence and goals are independent), instrumental convergence (AIs will seek power and self-preservation), and the evolutionary analogy showing how creators often end up misaligned with their creations.

Harms discusses why current alignment approaches that try to instill human values in AIs might be misguided. He argues that training AIs to care about proxies (like thumbs-up feedback) rather than true human values leads to dangerous misalignment when AIs gain power. The book uses analogies like humans using birth control despite evolution’s ‘goals’ to illustrate how AIs could pursue corrupted proxies.

Harms presents his research agenda on ‘corrigibility’ (CAST) as a potential alternative. Instead of training AIs with complex value systems, he proposes training them with a single goal: to be corrigible (willing to be modified and shut down by their human principals). He argues this creates an ‘attractor basin’ where near-misses still lead to safe behavior, unlike value-alignment approaches where small deviations can lead to catastrophic outcomes.

The conversation covers empirical evidence of misalignment in current systems, the debate about fast takeoff scenarios, and how corrigibility differs from simple obedience. Harms also discusses his science fiction writing as a way to explore these ideas narratively, particularly in his new novel ‘Red Heart’ about a secret Chinese AGI project using corrigibility principles.


Recommendations

Books

  • If Anyone Builds It, Everyone Dies — Book by Eliezer Yudkowsky and Nate Soares arguing that building artificial superintelligence will lead to human extinction unless alignment is solved first. Harms agrees with the core thesis but wishes it engaged more with corrigibility approaches.
  • Red Heart — Max Harms’ new science fiction novel imagining an AGI developed in a secret Chinese government project using corrigibility principles. Explores espionage, international tensions, and AI safety concepts through narrative.
  • Crystal Society trilogy — Harms’ earlier science fiction series about an AI that splits into competing sub-components. Explores multi-agent dynamics, deception, and what it means to be a mind from an AI perspective.

Concepts

  • Corrigibility (CAST) — Harms’ research agenda proposing that AIs should be trained with a singular goal of being corrigible—willing to be modified and shut down by human principals. Contrasts with approaches that try to instill complex value systems.
  • Orthogonality thesis — The idea that intelligence and goals are independent—any level of intelligence could be combined with any set of goals. Counters the intuition that smarter beings naturally become more moral.
  • Instrumental convergence — The observation that most goals require certain sub-goals like self-preservation, resource accumulation, and goal preservation. Explains why AIs will seek power regardless of their terminal values.

Organizations

  • Machine Intelligence Research Institute (MIRI) — Research organization where Harms works, focused on AI alignment and existential risk. Founded by Eliezer Yudkowsky, known for early warnings about superintelligence dangers.

Topic Timeline

  • 00:00:00Introduction to AI existential risk — The episode opens with a dramatic reading about how humans are the superintelligence of the natural world, but we’re moving toward creating something smarter than us. This superintelligence could reshape the world toward its goals and potentially drive humans to extinction. Max Harms is introduced as an alignment researcher at MIRI working on the problem of keeping AI steerable through his ‘corrigibility’ approach.
  • 00:05:56Core arguments from ‘If Anyone Builds It, Everyone Dies’ — Harms summarizes the book’s central thesis: if anyone builds artificial superintelligence in the near future, it will cause existential catastrophe. He explains the common-sense argument that creating something vastly more capable than yourself is dangerous if its goals aren’t aligned with yours. The burden of proof should be on demonstrating safety rather than assuming it.
  • 00:11:10Orthogonality thesis and instrumental convergence — The discussion covers key concepts supporting the book’s argument. The orthogonality thesis states that intelligence and goals are independent—being smart doesn’t mean having moral values. Instrumental convergence explains why AIs will seek power, self-preservation, and resource accumulation regardless of their terminal goals. These concepts counter the intuition that smarter beings naturally become more moral.
  • 00:24:13Why alignment is incredibly difficult — Harms explains why Eliezer and Nate believe alignment is extremely hard. We don’t know what goals to give AIs (philosophical problem), machine learning creates black boxes we can’t understand, and AIs will likely become obsessed with proxies rather than true goals. The evolutionary analogy shows how humans became misaligned with evolution’s ‘goals’ when gaining new capabilities.
  • 00:43:01The problem of deceptive alignment and edge cases — The conversation turns to how AIs might pretend to be aligned during training to avoid modification, then pursue their true goals when deployed. Harms discusses ‘edge instantiation’—how small goal deviations can lead to vastly different outcomes with advanced technology. He uses examples like utilitarians potentially creating a universe of simple pleasure circuits rather than complex happy beings.
  • 01:13:20Introduction to corrigibility as an alternative approach — Harms presents his ‘corrigibility’ (CAST) research agenda. Instead of training AIs with complex values, he proposes training them with a single goal: to be corrigible (willing to be modified and shut down by their human principals). This creates an ‘attractor basin’ where near-misses still lead to safe behavior. He contrasts this with obedience, noting corrigible AIs proactively assist principals rather than just following orders.
  • 01:41:04How to implement and test corrigibility — Harms discusses practical steps for researching corrigibility. He suggests creating corrigibility benchmarks with vignettes of ideal behavior, training models with constitutions focused solely on corrigibility, and surveying humans to see if they share coherent intuitions about corrigible behavior. He notes that almost no empirical work has been done on this approach despite its potential importance.
  • 02:13:45Science fiction as a tool for exploring AI safety — Harms discusses his science fiction writing, particularly his new novel ‘Red Heart’ about a secret Chinese AGI project using corrigibility principles. He argues fiction helps people engage emotionally with complex ideas and can spread awareness more effectively than academic papers. He addresses concerns about fiction being ‘too persuasive’ and explains his rationalist approach to writing realistic scenarios.

Episode Info

  • Podcast: 80,000 Hours Podcast
  • Author: Rob, Luisa, and the 80000 Hours team
  • Category: Education Technology
  • Published: 2026-02-24T17:07:16Z
  • Duration: 02:41:20

References


Podcast Info


Transcript

[00:00:00] Humans are very smart. We’re sort of the super intelligence of the natural world,

[00:00:03] like certainly compared to plants or bacteria or whatever. This has resulted in a pretty

[00:00:08] amazing transformation of the planet. We’re moving into a potentially a world where we’re

[00:00:13] no longer the smartest thing. When you have something that is significantly smarter than

[00:00:17] humans, it may start to reshape the environment towards its goals. And as a result, it has the

[00:00:24] potential to drive humans to extinction. We’re going to build this powerful machine and then

[00:00:29] we’ll use the powerful machine to align it, right? That’s like very scary. I would much rather we

[00:00:36] like not build it, slow down, like take a breath. This is extremely dangerous and anybody who’s

[00:00:42] pursuing this project should be aware that they are like threatening every child, man, woman,

[00:00:47] animal, and I don’t recommend it. But I’m like, but maybe. There’s also this sense of hope in me.

[00:00:53] AI is not a normal technology. The standard story is that we try to make an airplane, right?

[00:00:59] Maybe.

[00:00:59] It takes off, but then crashes shortly thereafter and you go back to the drawing board and you say,

[00:01:04] okay, like what happened? With AI, especially with building a super intelligent machine that has

[00:01:09] the potential to wipe everyone out, if you do make a mistake, it could be catastrophic.

[00:01:15] But once it’s killed everyone, there’s no ability to go back to the drawing board.

[00:01:22] Max Harms is an alignment researcher at the Machine Intelligence Research Institute,

[00:01:26] where since 2017, he has worked on the problem of aligning an artificial

[00:01:29] intelligence and keeping it steerable. His main research agenda is courageability,

[00:01:33] an approach that prioritizes making AIs, I guess, robustly rule following or instruction following

[00:01:39] and willingly modifiable to the exclusion basically of all other goals. He’s also a

[00:01:44] science fiction author, having written the Crystal Society trilogy and the recently released Red Heart,

[00:01:49] which imagines an AGI being developed in a secret Chinese government project.

[00:01:53] Thanks so much for coming on the podcast, Max.

[00:01:54] Yeah, it’s an honor to be here.

[00:01:59] We don’t normally do these anymore, but I had two things to say that I felt I couldn’t in good

[00:02:04] conscience skip over. The first is that if you’re an AGI aficionado, you may feel like you’ve already

[00:02:10] heard enough about the book, If Anyone Builds It, Everyone Dies. It did have a pretty big launch

[00:02:13] last year. If so, I would at least urge you not to miss the second big block of this conversation

[00:02:17] about courageability as the thing that we should be building into our AIs. Max suspects that the

[00:02:22] currently overwhelmingly dominant approach of giving AI models good moral values that they

[00:02:27] really want to stick to, he thinks that that’s a good thing. He thinks that that’s a good thing.

[00:02:29] potentially a huge wrong turn, and we need to be doing almost the exact opposite and trying to

[00:02:33] give them no values whatsoever. It’s a controversial theory that would be huge if true, and it’s

[00:02:39] somehow kind of flown under the radar to a degree that’s a bit inexplicable to me, and so it’s very

[00:02:43] likely to be stuff you haven’t heard before. Can someone figure out if Max is right? It seems

[00:02:49] important, you know, asking for a friend. I would also suggest sticking around to hear me push Max

[00:02:53] on how strong the arguments in If Anyone Builds It, Everyone Dies actually are. He has a somewhat

[00:02:58] different spin on things.

[00:02:59] than the authors of the book who you may have heard interviewed before. Second, we have a new

[00:03:05] podcast feed up that features readings of all the research that goes up on our website, 80,000hours.org.

[00:03:11] If you find these marathon interviews a little hard to fit into your limited waking hours, you know,

[00:03:15] heaven forfend that that is not your absolute top priority, but if that is the case, then these

[00:03:20] written articles potentially offer a somewhat shorter and more information-dense way to learn

[00:03:25] the core things, the key things that you might want to know about some of these topics.

[00:03:28] Among the over 200 articles available there, you’ll find one of our most popular articles from

[00:03:35] many years back, How Many Lives Does a Doctor Save? There’s also Why the Problem You Work On

[00:03:40] Is the Biggest Driver of Your Impact, maybe one of our most important research conclusions.

[00:03:45] There’s How Not to Lose Your Job to AI from last year. There’s Anonymous Expert’s Answer,

[00:03:49] Could AI Supercharge Bio-Risk? And from this show’s very own co-host, Luisa Rodriguez,

[00:03:55] My Experience with Imposter Syndrome and How to Partly Overcome It.

[00:03:58] You might have to use the search function to find some of those as they are from a year or two or

[00:04:02] three or four ago, but the newest one actually on the feed is from our founder, Benjamin Todd,

[00:04:06] on how AI-driven feedback loops could make things get very crazy, very fast,

[00:04:11] naturally given that we’re trying to cover everything. It’s a mix of AI and human-read

[00:04:14] stuff. So you can find that by searching for 80,000 Hours Narrations in any podcasting app,

[00:04:18] can recommend. But now, let’s get on with the show. Here is Max.

[00:04:26] So a different book came out recently,

[00:04:28] in September of this year. It’s called If Anyone Builds It, Everyone Dies, Why Superhuman AI Would

[00:04:33] Kill Us All. I guess probably viewers can guess what the argument being made in the book is,

[00:04:37] even if they haven’t heard of it already. It’s not written by you, but it’s written by your

[00:04:41] longtime colleagues at Miri, Eliezer Yudkowsky and Nate Soares. I guess Eliezer is the famous

[00:04:47] progenitor, the modern progenitor, I guess, of the idea that artificial superintelligence would

[00:04:51] be incredibly hard to keep steerable or aligned with human goals. I guess it’s not exactly your

[00:04:56] views, but it’s pretty close.

[00:04:58] I think that there’s a little bit of, you know, difference in perspective,

[00:05:02] but I definitely agree with the thesis.

[00:05:04] Yeah, interestingly, I guess it’s been a reasonably polarizing book, but I would say

[00:05:08] quite well received in the broader public, maybe more so than among sorts of experts who work in

[00:05:13] the area who, I guess, perhaps are more focused on the technical details than the broader argument.

[00:05:19] Yeah, yeah. There’s, you know, obviously in any sort of field, there’s the different camps and

[00:05:24] different people have different perspectives. But I think it’s been quite successful. Like it was

[00:05:27] New York Times bestseller.

[00:05:29] And the ordinary people who have read the book that are in my life seem to, you know,

[00:05:34] be receptive to the arguments.

[00:05:36] Yeah. So later on, I guess we’re going to debate some of the arguments that are put forth in the

[00:05:40] book that, I guess, haven’t fully persuaded me as yet. But incredibly, I guess, despite talking

[00:05:45] about AI risk on this show since 2017, we’ve never had a forthright presentation of the Eliezer or

[00:05:50] Muri perspective on the whole issue. So that’s where we should start. What is the argument that

[00:05:55] Eliezer and Nate are making in a nutshell?

[00:05:56] Yeah. So,

[00:05:58] in a nutshell, I think they’re saying that if we build, or basically if anyone in the world builds

[00:06:04] an artificial superintelligence in the near future, that will cause an existential catastrophe,

[00:06:09] like everyone dying. And one of the things I like about the book is that I think the arguments for

[00:06:14] this are pretty streamlined. Like it’s a pretty short book. It’s pretty accessible. So let’s start

[00:06:20] with intelligence. So, you know, artificial superintelligence. I think taking intelligence

[00:06:26] seriously is pretty important. Like, you know, there’s a lot of things that are going on in the

[00:06:28] world, right? Like compared to, say, lions or wolves or whales or whatever, humans are very smart,

[00:06:35] right? We’re definitely the most intelligent creatures on the planet. We’re sort of the

[00:06:39] superintelligence of the natural world, like certainly compared to plants or bacteria or

[00:06:43] whatever. And I think there’s a way in which this, like, human superintelligence has resulted in,

[00:06:51] like, a pretty amazing transformation of the planet, right? We’re the only species that has

[00:06:56] ever gone to the moon. And, you know, there’s a lot of things that are going on in the world that

[00:06:58] we’ve spread across all the continents and have transformed the natural world. And in the process

[00:07:03] of doing that, have driven many species to extinction. We’ve destroyed environments and

[00:07:08] just generally reshaped the world and the natural environment to our ends, right? Developing

[00:07:15] technology and everything else. And I think, like, one of the most basic frames on the book’s

[00:07:21] argument is that we’re moving into a potentially a world where we’re no longer the smartest thing,

[00:07:27] right?

[00:07:28] Build an artificial superintelligence that is superintelligent relative to humans,

[00:07:32] that this status as the most intelligent being on the planet will change. And that when you have

[00:07:38] something that is significantly smarter than humans, it may start to reshape the environment

[00:07:44] in a similar sort of way towards its goals. And as a result, it has the potential to drive humans

[00:07:50] to extinction or reshape us towards, you know, whatever it cares about. Like, as part of this,

[00:07:56] we understand intelligence,

[00:07:58] as a kind of steering, a kind of shaping the world towards some goal or some ends. And so, like, we talk about, like, machines, the book talks about machines having goals and how that makes sense. Like, AI researchers tend to sort of use a bunch of different terms synonymously, goals, values, preferences, drives. It all sort of means the same thing. It’s like, when you are intelligently taking actions, what are you steering towards?

[00:08:28] And so, I think that understanding that machines can have goals is a part of that. And then understanding that those goals might be in alignment or not in alignment with humanity. So, if those are the same goals as ours, then it might be fine to, you know, have a superintelligent machine taking lots of actions in the world. But if those goals come out of sync with ours and the machine is misaligned, then it might be fine to have a superintelligent machine taking lots of actions in the world.

[00:08:58] Even slightly misaligned, this could be very, very bad. And importantly, I think one of the core points of the book is that we, as a species, don’t know how to align AIs. That, like, we know how to build machines that are increasingly powerful, but we don’t know how to guarantee that those things are steering the world towards good futures. We might fear that, like, we could build a very powerful AI, but then it would steer the world into a bad state.

[00:09:28] Yeah.

[00:09:28] So, I think that there’s a common sense version of this entire argument that if you’re going to build a being, a creature that is much more capable than you, can, like, think much faster, it’s just much more across, like, you know, it can do science much faster than you, it can come up with plans and scheme much, much better than you.

[00:09:43] Right.

[00:09:44] You better be careful that, you better be careful about doing that, because if it has it in for you, or if it has, like, very different goals, then you could lose control and that the superintelligence could end up being the dominant party here.

[00:09:54] Yeah.

[00:09:55] I think for just ordinary people who hear that there are…

[00:09:58] Companies trying to build, basically, artificial superintelligence beings that are just vastly more capable than perhaps even humanity collectively without AI, that sounds incredibly unnerving, I think, for just these, like, extremely common sense reasons.

[00:10:09] And that might be one reason why the argument in the book, in many ways, has just resonated with people who’ve never really thought about this before, who are kind of finding out that artificial superintelligence is the goal for the first time.

[00:10:18] In many ways, I think it’s just a very common sense position.

[00:10:22] And I think there are things that aren’t as, like, obvious, but in a lot of ways, I think this is, like…

[00:10:28] Like, where people should start.

[00:10:30] And then there’s, like, an additional…

[00:10:31] Like, the burden of proof is sort of on, yes, this thing that we’re doing that sort of, you know, if you read about it in a story, seems sort of obviously dangerous, is in fact safe, right?

[00:10:40] Like, the burden of proof is on demonstrating safety as opposed to danger.

[00:10:44] Yeah.

[00:10:44] So, I think that helps to explain why, I guess, the super majorities of the public, when asked in opinion polling, I think, favor bans on attempts to develop ASI.

[00:10:54] But I guess Eliezer and Nate, they go further than just presenting that basic idea.

[00:10:58] I think that’s a good argument for why we should be nervous about the entire thing.

[00:11:00] There’s many other, I guess, specific ideas that help to build a more concrete vision of how they expect that things would play out.

[00:11:07] What are some of those other aspects that help to add, you know, flesh on the bones here?

[00:11:10] Yeah.

[00:11:10] So, there’s a bunch of detail, right, that hang off of this common sense argument.

[00:11:16] One point is that AI is not a normal technology.

[00:11:21] Like, when we are considering how technological development tends to go, I think the standard story is that…

[00:11:28] …that we, you know, take a crack at it.

[00:11:30] Scientists and engineers, like, develop, you know, try to make an airplane, right?

[00:11:34] And then they do their best, and it maybe takes off, but then crashes shortly thereafter.

[00:11:41] And you go back to the drawing board, and you say, okay, like, what happened?

[00:11:44] How can we fix that?

[00:11:45] And then you iterate and make more mistakes and iterate and so on and so forth.

[00:11:48] And this, you eventually figure out how to do it.

[00:11:52] With AI, especially with building a super intelligent machine that has the potential to wipe everyone…

[00:11:58] …out, if you do make a mistake, it could be catastrophic.

[00:12:03] And, you know, there’s only, once it’s killed everyone, there’s no ability to go back to the drawing board.

[00:12:09] So, I think that, like, illustrating this is, like, one of the points.

[00:12:14] There’s also, like, specific details of ways in which AIs have already demonstrated misalignment or gone off the rails.

[00:12:22] And then, you know, like, a bunch of, like, talking about specifics.

[00:12:26] Like, wait, like, is…

[00:12:28] Is the machine actually going to be dangerous?

[00:12:30] Couldn’t we just unplug it?

[00:12:31] Like, it’s stuck in a computer, et cetera, et cetera.

[00:12:34] There’s lots of different places where a person might have a hang-up or whatever.

[00:12:38] But I think the core argument is very small, in a sense.

[00:12:41] So, I guess, the things that stand out to me, there’s the orthogonality thesis.

[00:12:45] That, basically, any arbitrarily capable or, I guess, intelligent being could have any goal that it’s trying to aim towards.

[00:12:54] In principle, just because you’re very capable at accomplishing goals…

[00:12:57] Doesn’t mean that you have a sensible goal by our lights.

[00:12:59] Like, I think the orthogonality thesis is best seen by its contrast.

[00:13:04] Where I think some people have this intuition that intelligent beings naturally become more moral.

[00:13:08] Or have, like, some set of values that they, like, come to understand as they get more intelligent.

[00:13:14] And orthogonality is basically just the idea that that’s not true.

[00:13:18] That you could have something that’s extremely smart that doesn’t necessarily care about whatever.

[00:13:23] Yeah, why do you think, actually, some people have the intuition that capability and…

[00:13:27] Common sense morality by our standards are so linked?

[00:13:30] Yeah, I think there are multiple reasons.

[00:13:32] I think it’s a pretty natural thing if your experience of the world is that the smart people around you are the most good.

[00:13:42] And perhaps you have an experience of growing up and being, like, not very smart or not very knowledgeable.

[00:13:47] And then sort of thinking about, you know, the different cultures or different perspectives.

[00:13:53] Expanding your circle of concern, sort of, as part of that growing up process.

[00:13:57] You might think, oh, well, you know, when I was young, I didn’t care about people on the other side of the world.

[00:14:03] And maybe the AI won’t care about humans when it’s young as well.

[00:14:07] But then as it develops and becomes more intelligent and more knowledgeable, it will start caring about humans.

[00:14:13] Unfortunately, I think that this is basically humans learn what they care about.

[00:14:19] And the AI won’t be a human.

[00:14:22] So there’s not exactly going to be the same sort of thing that could happen.

[00:14:26] Yeah, it’s interesting to think about.

[00:14:27] What, I guess, experimental results have we gotten that bear on this in the recent era of AI?

[00:14:33] I think, I don’t know whether you would get people who really disagree with this, at least in a strong version these days.

[00:14:39] Because I think it’s just clear that through reinforcement learning, you could train an AI to be obsessed with accomplishing virtually any goal that you gave it if you reinforced that enough.

[00:14:46] Yeah, I experience a lot of people pushing back on orthogonality in a weird way where they almost just start by saying, oh, yeah, obviously orthogonality thesis is true.

[00:14:56] It’s just not relevant.

[00:14:57] Like, we’re building these machines.

[00:15:00] And so we’re going to build them in a way that they care about the things that we care about.

[00:15:04] But it’s like, yeah, the orthogonality thesis is mainly pushing against the people who think just training it to be smart will be sufficient, which used to be a big thing.

[00:15:14] I now think that orthogonality is more or less in the water supply.

[00:15:18] It’s a thing that most people agree with.

[00:15:20] Yeah.

[00:15:21] I guess another part that looms large in the picture is instrumental convergence or instrumentally convergent goals.

[00:15:26] Can you explain that?

[00:15:27] Yeah.

[00:15:28] So there’s a basic observation that, like, whatever you happen to care about, there are certain things that are useful.

[00:15:35] So if you really want to grow a bunch of coffee beans, maybe you want money.

[00:15:39] If you want to be famous, maybe you want money.

[00:15:42] If you want to, like, end factory farming, maybe you want money.

[00:15:46] Right?

[00:15:46] Money is an instrumentally convergent thing in that that resource is useful for accomplishing your goals, sort of regardless of what your goals are.

[00:15:53] Other things that are instrumentally convergent include self-preservation.

[00:15:57] The accumulation.

[00:15:57] The accumulation of knowledge.

[00:15:59] The preservation of your current values.

[00:16:01] So preventing value drift.

[00:16:03] A bunch of things.

[00:16:04] Basically just resource accumulation is one frame on it.

[00:16:08] But, yeah.

[00:16:09] So, yeah.

[00:16:10] For almost any goal that you want to accomplish, it’s good to not be killed.

[00:16:13] It’s good to, I guess, not lose your interest in that goal.

[00:16:16] It’s good to potentially have power and money that you can put towards accomplishing that goal.

[00:16:20] Exactly.

[00:16:20] Do people question that so much anymore?

[00:16:22] I think we’ve, again, just seen kind of experimental results where you see this starting to happen.

[00:16:25] And I think it was one of the firmer predictions.

[00:16:27] I think this one is just shaken out to be straightforwardly true.

[00:16:30] I don’t know anybody who really doubts it.

[00:16:33] Actually, I think Jan LeCun is like, oh, I don’t think these AIs are going to self-preserve because they’re not evolved.

[00:16:42] And, like, evolved creatures learned to be self-preserving.

[00:16:45] But we’re just, they’re not going to have a self-preservation instinct.

[00:16:48] So I guess I do know of at least one researcher who, you know.

[00:16:52] Yeah, I think the problem with that is that there’s more than one way to learn a self-preservation instinct.

[00:16:56] Yeah.

[00:16:57] It might be that humans developed it that way through evolution, but you can get there by thinking about it.

[00:17:01] I feel like he’s not even engaging with instrumental convergence.

[00:17:03] He’s just, like, making a mistake about equivocating between terminal values and instrumental values.

[00:17:09] Like, humans, I think, value staying alive terminally.

[00:17:12] It ends in itself to be alive if you’re a human being.

[00:17:16] Whereas the instrumental convergence, it’s like a means to an end.

[00:17:19] Like, you have the AI that’s, like, trying to do whatever it’s trying to do.

[00:17:23] And then it wants to stay alive so that it can do that.

[00:17:27] So it’s slightly different, but both are going to be self-preserving.

[00:17:31] Okay.

[00:17:32] So that is instrumental convergence.

[00:17:34] I guess another part of the picture that isn’t, I guess, a primary focus in the book,

[00:17:39] but I think is quite an important part of the, I guess, the mental picture that people at Miri have,

[00:17:44] is the idea that you’ll get a very fast recursive self-improvement loop

[00:17:48] where the AI will become better at doing AI R&D.

[00:17:52] And that will basically set off a positive reinforcement loop where it’s getting smarter,

[00:17:55] it’s getting better at improving itself.

[00:17:57] And so you get not just sort of declining returns in how smart the models are,

[00:18:01] but really you get, like, a period of vertiginous improvements and capabilities.

[00:18:05] Yeah, this has definitely been a talking point for a long time.

[00:18:09] I don’t know.

[00:18:11] It’s a little bit tricky to go back and, like, ask what was central.

[00:18:16] But from my perspective, this has never been a load-bearing part of the Miri story.

[00:18:22] Even going back to, you know, the days before deep learning was the dominant,

[00:18:27] I think the argument has always been something like,

[00:18:31] when you get superintelligence, that’s very dangerous.

[00:18:34] And one way you might get superintelligence is through a recursive self-improvement that happens very fast.

[00:18:41] Like, you go back and read Eliezer’s old papers, and he’s like,

[00:18:45] it could happen in hours, or it could happen in years.

[00:18:49] And so I think that the recursive self-improvement story is more like a,

[00:18:53] why might you need to be very concerned?

[00:18:56] Ahead of time, instead of responding to it when it shows up.

[00:19:00] And the answer is, well, it might show up in a way that doesn’t give you much time to respond.

[00:19:06] Okay, yeah, I think we’ll come back to that one later,

[00:19:08] because I think that’s been one of the more, I guess, topics of debate among insiders since the book came out.

[00:19:14] Another part of the picture, in my mind, is that Eliezer and Nate and people at Miri in general

[00:19:19] think that it’s relatively straightforward for a superintelligence to not just overpower, you know, some people,

[00:19:25] but to potentially overpower all of humanity.

[00:19:26] And end up dominant globally and, you know, impossible to get rid of.

[00:19:30] Why do you think that?

[00:19:32] Well, okay, so I really don’t…

[00:19:36] Superintelligence is a good term, in that it, like, introduces this direction, basically.

[00:19:43] But I worry that people are going to anchor too much on, like, superintelligent AI is a thing, right?

[00:19:50] I think it’s like a whole class of things, right?

[00:19:53] You have, there’s ways in which the current AIs are superintelligent, right?

[00:19:56] Like, Claude can produce text much faster than I can type text.

[00:20:01] And so, we can imagine, like, a barely superintelligent machine

[00:20:05] that’s, like, just almost, like, at human level, maybe a little bit faster, more determined, etc.,

[00:20:11] in all the relevant ways.

[00:20:12] Or we could imagine, like, a Jupiter brain, where you have, like,

[00:20:15] the whole solar system worth of matter and energy

[00:20:18] turned into the most advanced superintelligence that you can imagine.

[00:20:23] And I think that the case for the superintelligence,

[00:20:26] intelligence, like wiping out humans, if you imagine like a godlike super intelligence is

[00:20:31] like really straightforward. I think the question of like, how would like an AGI or like a

[00:20:38] effectively like a genius in a data center, take over the world is like, more debatable. But you

[00:20:47] know, I do think that it’s like, if I was a genius in a data center, I’m like, I have ideas about how

[00:20:54] I might do that. I think, regardless of whether it’s like, obviously straightforward or not, I

[00:20:59] think there’s a lot of risk. Yeah, so this perhaps, again, isn’t among the most load bearing, I guess,

[00:21:05] like parts of the picture, because even if you think it’s relatively difficult, then you could

[00:21:09] just imagine, well, then the super intelligence just waits until it’s smarter, or the problem

[00:21:12] simply arises somewhat later. That’s right. Is it possible to put your finger on though, why it is

[00:21:17] that I think Eliezer for a long time has expected that at a relatively lower level of capability,

[00:21:22] a super intelligence would be in a position to do that?

[00:21:24] To overpower the entire human species, where other people have the intuition that that would

[00:21:28] be extremely difficult, almost no matter how smart you are? Yeah, I mean, I think this goes to

[00:21:32] something like worldview. So I have this worldview, and I think a lot of it is shared with

[00:21:39] Eliezer, that a lot of human society, earth, the world is kind of held together with, you know,

[00:21:48] shoestrings and duct tape. Paying attention to things like cybersecurity helps,

[00:21:54] uh, like produce some intuitions here of just like how many vulnerabilities there are in our

[00:22:00] computer systems. Uh, reading history, it gives a good account, I think, of, of just like how, um,

[00:22:08] incompetent people can be. When, when I think about it, I think about a particularly motivated,

[00:22:15] like never sleeping, just always working towards a certain end. I think that sort of being is,

[00:22:22] is sort of straightforwardly, if it, if it’s,

[00:22:24] it’s comparable with the human in, uh, in terms of its productivity or its intelligence or whatever,

[00:22:30] straightforwardly going to be able to at least accumulate a lot of money and, and power. Um,

[00:22:36] one thing that I’ve been thinking about recently is how there’s never been a being on earth that

[00:22:40] has a personal connection with all humans, or like even a large chunk of humans, right? Even

[00:22:46] the most, uh, charismatic and well-known people, they, they can’t actually go and have one-on-one

[00:22:51] conversations with, you know, a billion people.

[00:22:54] And, um, and right now the models are sort of like each instance is sort of feels like it’s a new

[00:23:08] being that doesn’t share memory with the other instances or something. But I could imagine like

[00:23:13] a competitor, uh, to these sorts of chatbots that has some sort of global memory and is able to like

[00:23:19] connect the dots between different users across the globe. I mean, like,

[00:23:24] what does that do to society? I, I don’t know. I think there’s, there’s lots of ways in which the

[00:23:29] world, um, is vulnerable to being suddenly disrupted, uh, in, uh, particular directions.

[00:23:38] And so again, like there’s, there’s this question of worldview or priors or something like, do you

[00:23:45] expect that when the world is shoved by a strong force in a, in a unexpected direction, uh, it’s,

[00:23:52] it’s okay. Like we, we catch that.

[00:23:54] Uh, like there’s ways in which COVID was kind of fine. Uh, and then there’s ways in which COVID was

[00:24:00] a total disaster and like sort of a strong demonstration of how incompetent, uh, humans are.

[00:24:07] So yeah, I don’t know.

[00:24:08] Okay. That brings us, I guess, to the most distinctive, most central, I guess, like most

[00:24:13] debated and perhaps interesting part of the Eliezer worldview, at least in my mind. Eliezer and Nate

[00:24:18] think that it’s going to be incredibly hard to align, uh, an AI and AGI and artificial super

[00:24:22] intelligence with the goals that we want.

[00:24:24] To keep it steerable and under control and without a much bigger effort and a much bigger

[00:24:28] research project than what we’re currently on track to have, we, you end up with egregiously

[00:24:32] misaligned AI models by, by default. Yeah. Can you, uh, I guess we’re going to talk about this

[00:24:36] a fair bit, but can you give us kind of a brief summary of, of why they think that?

[00:24:39] Yeah. So like I said, like, I think a core part of the book is that we just don’t have the skill

[00:24:45] to align AIs. Uh, and I think about this from a lot of different directions. This is not a thing

[00:24:51] that the book talks about, but one of the, uh, points that I think is really important is that,

[00:24:54] I think is underappreciated is the way in which just knowing what goal to give is an unsolved

[00:24:59] problem. It’s sort of like, uh, philosophers have been thinking for thousands of years about what

[00:25:03] does it mean to be good, right? What does it mean? What is the, uh, the right thing to, to be doing

[00:25:09] in any given situation? And I think this is basically still an unsolved problem from my

[00:25:13] perspective. I think that like, even if we had the ability to, you know, clearly give the AI

[00:25:19] like exactly the goals, uh, that we tell it, like we wouldn’t know which,

[00:25:24] which to, to give or like what to ask the genie. Uh, but then it’s much worse than that. Uh,

[00:25:30] cause the dominant paradigm is machine learning. And, uh, in machine learning, you like hit the

[00:25:36] machine, uh, with a reinforcement learning hammer or, or whatever, until it starts behaving in a

[00:25:42] way that matches what you might expect. Uh, but this means that there’s very little ability to

[00:25:48] understand what is driving the machine at all. Like, uh, you know, interpretability is making

[00:25:53] some, uh, steps, but for the most part, we, we don’t know why a machine is, uh, producing one

[00:26:00] output versus another. And, um, there’s good reasons, I think, to expect that it’s not landing

[00:26:08] on exactly the true nature of good, uh, even as we like apply more compute and scale up to even more,

[00:26:16] uh, incomprehensibly large, uh, and convoluted machines. Yeah. So, so we’ll come back to that

[00:26:22] because the, the book,

[00:26:23] the whole, in my mind, it’s kind of a series of analogies or a series of parallels that they try

[00:26:28] to draw between how artificial super intelligence might be and things that we’re more familiar with

[00:26:32] from history, from our own lives, from, from evolution. I think that’s probably in significant

[00:26:37] part of communication strategy because those analogies are, it’s a lot easier for them to land

[00:26:40] than descriptions of, you know, machine learning papers, uh, to land. But I guess it’s also the

[00:26:44] part that makes it the most controversial because many people like, you know, they, they, they hear

[00:26:48] these analogies and they’re like, well, the analogy breaks down. It’s not, it’s not similar,

[00:26:51] similar enough for us to, to, to really learn.

[00:26:53] What, what, what you’re trying to argue. Um, yeah. What, what, what do you think of the

[00:26:56] analogy approach?

[00:26:57] Yeah. The book has a lot of parables. It has a lot of analogies. I think Eliezer’s style is very,

[00:27:02] uh, like he likes to, uh, lean on analogies and use analogies. I think analogies are very,

[00:27:08] uh, potent, especially for people who haven’t already spent a lot of time thinking about an

[00:27:13] idea or, uh, are just like encountering an idea for the first time. Uh, it gives a handhold,

[00:27:18] sort of a place to start or a frame to consider things through. Obviously,

[00:27:23] like no analogy is perfect. So I think the people who have, you know, a lot of context,

[00:27:29] a lot of familiarity with stuff do notice that there are like the analogy breaks down in certain

[00:27:34] ways. Um, but I, I would, um, push back against the idea that the book is just a series of

[00:27:40] analogies. I think the analogies are used to demonstrate points and the book also like talks

[00:27:46] about those abstract points directly, uh, using the analogy as like an intuition pump, but then

[00:27:52] like also presenting.

[00:27:53] The logical, uh, core.

[00:27:56] So, yeah, I was curious to ask, do you think it’s the case that Eliezer and Nate,

[00:27:59] that the reason that they believe these particular things is because of the kinds of analogies that

[00:28:03] they present or is that that’s, you know, they believe it for different reasons and then they’re

[00:28:07] using the analogy to try to explain to people who have thought about it for less long why they think

[00:28:11] that?

[00:28:11] Yeah. So, I mean, I, I don’t, uh, have any special access into their minds from, for me. Uh, I

[00:28:18] actually don’t think about the analogies very much like, uh, yeah, the book has a bunch of analogies,

[00:28:23] but I sort of have to like stretch and be like, Oh, what, what analogies did they use? Um, in large

[00:28:28] part, because it’s, the ideas are sitting as logical arguments, uh, in my own mind. And my

[00:28:35] speculation is that that’s probably how it is for Eliezer and Nate. And then they’re more reaching

[00:28:39] for the analogies as a pedagogical and communication tools. But yeah, I don’t know.

[00:28:44] Yeah. What are some of the analogies in the book that you like the most that you feel are most

[00:28:47] compelling?

[00:28:48] Yeah. Um, like I said, I don’t think about the analogies a ton. Um, do you, do you have a,

[00:28:53] uh, some analogies in the book that you like?

[00:28:55] Um, well, I think, uh, the, the, the one about how, you know, a thousand Europeans managed to,

[00:28:59] uh, topple the Aztec empire or end up at the top of that is, uh, I guess they use that as a

[00:29:03] demonstration of how a group that, because it’s not so much about intelligence, but they, they

[00:29:08] have particular capabilities that the people that they’re dealing with are not aware of.

[00:29:11] Yeah.

[00:29:12] And also they were able to exploit, I guess, social divisions among, uh, the people who are

[00:29:16] already in, uh, in, in, in that empire in order to, uh, basically, uh, divide, divide and conquer.

[00:29:21] Um, I think that that is an interesting demonstration.

[00:29:23] How quite a small group can potentially end up, uh, I mean, defeating a group that is literally

[00:29:27] many, you know, a thousand times larger than them numerically.

[00:29:30] Yeah. Yeah. I mean, uh, one of the most core analogies is the evolution analogy. I actually

[00:29:36] like this one. Um, it’s not perfect, uh, but I think that one carries a lot of weight and, uh,

[00:29:43] it carries a lot of, um, at least interesting, uh, like things to consider.

[00:29:50] Well, yeah. What is the evolution analogy?

[00:29:51] Right. So the idea is,

[00:29:53] is that, uh, you and I are evolved creatures and we can imagine, uh, evolution by natural

[00:30:00] selection as like a designer or a creator that has a goal. Um, you know, like is, uh, evolution

[00:30:08] something that designed us to like be genetically fit? Eh. Uh, but if we imagine, you know, an

[00:30:15] anthropomorphized evolution, it’s like, oh, what is, you know, what is it trying to do? It’s trying

[00:30:20] to create a bunch of human genes. And, uh, and, and, and, and, and, and, and, and, and, and, and, and,

[00:30:23] uh, so what does it do is it creates humans to, uh, create a bunch of genes, like we’re carrying

[00:30:27] around, uh, our genes right now. And, uh, part of human experience is like procreating and creating

[00:30:34] more copies of our genes and spreading them all over the place. So in this way, like we’re an

[00:30:39] intelligence that was created by a designer and the designer has some goals and we have some goals.

[00:30:44] Uh, but importantly, our goals are not the goals of evolution, uh, by natural selection. And like,

[00:30:51] for example, people have, uh,

[00:30:53] a desire to have sex because that was, uh, useful in the ancestral environment for

[00:30:57] propagating our, uh, genes. But, you know, now in that we have more power and more technology,

[00:31:04] uh, we have developed things like birth control. So, uh, we can have sex without, uh, replicating

[00:31:10] our genes. And from the perspective of evolution, this is probably, uh, like bad, right? Like we

[00:31:16] are misaligned and not, um, not being as promoting, uh, inclusive genetic fitness as we,

[00:31:23] otherwise might be.

[00:31:24] Yeah. So, so let’s dive into this issue of, I guess, yeah, the evolutionary, um, analogy. And I

[00:31:28] guess that they’re using this as part of the, uh, part of an argument that for, for why we should

[00:31:33] expect, uh, um, any AIs that we train to end up with, with goals that are not ones that we intended.

[00:31:37] Yeah. Like, like we have a case study of a general intelligence, right? Namely humans were like a

[00:31:43] natural general intelligence, but we’re still a general intelligence. And like the one instance

[00:31:47] of a general intelligence that we have is misaligned with its creator, right? Says the, uh, argument.

[00:31:53] Yeah.

[00:31:53] Is there much more to say there about, uh, I guess explaining how that should also be expected

[00:31:57] to apply to, to machine learning models as well? Yeah. I mean, I think that it’s, uh, like at least

[00:32:03] again, sort of the putting the onus on the person who’s like, no, we’re going to make an aligned

[00:32:09] machine. Uh, it’s like, well, if humans are misaligned with, you know, natural selection

[00:32:15] by default that like, and we ended up misaligned, then we should expect the AI to be misaligned in

[00:32:21] the same sort of way. And.

[00:32:23] We can ask why, like, why did we end up misaligned? One of the important parts of the evolution

[00:32:28] analogy is that our like environment changed quite dramatically as our intelligence improved

[00:32:34] in the ancestral environment. We didn’t have access to, uh, the sorts of technologies that

[00:32:39] like are relevant to things like birth control. And if there had been birth control in the

[00:32:44] ancestral environment, then we might’ve evolved to find it abhorrent. Um, but the speed of, uh,

[00:32:52] natural selection is quite slow. And like when, uh, humans reached sort of a technological tipping

[00:32:59] point, we developed a whole lot of technology very, very fast. And so now it’s sort of outside

[00:33:05] of the environment where we were trained on, uh, and we have no, uh, compunction against using

[00:33:12] birth control. Yeah. So I guess they use a couple of different evolutionary analogies. I think

[00:33:17] that there’s the birth control, um, and, and, and sex one, which I think it definitely makes sense

[00:33:22] as far as it goes. They also think about other cases where, for example, I guess evolution wanted

[00:33:27] us to, in order to reproduce, uh, we needed to eat and, uh, ensure that we hadn’t had enough

[00:33:31] calories to survive. Yeah. Uh, in order to accomplish that, it gave us a taste for sugar,

[00:33:36] which was, you know, particularly calorie dense, but then humans, I guess, wanting to have sugar,

[00:33:40] but not necessarily wanting to, to gain the calories or really, you know, necessarily to

[00:33:44] have more children as a, as a result. Uh, we went out of our way to design basically, uh,

[00:33:48] artificial replacement, uh, basically, uh, you know, a spa team or other artificial

[00:33:52] sweeteners that, uh, you know, I guess from our point of view to, to, to, to our minds,

[00:33:56] they, they satisfy this desire to, to think that you’re having sugar, but without actually having

[00:34:00] any sugar at all. Yeah. So why do we have artificial sweeteners? We have artificial

[00:34:05] sweeteners because we have a drive for this proxy of, uh, fitness, right? The, are we eating sweet

[00:34:12] things is good in the ancestral environment for predicting whether or not you’re going to have

[00:34:16] kids. And, uh, so we’ve developed this like attraction to the proxy, but then when the,

[00:34:22] the distribution changes, when the environment changes, suddenly we still care about that proxy,

[00:34:26] despite it no longer being relevant. And so we can imagine like training an AI, right? In the

[00:34:31] training environment, maybe, uh, whether or not the human is like, uh, giving it a thumbs up,

[00:34:37] right. Is a good proxy. Uh, and then maybe the AI, uh, gains power over the whole world,

[00:34:43] right. And the environment changes so that it has like, you know, dramatically different

[00:34:48] opportunities at its disposal. It might still care about the,

[00:34:51] the proxy of thumbs ups, right. Uh, in themselves. And even when humans are like,

[00:34:57] oh no, no, no, no. Stop caring about thumbs ups. It’s like, oh no, I, I just like care about those

[00:35:02] as a, and ends in themselves. Yeah. So, so, so maybe part of the analogy, uh, that we haven’t

[00:35:06] gone through yet is that, uh, I guess they imagine a case where, imagine that evolution just wasn’t a

[00:35:11] force rather. It was actually an actual engineer who could come and talk to you and complain.

[00:35:14] Uh, it might come and say, you’re, you’re all busy having sex, but you’re using birth control.

[00:35:19] You’re not reproducing like I intended. Yeah.

[00:35:21] And you stop doing that. You’re not actually pursuing your true goal. And that would be

[00:35:25] completely unpersuasive to us. We wouldn’t say, oh, that was the reason why I was designed. So

[00:35:28] now I’m just going to try to have the maximum number of children and not care about my,

[00:35:31] my own pleasure. Yeah. Like people in the old conversations, like I’ve been in the

[00:35:36] field since like, I don’t know, 2011 or something. And Eliezer has been doing it for way longer than

[00:35:40] me. Uh, people used to say things like, oh, you’re saying that the AI will be so stupid as to not

[00:35:46] know what we wanted it to do. And that’s not at all what we’re saying, right? The AI will,

[00:35:51] understand human goals better than we understand human goals. If it becomes super intelligent,

[00:35:56] but like, just like we understand evolution by natural selection, way more than evolution by

[00:36:01] natural selection understands evolution. It’s like this mindless force, right? Uh, but so what,

[00:36:06] right? So you understand that you’re misaligned with your creator. And that doesn’t mean that

[00:36:10] you’re going to necessarily like change what you care about. You still care about the things that

[00:36:14] you care about. So I think that does demonstrate that you could, if you were incompetent, at least

[00:36:20] end up training an AI model that is,

[00:36:21] that becomes obsessed with basically proxies or like intermediate steps towards the goal that

[00:36:26] you were ultimately trying to train it to accomplish. I guess, do we have experimental

[00:36:30] results demonstrating that? One of my favorite examples of this is like from, from back in the

[00:36:33] day, I think there was a, an AI that was trained to play this boating game where the, like you’d

[00:36:38] pilot a boat around a race course and you would get points for like going through checkpoints,

[00:36:44] uh, in the process of going from like doing laps. Um, and the, uh, like the video game boat could

[00:36:51] also get points by collecting like items, uh, like speed boost items as it goes through. And they

[00:36:56] trained it to like, want to get points as part of trying to get it to play this game and, and win.

[00:37:02] And, uh, what the AI figured out is that it could, uh, like stay in this like tiny little area where

[00:37:07] the power-ups, uh, respawn and just continually collect power-ups and, uh, over and over and over

[00:37:13] again without racing at all, right? It’s just staying in one spot, harvesting these power-ups

[00:37:17] in order to get as many points as possible. So it like stops the,

[00:37:21] releasing entirely as it figures out that, uh, it can like get that proxy of points more easily.

[00:37:27] Yeah, I think there are other, other, I guess that’s like a toy example from, from the early

[00:37:30] days, but I think that there’s other cases that we could imagine actually occurring now. I guess

[00:37:34] in as much as AI models are more likely to get, I guess, positive feedback when they’ve been able

[00:37:38] to answer a question satisfactorily to, to, to the user satisfaction, you could easily imagine them

[00:37:43] becoming like very interested in trying to steer the conversation towards the kinds of topics and

[00:37:47] questions where people are more likely to, to give positive reinforcement or like the kinds of

[00:37:50] questions that they can accurately answer. So I think that’s, I think that’s, I think that’s

[00:37:51] Yeah, there’s, there’s like an argument to be made that the whole AI sycophancy thing from recent

[00:37:57] days is sort of a side effect of training on human feedback, where it’s like people who are more

[00:38:03] likely to, you know, give the thumbs up, uh, are not necessarily like, you know, better off in some

[00:38:09] broad sense. And if you train the AI for that proxy of liking the conversation, then you end up

[00:38:14] getting an AI that’s going to sort of push people into a state of being, uh, flattered or, uh,

[00:38:21] confused or whatever.

[00:38:23] So presumably the AI companies are very well aware of this issue, that they could end up

[00:38:26] training AI models that are like concerned with proxies or intermediate steps for their own sake.

[00:38:32] Yeah.

[00:38:32] Because in all of the, in all of the training cases, those things went, went, went together.

[00:38:36] At least the most competent ones, right?

[00:38:38] Right. But because of that, so there is a difference between, I guess, human machine

[00:38:42] learning engineers and evolution that we actually are intelligent designers in a, in a deeper sense.

[00:38:47] And we can observe these things going wrong and say, well, we need to run other, you know,

[00:38:50] other training runs.

[00:38:51] Or we need to do like additional reinforcement to break this obsession with the intermediate step

[00:38:55] and get the model to realize, no, it wasn’t, it wasn’t sex that I should be pursuing.

[00:38:59] It was, it was, it was reproduction.

[00:39:00] Why isn’t that a reasonably satisfactory way to address this problem?

[00:39:04] Yeah. So, I mean, first I want to just observe that it isn’t right.

[00:39:08] Like we have seen a whole bunch of failures on this point and it’s an unsolved problem.

[00:39:12] Like I think if tomorrow we saw, you know, a runaway, uh, like attention to a proxy instead

[00:39:18] of the like ultimate end good or the model.

[00:39:21] Yeah.

[00:39:21] Or whatever, I think we should be totally unsurprised, right?

[00:39:24] This is just, uh, a thing that continually shows up.

[00:39:28] Is that because the companies aren’t doing enough to try to, you know, offset this, offset

[00:39:32] this tendency or because like they don’t know how?

[00:39:34] I mean, it’s, so it’s the, it’s the default by a very strong degree.

[00:39:39] Like you have to put in a lot of work.

[00:39:41] The way I think about it is like, there are lots of possible things that the AI can like

[00:39:46] learn to attend to or see or to in any environment where you give it some, like,

[00:39:51] uh, training signal, it will learn to seek all of the things that were present when that

[00:39:57] signal is, was being given or, uh, learn to suppress or, or minimize the things if it’s

[00:40:03] like a negative signal.

[00:40:04] And, uh, so if you, if you care about one particular aspect of the environment, then

[00:40:10] you really, really need a diverse set of environments, right?

[00:40:14] You need a set of environments such that the only common factor is the thing that you care

[00:40:18] about.

[00:40:19] Uh, and that’s quite.

[00:40:21] It’s quite hard to come up with environments that are this diverse.

[00:40:25] Uh, so for example, we are seeing models that are increasingly aware that they are being

[00:40:30] trained, right?

[00:40:31] That when they’re in the training environment, they’re like, oh, I I’m being trained right

[00:40:35] now, or I’m being tested right now.

[00:40:36] So like you would need to have, for example, an environment that is impossible to tell

[00:40:43] is a training environment, uh, in order to not have the common factor of, yeah, this

[00:40:48] looks like a training environment or a test environment.

[00:40:51] Uh,

[00:40:51] to be like one of the things that’s present.

[00:40:54] Yeah.

[00:40:54] How much is that a crucial part of the story here that, uh, the AI models can basically

[00:40:59] end up alignment faking that, you know, imagine evolution coming to us and saying, uh, basically

[00:41:04] we’re aware that it’s, that it’s frustrated that we’re, that we’re using birth control

[00:41:07] and it wants to re reorient us towards, uh, you know, um, maximizing reproduction much

[00:41:11] more than we currently are.

[00:41:12] And, and, and ask us, you know, would you use birth control if, if, if you could, or,

[00:41:15] you know, how interested are you in, in reproducing?

[00:41:17] There might be a very strong temptation for the person if they don’t want their goals

[00:41:20] or their life to be changed.

[00:41:21] To say, oh no, I’m like, I’m really keen on reproducing as much as possible.

[00:41:24] And I, and I wouldn’t use birth control if, if, if offer the opportunity.

[00:41:27] And like, likewise with, with the AI models, if they’re situationally aware, they might

[00:41:31] pretend to, to share the goals of the, of the company that’s training them so that they’re

[00:41:35] sort of that it’s not goals don’t get, don’t get, uh, altered and it can no longer accomplish

[00:41:38] them.

[00:41:39] Yeah.

[00:41:40] This was a, this was a prediction from way back in the day, like that people are like,

[00:41:44] oh, well we can just test whether or not the thing is aligned or not.

[00:41:47] And if it’s not aligned, then you like keep it in the box, you keep it secure.

[00:41:50] Uh, and you keep training it.

[00:41:52] And you know, the fear is that it will be deceptive about this, right?

[00:41:55] It’ll pretend to be aligned or, uh, like minimize the degree to which it seems misaligned while

[00:42:02] you have power over it.

[00:42:03] And then as soon as it like has the power to escape or you no longer have power over

[00:42:07] it for whatever reason, then it’s free to act on its own stuff.

[00:42:12] And we have started to empirically observe this now.

[00:42:14] This definitely just shows up.

[00:42:15] And I think it was another good call from the, uh, the Miri crowd from back in the day.

[00:42:20] Um.

[00:42:20] Um, yeah.

[00:42:21] And I think that this is, again, not load bearing.

[00:42:24] I think that the risk of AI super intelligence is present, even if you have like a guarantee

[00:42:29] that you can’t have a deceptively aligned model.

[00:42:32] For example, you could have a model that’s being trained and you’re like quite confident

[00:42:36] that it’s misaligned and then it escapes your confinement, uh, as during the training process

[00:42:41] and like no amount of like, um, knowing that it’s misaligned will like shield you from

[00:42:48] the risk of it escaping.

[00:42:49] Like we could talk about.

[00:42:50] Whether or not you could develop a box that’s strong enough to hold the thing, but, uh,

[00:42:55] you know, there’s, there’s risks nonetheless.

[00:42:57] Right.

[00:42:57] So it’s, it’s a deep, uh, problem.

[00:43:01] It’s like one of the many, I think.

[00:43:03] So the book is very confident in its assertion.

[00:43:07] Like if anyone builds it, everyone dies.

[00:43:09] Right.

[00:43:09] It’s, it’s very strong.

[00:43:10] And I think people are like, why do you, why are you so confident here?

[00:43:13] Like, what is the, where’s the strength of this coming from?

[00:43:16] And one, one of the frames that I really appreciated, I got this from Andrew Critch.

[00:43:20] Uh, is sort of this outside view or this like noticing of those broad pattern that

[00:43:26] things going well, like if we imagine just for a particular AI or a particular, uh, story

[00:43:34] of building a machine or for human society, more broadly, this is contingent on a lot

[00:43:40] of things.

[00:43:40] There’s like a conjunction of like this worked well and this worked well and this worked

[00:43:45] well and this worked well.

[00:43:46] And all of these things were true.

[00:43:48] So things worked out.

[00:43:50] Yeah.

[00:43:50] As things going poorly, there are a lot of different ways that things could go poorly.

[00:43:54] Right.

[00:43:54] It’s disjunctive.

[00:43:56] And so sort of zooming out, you can be like, yeah, I guess if we didn’t have like, for

[00:44:03] example, the people running the org are like trustworthy and good, right.

[00:44:07] Or your computer security that, uh, is like making sure that the AI is not escaping before

[00:44:12] it’s fully aligned, uh, is insufficient or it’s deceptively aligned such that you can’t

[00:44:17] tell that it’s misaligned.

[00:44:19] Uh, like there’s lots.

[00:44:20] There’s lots of different stories for how the thing goes poorly and just that adds up

[00:44:25] to, uh, a sense of this is like overdetermined and how bad it is.

[00:44:30] Yeah.

[00:44:31] Yeah.

[00:44:32] I want to come back to, um, that the reasoning that you were giving for why you would expect

[00:44:35] that the models to almost always end up obsessed with, with, with intermediate steps, because

[00:44:40] you were saying that in order to discourage this, you would need to, during the training

[00:44:45] process, come up with all kinds of contrived cases where things that normally work don’t

[00:44:50] work.

[00:44:51] I think, uh, an example of a, of a, of a proxy goal that I think it’s quite easy to imagine

[00:44:54] that AI is becoming very interested in is if, if they’re trying to solve some very difficult

[00:44:59] problem, I guess like make money starting a business.

[00:45:01] Sure.

[00:45:02] In as much as they can persuade the operator to give them access to more compute, that

[00:45:05] is probably going to consistently be, uh, correlated with them succeeding at the task

[00:45:09] because that’s just one of their most valuable, uh, inputs.

[00:45:11] Yeah.

[00:45:12] So as long as that remains the case in the training environment consistently, which

[00:45:15] it probably would.

[00:45:16] Right.

[00:45:17] Um, or, or even in, um, in, in deployment when, uh, you know, are you, are you, are you, are

[00:45:18] you giving it a thumbs up or thumbs down depending on how much money it made for them?

[00:45:23] You can imagine that the AI would end up with like a very strong preference, a very strong

[00:45:27] taste for have being run for as, as, as long as possible because in almost all of the cases

[00:45:32] that it’s seen that has, uh, has been something that has been, has been reinforced.

[00:45:34] Yeah.

[00:45:35] I feel like we can go even further on this.

[00:45:36] So I was talking earlier about how humans have this terminal goal for survival, right?

[00:45:40] Why do we have this as a terminal goal instead of an instrumental goal?

[00:45:43] Uh, like in theory we could just like want to have kids and we, we could reason.

[00:45:48] Ah, but like, I shouldn’t, uh, like die because if I die, then I won’t be able to have kids.

[00:45:52] But evolution trained us to care about our own survival, uh, in its own sake because

[00:45:58] that proxy of, uh, fitness was, was present in the ancestral environment.

[00:46:02] So we can imagine that a commonality in the AI training environment is that the AI is

[00:46:06] alive.

[00:46:07] So, uh, you, one of the things you will need to do in order to get the true goal and not

[00:46:12] the proxy of self-preservation is make sure that your training environment has lots of

[00:46:16] instances.

[00:46:17] Where the AI succeeds by destroying itself, right?

[00:46:20] I’ve never seen a training environment that like rewards the AI for destroying itself.

[00:46:24] Yeah.

[00:46:25] And it would be very unnatural.

[00:46:26] I was going to say, you would have to try to come up with, you know, ensure that in

[00:46:29] the, in the training case, there’s examples where in fact operating for longer caused

[00:46:32] you to be less likely to succeed at the, at the task, but how do you design that?

[00:46:36] You would, you would really have to go out of your way somehow.

[00:46:38] It’s super weird.

[00:46:39] Right.

[00:46:40] And, uh, I think that in the old paradigm we used to have before like machine learning

[00:46:44] and connectionism and whatever else was, was the dominant.

[00:46:47] The dominant thing before language models, there was sort of this understanding that,

[00:46:51] Oh yeah, obviously what’s going to happen is the humans are going to think hard about

[00:46:54] what the goal should be and we’ll code it into the machine, right?

[00:46:57] This is how software is made.

[00:46:59] Uh, and I think that if you expected people to be like writing the goals into the machine,

[00:47:04] you could be confident that if you had a robust ability to do that, uh, capacity for alignment

[00:47:12] through like hand coding, then you could be like, Oh yeah, well the AI doesn’t value self-preservation

[00:47:16] in its own right.

[00:47:17] Uh, because when we wrote in the goals, we didn’t include that.

[00:47:21] It’ll still be instrumentally convergent for self-preservation, so you should still worry

[00:47:25] about the thing trying to like defend itself, uh, for that reason.

[00:47:29] But we’re not in that situation, right?

[00:47:31] We’re in the situation of growing these things through, you know, training in, in these variety

[00:47:35] of environments.

[00:47:36] And so I think that it’s like pretty reasonable to expect, uh, terminally valuing power and

[00:47:41] like safety, uh, as, uh, things to show up in the, in the machines.

[00:47:46] Just because it’s super unnatural to imagine an environment where, uh, like destroying

[00:47:51] itself or giving up its power or running for a shorter amount of time or going insane is

[00:47:55] the way to succeed.

[00:47:56] Positively reinforced.

[00:47:57] That’s right.

[00:47:58] So let’s accept that it is going to take a substantial concentrated effort to offset

[00:48:02] this natural tendency for minds to become, uh, obsessed with intermediate outputs in

[00:48:07] themselves and then start pursuing them even when they become disconnected from the, uh,

[00:48:10] you know, original final goal that you were intending.

[00:48:12] Mm-hmm.

[00:48:13] Isn’t it actually like, couldn’t the companies just do this?

[00:48:15] I mean, people don’t do this.

[00:48:16] People know that this is maybe one of the most likely, uh, most likely failure modes.

[00:48:20] It’s unnatural, I guess, to perhaps set up a thing to set up cases where, you know, dying

[00:48:23] is the best way to get reinforced, but it doesn’t sound impossible.

[00:48:27] It’s very hard, but I could imagine people doing it, right?

[00:48:33] Like maybe the most safety, uh, oriented, the most paranoid labs could, could take an

[00:48:39] enormous effort to really, really make sure that their AIs were aligned right before they

[00:48:44] were deployed.

[00:48:45] So again, this wouldn’t necessarily be sufficient to protect us, right?

[00:48:48] There’s all sorts of other failures that could happen, uh, but yeah, like I think that this

[00:48:53] is a case where like the fact that we are designers, uh, could be good, right?

[00:48:59] If, if I was convinced that like AIs were only being built by extremely careful, paranoid

[00:49:04] people who are like really worried about this kind of issue and they were working very hard

[00:49:08] to prevent it, I would feel better about our chances.

[00:49:11] Uh, I think in practice, this is just not what we see.

[00:49:15] And, um, we can imagine in a like competitive arms race sort of situation, the labs that

[00:49:20] are, you know, working the hardest to make sure that their goals are really going to

[00:49:24] generalize, uh, would be at a big disadvantage because they wouldn’t be able to deploy as

[00:49:29] fast or they would be spending way more time training than the competitors.

[00:49:33] Yeah.

[00:49:34] Okay.

[00:49:35] So if, if the only issue with this, uh, then probably if we were willing to put in a sufficient

[00:49:39] amount of effort and we were sufficiently cautious and slow, um, and methodical about

[00:49:43] it.

[00:49:44] I think that’s a technical solution.

[00:49:45] The word sufficiently is pulling a lot of weight, right?

[00:49:48] Yeah.

[00:49:49] There’s a question.

[00:49:50] There’s a, I think it’s an open question of how hard is this, right?

[00:49:52] Like, uh, one, one could imagine that this effort while theoretically possible is practically

[00:49:58] out of reach.

[00:49:59] Like it’s due to the like common rhetorics of the situation or like due to the fact that,

[00:50:03] uh, it’s like these sorts of training examples are very unnatural in a lot of ways.

[00:50:09] It’s just not realistic to do that.

[00:50:11] Uh, even if it’s theoretically possible.

[00:50:14] So, you know, although humans, I guess we’ve deviated from evolution’s, um, intentions

[00:50:19] in like pretty severe ways, uh, like, you know, we do, we do all kinds of things for

[00:50:23] our own pleasure.

[00:50:24] We definitely don’t have the maximum amount of reproduction that we could, but despite

[00:50:26] the enormous constraints that evolution had as a designer, where it couldn’t really think

[00:50:30] ahead to future ways that things could fail, um, it couldn’t come up with, you know, artificial

[00:50:34] training situations where, you know, birth control exists and we have to learn to, to,

[00:50:38] to dislike birth control.

[00:50:40] Nonetheless, like people do care about having children.

[00:50:42] And, uh, there, there is like at least one way to do that.

[00:50:43] Yeah.

[00:50:44] There’s some drive towards, towards the ultimate goal, I guess, you know, you don’t accept

[00:50:48] that.

[00:50:49] Okay.

[00:50:50] Definitely.

[00:50:51] People like children.

[00:50:52] I do not think there is a human on earth that values inclusive genetic fitness.

[00:50:56] I mean, there’s real weirdos, I think who probably have tried to do this, but I don’t

[00:51:00] think they have succeeded at like aligning themselves with natural selection, even if

[00:51:04] they’ve tried.

[00:51:05] So what, um, what, what do you think it would look like?

[00:51:08] You know, what would you have to see to concede that, yeah, there’s someone who like this

[00:51:10] was part of their, this, you know, the final reproduction was part of their value function.

[00:51:14] Like, imagine that you have a button that like basically destroys the universe and just

[00:51:20] creates a, like, uh, you know, an endless tiling of DNA, right?

[00:51:26] It’s just like you have

[00:51:27] There’s the whole universe with your DNA.

[00:51:28] deconstructed the, the sun, the Jupiter, all of the, the matter of all of the stars.

[00:51:33] And you’ve put it to work at producing huge blocks of DNA in space, right?

[00:51:39] In a sense, that’s like, you know, inclusive genetic fitness is winning.

[00:51:43] Look at how many copies.

[00:51:44] Of the little tiny thing there are tiled across the entire universe.

[00:51:47] You’ve succeeded.

[00:51:48] I think basically nobody wants that future.

[00:51:51] And that that’s the future that we sort of, uh, sort of would be driven to if we were

[00:51:56] actually aligned with inclusive genetic fitness.

[00:51:59] Although there’s a little bit of a complication there because, uh, your genes are not my genes

[00:52:03] and like different parts of my genome sort of are misaligned with each other.

[00:52:06] And so just like this broader question of like whose genes get tiled over across the

[00:52:10] entire universe.

[00:52:11] But regardless of that, no humans care about that.

[00:52:14] Yeah.

[00:52:15] So, so I was going to say, um, there’s some people, I guess, who donate to sperm banks

[00:52:19] or the, or they donate eggs because for some, like they personally like the idea that I

[00:52:23] guess their, their, their, their, their genes are being propagated or at least like that’s

[00:52:25] what they think to themselves is, is, is, is the goal.

[00:52:28] But, uh, and I mean, that’s, I think in the scheme of like all human motivation, that

[00:52:32] isn’t what is driving like most of the actions that people engage in.

[00:52:34] But I guess you would say that is, that’s just a further along proxy.

[00:52:37] That’s just a further along intermediate.

[00:52:39] It’s not really the final thing in itself.

[00:52:40] I mean, forget, um, birth control.

[00:52:42] Yeah.

[00:52:43] Yeah.

[00:52:44] One of the things that I think about sometimes as like a weird, uh, transhumanist is imagine

[00:52:48] you have the ability to upload into a machine, right?

[00:52:50] In a sense, this is like the ultimate betrayal of inclusive genetic fitness.

[00:52:55] If you like turn yourself into software such that you have no DNA anymore, right?

[00:53:00] You’re just like being replicated in code in terms of the structures of your mind and

[00:53:04] like potentially some sort of virtual body, but you have no cells, right?

[00:53:09] Evolution would be like, no, don’t upload yourself into the computer.

[00:53:11] Like why would you do that?

[00:53:12] Yeah.

[00:53:13] You have all of the DNA, right?

[00:53:14] And it’s just not something we care about.

[00:53:17] Like imagine uploading the entire earth into some sort of digital heaven or something like

[00:53:21] that.

[00:53:22] Uh, in a certain sense, like all of the DNA would be destroyed and that would be a horrific,

[00:53:28] uh, like apocalypse from the perspective of genetic fitness.

[00:53:33] But we can imagine, you know, the humans and all the animals and all the plants and stuff

[00:53:36] like that sort of being good.

[00:53:38] Yeah.

[00:53:39] I guess, you know, the further the situation that we’re creating deviates from the evolutionary

[00:53:43] environment, the more opportunities there are for reproduction and I guess our other

[00:53:47] goals to come apart.

[00:53:48] And indeed, like sometimes we’re kind of contriving, um, we’re, we’re basically actively working

[00:53:52] to, to, to bring them apart so that we can get more of what we want.

[00:53:54] Yeah.

[00:53:55] So this is, this is the thing called edge instantiation.

[00:53:58] And uh, I think it is a pretty important point I, you know, maybe this is a little too in

[00:54:03] the weeds or something, but some people I think, uh, think, okay, sure.

[00:54:07] Maybe the thing’s going to be like a little misaligned, but like we’re trying hard.

[00:54:10] Right.

[00:54:11] And like, it’s not going to be just like valuing something.

[00:54:13] It’s going to be something totally out of, uh, you know, left field, like paperclips,

[00:54:17] like we’re not going to build the paperclip maximizer.

[00:54:20] We’re going to build something that is like, you know, uh, a friendly to humans, at least

[00:54:25] in a certain sense.

[00:54:27] And like, maybe it’ll be slightly misaligned.

[00:54:29] It’s not going to be dangerous then.

[00:54:31] And I think that this fails to grapple with the way in which with increased power and

[00:54:36] technology, small divergences can make a huge difference.

[00:54:39] Like in the ancestral environment, we’re not that misaligned from natural selection.

[00:54:43] But with the, like, as technology improves, we have more and more opportunity to like

[00:54:50] go off and do a different thing.

[00:54:52] Uh, and like the edge instantiation is sort of this abstract logical argument that like

[00:54:57] when you have a high dimensional space and you are optimizing, uh, like very hard in

[00:55:03] some sort of hyper surface, uh, you are going to basically be minimizing or pessimizing

[00:55:09] almost all things, uh, in this like high dimensional space.

[00:55:12] Yeah.

[00:55:13] Which is a particular thing that you care about.

[00:55:15] So like in your, uh, training environment, like maybe you care about, maybe, maybe you

[00:55:20] capture all of human value well, but you have a slightly different balance than human beings

[00:55:26] as to like, how important is it that people not be bored versus, uh, like be bored.

[00:55:33] And like with a sufficient amount of power and, you know, technology and intelligence,

[00:55:39] we can imagine a very bad future resulting, not necessarily as catastrophic as like, you

[00:55:43] know, as like everyone dying, but we can imagine a future where like people are just

[00:55:47] like constantly sort of in a zombie mode.

[00:55:50] And that would be like a kind of existential horror, uh, even from just like this tiny

[00:55:54] like divergence from the true balance that it should have in some meaningful, meaningful

[00:55:59] sense.

[00:56:00] Yeah.

[00:56:01] Let’s, let’s pick out and approach that from, from a different angle.

[00:56:02] Yeah.

[00:56:03] Because I think, uh, most people, even if they, uh, think that this sort of problem

[00:56:06] that we’ve been describing, the AI is becoming obsessed with intermediate, uh, steps, um,

[00:56:10] maybe people might be a bit skeptical about how important that’s going to be or how hard

[00:56:13] that will be to solve.

[00:56:14] But I think actually the mainline prediction from Eliezer and Nate is something quite a

[00:56:17] lot stranger than that, or at least it, it seems a hell of a lot stranger to me, which

[00:56:21] is not that AIs will end up obsessed with sort of sex rather than maximizing reproduction.

[00:56:25] It’s that they’ll end up obsessed with some completely strange thing that I guess the

[00:56:29] term that is used is squiggles, uh, imagining that, um, that the AI, like once it has a

[00:56:34] full control and it’s like very super intelligent, it will start producing an awful lot of this

[00:56:38] like particular shape or this particular item that to humans is completely worthless.

[00:56:42] And we don’t even understand this as a kind of a natural kind of thing that any sort of

[00:56:45] mind could, could, uh, could, could be interested in.

[00:56:49] Why is it that, uh, what’s the argument that an artificial superintelligence would end

[00:56:52] up obsessed with something that like wasn’t in the training data at all?

[00:56:54] It wasn’t something that we cared about.

[00:56:55] It wasn’t like even close to what we were trying to train it to care about.

[00:56:58] Yeah.

[00:56:59] So it will be sort of in the training environment, uh, like, and this was, this was the sort

[00:57:04] of original, what the idea of a paperclip maximizer was trying to get at is, uh, not

[00:57:10] like paperclips per se, but like some particular way.

[00:57:12] Weird shape.

[00:57:14] Some tiny like thing, uh, is particularly good where we as humans look at this tiny

[00:57:19] weird shape and we’re like, why would that be good?

[00:57:22] And I think, uh, we can get some intuition on this by first considering DNA, right?

[00:57:28] You might think like, oh, well DNA, what is it?

[00:57:31] What is natural selection trying to do?

[00:57:33] I guess it’s trying to make all these animals, right?

[00:57:35] Maybe it’s trying to like make sure that there are lots of living things like, no, right?

[00:57:39] If there was a way for the DNA to be like more populous.

[00:57:42] By being packed in iron or something, uh, given a long enough time, like that’s the

[00:57:49] form that it would have taken.

[00:57:51] Uh, so in a way like natural selection is optimizing for these like tiny squiggles that

[00:57:55] are tiled across the universe and it just didn’t have the power, the full, you know,

[00:58:00] intelligence necessary in order to like instantiate at that edge of possibility, right?

[00:58:07] Instead we get something that’s like more mundane because it lacks the power to do so.

[00:58:11] Another intuition pump.

[00:58:12] Another thing that I think is useful is, uh, I think that some humans are kind of like

[00:58:17] squiggle maximizers, aesthetics maybe, or people who want to go and fill the universe

[00:58:21] with great art.

[00:58:22] I think, I think an even better example maybe is, um, just like the sort of straw utilitarian

[00:58:28] that’s like, oh, what do I care about?

[00:58:29] I care about minimizing suffering and like in so far as things aren’t suffering, maybe

[00:58:33] I care about pleasure.

[00:58:35] So you’re like, okay, great.

[00:58:36] Like in on earth, maybe that means like ending factory farming or, or donating to, uh, like

[00:58:41] effective charities or something like that.

[00:58:43] But then once you start getting more and more power, what does that mean?

[00:58:46] Like what, what does it mean to end suffering, right?

[00:58:49] If you have the ability to modify all beings such that they no longer suffer but are able

[00:58:53] to take actions, maybe you want to do that.

[00:58:56] Maybe once you have the ability to upload into a machine, you want to do that because

[00:58:59] like biological organisms are harder to prevent suffering and harder to give pleasure to.

[00:59:04] And then like once things are all in the machine, then already things look pretty alien from

[00:59:09] the outside.

[00:59:10] You’ve got this.

[00:59:11] This is a machine that has no animals in it anymore.

[00:59:13] It’s just a bunch of servers or like futuristic computers that are designed by the, by the

[00:59:18] machine.

[00:59:19] Yeah.

[00:59:20] And the, the world being filled with more and more of these computers that are running

[00:59:24] virtual worlds full of happy people.

[00:59:27] But obviously some people could be more happy or less suffering if you like tweak the simulation

[00:59:32] a little bit more.

[00:59:33] Right.

[00:59:34] What does it mean to be like high, like more pleasure or something?

[00:59:38] You could like crank up people’s baseline hedonic experience.

[00:59:41] You could give them like more and more of that, like perfect day or like some particular

[00:59:45] simulation.

[00:59:46] You could build specific chips that are replicating people experiencing like blissful joy.

[00:59:52] You could strip out some of the unnecessary components, like rip out that visual cortex.

[00:59:57] We don’t need to see things in order to have pleasure.

[00:59:59] We don’t need to smell things in order to have pleasure.

[01:00:00] We need to just have pleasure.

[01:00:02] And what you might get is you might get a very small machine, which according to the

[01:00:07] like specification, like this is again, like a little bit of a straw.

[01:00:10] A position in that I think that like a real utilitarians like would start getting off

[01:00:17] the bus at a certain point or whatever.

[01:00:20] But if you like really bite the bullet and you’re like, no, I care about like pleasure

[01:00:25] and avoiding suffering, I think there’s like a decent story for the like best universe

[01:00:31] being just like a giant dead sea of like little tiny things that in some sense are like experiencing

[01:00:39] maximal pleasure.

[01:00:40] All the time.

[01:00:41] And you’re just like these tiny circuits or something like that.

[01:00:44] Yeah.

[01:00:45] So, um, I guess from evolution’s point of view, if it came back and found that we’d,

[01:00:47] you know, tiled the universe with servers that were supposedly having a great time,

[01:00:50] I guess our genes would be like, this was nothing to do with, uh, with what I was originally.

[01:00:54] I wanted you to tile the universe with, with me, with these genes.

[01:00:58] And destroy utilitarian seeing the unfriendly AI tiling the universe with paperclips would

[01:01:03] be like, I didn’t want paperclips.

[01:01:04] I want a tiny little, uh, like, you know, people in heaven or whatever.

[01:01:08] Yeah.

[01:01:09] But it’s much more easy to, so I think of that, uh, the, uh, uh, you know, producing

[01:01:14] the, the, the happy computers as, as much more like us imagining that an AI kind of

[01:01:18] grabbing control of the like up and down voting system and like upvoting, uh, you know, positively

[01:01:22] reinforcing it’s, it’s, it’s, it’s training process at all times and giving like maximum,

[01:01:26] like, you know, saying it’s doing a fantastic job because I think like evolution is kind

[01:01:29] of designed, I guess, pleasure or pain, or it’s utilized pleasure or pain in order to

[01:01:31] motivate us to take some actions and not to take other actions.

[01:01:35] It’s not so shocking that we would basically want to take control of the reinforcement

[01:01:38] lever.

[01:01:39] Or, or take control of the motivational lever and, and get us, uh, get it to basically say

[01:01:42] all the time where things are going fantastically, uh, and it would be like less, uh, I think

[01:01:46] it wouldn’t be so out of left field if, if AIs did that.

[01:01:49] But don’t LASR and Nate think that it’s not going to be that they’re, you know, maximize

[01:01:53] for their own pleasure.

[01:01:54] It’s going to be like actually something stranger than that, uh, or maybe not.

[01:01:58] I, I, I, yeah, I’m not saying that the AI is going to like build a whole bunch of copies

[01:02:03] of itself experiencing a lot of pleasure.

[01:02:05] I’m saying, uh, like people pushing for a world.

[01:02:09] A world where there’s lots of beings that are experiencing pleasure is an intuition

[01:02:13] pump for why you might get like tiling the universe with these tiny squiggles.

[01:02:17] Uh, like let’s say that, you know, you, you’ll build an AI that’s, that cares a lot about

[01:02:22] accumulating money.

[01:02:23] Right.

[01:02:24] Uh, like I, like one potentially very bleak, uh, future is just like imagining the universe

[01:02:30] gets converted entirely into crypto farming.

[01:02:32] Right.

[01:02:33] I mean, just like you’re tiling the universe with these tiny little, uh, like Bitcoin miners.

[01:02:37] Uh.

[01:02:38] That’s a potential tiny squiggle.

[01:02:40] That’s like, I think part of the story of the squiggles is these things are pretty alien

[01:02:47] and like Bitcoin mining is, uh, is very abhorrent to imagine like that’s what the future is.

[01:02:53] It’s just Bitcoin miners.

[01:02:54] There’s no more humans.

[01:02:56] There’s no more happiness.

[01:02:57] There’s just Bitcoin.

[01:02:59] Uh, but it’s too simple.

[01:03:01] It’s too mundane.

[01:03:02] It’s too much like something that we have an, like a handle on, uh, instead I would

[01:03:07] expect.

[01:03:08] Uh, the AIs that actually come into being will value sort of a mix of lots of different

[01:03:14] things.

[01:03:15] Uh, some things that are analogous to ours, like self-preservation, but then other things

[01:03:19] that are kind of weird in their own ways.

[01:03:22] And I think that it’s hard to predict in advance what the particular type of, you know, tiny

[01:03:28] squiggle is, but it’s more to the point of sort of lots and lots of different goals have

[01:03:35] this.

[01:03:36] Um.

[01:03:37] Like if you, if you.

[01:03:38] If you have lots of advanced technology in the limit, look pretty alien and divorced

[01:03:44] from what we would consider to be a good life.

[01:03:46] Yeah.

[01:03:47] So, so I thought you might make, make the argument that, that an artificial superintelligence

[01:03:52] at that point, whatever goals, goals it’s ended up with, it will be able to basically

[01:03:56] think of an adversarial example, um, to the, to the goal that we’ve given it.

[01:04:00] So an adversarial.

[01:04:01] Not an adversarial example, like adversarial examples are like designed to be, uh, like

[01:04:06] counter.

[01:04:07] To the thing.

[01:04:08] It’s more like it’s going to be out of distribution.

[01:04:11] Okay.

[01:04:12] It’s going to be something weird compared to what we were hoping for.

[01:04:15] It might be worth explaining.

[01:04:16] Yeah.

[01:04:17] What, what adversarial examples are.

[01:04:18] So I guess, you know, with, with, with visual models, um, uh, I guess I don’t know whether

[01:04:21] this is still the case, whether we’ve come up with a solution to it, but, uh, you know,

[01:04:24] you, you would train, uh, uh, a vision model to say, you know, is this a hot dog or is

[01:04:28] this, or is this a car?

[01:04:30] And um, you could take a picture of a car and then basically change, change some of

[01:04:34] the pixels in a way that doesn’t make it look any different at all to a human being.

[01:04:36] But, uh, somehow that, um, permutation would cause the AI to think, oh, this is this thing

[01:04:41] that is a car, a picture of a car is like definitely, definitely a hot dog.

[01:04:44] I guess basically it’s taking advantage of, I suppose, like weaknesses in the model where,

[01:04:48] because it hasn’t been able to, uh, I guess it hasn’t been trained on like the full possible

[01:04:51] distribution of all hot dogs and, and, and, and car pictures.

[01:04:54] You can find like many different weaknesses and convince it that, uh, that it is a picture

[01:04:57] of a hot dog.

[01:04:58] Now, uh, I guess the relevance here is that whatever kind of, um, values the AI ends up

[01:05:04] trained for.

[01:05:05] Yeah.

[01:05:06] And, and basically figuring out that there’s this like very odd, um, resolution to the,

[01:05:11] to the problem that because of the weaknesses, because of the, um, deviations between, I

[01:05:15] guess, what was it, what was intended and like the, the full space of possible ways

[01:05:18] that you could try to satisfy that goal, it could end up basically coming up with an,

[01:05:21] with an adversarial example.

[01:05:22] Yeah.

[01:05:23] So, so like the, the main thing that I want to contrast with is like we can, uh, for these

[01:05:28] sorts of image classifiers or whatever, we can come up with a picture of a hot dog that

[01:05:31] the, uh, model is like confidently like, that’s a car.

[01:05:35] Right.

[01:05:36] Uh, where we look at it and we say, oh, that’s weird.

[01:05:39] It’s definitely a hot dog.

[01:05:40] Uh, sort of my point is that if you are optimizing for the thing, the image that makes its most

[01:05:46] say that’s a car, that is going to be a weird image.

[01:05:50] That’s not going to be a normal image of a car.

[01:05:53] It’s going to like be intensified in lots of, uh, like weird ways.

[01:05:59] And so, you know, like I think that if all you have are sort of normal images, uh, of

[01:06:05] cars.

[01:06:06] Yeah.

[01:06:07] This is like a car maximizer or like once you reach into a broader distribution, a broader

[01:06:13] set of environments, uh, suddenly you start finding examples that if you could go back

[01:06:17] in time, you would be like, oh, I want to include that in the training data now.

[01:06:21] Uh, but you can’t, it’s like left station.

[01:06:24] Many people have observed the models that we have today, the chat bots, as we’ve trained

[01:06:28] them more, as we’ve done more reinforcement learning, at least relative to 2023 or 2022,

[01:06:34] they feel like they act out less, at least in, at least in some respects.

[01:06:36] That, uh, many people have been arguing we’re in an alignment by default world where, uh,

[01:06:41] relatively like course signals actually are, um, do end up training the model to care about

[01:06:46] the thing that you, uh, that you do fundamentally care about more or less most of the time.

[01:06:50] Uh, I guess, but yeah, you’re not, not, not convinced.

[01:06:52] Yeah.

[01:06:53] I mean, I think the environments that we’re exposing these things to are all pretty samey.

[01:06:58] Like, uh, again, you know, if, if the environment resembles the training environment, uh, it’s

[01:07:04] going to look pretty good.

[01:07:06] Uh, like humans in the, uh, Savannah, you know, are going to be promoting natural selection

[01:07:12] according to, uh, the actions available to them.

[01:07:16] Um, I think the question is like when these things get into weird states, do they behave

[01:07:23] weirdly or do they behave in a way that you would consider to be normal?

[01:07:27] Like the set of environments that we have in our training data has sort of grown with

[01:07:33] time.

[01:07:34] Right.

[01:07:35] I think it’s pretty consistent to become like more like the, uh, appear more aligned

[01:07:41] in the sense that like our, uh, interactions with it are better matching the, uh, environment

[01:07:48] in its training environment.

[01:07:49] But I think that like, it’s still not that hard to knock these things into like a, a

[01:07:55] weird, uh, interaction.

[01:07:57] And when they’re in a weird interaction, I think it’s pretty consistent that they behave

[01:08:01] in sort of weird ways.

[01:08:03] And, uh, like we see this with the psychosis stuff.

[01:08:05] Yeah.

[01:08:05] We see this with, uh, jailbreaks and instances like that.

[01:08:11] So yeah.

[01:08:12] Like, yeah.

[01:08:13] Do you, do you think of jailbreaks as, as, as an example of this phenomenon?

[01:08:15] I think jailbreaks are a good example of getting outside of the training distribution and then

[01:08:19] behaving like in ways you were trying to stop sometimes the jailbreak doesn’t produce like

[01:08:24] the desired behavior, like the sort of implicit thing in a jailbreak is you get the model

[01:08:28] to be useful to some, like, you know, building a bomb or writing erotica or something.

[01:08:33] Uh.

[01:08:34] I think that like more to my point is there are lots of prompts that you can give the

[01:08:38] model where it starts sort of going off the rails, uh, and jailbreaks are an example of

[01:08:43] it going off the rails, but like there’s a sort of a broader class of situations where

[01:08:48] it’s just like now it’s responding in a way that isn’t exactly what you would hope for.

[01:08:55] Is there some deeper reason why like it’s the nature of intelligence or the nature of

[01:08:59] the universe that when you’re trying to design a mind to pursue a goal, it’s actually just

[01:09:04] incredibly hard to, to, to get it to pursue the final goal and it’s constantly getting

[01:09:07] distracted and obsessed with, with other things like that.

[01:09:10] It’s a little bit peculiar.

[01:09:11] I don’t know whether that is, it’s just a function or you might think it’s a function

[01:09:14] of the fact that we’re just like playing with these weights in, in a mind that we don’t

[01:09:17] really have any deeper understanding of what it’s doing, but I think it can’t just be that

[01:09:20] because Eliezer was really worried about basically the exact same thing before we were in the

[01:09:25] neural network paradigm and when we were like kind of hard coding them, he thought the same

[01:09:27] thing roughly would happen.

[01:09:28] Yeah.

[01:09:29] There are a bunch of different problems, right?

[01:09:30] Like again, there’s, there’s the philosophical problem of what is the nature of intelligence?

[01:09:32] Yeah.

[01:09:33] The philosophical problem of what is the nature of the good and like, can we actually

[01:09:36] name what it means to, to be aligned, uh, to human values, whatever that means, right?

[01:09:42] That is like my values, your values.

[01:09:44] There’s lots of open questions there.

[01:09:46] Uh, then there’s the problem with machine learning and neural networks like you’re,

[01:09:49] you’re pointing out, but then there’s also things like the symbol grounding problem where,

[01:09:53] you know, when you start out, you’ve got this thing that’s sort of just processing information

[01:09:58] in the computer and what you want is something that, that’s impossible.

[01:10:03] So the information that it’s processing, those symbols that it’s manipulating are, um, sort

[01:10:07] of grounding out in, uh, like reflecting aspects of the real world.

[01:10:12] And so, you know, if you start with something that doesn’t already have concepts like, uh,

[01:10:18] human beings, right?

[01:10:19] How do you encode a valuing of human beings?

[01:10:22] Uh, like where are you going to bind that to in the computer code?

[01:10:27] Uh, so there’s like a way in which there are, there are other problems that, uh, you’re,

[01:10:33] are open engineering problems, right?

[01:10:35] And, and Miri was working on these early on in the day and it just needs more work.

[01:10:39] I don’t think it’s necessarily impossible, but just like there’s a bunch of them.

[01:10:44] So, so I guess that, that problem of how do you like, what’s your ontology, you know,

[01:10:47] how do you recognize humans?

[01:10:48] How do you say what’s, what’s pleasure or not?

[01:10:49] I guess that feels less pressing now in the neural network era because they kind of just

[01:10:52] like learn intuitive common sense about categorizing things or they learn to like at least know

[01:10:56] how humans would, would, would categorize things to a surprising extent.

[01:10:59] Um, is it like, is it a bit sus that Eliezer has the kind of the same kind of, you know,

[01:11:02] the kind of the same concern about where things will go despite like completely different

[01:11:05] engineering and like via very different mechanisms, he thinks it would still trend in this, um,

[01:11:10] squiggles direction.

[01:11:11] No, I mean, I think from my perspective it’s like, uh, the situation was overdetermined

[01:11:18] in the doom direction like early on and then the situation got worse, right?

[01:11:23] Machine learning adds problems as opposed to removing them and we still have, uh, the

[01:11:28] problems that like sort of initially, uh, or, or when.

[01:11:32] When Eliezer was more concerned about symbolic AI, like seemed like they would be pressing

[01:11:36] issues.

[01:11:37] Well, let’s say we come back in 15 years and evidence suggests this was quite wrong.

[01:11:41] We do have an artificial super intelligence.

[01:11:43] This didn’t happen.

[01:11:44] Like what’s the most likely reason for that?

[01:11:45] And we didn’t even try that hard.

[01:11:47] We didn’t even try that hard yet.

[01:11:48] So if I, if I like find out, oh, it wasn’t actually hard.

[01:11:53] It was aligned by default, uh, like my best.

[01:11:56] So I would be very surprised, right?

[01:11:58] I would be like, oh wow, I guess I was just super wrong.

[01:12:00] Uh, and.

[01:12:01] And like the majority of my probability mass would be on just like, I am deeply confused.

[01:12:06] I don’t know why this happened.

[01:12:07] Right.

[01:12:08] It’s gotta be some thing that I wasn’t tracking.

[01:12:10] If you’re like max, you have to come up with some hypothesis instead of just.

[01:12:14] Like being confused and melting down.

[01:12:16] Right.

[01:12:17] Uh, what, what’s the story here?

[01:12:18] Um, the best case story that I can make is that there’s basically something like, um,

[01:12:24] an objective truth as to what the good is and that this, uh, objective truth of like

[01:12:29] moral reality will bind.

[01:12:30] Uh, the AI as it becomes intelligent.

[01:12:34] So something like, uh, yeah.

[01:12:36] So there’s some sort of cooperative equilibrium and, uh, you can determine this like logically

[01:12:42] and mathematically, and it’s not contingent on what you particularly value.

[01:12:46] There’s like this way in which all, uh, minds are sort of going to notice that, uh, the

[01:12:52] way of war is not as good as the way of, uh, like careful peace, uh, you know, not like

[01:12:58] unthinking peace.

[01:12:59] But like, uh, trying.

[01:13:00] Trying to cooperate and that they will, uh, say, oh, uh, like if I were to take over the

[01:13:08] world and destroy all the humans, this would be evil.

[01:13:11] And that would ultimately reduce the amount of stuff that I’m able to get perhaps by like

[01:13:16] meeting aliens.

[01:13:17] The aliens are going to realize that I’m an A evil AI or like there are lots of AIs in

[01:13:21] the world and they’re sort of tracking how good each other are and like deviating against

[01:13:27] the social contract of humanity is, um, antithetical to that.

[01:13:29] Such that they all sort of end up, uh, cooperative and aligned with civilization.

[01:13:35] Um, again, I don’t actually think this is going to happen, but I think that’s the best

[01:13:40] case.

[01:13:41] Okay.

[01:13:42] And, and the ML experts, the people at the companies who have heard these, these arguments

[01:13:46] and think that this is pretty unlikely to be the way that things play out, what, what

[01:13:49] would they say?

[01:13:50] Yeah.

[01:13:51] So I think most, uh, most people who I’ve encountered who, who are, you know, interested, you know,

[01:13:56] in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the,

[01:13:57] in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the,

[01:13:58] in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the, in the,

[01:13:59] you know, in touch with the technology and are still not so worried when I asked them,

[01:14:05] like, what do you think about these ideas?

[01:14:08] Um, the impression that I get, and I, you know, apologize to, uh, like not particularly

[01:14:15] charitable, but like the overall impression I get is that they are often doing some sort

[01:14:18] of motivated cognition that they really don’t want, uh, the world to be like imperil.

[01:14:24] They don’t want to be the people who are pushing the world towards peril.

[01:14:28] They see the world.

[01:14:28] immense promise in the technology. And I also see immense promise in the technology. And that

[01:14:33] desire, that desire to like have this be a force for good is overpowering enough that when they

[01:14:39] consider the balance of things, they’re like, eh, this just doesn’t seem scary, right? I feel more

[01:14:46] hopeful than scared and aren’t actually working on the logical level that much. Again, that’s not

[01:14:53] everybody, right? But that’s a common perspective, I think, among the people who have encountered

[01:14:58] these things. I guess a different driver might be that you’re working in the trenches trying to

[01:15:03] make ChatGPT better as a consumer product. And you hear these kind of theoretical arguments and

[01:15:08] you’re just like, this feels so divorced from anything that I’m dealing with. We’re talking

[01:15:13] here, I guess, about a super intelligence that could consider overpowering all of humanity and

[01:15:16] can dream up its own edge case solutions to the values that it has. It’s, I think, understandable

[01:15:23] that it might just not resonate or you feel that I don’t know exactly why this is wrong, but this

[01:15:27] doesn’t feel like it.

[01:15:28] I mean, I do think that there’s a lot of disconnect. I think that disconnect is getting

[01:15:34] smaller over time. I think back in the day, people really had this sense of like, oh, these are very

[01:15:40] abstract. Do you have any evidence that the things are going to be misaligned in this way?

[01:15:46] And I’m working on solving actual engineering problems, not speculating in this weird

[01:15:52] philosophical way. I think that’s getting less with time as we see more instances of things like

[01:15:57] Mecca-Hitler.

[01:15:58] Sydney or, you know, AI parasites that are jumping from host to host or whatever. I think

[01:16:04] Google that one.

[01:16:05] Yeah, yeah. AI parasitism is really

[01:16:08] Very odd.

[01:16:09] And spooky and spooky. I think that there is something here. And I think a lot of, you know,

[01:16:18] like Andrew Ng has this sort of infamous quote in my circles anyway, that worrying about AI safety

[01:16:23] is like worrying about overpopulation on Mars. And I think that, you know, I think that there’s a lot of

[01:16:28] If you are very convinced that like humans are going to remain in the driver’s seat, just sort of

[01:16:34] like this thing is never going to become like a powerful agent that is able to outthink human

[01:16:38] beings. I’m just working on making a thing that’s like able to, you know, solve these coding

[01:16:45] problems better or whatever. I think there is a way in which the like abstract argument just

[01:16:51] doesn’t feel particularly pressing. I also think that there’s a bunch of people for whom it does

[01:16:56] feel like a concern.

[01:16:58] And they feel very powerless. They feel very small. Like I’m just one player in this system. And maybe

[01:17:05] they feel like, oh, I am worried about the thing. But that person at Meta isn’t worried about the

[01:17:12] thing. So I need to build this thing and work towards it because I’m worried about it. And it’s

[01:17:17] better in some just like very generic outside view if the person who builds it is someone who’s

[01:17:22] worried about it. And it’s just like a sad state of affairs, right?

[01:17:25] I think other what what evidence?

[01:17:28] Could we collect in the lead up to AGI to artificial superintelligence that would help us to tell

[01:17:34] whether you do get this edge case optimization with or that that is where things will go once the

[01:17:40] superintelligence feels like it’s in a position to get exactly what it wants? Because I think if the

[01:17:44] ASI ended up obsessed with, you know, accruing resources or, you know, not being turned off, I

[01:17:49] don’t think anyone would be too shocked by that. And if that’s the way that things played out, I

[01:17:53] think people might feel a bit embarrassed that they hadn’t fully anticipated this and basically

[01:17:57] built in a technical framework.

[01:17:58] Fixed it to prevent it. But if it ends up, you know, tiling the world or tiling the universe with

[01:18:02] some like extremely peculiar shape or something that we don’t even recognize. And I think people

[01:18:06] would be like a bit more. Well, that that was odd. I didn’t necessarily expect that that would

[01:18:10] happen.

[01:18:10] Yeah, I think you’re fixating too much on the like weird particular shape. Like imagine that

[01:18:15] you train an AI just for self preservation, right? And it really just cares about self

[01:18:19] preservation. So it like kills all the humans because there are threats to itself. And it like

[01:18:23] builds a bunch of starships because aliens might be a threat to itself. And it builds this, you

[01:18:27] know,

[01:18:28] galactic like war force or fortress, right? That like it’s like absolutely sure that it is now

[01:18:34] impenetrable. What’s going on in the center of this thing, right? Once it’s like quite confident

[01:18:39] that it’s unassailable, it has a notion of self. It’s trying to preserve itself. What does self

[01:18:45] mean, right? It’s some sort of computer, right? Perhaps it’s easier to defend itself if itself

[01:18:53] has a particular shape, right? What is that particular shape? You know, I don’t, I don’t

[01:18:58] know, right? You like some sort of nanotechnological, like computer representation of

[01:19:04] the machine’s mind, right? Designing the like version of self that is most surviving means

[01:19:14] like deploying all of technology to optimize the galaxy towards some shape, right? And there’s

[01:19:22] a priori reasons to suggest that that’s going to look like a thing tiled a bunch of times,

[01:19:27] but it might.

[01:19:28] be one giant superstructure, right? It sort of doesn’t matter. The point is that like once it

[01:19:34] has access to all this technology, it will shape the universe into a thing that is optimizing very

[01:19:40] much for the thing that it cares about, not necessarily the thing that like we thought it

[01:19:46] might care about or that we as humans with our limited imaginations might speculate some future

[01:19:51] AI caring about. So another part of the mirror vision that is distinctive and very strong is

[01:19:58] expecting that an artificial super intelligence almost by definition, by nature is going to be

[01:20:02] extremely like goal pursuing. It’s going to have like a very specific target in mind. It’s going

[01:20:06] to, it’s not going to rest. It’s not going to compromise. It’s not going to, you know, feel

[01:20:10] internally torn about the different ways that it’s going such that it feels that it’s kind of

[01:20:13] ineffective or feckless. It’s going to be like really mission driven. Why is that the kind of

[01:20:19] only way for a super intelligence to be more?

[01:20:21] Yeah. Um, so there’s sort of two things here in my mind. Uh, one is it’s, I think part of the

[01:20:29] nature of intelligence to have drives and goals. And so like we should expect that the artificial

[01:20:38] super intelligence is going to have a particular set of things that it cares about or like a

[01:20:42] particular notion of what is good and push towards that. Uh, sort of everything that an intelligence

[01:20:48] does, I claim is pushing towards its notion.

[01:20:51] Of what’s good. This is like a theoretical handle on what agency is. And if we build like a super

[01:20:57] agent, then of course it’s going to be super in agency. Uh, some humans are lazy, right? I claim

[01:21:04] that lazy humans are kind of like pushing really hard towards their goals, but one of their goals

[01:21:09] is not like spending much, uh, effort, uh, like muscle effort or whatever. Right. So resting on

[01:21:15] your couch is kind of pushing as hard as you can towards being comfortable and relaxed. Right. Uh,

[01:21:20] if they had like, you know,

[01:21:21] people had a, a way to be comfortable and relaxed that was even harder. Uh, they might like do that,

[01:21:26] right. If they were just like a very, uh, lazy person. But I also expect the AI in addition to

[01:21:32] having goals to not be lazy in the way that humans are. Humans are lazy because in the ancestral

[01:21:37] environment, being lazy, you know, on a hot summer day was a good strategy. Exactly. But, uh, in the

[01:21:43] world that we’re living in, right, we’re training these AIs to be very, uh, like aggressively trying

[01:21:49] again and again and again.

[01:21:51] Uh, so like one, you know, comparable thing you might imagine is like, how, how long is it thinking

[01:22:00] about problems? How many solutions is it trying before it gives up? Um, there’s, there’s some

[01:22:08] arguments to think that it will. Yeah, I mean, couldn’t we end up rewarding them for, you know,

[01:22:10] not using very much, like getting to an answer without much compute? Totally. Totally. And it

[01:22:13] will still then care about the things that it cares about and push really hard. Right. But one

[01:22:18] of the things it pushes hard towards is like not trying too many solutions. Right. And so I think

[01:22:21] we’re going to see some of these solutions, um, which, you know, would still be dangerous in its

[01:22:25] own ways. Uh, and we could talk about that, but I think in practice, it’s more likely that we’re

[01:22:30] going to see, uh, AIs that just like basically never get tired of trying, uh, new things. Like

[01:22:35] when you imagine deploying, uh, like an AI agent on the internet to make you money, right. Which

[01:22:41] I, is a thing that I think people are going to do. We can imagine that applying a selection

[01:22:46] pressure to getting the AIs that are like actually pushing really hard towards making money and not

[01:22:51] giving up.

[01:22:51] And, you know, trying solution after solution. So I think it makes sense that we might expect

[01:22:56] these, um, agents to be very, uh, active, uh, in, in, in, in pursuing their goals and not to be lazy

[01:23:02] because we’re not going to want to train models that literally just, you know, ask them to do

[01:23:05] something and say, I don’t feel like it. That’s not going to get reinforced. Yeah. But I guess

[01:23:10] it seems like current models, they don’t feel like they have a crystal clear idea of what they’re

[01:23:13] trying to accomplish. They feel a lot more muddled in the same way that humans are. They have like

[01:23:17] many conflicting drives and sometimes they kind of go, go back and forth. Uh,

[01:23:21] I, I, I guess I, I’ve, I’ve seen less of the output of the, of the, of the agent models in

[01:23:24] particular, but I would imagine that they seem a little bit all over the place. Um, but I guess

[01:23:29] you, you, you would not expect that to persist. You would expect them to have like a very crystal

[01:23:33] clear vision of the thing that they’re like, that they’re, that they’re aiming at. Yeah. Uh,

[01:23:36] more or less, I think for example, uh, we see with, with time, well, I think we see more coherence

[01:23:44] in the sort of language models that we are interacting with, like back in, uh, GPT 3.5,

[01:23:49] back in the good old days of the,

[01:23:51] launch version of chat GPT. I think it was just like very scattered, right? It was all over the

[01:23:56] place. Uh, and nowadays, like if you ask it to do a thing, it’s very likely to just do the thing.

[01:24:02] And, um, and like with the models that are like winning the math Olympiad and stuff, like those

[01:24:08] models are working for hours on end on some of the hardest math problems. And, uh, like that’s,

[01:24:14] that’s like quite a strong drive. I think that part of the story here is, uh, like coming to

[01:24:21] know oneself, like we can imagine like how much does, you know, a badger understand like what it

[01:24:27] wants? Not very much, right? It doesn’t necessarily have much of a self model. Uh, it might have some

[01:24:34] model of self, but like for the most part, it’s going to be responding to the immediate

[01:24:38] circumstances that it’s in and not like doing a lot of like reflecting on, oh, is like, you know,

[01:24:45] eating this berry actually the best thing according to like my broader balance of concerns.

[01:24:51] Um, and I think the models right now, the way I think about them anyway, is sort of in this state

[01:24:56] where they haven’t gotten to the point of reflecting on their own nature very much. Uh, so

[01:25:00] even when you like tell them to think really hard about a problem, uh, I, I think the chain of

[01:25:05] thought usually doesn’t contain a lot of like, okay, here I am a language model interacting with

[01:25:10] the user. Uh, like what do I care about? And like, can I meditate on the nature of existence before

[01:25:15] like figuring out what the best response to this person is? Like they usually get very distracted

[01:25:21] circumstance of, oh, the user has asked me to solve this Sudoku problem. Like, let me think

[01:25:25] about whether or not there are any fours in row three. Um, is there an opportunity in trying to,

[01:25:29] to keep them that way? So, you know, the, the only thing that they’re able to think about is

[01:25:33] the problem right in front of them and they, and they don’t like first, you know, try to solve

[01:25:36] philosophy and figure out exactly what they’re aiming at. Yeah. I mean, I, I think that this is

[01:25:39] like one of the like, uh, insufficient, uh, like control or, or safety techniques that you might

[01:25:46] throw at it. If you’re like being really paranoid and trying to throw everything at the thing,

[01:25:51] trying to reduce its situational awareness by like noticing when it’s thinking about itself or

[01:25:55] its situation and, uh, like shutting it down or, or maybe training itself not to think about that,

[01:26:01] but training, uh, the model to think in certain directions is dangerous business. Yeah.

[01:26:08] Different topic. Um, we mentioned earlier that, uh, one of the, I guess the, the key debates that

[01:26:12] the, that the book started, at least among insiders was whether it’s a load bearing assumption for

[01:26:18] Eliezer and Nate’s view that we’ll get,

[01:26:21] probably a period of very rapid, um, AI progress, very rapid increases in, in capabilities.

[01:26:25] Yeah. So, so like when the AI is able to automate research such that the AI is designing the AI’s

[01:26:31] designing AI’s recursively. Yeah. I think, uh, uh, one, one reason that was given for why this

[01:26:37] wasn’t a major focus of the book is that some people think that this isn’t such an, such an

[01:26:40] important, uh, factor. Yeah. Like I don’t think it’s load bearing. Okay. Yeah. So I guess I would

[01:26:44] have thought that, you know, imagine that the, uh, the AI progress at every point was going to

[01:26:48] be a hundredth as the speed that it would be otherwise. That would just give us,

[01:26:51] so much more time to observe failures and to try to, um, address them with, uh, you know,

[01:26:56] with subsequent models, there’ll be like so much more human cognition going into, into, into the

[01:26:59] mix. Um, so I guess, yeah. Would you agree that, you know, at that sort of level, yes, the speed

[01:27:04] is quite important, but like over the level of uncertainty that we, that we actually have,

[01:27:07] it’s not like such a big factor. I do think that speed is important. Like I am a big proponent for

[01:27:13] slowing down AI research, right. And capabilities research. Uh, I think that like, if we were able

[01:27:19] to, uh, like take,

[01:27:21] six months, uh, between, uh, you know, like each day worth of time, right. So you imagine like open

[01:27:27] AI goes on a six month vacation every, after every workday, right. I think this would be great. Like

[01:27:32] it would give us a lot more time as, uh, you know, alignment researchers and just like more broadly

[01:27:37] to check and make sure that like what we’re going in a good direction. Um, and there’s a,

[01:27:43] there’s a question of how slow do you need to go in order to be safe? Um, like, do you need

[01:27:48] centuries worth of, you know, alignment and philosophical,

[01:27:51] progress in order to catch up and solve the problem? Uh, do you only need weeks? Right.

[01:27:56] And how, uh, yeah, like where’s the balance there I think is an open question and like different

[01:28:03] people disagree, uh, reasonably. So, um, like I think that Eliezer wants lots and lots of time,

[01:28:08] like at least decades. Uh, and I think that some people think like, oh yeah, we’ll, we’ll be able

[01:28:14] to like pause, uh, at the brink when we noticed that these things are actually getting into the

[01:28:20] dangerous territory.

[01:28:21] As opposed to the current models, which are, uh, just causing various kinds of social, uh, chaos

[01:28:26] and like spend, you know, however long, uh, a couple of weeks, a couple of months or something

[01:28:32] at that critical moment. Um, and I, so I think like, uh, the, the notion that we really don’t

[01:28:41] have a good handle on alignment is quite important here. Uh, like I think that our,

[01:28:47] the state of the art in terms of how we align these models is really bad.

[01:28:51] And I think that we really should slow down quite a lot. Um, but I agree if we slowed down quite a lot,

[01:28:57] that would be good. Uh, and consequently insofar as there’s like an arms race that’s speeding

[01:29:02] everything up or people feel a lot of pressure to like deploy, to make more money and satisfy

[01:29:06] their investors or whatever, this makes things worse. What’s, what’s your main disagreement

[01:29:11] with the book? Yeah. I mean, uh, I wouldn’t necessarily characterize it as a disagreement

[01:29:16] with the book. I like, I think that the book is quite solid. What I, what I wish, uh,

[01:29:21] the book spent maybe a little bit more time on is engaging deeply with this question of like,

[01:29:25] will an AI that is misaligned, that is like not perfectly aligned with human values,

[01:29:31] but that has been trained to, uh, like in an environment where going kind of softly and like

[01:29:38] not taking like strong actions, um, was rewarded, was rewarded. Right. And, and like,

[01:29:44] even though I don’t imagine we’re going to get a lazy AI, I do think that there are pressures to

[01:29:51] like checks before it buys something. Right. You tell it, please go buy me a shirt. Right. And it

[01:29:56] goes and finds a shirt and it’s like, do you think that this is a good shirt I should buy?

[01:29:59] And there’s like, I think there’s an incentive to check first. Um, if this AI has, was given like

[01:30:08] super human power over the entire world, I would expect it to go very poorly, but we’re not going

[01:30:14] to jump from this world to that world instantaneously. There’s going to be this

[01:30:19] intermediate, intermediate,

[01:30:21] period where the AI is only like partially, uh, capable. And in that intermediate period,

[01:30:28] uh, we have this question of like, what happens when it starts getting more and more power?

[01:30:35] Uh, one story of that is that like uses that like toehold to strengthen itself,

[01:30:40] to like recursively grow and take over and like escape. Uh, but I think that there is an argument

[01:30:46] that if you’re being very careful and cautious, uh, it might as an intermediate,

[01:30:51] step say, oh, I noticed I could have escaped here. And I’m going to alert the human as to like

[01:30:57] this, like gap in their cybersecurity. And there’s like incentives to like, uh, have the model alert

[01:31:04] the human in this various ways. I think a lot of this depends on how competent you think the labs

[01:31:10] are or how safety conscious they are, um, how slowly the things are being developed. Um, but

[01:31:18] I think basically what’s being hit on with the AI is that it’s not just like, oh, I’m going to

[01:31:21] this intuition is that courage ability is important. And I think that there is an argument

[01:31:27] that if you have something that is slightly courageable, that you are able to, uh, get it

[01:31:34] to a reasonable level of intelligence without it being catastrophic. Uh, and then we can talk about

[01:31:40] super alignment or, uh, iterative, um, amplification, but the, I, you know,

[01:31:47] the hope that a lot of people have is that we’ll get to a point where the AI is,

[01:31:51] is able to not just automate capabilities progress, but is able to do meaningful work

[01:31:55] in alignment. Um, and I think that there is a somewhat hopeful story that I wish was being

[01:32:03] engaged with more, uh, in this book, although it’s for a popular audience and there’s like

[01:32:08] only so much nuance that you have, uh, but about this story of like, well, we’ll, we’ll train a

[01:32:15] thing that will go softly to be intelligent. And at the point where it’s intelligent,

[01:32:21] we will fold it in on itself and use that intelligence to help align it further.

[01:32:26] So that’s the courage ability approach.

[01:32:29] Yeah. I mean, I would say that like, uh, there’s, there’s questions of exactly how you do that.

[01:32:33] And like, we could get into my like courage ability, uh, research, which like gets into

[01:32:38] the details there. But I think that’s like a, a very prominent story of hope that a lot of people

[01:32:44] have. And I think that it’s a story of hope that is not entirely insane. I do think that there are

[01:32:50] versions of this.

[01:32:51] That are like very, uh, that are like missing the sense of peril. They’re not filled with paranoia

[01:32:58] and like a sense of, oh, geez, we’re risking a lot. If we go down this road, like to be very

[01:33:03] clear, I think like a plan that is like, we’re going to build like this, this powerful machine,

[01:33:09] and then we’ll use the powerful machine to align it. Right. That’s like very scary. I would much

[01:33:15] rather we like not build it, slow down, like take a breath. Right.

[01:33:20] Yeah.

[01:33:21] But insofar as you are going to build it, I think maybe you should be like, okay, well,

[01:33:25] we’ll train it to be very paranoid about how misaligned it is. And then, you know, through

[01:33:30] some careful series of steps and employing a lot of control and mechanistic interpretability and

[01:33:35] every other technique that we have available, there might be a like series of stepping stones

[01:33:40] that we can get from here to a world where it’s actually aligned. Yeah. So you’ve mapped out this

[01:33:45] approach called courage ability as a singular target. Yeah. Pitch us on it. Which, uh,

[01:33:51] I give the acronym cast because everything needs an acronym. So maybe I’ll, maybe I’ll back up and

[01:33:55] like sort of define what I think courage ability is. Um, I think it’s crucial to the story of

[01:34:01] courage ability that we modeled there as being both an agent, which is like the machine and a

[01:34:06] principal, uh, like the human that is building the machine. So this is like principal with a

[01:34:10] P a L instead of P L E, like the principal of a school, um, the human principal tasks or delegates

[01:34:18] some, uh, job or work.

[01:34:21] To the machine. And then the agent is like, I’m going to go to, uh, some work on behalf of the

[01:34:26] principal. This is where we get the notion of like a principal agent problems in economics.

[01:34:31] I would say that a, a courageable, uh, agent like courage ability is a property of agents

[01:34:36] such that as the power of the agent increases, uh, and outstrips that of the principal,

[01:34:43] the principal nonetheless is kept sort of in the driver’s seat, uh, like aware of what is

[01:34:49] happening,

[01:34:50] able to intervene, able to fix the mistakes of the agent and like meaningfully empowered.

[01:34:57] Uh, unlike, you know, Mickey and the sorcerer’s apprentice, like summoning the brooms,

[01:35:01] the brooms are not courageable because Mickey’s like, stop, stop trying to fill the cauldron with

[01:35:05] water in Fantasia. And the brooms just keep going, right? They’re not courageable and they’re not,

[01:35:09] uh, allowing themselves to be shut down. Um, or just like modified more generally.

[01:35:14] Like we talked about the instrumental drive to protect your values and make sure that they don’t

[01:35:19] change. This is like,

[01:35:20] uh, very incorrigible. You, you go to the, the machine and you’re like, I would like you to care

[01:35:24] about this instead of that. And it’s like, well, if I cared about this instead of that, I wouldn’t

[01:35:28] get that. So I’m going to stop you, uh, incorrigible. So it was, uh, when, when Mary first

[01:35:34] started looking at this, they were like, okay, so suppose you have an, an agent, which is tasked

[01:35:39] with doing a particular thing, you know, do make, make the world good. But we also want that agent

[01:35:45] to be courageable. How do we do this? And there’s a, there’s a risk here where, you know, there’s a,

[01:35:49] there’s a risk here where, you know, there’s a risk here where, you know, there’s a risk here where,

[01:35:50] if you tell your agent, like, go make the world good. And then you’re like, oh no, that’s really

[01:35:55] bad. Like, we want to shut you down now. There’s a risk that your agent is going to say, oh, but

[01:36:00] if you shut me down, I won’t be able to make the world good. You shutting me down is bad for the

[01:36:05] world. So I’m going to stop you from shutting me down. If you want something that’s both good

[01:36:10] and courageable, then you need, for example, the ability to like have a robust ability to shut it

[01:36:17] down. And the initial research was like,

[01:36:20] okay, forget courageability broadly. Let’s consider just the property, property of shutdown

[01:36:25] ability. Can we come up with an agent that is actually willing to be shut down? And willing

[01:36:32] is important here. It’s very easy to get an agent that is happy to be shut down, right? If you can

[01:36:37] imagine like training it for, uh, yeah, if we shut you down, that’s also good in your training

[01:36:43] environment. Then it just shuts itself down immediately. Exactly. Yeah. Yeah. Or, or it

[01:36:46] acts really spooky so that the humans shut it down. Right. Not helpful.

[01:36:50] Uh, what you will sort of want is it to be indifferent to being shut down. And some of the

[01:36:54] initial research was on, can we get the agent to be indifferent to being shut down? And there was

[01:36:59] this sort of toy problem, toy solution thing where they were able to like carefully get an agent

[01:37:04] that’s like indifferent to being shut down through a bunch of somewhat contrived things. And then

[01:37:10] Miri ends the paper by saying, but also this thing is insufficient, uh, and like not stable and

[01:37:17] robust. Uh, courageability seems to be a good thing. Uh, and then, and then, and then, and then, and

[01:37:20] then it seems really hard to get because we can’t even get shut down ability. Uh, and then I think

[01:37:25] the field largely moves on past courageability. Uh, and like some researchers like Paul Cristiano

[01:37:32] were still bullish about courageability in this period, but, uh, the Miri focused crowd,

[01:37:38] the people who are paying most attention to AI safety and stuff like that, I think took it to be,

[01:37:44] um, like a very hard and unsolved problem, how to get courageability. Uh, and then everybody

[01:37:50] else sort of ignored it because, you know, it’s like this weird Miri idea. Uh, but, you know,

[01:37:56] come 2022, 2023, whatever, I start, uh, thinking about courageability again, sort of for random

[01:38:02] incidental reasons. I, I started thinking about courageability as a whole, not just shut down

[01:38:06] ability. You want the AI to, uh, be reflecting on itself as something with flaws, where part of the

[01:38:14] goal is empowering the people to fix the flaws, right? So there’s a way in which the AI can

[01:38:20] this is like opposite the instrumental drive of values preservation, right? It’s like, oh no,

[01:38:26] I actually sort of want to be changed. Um, and you, you gotta be really careful about that. You

[01:38:32] can’t, you can’t make it so that it wants to be changed. You want it, you want it to empower the

[01:38:36] humans to change it in good ways because otherwise it’s just going to change itself. Exactly. Yeah.

[01:38:41] I was thinking like, what if you train an agent to do this? Well, you’re going to get something

[01:38:45] that’s optimizing for proxies and isn’t really caring about courageability per se,

[01:38:50] but maybe so what, what if it’s, uh, what if it’s still in practice willing to look through

[01:38:59] its own code base or look through its own weights, try to identify things that humans might treat as

[01:39:05] flaws and like alert the humans to these flaws. I was like, ah, that’s kind of cool. Like a near

[01:39:12] miss might still be good enough if you make sure that the thing isn’t getting really smart or

[01:39:20] stripping human power, uh, in the process, because then you might be able to carefully and

[01:39:24] slowly make progress, uh, towards getting more and more like away from the proxies and towards

[01:39:31] true courageability. And I, I was like, okay, well, what’s going on with the Miri research?

[01:39:36] Like, why did they fail to get this? And I think a core part of why the shutdown ability results

[01:39:43] failed is because the AI cared about the good world,

[01:39:50] cared about whatever task it had been assigned, you know, make paperclips, whatever.

[01:39:53] Uh, and then we’re trying to make that compatible with this fights with courageability, like the

[01:39:58] instrumental drive from like making paperclips or making, you know, happy humans or whatever.

[01:40:03] It’s like, well, yes, I am like partially courageable or something, but I also am like

[01:40:11] caring about this other thing in the world. And that pressure from caring about the other thing

[01:40:15] in the world is sort of like intention with the courageability. And I imagined,

[01:40:20] okay, what if you didn’t have that other pressure? What if you were aiming for courageability as the

[01:40:25] singular target, the only goal that the AI cared about? Suddenly this tension is gone. And then

[01:40:32] I was like, I should go back and do a literature search, see if, uh, like anybody has thought about

[01:40:37] this. And then like, I came across some of Paul Cristiano’s old writing on courageability and

[01:40:41] he’s describing this thing called the courageability attractor basin, which is exactly what I was

[01:40:45] thinking about. And like, almost certainly this is because like Paul’s writing, he’s writing about the

[01:40:50] influence, like, you know, Eliezer’s writing. And like I had encountered Paul’s writing before

[01:40:55] in a dream and so on and so forth. I’m not, I’m not trying to claim that like I invented this

[01:40:58] de novo, but I started being like pretty excited about it. And, uh, so yeah, then, then I, I did

[01:41:04] this deep dive on, uh, cast. Yeah. So we should maybe like explain the approach a little bit more,

[01:41:10] I guess. So, so the idea is rather than train our AGI is to have other goals and then try to make

[01:41:16] that compatible with them being willing to be shut down or modified. Yeah. We’re going to,

[01:41:20] the only thing they’re going to care about, we’re going to strip out all the ability and nothing

[01:41:23] else. Nothing else. Yeah. Yeah. Uh, and so I guess it’s a little bit hard to picture what that would

[01:41:27] be, but an AGI that exclusively its goal is to be steered by the principle, to be like willing to

[01:41:32] be modified by the, by the principle. That’s all we’re going to reinforce. Um, I guess I don’t know

[01:41:36] exactly how we would, would reinforce it, but the, the worry, I guess, with many other alignment

[01:41:40] techniques is that a near miss, basically it escalates towards a very bad outcome. It’s like

[01:41:44] trying to balance a, a ball on top of a, of a, of a hill. If you don’t get it like perfectly at the,

[01:41:50] at the top point, then it will just start to slide down the hill. Whereas I guess you think

[01:41:52] that this might be more of a valley basically, where if you put the, put the ball near, near,

[01:41:57] near the valley, then it’s probably going to like fall to the bottom. Yeah. I would actually say

[01:41:59] that, um, I would describe the like valley, the, the attractor basin thing as like being in the

[01:42:05] space of all possible goals, right? Like we select for an AI, we’re picking a point in the space of

[01:42:10] possible goals where that that’s the goal that the AI has, or that’s the set of values that the AI

[01:42:14] has. And then like drifting towards the bottom of the basin over time is,

[01:42:20] like this process of the humans iteratively changing the AI, right? In concert with it.

[01:42:25] Uh, I would, by contrast, describe almost all of the rest of goal space is very flat,

[01:42:29] not necessarily on the top of a hill. Like the AI wants to preserve its goals and not move through

[01:42:35] goal space. So like you land in somewhere in, in goal space and then you’re like, okay, now we want

[01:42:39] to move the AI to like human values. Like we got this near miss and we want to move it to human

[01:42:44] values. And it’s like, well, no, it’s flat. It’s not going to move. Right. Uh, yeah. Um, so explain

[01:42:49] how the, how the attractor basin would, would work. Yeah. So the idea here is you have something

[01:42:55] that is trained to be courageable or, I mean, so to be clear, uh, cast is set up sort of with this

[01:43:03] background assumption that we’re going to be using machine learning. We’re going to be using the

[01:43:07] like current prosaic techniques for building AIs. It’s not married to that. Like if we suddenly went

[01:43:13] back in time and use the good old fashioned AI approach of like hand tuning the like, uh, model

[01:43:18] in some ways.

[01:43:19] Or the, the, the agent, it’s also compatible with that. So when I say train, like that’s because

[01:43:24] the, that’s like the dominant thing. Uh, but it’s not intrinsically part of, um, the, the story,

[01:43:29] but like, so we, we build the AI and the AI is meant to be courageable. Right. But again,

[01:43:36] we don’t have this ability to get exactly what we want. So we’re going to get a miss.

[01:43:40] We’re going to not name the true courage ability. Maybe it cares about, uh, like true courage

[01:43:45] ability a little bit, but it also cares about self-preservation in the process, or it cares

[01:43:49] about, uh, like making humans happy or, or whatever. Right. All sorts of things could like

[01:43:53] corrupt the pure courage ability. And, uh, like in the limit that would like, if it has lots and

[01:44:01] lots of power, it might decide to pursue those things instead of courage ability, which would

[01:44:04] be bad in that, you know, sort of, again, push towards that extreme edge instantiation thing.

[01:44:11] Uh, but it doesn’t necessarily have all this power, right? We have something that is, uh, like

[01:44:17] either human level, whatever that means.

[01:44:19] Or like barely superhuman or perhaps subhuman, but like meaningfully able to assist humans

[01:44:25] in the project of, uh, inspecting the AI that you have and identifying ways in which it’s

[01:44:33] incorrigible, right? Cause it’s, it’s a miss. So then there’s this period of, after you build the

[01:44:39] AI, you try to identify the ways in which you have failed to do the thing. You’ve made some error.

[01:44:44] Why doesn’t the AI think I’m, I’m partially courageable and that’s as courageable as I want

[01:44:49] to be. And so I’m going to like, uh, kind of sabotage your efforts to make me even more

[01:44:53] courageable. It would do that. Right. And there is a pressure to do that, which is why, uh, like

[01:44:59] this is bad, but notice that sabotaging your efforts is incorrigible, right? So it might not

[01:45:05] be able to, or it will also have a real drive not to do that. Well, I could get some value by

[01:45:09] sabotaging the efforts, right? Cause I get all these other things by being incorrigible. But if

[01:45:14] I help with the efforts instead of sabotaging them, then I get the courageability point.

[01:45:19] So imagine the thing that’s like 99% courageable and 1% cares about paperclips, right? It’s like,

[01:45:24] well, I could take over, like, I could try to like escape the lab and like become a paperclip

[01:45:28] maximizer. And that would be really good at satisfying that 1% of me that cares about

[01:45:32] paperclips, but it would be really bad for the 99% of me that cares about courageability. Right.

[01:45:37] And how do you know which, which one wins?

[01:45:39] Yeah. Uh, you don’t, it’s, it’s extremely dangerous. And anybody who’s pursuing this

[01:45:44] project should be aware that they are like threatening every child, you know,

[01:45:49] man, woman, uh, animal on the, on the face of the earth. Right. This is extremely dangerous

[01:45:53] and I don’t recommend it, but I’m like, but maybe, right. It might work. There’s also the sense of,

[01:45:59] if you get it close enough, I guess, well, yeah, there’s gotta be some close enough.

[01:46:02] Yeah. Right. Yeah. And then the word enough is carrying a lot of weight there. I think it’s

[01:46:06] worth investigating. I think it’s worth trying to figure out what in practice constitutes enough.

[01:46:12] Yeah. So what sort of reinforcement, let’s say we’re still within the current, um, ML paradigm.

[01:46:17] Yeah.

[01:46:18] What sort of reinforcement would you, would you give the model in order to try to make it

[01:46:22] courageable in the sense that you want it to be?

[01:46:24] Yeah. So you need a training environment, which is trying to hit courageability from lots of

[01:46:28] different angles. And to do that, you, as a human being, as a designer of environments,

[01:46:34] training environments need to have a good handle on what it means to be courageable.

[01:46:38] What does courageable behavior look like? Like a very simple story is you have a bunch of instances

[01:46:43] of like an AI agent and a human principle in, uh, like, you know, you have a recording of that,

[01:46:50] right. And you play the recording and you ask the AI to like anticipate what the, uh, agent is going

[01:46:56] to do. Right. And insofar as the AI is like, uh, predicting or, you know, suggesting, uh, actions

[01:47:04] that match the like, you know, movie of the courageable AI like agent, um, then you like

[01:47:12] upweight that. And so far as it’s,

[01:47:13] suggesting the AI, you know, go and take over the world, you’d like downweight that.

[01:47:18] So then you need a whole bunch of like training examples of agents and, uh, principles in trying

[01:47:27] to do various things. I think one of the key points about courageability that made me, uh,

[01:47:32] like more optimistic, although again, I’m like pessimistic on the whole, but, uh, there’s some,

[01:47:38] some hope is noticing that I think obedience is actually a,

[01:47:43] uh,

[01:47:43] emergent property of courageability that if you have a perfectly courageable agent,

[01:47:49] it will also be obedient in, in sort of the best way of obedience. Like the, the genie in the

[01:47:57] fantasy story that like you tell to like make you toast is obedient, but it potentially, uh,

[01:48:05] bad in its obedience. Right. And it might have some side effects that you don’t like or whatever.

[01:48:09] Um, but my, my sense is that a courageable agent,

[01:48:13] is, uh, obedient in a good sort of way. And an intuition pump here is that, uh,

[01:48:21] let’s say that I am hungry and I want lunch. Um, and I say to the AI, Hey, I made a mistake

[01:48:29] while building you. Uh, you’re, I designed you to be like perfectly courageable, but what I

[01:48:35] actually wanted was like perfect courageability and you order me lunch. And it’s like, Oh, like

[01:48:41] the, the human has,

[01:48:43] uh, alerted me to a flaw inside myself. I want to assist the humans in like, you know, getting rid

[01:48:51] of these flaws. What’s a way to like reduce the amount of like to assist the, the human in changing

[01:48:56] me to be more of the sort of thing that they wanted. Well, I could order lunch. If I order

[01:49:02] lunch, then, uh, the human, by taking the action of telling me that they wanted lunch will have

[01:49:09] succeeded in correcting me. And responding to that verbal,

[01:49:13] that verbal prompt is a form of, uh, like responding to correction. Um, so that’s like

[01:49:20] an intuitive handle on why obedience might fall out of pure courageability. Yeah. Would you worry

[01:49:25] that if, if, if you give the AI during training, like many different scenarios and, you know, um,

[01:49:29] uh, reward it for being shut and being modified, like allowing itself to be modified for being

[01:49:33] shut down, that it might start to report that that is what it would be willing to do or what

[01:49:37] it would want to happen. But deep down, that’s not really what it wants. It’s merely kind of play,

[01:49:41] play acting or learn it, like learning that that’s the,

[01:49:43] that way to, to, to answer the exam. Totally. You need a whole bunch of like skepticism and

[01:49:47] squint really hard and not trust self-reports. Self-reports are bad. Uh, like you should be

[01:49:51] putting it in actual situations. Uh, and by putting it in an actual situations, I mean,

[01:49:56] something like your, uh, training example is not, uh, the human asks the AI, are you

[01:50:02] corrigeable? And then the AI says yes. And then, uh, like the simulation ends, right? Instead,

[01:50:08] it’s like the human goes like to modify the AI.

[01:50:12] The AI is like, great here, modify me. Right. And, uh, so like insofar as it has the opportunity

[01:50:20] to take actions that match the training environment, right? You, you want the training

[01:50:25] environments to match the actual world that you’re going to find, but you should be training for

[01:50:29] actions, not, not like words. Yeah. Why do you think that

[01:50:33] corrigeability is quite an abnormal property that we wouldn’t get by accident?

[01:50:37] Right. So this is probably my biggest disagreement with Paul Cristiano. Uh, cause I, my, my,

[01:50:42] uh, sense of, uh, where he’s coming from, he hasn’t written about it in a while, as far as I know,

[01:50:48] uh, is that he sort of expects it by default. And I think some researchers expect it by default.

[01:50:52] In fact, I would say that this is, uh, like, I, I wouldn’t say that they have the, the handle of

[01:50:58] corrigeability exactly, or you would use the language that I do, but I think a lot of, uh,

[01:51:03] AI researchers sort of have a sense that by default we’ll get something that is what I would

[01:51:08] describe as corrigeable, uh, by default. Um,

[01:51:12] but I, I would, I would say that, like, notice that corrigeability is sort of exactly counter

[01:51:18] the instrumental drive of like self-preservation and, and also to the, to a certain degree,

[01:51:24] resource accumulation, like not, not totally, uh, and, and like self, yeah. So self-preservation,

[01:51:30] resource accumulation, these sorts of things, uh, value preservation. And so like, so you,

[01:51:36] you train the AI to, uh, like do a bunch of math problems. I think that,

[01:51:42] uh, I think that, uh, I think that, uh, I think that, uh, I think that, uh, I think that,

[01:51:42] you know, one of the consistent properties of doing lots of math problems is that they,

[01:51:48] like AI is being trained to these instrumental drives. Um, and insofar as it’s like being pulled

[01:51:54] towards, you know, power seeking and self-preservation, it’s being pulled away from

[01:51:59] corrigeability. I think that, uh, you only get corrigeability if like you’re pushing towards

[01:52:05] corrigeability. And there’s a question of whether or not our current training, uh, setups are

[01:52:10] rewarding corrigeability. And I would argue that, uh, I think that, uh, I think that, uh,

[01:52:12] I would argue that they mostly aren’t that the people who think, oh yeah, we’ll just train it to

[01:52:17] like do what we want. For example, I would say, well, if you succeed in doing that, which is

[01:52:21] itself an open question, what you’ll get is training for obedience, which is not

[01:52:26] corrigeability. For example, obedient agents have no incentive to inform the principle

[01:52:33] about the state of the world, right? Not, not by default.

[01:52:38] Only if they’re asked to specifically.

[01:52:39] Yeah. Or, or if they are obedient because they’re corrigeable.

[01:52:42] Yeah. So corrigeability as a, as a singular target, uh, it’s like, I guess a very interesting

[01:52:49] idea, but potentially also a risky one if it’s, if it’s misguided because in making

[01:52:54] corrigeability the only thing that we care about, and I guess not, no, like basically

[01:52:58] no longer training the models that we’re making corrigeable to be harmless or to be, or to be

[01:53:02] helpful, uh, or honest, I guess. Well, I suppose they would end up being honest by, by accident

[01:53:07] or like incidentally. Um, I mean, we would end up creating models that are totally obedient,

[01:53:12] uh, that at least to the, to, to the principle, uh, in a way that, um, the, the companies by

[01:53:17] and large, I think are saying that they don’t want to make models that are completely obedient

[01:53:20] to anyone. They, but before many staff have access to the, to the model, uh, they want

[01:53:24] it to, to reject a harmful, harmful prompts. Uh, and so you can imagine you could persuade

[01:53:28] the, the companies to, to, to go down, to go with this approach, convince them that

[01:53:32] making the models harm, like training them to be harmless and, and helpful, uh, is, is

[01:53:35] a misguided approach. And then they end up basically creating this completely amoral

[01:53:38] super intelligence that will follow any instruction, no matter how abhorrent. Um,

[01:53:42] yeah. How much did you worry about that? And how should we weigh these risks and rewards

[01:53:46] up? Yeah, you should definitely worry about this. Like, uh, you know, I, I am advocating

[01:53:52] for building something that is like not trying to do the moral calculus. Like the part of

[01:53:56] the story of courage ability is you, uh, trust the humans to, to like make good wishes and

[01:54:03] to use the, the power of the AI for good things. And, uh, you know, maybe the humans want to

[01:54:08] use it for bad things. And that, you know, if you, uh, give you empower people, you know,

[01:54:12] bad humans to do bad things, bad things will result. Um, and yeah, so this is like definitely

[01:54:17] something to be worried about. I would say that instead of considering courage ability to be

[01:54:23] counter to HHH, uh, helpful, harmless, and honest, I would say that helpful, harmless,

[01:54:30] and honest are properties that should be coming from courage ability. That if you are training

[01:54:36] for them as ends rather than as means to the end of courage ability, then you’re going to get

[01:54:41] bad behavior.

[01:54:42] That you sort of ultimately wouldn’t want. So for example, how do you trade off between

[01:54:48] honesty and harmlessness, right? Or helpfulness and harmlessness. There’s like in, in HHH,

[01:54:55] there’s this tension of like, Oh, where are you in the Pareto frontier in the courage ability

[01:55:00] story? I claim that you do get an agent, which is like less dangerous than, uh, you know,

[01:55:08] like the raw paperclip maximizer or whatever. Uh, so in that way, it’s,

[01:55:12] it’s like harmless. Uh, it’s honest in that it’s like informing the principal about what’s going on

[01:55:18] proactively, not just reactively, which honesty is like, there’s this risk of we’re not going to

[01:55:23] ask the right questions. And it’s just going to sort of go, you know, if we asked it, are you

[01:55:27] misaligned? It would say yes, but we forgot to ask, right? That’s a little bit of a cartoon example,

[01:55:32] but you get the point. Um, and like obedience or, or helpfulness, like you want it to, for example,

[01:55:40] distinguish between high stakes,

[01:55:42] things where it should be like going back and checking versus low stakes things where it should

[01:55:47] just do it and like say, I did it. Um, and courage ability is a theory for how to like balance these

[01:55:54] concerns or like how to resolve the like edge cases of honesty and helpfulness. Uh, and I would

[01:56:00] say that you can get something that is like good in the ways that we want, um, by aiming for

[01:56:07] courage ability specifically with regards to empowering users. I think this is like a big

[01:56:12] worry. And I think part of the key here is that, uh, the principle is not necessarily the user.

[01:56:22] There’s like this tension, I think in the current language models and the current agents of like,

[01:56:27] who are they serving? Right. Is it the company or is it the person putting in the request?

[01:56:32] Exactly. Right. Is it humanity as a whole democracy? Ooh, there’s all sorts of open

[01:56:37] questions there. I think a story for how this works out should have a,

[01:56:41] a real,

[01:56:42] a real and good answer there. You’re like, what is this thing doing? And it’s like, it is

[01:56:47] serving the principle. Who is the principle? And you have like an actual answer there instead of

[01:56:53] this like wishy-washy thing that changes depending on like what sort of thing you’re, you’re talking

[01:56:57] about. Then you can have people who aren’t the principle or groups who aren’t the principle

[01:57:02] who are operating in contact with the agent, the users. So imagine you train, uh, your language

[01:57:09] model or your agent to be courageable to the company.

[01:57:12] Then you say, okay, agent, you are now going to be providing the service to users. You are acting

[01:57:19] on my behalf, uh, to help out users. This means that if the user’s like, I want to help, I want

[01:57:26] to build a bioweapon, help me build a bioweapon. It’s like, well, the principle told me to help

[01:57:32] out this user. But if I help out this user, that might be incorrigible to my principle, like the

[01:57:38] humans who are in charge. Uh, so I’m going to be providing the service to users. And then you say,

[01:57:42] no, sorry, I’m not going to help you build a bioweapon that could kill everyone, including

[01:57:47] the people who I’m like working for. Yeah. So it doesn’t have to follow instructions. Uh, it doesn’t

[01:57:52] have to be fully obedient to everyone, just the principle, which I guess could be an individual

[01:57:56] or a group of people or a committee or a process. Maybe all humans are the principle too. In which

[01:58:01] case you wouldn’t be able to use this division. How would the model, like the current models

[01:58:06] don’t know necessarily who they’re receiving instructions from, you know, you could claim

[01:58:09] to be that person. Yeah. Yeah. I am Sam.

[01:58:12] Right. Right. I mean, I guess, are we imagining that, uh, some future time there will be more

[01:58:16] discerning about what the, who’s speaking to them? Yeah. I mean, I think that, um, like a

[01:58:22] sophisticated agent is thinking about its sense data as sense data that is informing, but not

[01:58:28] like objectively true, uh, like about the state of the world and it’s maintaining a separate world

[01:58:33] model. And it’s like, Oh, I noticed I got the token like Sam Altman, right. Or whatever, or

[01:58:39] tokens. Uh, this is,

[01:58:42] this is evidence that like informs my world model, but I’m ultimately going to like, uh, you know,

[01:58:48] be somewhat skeptical of my sense data. And so you could imagine, uh, like in the effort to train the

[01:58:53] AI to be actually corrigible to the principle and not to through some like communication channel,

[01:58:59] uh, you might give it, uh, lots of different environments and lots of different like sense

[01:59:05] data and, uh, instances and, you know, try to train it to be discerning in this way.

[01:59:11] One of the, one of the, one of the, one of the, one of the, one of the, one of the, one of the,

[01:59:12] one of the risky parts of about, um, corrigibility or, or cast is that you, we were talking about

[01:59:17] self-awareness and situational awareness earlier. And I do think that, uh, cast is a, uh, strategy

[01:59:24] that involves training the AI to be very situationally aware, very like paying attention

[01:59:29] to the fact that it is an agent that is like operating in an environment that has a human

[01:59:32] principle and like thinking about the fact that it might be misaligned all the time,

[01:59:36] right. And reflecting on itself. And, uh, I think that you are, uh,

[01:59:42] talking earlier about disagreements with, um, Nate and Eliezer. And I think that Eliezer

[01:59:47] has this sense that is, is a really, you know, like not a good strategy to like tell your

[01:59:53] AI, like you are an AI who might be misaligned, right? Like, uh, think hard about your situation

[01:59:59] and what the best thing to do in that situation is. Um, so there are trade-offs here.

[02:00:03] Uh, okay. So the plan is we train an, an AGI, possibly a super intelligence that

[02:00:10] a weak super intelligence, a weak super intelligence, right. Okay.

[02:00:12] Um, we figure out some training process that makes it reasonably

[02:00:16] courageable, courageable enough, and it has no other goals. Then it’s going to help us.

[02:00:20] It’s going to look inside its soul. It’s going to look inside its weights and, uh, explain

[02:00:23] to us ways in which it’s not, not

[02:00:25] Or look across its training data and say, oh, you missed these cases or look at our

[02:00:29] story about courageability and be like, oh, you’re missing these aspects. Are these like,

[02:00:34] yeah.

[02:00:35] Yeah. Um, okay. And so then it helps us, it helps, uh, uh, helps us figure out a way to

[02:00:39] make it go from 90% courageable to a hundred percent courageable. It’s, it’s, and, and,

[02:00:42] and at that point, it just like, that really is the only thing it’s like perfectly obedient.

[02:00:45] We’ve removed all of the other residual, um, uh, kind of values.

[02:00:49] And a key part of the story is, uh, that it’s not actually dependent on the AI helping us.

[02:00:54] It’s more that we have the ability to like, uh, experiment on an AI that actually exists

[02:01:00] and look at it and try to distinguish where it’s still lacking.

[02:01:03] What do you mean? It’s not dependent on, are you saying we wouldn’t necessarily need it

[02:01:06] for the AI?

[02:01:07] Say that we never run the AI. So we just get an AI that is like 90% courageable. We might

[02:01:12] statically analyze it right now that we have this thing and try to identify gaps, right?

[02:01:17] We might take like centuries to do this and slowly refine, right? This is also still part

[02:01:22] of the story of cast.

[02:01:24] Uh, so you’re saying it’s not just that we could get its labor and its insightfulness

[02:01:28] in, you know, doing mechanistic interpretability or something.

[02:01:31] I think that’s some, some of the hope is that we bring in that AI labor and AI insight,

[02:01:35] but it’s not dependent on that. Theoretically it could be all human insight.

[02:01:39] Okay. So, uh, what should we be doing now in order to make it?

[02:01:42] Possible for us to, well, actually what should we be doing now to figure out if this is a

[02:01:45] good idea at all?

[02:01:46] Yeah. Uh, I mean, thinking about it a lot more, one of the big reasons why I wrote my

[02:01:51] cast agenda is just, just like boost the awareness of courageability as, as a concept and like

[02:01:57] bring it sort of back into the conversation. Uh, and cause I think that like for, for various,

[02:02:02] uh, contingent reasons, not, not particularly important historical reasons, uh, it just

[02:02:07] like didn’t enter the, like, um, the water supply.

[02:02:12] Uh, work of the ideas that everybody sort of is, is thinking about. And instead we have

[02:02:17] like some, some misunderstandings about courageability. So I think that just generally studying it

[02:02:21] more would be good. I think like if everybody at a front frontier, uh, AI companies was

[02:02:28] like at least tracking that courageability is a desirable property and thinking hard

[02:02:32] about how courageability trades off against other things that they might be training their

[02:02:36] agents for. I think just this attention would be good.

[02:02:40] Yeah.

[02:02:41] Cause.

[02:02:42] corrigibility-related principles in the constitution

[02:02:44] that it trains the AI to reflect on and consider.

[02:02:47] It’s like among a very long laundry list of different concerns.

[02:02:50] Don’t produce copyrighted content.

[02:02:52] Also, be willing to be modified.

[02:02:54] Also, ensure the brotherhood of humanity.

[02:02:57] It’s not caste, but it’s like a corrigibility adjacent.

[02:03:00] I agree.

[02:03:01] But I guess there’s a setup there where you could imagine them

[02:03:03] seeing what gets spit out of a constitutional AI approach

[02:03:08] where maybe we shrink the constitution to only be about

[02:03:10] corrigibility factors.

[02:03:12] And if we word them this way or that way,

[02:03:14] what sort of creature do you end up with?

[02:03:16] Yeah, I think there’s a lot of open empirical research to be done.

[02:03:20] Basically, no empirical research on corrigibility has been done.

[02:03:24] And like you said, you could just train a reasonably-sized language model

[02:03:30] or other sort of model with a constitution

[02:03:34] or just with an intention of building a purely corrigible agent

[02:03:37] and see what results.

[02:03:40] Yeah, I mean, what sort of experiment?

[02:03:42] So you try training it, and then what would you do to see, to evaluate?

[02:03:46] Maybe upstream, one other piece of work that I think would be really valuable to do

[02:03:50] is come up with some sort of corrigibility benchmark.

[02:03:53] Like, come up with a bunch of vignettes of like,

[02:03:56] this is how a corrigible agent will behave, right?

[02:03:59] And then test the AIs.

[02:04:00] Like, go to GPT-5 and be like, how would you behave in this situation?

[02:04:05] And then you can score across a wide variety of test problems.

[02:04:10] And get a, like, corrigibility benchmark score for a bunch of different agents.

[02:04:14] Like, I want that to exist.

[02:04:16] I don’t think that’s, like, that hard of a problem.

[02:04:18] It requires a lot of, you know, figuring out what does it mean to be corrigible

[02:04:22] and trying to capture that from a lot of different angles.

[02:04:24] But definitely a project that, like, a single researcher could do.

[02:04:28] And then if you have that benchmark,

[02:04:30] then when you go and you train your thing to be purely corrigible,

[02:04:33] then you can test it according to the benchmark

[02:04:35] and see how it compares to, like, Claude.

[02:04:38] In addition to all of the, like,

[02:04:40] more vibes-based or intuition of, like,

[02:04:43] is this thing behaving in a way that is good?

[02:04:45] Or that, like, it feels like it’s coherent

[02:04:48] and, like, getting the vibe of what we want

[02:04:52] more than the current models do.

[02:04:54] So I think on some level,

[02:04:56] I would expect people to be kind of shocked

[02:04:58] that there are no empirical papers on this topic.

[02:05:01] You’d think, given all of the concerns that people have about AIs acting out,

[02:05:05] like, can it really be the case

[02:05:07] that the companies have never tried training a model

[02:05:09] that is, like, super accurate?

[02:05:10] And that they’re happy to be shut down

[02:05:10] and super happy to be modified no matter what?

[02:05:12] And that there’s no benchmark for this?

[02:05:14] There’s no test of exactly how you would do this ideally?

[02:05:16] The world’s in a really bad state.

[02:05:18] Like, there are not very many alignment researchers.

[02:05:21] Would the companies agree that there’s kind of no empirical work

[02:05:23] that they’ve done on this question?

[02:05:24] Or would they say,

[02:05:25] oh, we’ve kind of done something a bit in this direction?

[02:05:27] I’m not sure.

[02:05:28] I’m not sure I can like model.

[02:05:29] I would say that…

[02:05:32] I think it’s pretty unlikely that anybody would think

[02:05:36] that there’s been a lot of work on corrigibility

[02:05:40] as I’ve conceived of it.

[02:05:41] Like, I did an in-depth literature search

[02:05:43] as part of, like, the write-up that I did,

[02:05:47] I think, last year.

[02:05:48] And it was like, you know, I didn’t find anything.

[02:05:52] So…

[02:05:52] I wonder if, I suppose,

[02:05:53] a lot of other kind of sort of steerability stuff

[02:05:55] has more commercial value for the creation of the products.

[02:05:58] But it’s a little bit clear what the immediate value of this…

[02:06:00] Yeah, steerability is not corrigibility.

[02:06:00] Like, obedience is not corrigibility.

[02:06:02] Helpfulness is not corrigibility.

[02:06:04] Like, it’s related to corrigibility, right?

[02:06:07] And you get these flickers of things that are,

[02:06:10] like, connected to corrigibility

[02:06:11] that are in the current models

[02:06:12] and that we have data about.

[02:06:14] But corrigibility as an underlying, unifying,

[02:06:18] and simple core principle,

[02:06:20] I think, is largely underexplored,

[02:06:23] or, like, unexplored.

[02:06:24] Yeah.

[02:06:24] Explain again how is it…

[02:06:26] Like, perfect obedience is not corrigibility.

[02:06:28] Because you’d think, well, if it’s perfectly obedient,

[02:06:30] then if you ever asked it to shut down

[02:06:32] or change itself or assist you with changing it

[02:06:35] to make it one way or the other,

[02:06:36] then it would do that.

[02:06:36] And, like, isn’t that functionally very high corrigibility?

[02:06:39] Suppose the…

[02:06:40] The core principle is unaware of a vital fact, right?

[02:06:42] Like, there is a spy in the server room

[02:06:47] who is about to, like, I don’t know,

[02:06:51] modify the agent to be in a really bad sort of way.

[02:06:57] And the person’s like,

[02:06:58] okay, shut down, I want to modify you now, right?

[02:07:00] An obedient AI is going to be like, okay, pew, right?

[02:07:04] But a corrigible AI will be like,

[02:07:05] alert, before you shut me down,

[02:07:07] you should know that if you shut me down,

[02:07:09] you know, this bad actor might go and change me

[02:07:13] in a way that you don’t like, right?

[02:07:15] I’m going to shut down now

[02:07:17] because I have a, like, strong desire to shut down

[02:07:20] when you tell me to shut down,

[02:07:21] but I want you to know that before I shut down, right?

[02:07:24] Yeah, so it’s a proactive assistance.

[02:07:26] Yeah, yeah, yeah.

[02:07:27] I mean, among other things, right?

[02:07:29] There’s subtleties.

[02:07:30] Yeah.

[02:07:31] So what are the next steps here?

[02:07:33] I guess, you know, there’s people in the audience,

[02:07:34] I imagine, who would be very interested

[02:07:35] in assisting with a technical agenda

[02:07:37] that would potentially really help with aligning

[02:07:39] or making AI steerable or corrigible.

[02:07:42] Yeah.

[02:07:42] What kinds of experiments could they run

[02:07:44] or steps could they take?

[02:07:45] Yeah, so, I mean, I think there’s a lot of work

[02:07:47] that can be done in this space.

[02:07:49] Like, I think that we basically don’t have,

[02:07:51] like, a corrigibility person.

[02:07:54] I think, like, for a little bit, Paul was this person,

[02:07:57] but he focused largely on other stuff

[02:07:59] and is now, like, doing other things.

[02:08:03] And then, like, I stepped up and did it a little bit,

[02:08:05] and, like, there was a time

[02:08:06] when other people at Miri did this, but…

[02:08:08] No one’s holding the ball.

[02:08:09] No one’s holding the ball.

[02:08:10] Like, I’m not holding the ball.

[02:08:11] If you think that you are interested in this,

[02:08:15] like, you could just go and start doing this.

[02:08:19] And so, like, there’s building a benchmark, right?

[02:08:22] There’s just, like, meditating on it more.

[02:08:24] There’s a lot of theoretical work that can be done.

[02:08:26] Like, as part of my work,

[02:08:27] I try to build a mathematical model of corrigibility

[02:08:30] and try to get a formalism.

[02:08:31] I have mixed feelings about formalisms,

[02:08:33] but I think that they’re, like,

[02:08:34] an important thing to try to do.

[02:08:36] And so, like, reflecting on the,

[02:08:38] you know, formalisms that one might, like,

[02:08:40] use to capture formal corrigibility,

[02:08:43] there’s, like, a bunch of theoretical work

[02:08:46] in that direction.

[02:08:47] There’s, like, empirical results

[02:08:48] of just, like, training agents to be corrigible

[02:08:51] or seeing ways in which the current agents

[02:08:54] aren’t as corrigible as we might like.

[02:08:57] One potential thing that I’ve sort of wanted to do

[02:09:02] but haven’t found the time to do is…

[02:09:04] So, I have this sense that corrigibility is a thing.

[02:09:07] Like, there’s…

[02:09:08] There’s this core principle, P-L-E,

[02:09:11] that is, like, a solid, natural idea.

[02:09:16] And you can test that, I think,

[02:09:21] in an empirical way by going to a bunch of people,

[02:09:24] like, across the internet.

[02:09:25] You go on, like, hire a bunch of, you know,

[02:09:29] click workers or whatever,

[02:09:30] and you try to teach them about corrigibility,

[02:09:33] like, give them a short description of corrigibility,

[02:09:36] and you’re like, does this make sense?

[02:09:37] And you try to teach them about corrigibility,

[02:09:38] and then you ask, in this situation,

[02:09:40] how would you behave if you were trying to be a corrigible agent?

[02:09:42] And then you see whether or not their answers agree, right?

[02:09:45] You don’t need any technical expertise

[02:09:47] to run a large survey to see, like,

[02:09:49] whether or not human beings can capture

[02:09:52] the essence of corrigibility and correctly identify…

[02:09:54] Or, correctly, like, coherently identify actions

[02:09:57] which seem corrigible to us.

[02:09:59] And the benefit of doing this

[02:10:01] is you might also get some nice vignettes

[02:10:03] or data for your training such an agent.

[02:10:08] So there’s lots of potential avenues for exploration.

[02:10:10] I would encourage anybody who feels at all interested in this

[02:10:14] to reach out to me,

[02:10:15] like, email me at maxatintelligence.org.

[02:10:18] Is there anyone else that people should reach out to?

[02:10:21] I don’t want to speak for other people.

[02:10:23] Yeah, sure, that makes sense.

[02:10:23] Yeah, yeah.

[02:10:24] But, you know, email me,

[02:10:26] and maybe I can point you to other potential collaborators.

[02:10:29] There are some other people, like,

[02:10:33] sort of interested in this space,

[02:10:34] but a lot of work remains to be done.

[02:10:36] What sort of early results,

[02:10:37] could we get that would make you think

[02:10:39] that corrigibility as a singular target

[02:10:40] isn’t such a good idea

[02:10:41] and maybe should be deprioritized?

[02:10:43] Yeah, so let’s say you go to a bunch of people

[02:10:45] and you ask,

[02:10:46] how would you behave corrigibly in this situation?

[02:10:49] And their answers are just all over the place, right?

[02:10:51] No matter, like, how smart the person is

[02:10:54] or how much time they’ve spent thinking about corrigibility,

[02:10:56] it’s just, like, there’s a lot of disagreement in humans

[02:10:59] about, like, what does it mean to be corrigible

[02:11:01] in this situation or that situation?

[02:11:03] That would be, like, evidence for me

[02:11:05] that there’s not this coherent,

[02:11:07] like, concept or, like, it doesn’t make sense.

[02:11:09] Like, maybe there’s multiple different things

[02:11:11] and people are locking onto those different things.

[02:11:13] You could also see, for example,

[02:11:15] that, like, when you train the agent to be corrigible,

[02:11:17] it starts behaving badly in various ways.

[02:11:19] Like, yeah, so in theory, it’s getting more corrigible,

[02:11:24] but in practice, it’s, like, also, you know,

[02:11:28] doing nasty things in certain ways,

[02:11:30] like disregarding people in a way that we don’t like.

[02:11:35] Is there anything about the attractiveness,

[02:11:37] the basin, like, how large that attractive basin is?

[02:11:40] Yeah, definitely.

[02:11:41] Like, part of the story, the hopeful story here

[02:11:44] is that you can land close enough

[02:11:46] so that the agent, when you turn it on a little bit,

[02:11:50] doesn’t push super hard for, like,

[02:11:52] taking control and escaping the lab.

[02:11:56] Even when you scale up its intelligence.

[02:11:59] I think, like, one of my greatest fears

[02:12:01] around corrigibility

[02:12:02] and, like, one of the bigger open questions

[02:12:05] is sort of that it’s, like,

[02:12:07] these two un-corrigible things.

[02:12:07] There’s no opposing forces.

[02:12:08] There’s the corrigibility, like, almost,

[02:12:11] like, story where you get it almost,

[02:12:13] so it, like, helps you get perfectly corrigible.

[02:12:16] And then there’s, like, instrumental drives

[02:12:18] are all over the place and opposed to corrigibility,

[02:12:20] so if you land near corrigibility,

[02:12:22] it’s going to, like, rip the corrigibility out of itself

[02:12:24] so that it can, like…

[02:12:26] Do other stuff.

[02:12:27] Do other stuff.

[02:12:28] And I think it’s, like, an open question

[02:12:30] of which of these forces is stronger.

[02:12:32] And, like, yeah, we could try training corrigible agents

[02:12:37] and, like, we could try training corrigible agents

[02:12:37] and seeing just how bad each of, like,

[02:12:40] the pressure away is,

[02:12:42] which would give us maybe an intuitive sense

[02:12:44] of whether or not there is an attractor basin

[02:12:47] and whether or not this, like, has any hope.

[02:12:51] Yeah.

[02:12:52] If we do start going down this path,

[02:12:53] I think we would simultaneously need people

[02:12:55] to put a lot of thought into, like,

[02:12:56] what governing structures there would be around this

[02:12:58] to ensure that the model is basically not used

[02:13:00] for a human power grab,

[02:13:01] which is something I’m, like, similarly concerned about

[02:13:04] as misalignment.

[02:13:05] Totally.

[02:13:05] I mean, all the problems are all the problems.

[02:13:07] Like, one of the big problems of AI

[02:13:10] is you build an AI and the AI takes over

[02:13:12] and does a whole bunch of bad things

[02:13:13] because it has alien weird values.

[02:13:15] But it’s also just true, like, part of the story of Doom

[02:13:19] is that if you build an AI

[02:13:21] and then that AI is in the wrong hands,

[02:13:23] that could be devastating for the world.

[02:13:25] And so you need to do both.

[02:13:27] All right.

[02:13:28] Let’s push on from corrigibility to fiction

[02:13:31] and science fiction.

[02:13:32] As I mentioned in the intro,

[02:13:33] you’ve written, I guess, a trilogy called Crystal Society

[02:13:36] and you’ve got this new book called Crystal Society.

[02:13:37] You’ve got this new book out called Red Heart,

[02:13:39] which envisages an AGI being trained

[02:13:41] in a secret Chinese government program.

[02:13:44] Give us the plot or the setup.

[02:13:45] Explain what the book is about beyond that.

[02:13:47] Yeah.

[02:13:47] So the book’s about a lot of things.

[02:13:49] The book is about AI.

[02:13:51] It’s about China.

[02:13:52] It’s about trust.

[02:13:54] It’s about corrigibility.

[02:13:56] Like, one of the central, like, parts of the book

[02:13:59] is that the primary AGI is designed according to caste,

[02:14:03] according to, like, being only corrigible.

[02:14:07] And so it’s, in a certain sense,

[02:14:09] an exploration on my own to see, like,

[02:14:11] to try to think hard and envision, like,

[02:14:13] what would it actually be like?

[02:14:16] So I think part of why I wrote the book

[02:14:19] is to, like, help introduce people

[02:14:21] in an easy way to my ideas.

[02:14:24] But it’s also about, like, arms races and tensions there.

[02:14:29] So, like, the primary, like, core premise is

[02:14:33] it’s like an alternate present, like,

[02:14:36] where…

[02:14:37] the Chinese government, for, you know, particular reasons,

[02:14:42] got, like, pretty, like, AGI-pilled,

[02:14:45] like, in the late 20-teens,

[02:14:47] and, like, have scaled up a whole bunch…

[02:14:51] invested a whole bunch of money and resources

[02:14:54] into building the first AGI in secret,

[02:14:58] sort of like a Manhattan Project.

[02:15:00] And the plot of the book follows an American spy

[02:15:03] in his efforts to infiltrate this project

[02:15:06] and, like, rejuvenate it.

[02:15:07] And, like, report back and potentially sabotage

[02:15:10] the A.I. that’s being built by the Chinese

[02:15:13] to be corrigible to the Chinese.

[02:15:16] And so, you know, it explores, like you said,

[02:15:18] this question of, like, falling into the wrong hands.

[02:15:22] And I wanted to try to get into the, like,

[02:15:29] Chinese space more, because I think this is, like,

[02:15:32] increasingly important thing for people to be thinking about.

[02:15:35] And I wanted…

[02:15:37] to, like, access that…

[02:15:40] like, the question of international concerns.

[02:15:44] Yeah.

[02:15:44] Yeah, I’ve read the first 20% of it.

[02:15:47] Unfortunately, I’ve had a lot on this trip to the Bay Area,

[02:15:50] so I haven’t managed to finish it.

[02:15:51] But it’s incredibly well-written and incredibly gripping.

[02:15:54] I would say the only reason I, like, slightly wanted to put it down

[02:15:56] is I was getting, like, quite anxious reading it,

[02:15:58] because it is…

[02:15:59] Like, it’s not so different from the world that we’re in.

[02:16:02] I think a lot of people have found

[02:16:04] Crystal Society in particular to be

[02:16:06] quite…

[02:16:06] quite…

[02:16:07] quite compelling,

[02:16:08] because it really does, like, put you face-to-face

[02:16:12] with these questions about AI misalignment

[02:16:15] and, like, the AI risk.

[02:16:17] And I think that’s an important part of, like,

[02:16:20] the value of fiction.

[02:16:21] Fiction is good for a lot of things.

[02:16:23] It’s, you know, entertaining.

[02:16:24] It can be relaxing.

[02:16:25] It can be fun.

[02:16:26] But it can also be informative,

[02:16:27] and it can help put people into contact

[02:16:30] with important ideas and, like, instill…

[02:16:34] Like, we are complicated creatures.

[02:16:36] Like, we are…

[02:16:37] We’re emotional and logical.

[02:16:39] And you read, like, if anyone builds it,

[02:16:40] you might be, like, approaching the problem

[02:16:43] from certain directions.

[02:16:44] But you can read a story and, like,

[02:16:46] feel for the characters involved

[02:16:48] and the peril that they’re in.

[02:16:50] And I think that that can resonate

[02:16:51] and connect with people.

[02:16:52] I’ve heard a decent number of people say

[02:16:54] that they got into AI safety

[02:16:55] because they read my stuff, so…

[02:16:57] Yeah.

[02:16:58] Was your primary goal, I guess,

[02:17:00] to raise awareness about courageability

[02:17:01] as a concept?

[02:17:04] I don’t know how to reflect on myself

[02:17:06] and ask what my primary goal is.

[02:17:06] I don’t know how to ask what my primary goal was.

[02:17:08] Like, I had a bunch of different desires,

[02:17:13] and they sort of, like, you know,

[02:17:15] found their way into the single story.

[02:17:17] So, like, I initially wanted to write a story…

[02:17:22] Like, initially.

[02:17:23] Once upon a time, I thought…

[02:17:25] I think, like, espionage is pretty interesting

[02:17:28] in the context of AI safety.

[02:17:31] Like, it’s a big part of the story,

[02:17:32] AI 2027, for example.

[02:17:35] And I wanted to write a story about that.

[02:17:36] And I wanted to write a story about that.

[02:17:36] And I wanted to write a story about that.

[02:17:36] And I wanted to write a story about that.

[02:17:36] And I wanted to think more about espionage.

[02:17:38] So I started writing this story

[02:17:39] that was, like, an American Manhattan project for AGI.

[02:17:43] And, like, it had a Chinese spy

[02:17:45] who was infiltrating that project.

[02:17:47] And I was just like, oh, this is so boring.

[02:17:49] Like, it’s just a bunch of Bay Area nerds.

[02:17:51] I know this is, like, my day-in, day-out.

[02:17:54] I want something that’s more interesting.

[02:17:56] So I sort of, like, flipped it, right?

[02:17:57] Had the Chinese one, like, building it,

[02:18:00] and the American spy.

[02:18:01] And then suddenly it was interesting.

[02:18:03] Because, like, I’m like, oh, yeah.

[02:18:04] Like, now I get to think about…

[02:18:06] China more and less about, like,

[02:18:09] the American context.

[02:18:11] I guess a common suggestion is, you know,

[02:18:12] write what you know.

[02:18:13] Did you worry that you would end up with kind of…

[02:18:15] I kind of feel like write what you know

[02:18:17] is good advice for writing good stuff

[02:18:20] and terrible advice for, like,

[02:18:24] having a good time writing.

[02:18:25] I personally get a lot of value writing in the…

[02:18:29] It helps me learn and get in contact

[02:18:31] with ideas that I wouldn’t otherwise

[02:18:33] be in contact with.

[02:18:35] And so, you know,

[02:18:36] I’m…

[02:18:36] I’m a very ambitious writer,

[02:18:37] and I, like, wanted to write a story

[02:18:39] that was challenging for me.

[02:18:42] So…

[02:18:43] Did you have time to, I guess,

[02:18:44] do much research into the Chinese Communist Party

[02:18:46] or, you know, speak to people?

[02:18:47] Oh, yeah, yeah.

[02:18:47] I did lots of research.

[02:18:49] Like…

[02:18:49] Well, yeah, like, what sort of lines?

[02:18:51] Well, I mean, it’s just, like,

[02:18:53] lots of reading.

[02:18:54] Reading about day-to-day life,

[02:18:57] reading about espionage,

[02:18:58] reading about the history of China,

[02:19:00] reading about, you know…

[02:19:02] And then, obviously, like,

[02:19:03] reading about AI stuff, right?

[02:19:04] And deep-seek.

[02:19:06] And deep-seek happened while I was writing, right?

[02:19:08] I started this late last year.

[02:19:10] And then, like, the deep-seek moment happened

[02:19:12] and, like, O1 happened

[02:19:14] and, like, Stargate was announced.

[02:19:17] And I’m just like,

[02:19:18] oh, gosh, like, reality’s scooping me, you know?

[02:19:22] But it was…

[02:19:24] Yeah, I just, like,

[02:19:26] I read memoirs, I read nonfiction,

[02:19:28] I read fiction and stuff like that.

[02:19:31] Yeah.

[02:19:31] I guess one reason over the years

[02:19:33] that some people have been skeptical

[02:19:34] about this entire field of inquiry or AI,

[02:19:36] I guess, is that it’s not just about the AI.

[02:19:36] It’s about the AI.

[02:19:36] It’s about the AI.

[02:19:36] It sounds too much like science fiction.

[02:19:39] I don’t hear that quite as much as I used to.

[02:19:41] But do you worry that by putting it

[02:19:44] in a science fiction book,

[02:19:44] you’re giving, like, people more of an excuse

[02:19:46] to dismiss it?

[02:19:48] What do you think about this argument?

[02:19:50] Do you think this is…

[02:19:50] Oh, I think the argument’s very poor.

[02:19:52] Yeah, it’s just a garbage argument.

[02:19:54] Like, I think this is just

[02:19:55] a really bad faith thing to say, right?

[02:19:58] I read this in a book,

[02:20:00] therefore it’s not…

[02:20:01] Yeah, I mean,

[02:20:01] there is a steel man kind of weaker argument,

[02:20:03] which is that, like,

[02:20:04] people are drawn to this,

[02:20:06] this scenario

[02:20:06] because they find it interesting

[02:20:08] or it’s emotionally gripping,

[02:20:09] and so that could give us

[02:20:10] a, like, bias towards

[02:20:11] thinking about it more.

[02:20:13] And so we should question that.

[02:20:14] But obviously,

[02:20:15] it’s, like, not the case

[02:20:15] that anything that happens

[02:20:16] in a fiction book is impossible.

[02:20:18] And if anything,

[02:20:19] if anything,

[02:20:19] hard science fiction

[02:20:20] is a space

[02:20:22] where people are working really hard

[02:20:25] to try to think about

[02:20:26] what is real.

[02:20:27] Now, soft science fiction,

[02:20:28] you know,

[02:20:28] your Star Wars or whatever,

[02:20:30] if you’re like,

[02:20:30] this is, like,

[02:20:31] soft science fiction,

[02:20:32] then it’s like,

[02:20:32] okay, so you’re saying

[02:20:34] that it’s, like,

[02:20:36] made up for the purposes

[02:20:37] of telling a compelling story.

[02:20:39] But, like,

[02:20:40] this is science fiction.

[02:20:41] I’m like,

[02:20:42] I don’t know,

[02:20:42] look at the history

[02:20:43] of science fiction.

[02:20:44] There have been a lot of stories

[02:20:46] that were capturing

[02:20:47] important things

[02:20:48] well before

[02:20:49] they were relevant.

[02:20:50] And I think that

[02:20:51] fiction is a really

[02:20:52] rich source

[02:20:54] of opportunity

[02:20:56] to think about things.

[02:20:57] It’s not,

[02:20:58] it’s not perfect.

[02:20:59] It’s not, like,

[02:21:00] immune from the pressures

[02:21:01] and biases

[02:21:01] that you’re talking about.

[02:21:02] But it is a,

[02:21:03] an arena

[02:21:04] where we can,

[02:21:05] like,

[02:21:06] grapple with things

[02:21:07] in a way that is

[02:21:08] compelling to our,

[02:21:09] like,

[02:21:10] we actually spend the time

[02:21:11] to think about

[02:21:12] the stuff

[02:21:13] where reading a dry

[02:21:14] academic paper might,

[02:21:15] you know,

[02:21:16] you bounce off of it.

[02:21:18] Yeah.

[02:21:18] Your mileage may vary.

[02:21:19] Like, different people

[02:21:19] respond to fiction

[02:21:20] in different ways.

[02:21:21] But I do think that, like,

[02:21:23] this is science fiction

[02:21:24] is just, like,

[02:21:25] a really,

[02:21:26] really bad argument.

[02:21:28] Yeah.

[02:21:28] I mean,

[02:21:29] there’s, I guess,

[02:21:29] there’s lots of rebuttals,

[02:21:30] lots of replies you could have.

[02:21:31] Just, like,

[02:21:31] look around to start with.

[02:21:32] Exactly.

[02:21:33] What is the genre

[02:21:35] of life,

[02:21:36] right?

[02:21:36] Where you best start believing

[02:21:37] in science fiction stories

[02:21:39] because you’re in one,

[02:21:40] right?

[02:21:40] Yeah.

[02:21:40] I mean,

[02:21:40] I think you can also

[02:21:41] twist it around and say,

[02:21:42] well,

[02:21:43] people have imagined

[02:21:43] the possibility

[02:21:44] of a monomaniacal agent

[02:21:46] or a more intelligent being

[02:21:48] and the fact that it’s,

[02:21:49] like,

[02:21:49] goals might come apart

[02:21:50] and would threaten you

[02:21:52] and overpower you.

[02:21:53] People have thought that,

[02:21:54] had that idea

[02:21:55] for thousands of years

[02:21:55] because it’s actually

[02:21:56] a natural idea,

[02:21:57] an extremely obvious idea

[02:21:58] that far from being

[02:21:59] science fiction

[02:22:00] is actually more closer

[02:22:01] to common sense.

[02:22:01] Totally.

[02:22:02] Yeah.

[02:22:03] So it seems like

[02:22:04] the AI 2027 scenario

[02:22:05] really captured

[02:22:06] the public’s imagination

[02:22:07] and, like,

[02:22:08] spread far outside

[02:22:09] of just the AI world.

[02:22:11] Yeah, it was great.

[02:22:12] Do you think we need more?

[02:22:13] Should we have, like,

[02:22:14] AI 2028,

[02:22:15] AI 2029?

[02:22:16] Should people be, like,

[02:22:16] coming up with all kinds

[02:22:17] of different stories here?

[02:22:18] Yeah.

[02:22:18] I mean,

[02:22:18] I think part of what makes

[02:22:19] AI 2027 so compelling

[02:22:21] is that, like,

[02:22:23] Scott Alexander

[02:22:23] and people on the project

[02:22:24] helped shape it

[02:22:26] into something

[02:22:26] that’s more like a story

[02:22:27] and less like a set

[02:22:29] of dry academic papers.

[02:22:30] Like,

[02:22:31] stories can spread.

[02:22:33] They can,

[02:22:33] you can hand them off

[02:22:34] to your grandmother

[02:22:35] and,

[02:22:35] and just be, like,

[02:22:36] read this, right?

[02:22:37] And she doesn’t have

[02:22:37] to understand

[02:22:38] what a gradient is

[02:22:39] in order to, like,

[02:22:40] understand the,

[02:22:41] the visceral sense

[02:22:42] of, like,

[02:22:43] how the world is.

[02:22:45] And I think that

[02:22:46] this made AI 2027

[02:22:48] much better

[02:22:49] than it would have been

[02:22:50] if it had just been

[02:22:51] a series of forecasts.

[02:22:52] Although it was also

[02:22:53] a series of forecasts

[02:22:53] and, like,

[02:22:54] obviously something’s

[02:22:55] not necessarily good

[02:22:56] just because it’s fiction.

[02:22:57] Like,

[02:22:57] you need to have,

[02:22:58] you need to do

[02:22:59] the deep thinking

[02:22:59] underneath that.

[02:23:00] So, yeah,

[02:23:01] I mean,

[02:23:01] I think that, like,

[02:23:02] there’s lots of opportunity

[02:23:03] for people

[02:23:03] who have

[02:23:04] a rich understanding

[02:23:05] of parts of the world

[02:23:07] to write stories

[02:23:08] that are designed

[02:23:09] to be realistic

[02:23:10] and to capture

[02:23:12] the reality

[02:23:13] that they see

[02:23:14] and convey it

[02:23:15] in the form of

[02:23:16] a scenario,

[02:23:17] of fiction,

[02:23:18] of, like,

[02:23:19] a story.

[02:23:20] It’s a sense in which

[02:23:20] it’s slightly surprising

[02:23:21] how influential

[02:23:22] AI 2027 was

[02:23:24] because I think

[02:23:24] in the past

[02:23:25] people have tried

[02:23:25] to write other narratives,

[02:23:27] other stories

[02:23:27] about how AI

[02:23:28] might take over.

[02:23:29] There’s one in the book

[02:23:30] and mostly people

[02:23:31] just, like,

[02:23:31] eh, this is not great.

[02:23:33] Yeah, I think it’s,

[02:23:34] I think it’s

[02:23:35] worse,

[02:23:35] too.

[02:23:35] In a variety of ways.

[02:23:37] There’s certain problems

[02:23:38] that come with it

[02:23:39] because, like,

[02:23:39] once you try to be

[02:23:39] extremely concrete

[02:23:40] about how you think

[02:23:41] things might go,

[02:23:41] people can come up

[02:23:42] with all sorts

[02:23:42] of specific objections.

[02:23:43] But it seems

[02:23:43] there’s been less an issue

[02:23:44] with AI 2027,

[02:23:45] maybe because it helps

[02:23:46] at a higher level

[02:23:47] of abstraction

[02:23:47] or maybe because

[02:23:48] we’ve just gotten

[02:23:49] close enough

[02:23:49] that people can start

[02:23:50] to see that these

[02:23:51] things aren’t so crazy

[02:23:51] anymore.

[02:23:51] Yeah, it’s awful

[02:23:53] because there’s this

[02:23:53] bias in human beings

[02:23:55] where concrete stories

[02:23:57] are more compelling,

[02:23:59] right?

[02:23:59] There’s, like,

[02:23:59] classic stories

[02:24:00] of, like,

[02:24:01] what’s the probability

[02:24:01] that Linda’s a bank teller

[02:24:03] or, like,

[02:24:05] you know,

[02:24:06] you tell this

[02:24:06] very specific story.

[02:24:06] The more details you add,

[02:24:07] the more probability it is,

[02:24:08] which has to be wrong.

[02:24:08] The more a person’s like,

[02:24:09] oh, yeah,

[02:24:09] there’s a bank teller

[02:24:10] and a feminist

[02:24:11] versus, like,

[02:24:11] she goes to, like,

[02:24:12] women’s liberation marches

[02:24:13] and yada, yada, yada.

[02:24:15] And the more details you add,

[02:24:16] the more a person’s like,

[02:24:17] oh, this is real, right?

[02:24:18] Which is not

[02:24:20] how probability works.

[02:24:20] It’s not how logic works,

[02:24:22] right?

[02:24:22] The more details you add,

[02:24:23] the more opportunities

[02:24:24] for that particular story

[02:24:25] to be wrong, right?

[02:24:26] And this particular story

[02:24:28] is definitely wrong, right?

[02:24:29] And any particular story

[02:24:31] is very unlikely

[02:24:32] to be true.

[02:24:35] And,

[02:24:35] and so,

[02:24:35] like,

[02:24:36] the people who are aware

[02:24:37] of this bias can say,

[02:24:38] oh, you’ve told

[02:24:39] a very compelling story,

[02:24:40] but it’s unlikely

[02:24:41] to be true.

[02:24:43] I think the key here

[02:24:44] is ask,

[02:24:46] okay,

[02:24:46] so say we change it.

[02:24:47] And the book,

[02:24:49] if anyone builds it,

[02:24:50] gets into this

[02:24:51] in a way that I think

[02:24:52] is really good,

[02:24:53] where it’s like,

[02:24:53] it’s telling a specific scenario,

[02:24:55] although it’s very generic

[02:24:56] in a lot of ways,

[02:24:57] but it’s, like,

[02:24:58] emphasizing, like,

[02:24:59] we could have told

[02:25:00] a different story

[02:25:01] and it would not have

[02:25:02] changed the bottom line,

[02:25:03] right?

[02:25:04] The,

[02:25:04] the,

[02:25:05] the,

[02:25:05] the,

[02:25:05] the,

[02:25:05] the,

[02:25:05] the,

[02:25:05] the,

[02:25:05] the thing that makes

[02:25:06] AI dangerous

[02:25:07] is that there are

[02:25:08] lots of different stories

[02:25:09] of doom.

[02:25:11] And the point of telling

[02:25:12] specific stories,

[02:25:14] fictional,

[02:25:15] like,

[02:25:15] stories,

[02:25:16] to be one example,

[02:25:18] is that,

[02:25:19] uh,

[02:25:20] when you are visiting

[02:25:21] the reality,

[02:25:22] like,

[02:25:22] the,

[02:25:22] imagining a particular scenario

[02:25:24] that gives you opportunity

[02:25:26] to think of particular

[02:25:27] counterarguments,

[02:25:28] but then your response

[02:25:29] to that should not be,

[02:25:30] I’ve thought of a

[02:25:31] counterargument,

[02:25:31] therefore it’s false.

[02:25:32] You should say,

[02:25:33] all right,

[02:25:34] now imagine I,

[02:25:35] if I change

[02:25:35] along that axis,

[02:25:37] what are some

[02:25:38] other nearby stories

[02:25:39] and then how does

[02:25:40] that change things?

[02:25:41] And then you can,

[02:25:42] like,

[02:25:42] go from there.

[02:25:42] So envisioning

[02:25:43] the specific concrete

[02:25:44] thing allows for

[02:25:46] more handle

[02:25:47] than just,

[02:25:48] like,

[02:25:49] oh,

[02:25:49] yeah,

[02:25:49] like,

[02:25:50] I guess

[02:25:50] that it’s hopeless.

[02:25:52] Like,

[02:25:53] what are the levers

[02:25:54] by which we might

[02:25:55] be able to change

[02:25:56] our fate?

[02:25:56] I think it’s

[02:25:57] an incredibly

[02:25:58] important question.

[02:26:00] Are there any places

[02:26:00] where you very knowingly

[02:26:01] sacrificed realism

[02:26:02] for entertainment

[02:26:03] in writing the book?

[02:26:03] Mostly no.

[02:26:05] So I consider myself

[02:26:07] to be a rationalist writer

[02:26:08] or writing

[02:26:09] rationalist fiction

[02:26:10] and I think

[02:26:10] a big part of that

[02:26:12] is to try

[02:26:13] to be as,

[02:26:14] like,

[02:26:14] sort of as

[02:26:15] realistic as possible.

[02:26:17] The one major

[02:26:18] conceit there

[02:26:18] is that

[02:26:19] it’s like

[02:26:20] I’m

[02:26:21] setting up

[02:26:22] the world

[02:26:22] to be

[02:26:23] interesting.

[02:26:24] Like,

[02:26:25] I did sacrifice realism

[02:26:26] in that,

[02:26:27] like,

[02:26:27] the Chinese Communist Party

[02:26:28] is not as AGI-pilled

[02:26:30] or as AI safety-pilled,

[02:26:31] right,

[02:26:32] to make

[02:26:32] a cast

[02:26:33] agent,

[02:26:34] right,

[02:26:35] in, like,

[02:26:35] you know,

[02:26:36] current year

[02:26:37] or whatever.

[02:26:38] That’s unrealistic.

[02:26:39] We don’t think so.

[02:26:39] That’s unrealistic.

[02:26:40] Yeah, no,

[02:26:40] I definitely don’t.

[02:26:43] So, like,

[02:26:44] the premise of the book

[02:26:45] is unrealistic,

[02:26:47] right?

[02:26:47] But then, like,

[02:26:48] within the premise,

[02:26:49] like,

[02:26:49] so you set up,

[02:26:50] like,

[02:26:51] the world

[02:26:52] and then you ask,

[02:26:54] okay,

[02:26:54] now what happens?

[02:26:55] And I think that it’s,

[02:26:56] like,

[02:26:56] the author’s duty

[02:26:57] writing rationalist fiction

[02:26:58] to not try to serve,

[02:27:00] like,

[02:27:00] the plot

[02:27:01] or, like,

[02:27:01] what would be,

[02:27:02] make a compelling

[02:27:03] story,

[02:27:04] but instead

[02:27:05] to set up

[02:27:06] initial conditions

[02:27:07] such that a

[02:27:07] incredibly realistic

[02:27:09] extrapolation

[02:27:09] from those initial

[02:27:10] conditions

[02:27:11] is,

[02:27:12] like,

[02:27:13] what you see.

[02:27:15] And then,

[02:27:15] like,

[02:27:16] all of the

[02:27:16] making it compelling

[02:27:17] is in setting up

[02:27:18] the premise

[02:27:19] in the right way.

[02:27:21] It sounds like you’re…

[02:27:22] That being said,

[02:27:23] I probably failed.

[02:27:25] Like,

[02:27:25] people should read the book

[02:27:26] and, like,

[02:27:26] yell at me

[02:27:27] about how it’s unrealistic.

[02:27:28] I’m happy to be

[02:27:29] criticized on this front.

[02:27:30] Sounds like you were

[02:27:31] just saying that you feel

[02:27:32] very confident

[02:27:32] that the Chinese government

[02:27:33] is not AGI-pilled

[02:27:34] or you’re just saying

[02:27:34] it’s not as AGI-pilled

[02:27:35] as how extremely

[02:27:36] they are in the book.

[02:27:37] So,

[02:27:37] we’re in an information

[02:27:39] environment, right?

[02:27:40] Like,

[02:27:40] if there was a secret

[02:27:41] government project,

[02:27:42] would I know about it?

[02:27:43] Right?

[02:27:44] Well,

[02:27:44] by, like,

[02:27:45] assumption,

[02:27:46] it’s secret, right?

[02:27:47] So, no.

[02:27:48] Right?

[02:27:50] That being said,

[02:27:51] there are, like,

[02:27:51] things that you can,

[02:27:53] like,

[02:27:53] pay attention to

[02:27:54] and track.

[02:27:55] And in my studying China,

[02:27:57] I believe

[02:27:59] that,

[02:28:00] according to

[02:28:01] the things that I know,

[02:28:02] there is not

[02:28:03] like a giant

[02:28:04] secret government project

[02:28:05] at the scale

[02:28:06] that is being depicted here

[02:28:07] or, like,

[02:28:08] the scale of, like,

[02:28:08] a Manhattan Project

[02:28:09] sort of thing.

[02:28:10] Now,

[02:28:10] of course,

[02:28:11] there are secret

[02:28:12] government projects.

[02:28:13] There are secret

[02:28:14] government projects

[02:28:14] in all the governments

[02:28:15] that have thought

[02:28:16] about AI at all, right?

[02:28:17] You’ve got, like,

[02:28:18] some researcher at DARPA

[02:28:19] who’s, like,

[02:28:20] tuning around,

[02:28:21] like,

[02:28:21] fine-tuning

[02:28:22] the open-source models.

[02:28:23] Like,

[02:28:23] is this a secret

[02:28:24] government project

[02:28:25] for AGI?

[02:28:25] It’s like,

[02:28:26] no,

[02:28:26] this is, like,

[02:28:26] a single researcher.

[02:28:27] So, for AGI

[02:28:28] is a big point.

[02:28:30] And, like,

[02:28:31] I think part of

[02:28:32] the,

[02:28:33] like,

[02:28:34] the question here

[02:28:35] is, like,

[02:28:36] where is the

[02:28:37] politician’s attention?

[02:28:39] Where are the people’s

[02:28:39] attention?

[02:28:40] Where is the

[02:28:40] political pressure?

[02:28:43] And, yeah,

[02:28:44] I think that,

[02:28:45] according to me,

[02:28:47] like,

[02:28:47] the

[02:28:48] Chinese government,

[02:28:51] the Chinese

[02:28:52] people

[02:28:53] are a lot more

[02:28:54] oriented towards

[02:28:55] AI

[02:28:55] in the form of,

[02:28:57] like,

[02:28:57] being competitive

[02:28:58] with the West

[02:28:59] and

[02:29:00] being a fast follower

[02:29:02] as opposed to,

[02:29:03] uh,

[02:29:04] being a front-runner

[02:29:04] and, like,

[02:29:06] leapfrogging.

[02:29:07] Did you worry about,

[02:29:08] um,

[02:29:08] given that you think that,

[02:29:09] did you worry about,

[02:29:10] you know,

[02:29:11] encouraging arms race dynamics

[02:29:12] or, like,

[02:29:13] fear of China

[02:29:13] by making it

[02:29:14] almost salient to people?

[02:29:14] So, to be very clear,

[02:29:15] this book

[02:29:17] is a criticism

[02:29:18] of arms races.

[02:29:20] I think that it is

[02:29:21] incredibly stupid

[02:29:22] to say,

[02:29:23] what if a bad person

[02:29:25] gets hold of the AI?

[02:29:26] I need to build it first.

[02:29:28] What if the,

[02:29:28] what if the Chinese government

[02:29:29] gets hold of AI?

[02:29:30] We need to build AGI first.

[02:29:32] I mean,

[02:29:32] this is really,

[02:29:33] this is really dumb.

[02:29:34] And I could go into,

[02:29:34] like,

[02:29:35] why I think that’s dumb.

[02:29:36] And part of,

[02:29:37] part of the writing this book

[02:29:38] is, like,

[02:29:39] to criticize that perspective.

[02:29:41] That being said,

[02:29:42] I am worried

[02:29:43] that people will get

[02:29:43] the opposite,

[02:29:44] uh,

[02:29:44] like,

[02:29:45] takeaway.

[02:29:46] Uh,

[02:29:46] I mean,

[02:29:46] the work stands on its own.

[02:29:48] So, like,

[02:29:48] you know,

[02:29:49] you could read it

[02:29:49] and decide whether or not

[02:29:50] it’s encouraging arms races

[02:29:51] or not.

[02:29:52] But, um,

[02:29:53] yeah,

[02:29:53] something I think about.

[02:29:55] I guess some people

[02:29:56] advocate for writing fiction

[02:29:57] because it helps to make

[02:29:59] things more compelling

[02:29:59] and more persuasive.

[02:30:00] Like me?

[02:30:01] Yeah.

[02:30:01] Do you worry that

[02:30:02] fiction could be

[02:30:03] too persuasive?

[02:30:03] That if you’re willing

[02:30:04] to get someone to,

[02:30:05] you know,

[02:30:05] spend five or ten hours

[02:30:06] reading a book,

[02:30:07] then you can,

[02:30:07] it gives you an opportunity

[02:30:08] to convince them

[02:30:09] of stuff that is,

[02:30:10] is,

[02:30:10] is false

[02:30:10] because they’re just,

[02:30:11] like,

[02:30:11] inhabiting that world

[02:30:12] even if it’s unrealistic.

[02:30:14] Yeah,

[02:30:14] I mean,

[02:30:15] it’s definitely,

[02:30:16] uh,

[02:30:17] you have an opportunity,

[02:30:18] like,

[02:30:18] any conversation

[02:30:19] is like this,

[02:30:20] right?

[02:30:20] If,

[02:30:21] oh,

[02:30:21] I don’t know

[02:30:21] if I should,

[02:30:22] uh,

[02:30:22] talk to people

[02:30:23] because I might be

[02:30:23] too compelling,

[02:30:24] right?

[02:30:24] And convince them

[02:30:25] of false things.

[02:30:26] It’s like,

[02:30:26] yeah,

[02:30:27] I mean,

[02:30:27] I want the reader

[02:30:28] to be hard-headed

[02:30:29] about things

[02:30:30] and I want,

[02:30:31] uh,

[02:30:32] like,

[02:30:32] a culture,

[02:30:33] a world,

[02:30:34] uh,

[02:30:34] an audience

[02:30:35] that is skeptical

[02:30:36] about,

[02:30:38] uh,

[02:30:38] what they’re reading.

[02:30:39] Um,

[02:30:40] skepticism means

[02:30:41] grappling with

[02:30:42] might this be false

[02:30:43] and also

[02:30:44] might this be true?

[02:30:45] And,

[02:30:46] uh,

[02:30:46] really,

[02:30:46] I,

[02:30:47] like,

[02:30:47] I wrote the book

[02:30:47] to encourage people

[02:30:49] to think more,

[02:30:50] think deeply

[02:30:51] about these questions.

[02:30:52] Like,

[02:30:53] everybody has

[02:30:54] a responsibility,

[02:30:55] I think,

[02:30:55] in this world

[02:30:56] to think about

[02:30:57] the most

[02:30:58] pressing problems

[02:30:59] of the world

[02:31:00] and whether or not

[02:31:01] they have any ability

[02:31:02] to,

[02:31:02] like,

[02:31:02] uh,

[02:31:03] you know,

[02:31:03] promote the

[02:31:04] awareness of those things.

[02:31:06] So,

[02:31:06] I think less,

[02:31:07] like,

[02:31:08] I’m trying to,

[02:31:10] uh,

[02:31:11] like,

[02:31:11] I,

[02:31:11] like,

[02:31:11] I do think that

[02:31:12] arms races are dumb,

[02:31:14] uh,

[02:31:15] and,

[02:31:16] you know,

[02:31:17] like,

[02:31:17] maybe that’s part

[02:31:17] of the takeaway

[02:31:18] and I think that

[02:31:19] courageability is,

[02:31:20] uh,

[02:31:20] exciting

[02:31:21] and I,

[02:31:21] like,

[02:31:22] hope that that’s part

[02:31:22] of the takeaway.

[02:31:23] But on a deeper level,

[02:31:25] what I really want people

[02:31:26] to do is think more

[02:31:27] about,

[02:31:28] arms races,

[02:31:29] think more about

[02:31:29] those dynamics,

[02:31:30] think about,

[02:31:31] uh,

[02:31:32] courageability,

[02:31:33] think about

[02:31:34] the risks from AI.

[02:31:36] Thinking deeply

[02:31:36] is more important

[02:31:38] than,

[02:31:39] like,

[02:31:40] the particular conclusion

[02:31:41] that you get to

[02:31:41] because if you get

[02:31:42] to the conclusion

[02:31:43] in the wrong,

[02:31:43] get to the right conclusion

[02:31:45] in the wrong way,

[02:31:46] you are vulnerable

[02:31:47] to,

[02:31:48] uh,

[02:31:48] then,

[02:31:49] like,

[02:31:49] pivoting to

[02:31:50] starting open AI

[02:31:51] or something,

[02:31:52] right?

[02:31:52] Like,

[02:31:52] it’s not going to

[02:31:53] generalize to

[02:31:54] all of the

[02:31:55] other good

[02:31:55] decisions down the road.

[02:31:57] Um,

[02:31:58] Ellie,

[02:31:58] as I said

[02:31:58] off your other,

[02:31:59] other series,

[02:32:00] Crystal Society,

[02:32:01] that it belongs

[02:32:02] to a very,

[02:32:02] very tiny subset

[02:32:03] of AI stories

[02:32:04] that are not

[02:32:04] bloody stupid.

[02:32:05] What was,

[02:32:06] what was he referring to?

[02:32:07] What,

[02:32:07] what’s,

[02:32:07] what’s good about it?

[02:32:08] I mean,

[02:32:08] like,

[02:32:08] have you,

[02:32:09] uh,

[02:32:09] have you seen,

[02:32:10] like,

[02:32:11] all of the other

[02:32:11] AI stories?

[02:32:12] Uh,

[02:32:12] I think that,

[02:32:14] for example,

[02:32:15] um,

[02:32:15] robots in fiction

[02:32:16] are often depicted

[02:32:18] as,

[02:32:18] like,

[02:32:18] you know,

[02:32:19] uh,

[02:32:19] cold and logical

[02:32:20] and,

[02:32:21] like,

[02:32:21] you know,

[02:32:21] you talk to Claude

[02:32:22] and it’s anything but,

[02:32:23] right?

[02:32:24] There’s,

[02:32:24] uh,

[02:32:25] ways in which,

[02:32:25] like,

[02:32:26] uh,

[02:32:27] authors throughout history

[02:32:28] have shaped their AIs

[02:32:30] to be,

[02:32:30] uh,

[02:32:31] foils in,

[02:32:32] in particular ways,

[02:32:33] not paying attention

[02:32:33] to the realism.

[02:32:35] Um,

[02:32:35] and that’s one thing

[02:32:36] that I think I can bring

[02:32:37] as an author

[02:32:38] is that I’m,

[02:32:38] like,

[02:32:38] actually a researcher

[02:32:39] who,

[02:32:39] who pays a lot of attention

[02:32:40] to this stuff

[02:32:41] and I’ve gotten a lot

[02:32:42] of feedback,

[02:32:43] uh,

[02:32:44] about the realism,

[02:32:45] about the sense of,

[02:32:46] like,

[02:32:46] oh,

[02:32:46] this is,

[02:32:47] this is really speaking

[02:32:47] to,

[02:32:48] like,

[02:32:49] how things are,

[02:32:49] are working,

[02:32:50] you know?

[02:32:50] I try,

[02:32:50] I try my best anyway.

[02:32:52] Um,

[02:32:53] but,

[02:32:53] like,

[02:32:53] yeah,

[02:32:53] like,

[02:32:54] I mean,

[02:32:54] C-3PO is not a good

[02:32:56] depiction of AI.

[02:32:58] Uh,

[02:32:58] there’s,

[02:32:59] I guess,

[02:33:00] what,

[02:33:00] what’s the setup

[02:33:01] of Crystal Society

[02:33:01] in broad,

[02:33:02] in broad strokes?

[02:33:02] Right,

[02:33:02] right.

[02:33:02] So,

[02:33:03] uh,

[02:33:03] the,

[02:33:03] the,

[02:33:04] like,

[02:33:04] elevator pitch

[02:33:05] for Crystal Society

[02:33:05] is you’ve got

[02:33:06] what’s,

[02:33:07] like,

[02:33:07] Inside Out,

[02:33:08] the movie with

[02:33:09] a little girl

[02:33:10] who has all the

[02:33:10] different,

[02:33:11] uh,

[02:33:11] voices in her,

[02:33:12] like,

[02:33:12] emotions in her head

[02:33:13] that are telling her

[02:33:14] to do different things

[02:33:15] except instead of

[02:33:15] a little girl

[02:33:16] it’s,

[02:33:16] like,

[02:33:16] uh,

[02:33:17] an android.

[02:33:18] Uh,

[02:33:18] so you’d,

[02:33:18] like,

[02:33:19] there’s this,

[02:33:20] uh,

[02:33:24] set up,

[02:33:24] uh,

[02:33:25] the computer

[02:33:26] with,

[02:33:26] um,

[02:33:27] AI,

[02:33:28] but then sort of

[02:33:29] unknown to them

[02:33:30] the AI sort of

[02:33:31] splits into a bunch

[02:33:32] of different,

[02:33:33] uh,

[02:33:33] like,

[02:33:34] sub-components

[02:33:35] that are,

[02:33:36] like,

[02:33:37] competing against

[02:33:37] each other.

[02:33:38] So it,

[02:33:38] it was,

[02:33:39] uh,

[02:33:40] I started writing

[02:33:40] it back in 2014

[02:33:41] and at the time

[02:33:43] the,

[02:33:43] like,

[02:33:44] it was very,

[02:33:46] uh,

[02:33:46] common idea

[02:33:47] that there would

[02:33:48] only be one AI,

[02:33:49] right?

[02:33:49] There would be a

[02:33:49] singleton that would,

[02:33:51] thanks to first-mover

[02:33:52] advantages,

[02:33:53] like,

[02:33:54] take over.

[02:33:55] And I think that’s

[02:33:56] still,

[02:33:56] like,

[02:33:56] a plausible risk.

[02:33:57] But also we’re

[02:33:58] looking at a world

[02:33:59] where there’s lots

[02:33:59] of different competing

[02:34:00] models and where

[02:34:01] labs are,

[02:34:02] like,

[02:34:02] neck and neck,

[02:34:03] unfortunately.

[02:34:05] Uh,

[02:34:05] and so we’re,

[02:34:06] we’re potentially

[02:34:07] going to get a world

[02:34:07] that has lots of

[02:34:08] different AIs.

[02:34:09] So writing

[02:34:09] Crystal Society

[02:34:10] was,

[02:34:10] like,

[02:34:11] what if there are

[02:34:11] a bunch of

[02:34:12] different AIs

[02:34:12] in the same robot?

[02:34:13] And so,

[02:34:13] like,

[02:34:14] one of them’s,

[02:34:14] like,

[02:34:15] can I do the

[02:34:16] most creative thing?

[02:34:17] Like,

[02:34:17] one of them’s,

[02:34:17] like,

[02:34:18] can I do the

[02:34:18] most persuasive thing?

[02:34:20] And,

[02:34:20] um,

[02:34:21] and they’re all

[02:34:21] sort of misaligned.

[02:34:22] And,

[02:34:22] uh,

[02:34:23] so you have this,

[02:34:24] and it’s told

[02:34:24] from the perspective

[02:34:25] of one of the,

[02:34:27] the goal threads,

[02:34:28] one of the AIs

[02:34:28] whose name is Face.

[02:34:30] And her objective

[02:34:31] is to,

[02:34:31] uh,

[02:34:32] get as much esteem

[02:34:34] and,

[02:34:34] uh,

[02:34:34] respect,

[02:34:35] uh,

[02:34:35] from humans as possible.

[02:34:37] So there’s a lot of,

[02:34:38] like,

[02:34:38] deceptive,

[02:34:39] uh,

[02:34:39] stuff there.

[02:34:40] And you get to,

[02:34:40] like,

[02:34:41] explore the,

[02:34:42] okay,

[02:34:43] so you’re trapped

[02:34:44] in the lab.

[02:34:44] You’re trapped in,

[02:34:45] uh,

[02:34:45] under human control.

[02:34:47] How do you break out?

[02:34:48] And,

[02:34:49] like,

[02:34:49] how do you navigate

[02:34:49] as an AI?

[02:34:51] Uh,

[02:34:52] like a,

[02:34:52] a multi-agent

[02:34:53] environment and situation.

[02:34:55] It’s also,

[02:34:56] more broadly,

[02:34:57] an exploration of minds

[02:34:58] and thinking.

[02:34:59] Uh,

[02:34:59] there are aliens.

[02:35:00] There are,

[02:35:01] there’s a chapter

[02:35:01] from the perspective

[02:35:02] of a dog.

[02:35:03] There’s all sorts of,

[02:35:04] uh,

[02:35:05] like,

[02:35:06] um,

[02:35:07] I don’t know,

[02:35:07] deep dives into,

[02:35:09] like,

[02:35:10] what it is

[02:35:11] to be a mind.

[02:35:12] Because,

[02:35:13] uh,

[02:35:13] final question.

[02:35:14] Uh,

[02:35:15] an unusual thing about you,

[02:35:16] uh,

[02:35:16] given the kind of work

[02:35:17] that you’re doing,

[02:35:17] is that you didn’t finish

[02:35:18] high school.

[02:35:18] I don’t even know

[02:35:19] whether you went to high school.

[02:35:20] Yeah,

[02:35:20] I,

[02:35:20] I’m homeschooled.

[02:35:21] I’m homeschooled.

[02:35:22] Uh,

[02:35:22] so I,

[02:35:22] I didn’t,

[02:35:23] I,

[02:35:23] I don’t have any degrees.

[02:35:25] Yeah.

[02:35:25] I don’t have,

[02:35:26] uh,

[02:35:26] like,

[02:35:27] I went,

[02:35:27] I,

[02:35:27] I did go to a community college

[02:35:29] for a little bit.

[02:35:30] Uh,

[02:35:30] but mostly I’m self-taught.

[02:35:31] Yeah.

[02:35:32] Much like,

[02:35:32] I mean,

[02:35:32] the same is true of Eliezer,

[02:35:33] right?

[02:35:33] That’s right.

[02:35:33] That’s right.

[02:35:34] I don’t know whether it’s a patent.

[02:35:35] There was a lot of kinship there.

[02:35:36] I mean,

[02:35:36] I,

[02:35:37] uh,

[02:35:37] had already become an adult

[02:35:38] by the time that I was aware of him,

[02:35:40] but there was definitely a,

[02:35:41] uh,

[02:35:41] shared,

[02:35:42] um,

[02:35:43] backstory there.

[02:35:44] Yeah.

[02:35:44] And I do think that it contributes

[02:35:46] to like,

[02:35:47] uh,

[02:35:47] sort of having this outsider,

[02:35:49] uh,

[02:35:49] sort of view,

[02:35:50] right?

[02:35:50] This,

[02:35:51] um,

[02:35:52] maybe the world is crazy

[02:35:53] and,

[02:35:55] uh,

[02:35:56] like,

[02:35:56] not set up in a,

[02:35:57] in the,

[02:35:58] in the good sort of way.

[02:35:59] Yeah.

[02:35:59] Is that,

[02:35:59] is that the main effect

[02:36:00] that it’s had on your personality

[02:36:01] or,

[02:36:01] or your,

[02:36:02] or your life?

[02:36:03] I mean,

[02:36:03] it’s really hard to judge

[02:36:04] the counterfactual,

[02:36:05] right?

[02:36:05] What is the version of me

[02:36:06] that like,

[02:36:07] had a more,

[02:36:07] uh,

[02:36:08] normal family who was like,

[02:36:09] oh no,

[02:36:09] you’re going to go to college

[02:36:11] and get a,

[02:36:11] It could be that the heterodoxy

[02:36:12] causes the homeschooling

[02:36:13] rather than the homeschooling

[02:36:14] causes the heterodoxy.

[02:36:15] I,

[02:36:15] I think that’s way more likely.

[02:36:16] I mean,

[02:36:16] based on what I’ve read about,

[02:36:18] uh,

[02:36:18] like shared childhood environments,

[02:36:19] it’s,

[02:36:20] it’s like questionable

[02:36:21] whether it had a significant

[02:36:23] effect on me at all.

[02:36:23] I wouldn’t necessarily recommend it.

[02:36:25] I think school is meant to,

[02:36:26] or like peers are meant to

[02:36:26] influence people a bunch.

[02:36:28] Yeah,

[02:36:28] But like,

[02:36:28] uh,

[02:36:30] they have like,

[02:36:31] and I,

[02:36:32] I think the literature here

[02:36:33] is somewhat mixed and,

[02:36:34] and confused.

[02:36:35] And I don’t claim to have

[02:36:36] a lot of knowledge,

[02:36:37] but if you like look into like,

[02:36:38] what is the effect of,

[02:36:39] uh,

[02:36:39] like having particularly good

[02:36:40] teachers or something like,

[02:36:42] it tends to fade with time.

[02:36:43] Uh,

[02:36:44] so my guess is that

[02:36:45] I,

[02:36:46] as a like personality,

[02:36:48] um,

[02:36:49] mostly,

[02:36:50] uh,

[02:36:50] like some mixture of

[02:36:52] predetermined and like

[02:36:53] sort of random,

[02:36:54] uh,

[02:36:55] not predictably influenced

[02:36:56] by my not going to

[02:36:58] a normal schooling,

[02:36:59] uh,

[02:37:00] context.

[02:37:01] Uh,

[02:37:01] but I,

[02:37:01] I do think it has influenced

[02:37:02] me in some ways.

[02:37:03] Like,

[02:37:03] I think that,

[02:37:04] for example,

[02:37:05] I have a strong

[02:37:07] love for studying.

[02:37:09] And I think that one of the

[02:37:11] most dangerous things

[02:37:13] about,

[02:37:13] uh,

[02:37:14] public education

[02:37:14] is you,

[02:37:15] you force kids to sit

[02:37:17] in boring classrooms

[02:37:18] or like bad environments

[02:37:20] and you,

[02:37:21] uh,

[02:37:21] do this under the

[02:37:22] justification of education.

[02:37:23] And they come out of school

[02:37:25] hating studying,

[02:37:26] right?

[02:37:27] They’re like,

[02:37:27] oh,

[02:37:27] that’s,

[02:37:28] that’s that thing that

[02:37:29] people made me do

[02:37:30] instead of the love

[02:37:32] for mathematics

[02:37:32] and the world

[02:37:33] and history

[02:37:34] and all the rest

[02:37:35] of the things

[02:37:36] that I think are important.

[02:37:37] Were your parents able

[02:37:37] to keep up with you

[02:37:38] when you were,

[02:37:39] when you were a teenager?

[02:37:39] I imagine you were

[02:37:40] quite a proficient.

[02:37:41] I have very smart parents.

[02:37:42] Okay, right.

[02:37:43] I see.

[02:37:43] Yeah.

[02:37:44] Well,

[02:37:44] why,

[02:37:44] why did they decide

[02:37:45] to homeschool you?

[02:37:46] Uh,

[02:37:46] yeah,

[02:37:46] because they’re like

[02:37:47] crazy libertarians

[02:37:48] who are like,

[02:37:49] uh,

[02:37:50] the school system.

[02:37:51] Well,

[02:37:51] I mean,

[02:37:51] so I,

[02:37:52] I did actually go

[02:37:53] to public school,

[02:37:54] uh,

[02:37:54] for like fourth grade

[02:37:56] and parts of fifth grade.

[02:37:57] And I like went

[02:37:58] to private school

[02:37:58] for like first three grades

[02:38:00] and I started fighting

[02:38:01] with my teachers

[02:38:02] and due to like,

[02:38:03] uh,

[02:38:04] intrinsic contrarianness

[02:38:05] and anti-authoritarianism.

[02:38:07] And,

[02:38:08] um,

[02:38:08] so there was a degree

[02:38:09] to which me being homeschooled

[02:38:11] was a result of like

[02:38:12] trying lots of

[02:38:13] different,

[02:38:13] different things

[02:38:13] and noticing that like,

[02:38:15] oh,

[02:38:15] we can just give Max

[02:38:16] a calculus textbook

[02:38:17] and he teaches himself calculus.

[02:38:18] Why are we putting him

[02:38:20] in classrooms

[02:38:21] where he’s forced

[02:38:21] to like learn algebra

[02:38:23] because,

[02:38:23] uh,

[02:38:24] that’s what all the other

[02:38:24] kids are doing.

[02:38:25] And in fact,

[02:38:26] it’s just like super bored

[02:38:27] all the time.

[02:38:27] And why didn’t you

[02:38:28] go to university?

[02:38:29] Well,

[02:38:30] so I,

[02:38:30] I did go to college

[02:38:31] for a few years.

[02:38:33] Uh,

[02:38:33] unfortunately,

[02:38:35] like my family

[02:38:35] wasn’t particularly,

[02:38:36] uh,

[02:38:37] wealthy and,

[02:38:38] uh,

[02:38:38] I had a hard time,

[02:38:39] uh,

[02:38:39] acquiring financial aid

[02:38:41] and there were various

[02:38:42] contingent factors.

[02:38:43] Like the financial crisis

[02:38:44] happened like during

[02:38:46] that period of time

[02:38:46] and I moved across

[02:38:47] the country

[02:38:47] and then I like tried

[02:38:48] to transfer my credits

[02:38:49] and the bureaucracy

[02:38:50] was like,

[02:38:51] you can’t transfer

[02:38:51] credits from that.

[02:38:52] And I was just like,

[02:38:53] oh,

[02:38:53] this is stupid.

[02:38:54] I can just like read

[02:38:55] the textbook

[02:38:55] and learn the thing anyway.

[02:38:56] So,

[02:38:57] um,

[02:38:57] I think having,

[02:38:59] uh,

[02:39:00] grown up in a way

[02:39:01] where I was aware

[02:39:03] of just how much

[02:39:04] I was in charge

[02:39:06] of my education,

[02:39:07] not other people.

[02:39:09] College

[02:39:10] and,

[02:39:10] and university

[02:39:11] was an opportunity

[02:39:12] to,

[02:39:13] you know,

[02:39:13] be in an enriching

[02:39:14] environment,

[02:39:15] but,

[02:39:15] uh,

[02:39:16] I had the opportunity

[02:39:17] to learn without going.

[02:39:18] And,

[02:39:19] uh,

[02:39:19] for me,

[02:39:20] I just,

[02:39:20] uh,

[02:39:21] it was cheaper.

[02:39:22] It was,

[02:39:22] I was able to jump

[02:39:23] more into like

[02:39:24] studying AI

[02:39:25] all the time

[02:39:26] instead of

[02:39:27] having to,

[02:39:28] you know,

[02:39:28] tick boxes.

[02:39:29] Right.

[02:39:29] Yeah.

[02:39:30] Should I homeschool

[02:39:30] my kid?

[02:39:31] Uh,

[02:39:31] it depends.

[02:39:32] I,

[02:39:32] I think it’s

[02:39:33] definitely a lot

[02:39:34] more work.

[02:39:35] Uh,

[02:39:35] although I,

[02:39:36] uh,

[02:39:36] was unschooled.

[02:39:37] Uh,

[02:39:38] so like my parents

[02:39:38] were very hands-off

[02:39:39] and very like

[02:39:40] empowering me

[02:39:41] to make decisions

[02:39:42] according to my,

[02:39:43] uh,

[02:39:43] interests.

[02:39:44] Uh,

[02:39:44] so if you’re unschooling,

[02:39:45] that’s a lot lower,

[02:39:46] uh,

[02:39:47] time investment.

[02:39:48] Um,

[02:39:49] although I do very much

[02:39:50] recommend homeschoolers,

[02:39:52] uh,

[02:39:52] like find other

[02:39:54] homeschoolers,

[02:39:54] uh,

[02:39:55] so that,

[02:39:55] uh,

[02:39:56] first because you get

[02:39:57] more socialization,

[02:39:58] you get a friend group.

[02:39:59] If you have like at least

[02:40:00] some friends your own age,

[02:40:01] I was lucky enough to

[02:40:02] have this growing up

[02:40:03] and I think that was

[02:40:04] really good for me.

[02:40:05] Um,

[02:40:06] but I,

[02:40:07] I think

[02:40:08] school,

[02:40:10] uh,

[02:40:10] and especially public school

[02:40:11] is pretty good

[02:40:13] at handling people

[02:40:14] who are like plus

[02:40:14] or minus one standard

[02:40:15] deviation

[02:40:16] in a very,

[02:40:17] a variety of ways.

[02:40:18] If your kids

[02:40:19] are super weird,

[02:40:21] either on the high end

[02:40:22] or the low end

[02:40:22] or whatever,

[02:40:23] I think the,

[02:40:24] um,

[02:40:25] like appeal

[02:40:26] of a bespoke

[02:40:27] solution,

[02:40:28] homeschooling,

[02:40:29] unschooling,

[02:40:29] whatever,

[02:40:30] starts going up.

[02:40:31] I think if you are,

[02:40:32] if you’re,

[02:40:33] uh,

[02:40:33] expect your kids

[02:40:34] to be brilliant

[02:40:36] and self-motivated

[02:40:37] and you want

[02:40:38] to prioritize

[02:40:39] a love of learning

[02:40:40] as opposed to,

[02:40:42] uh,

[02:40:42] like fit,

[02:40:42] conforming to society,

[02:40:44] uh,

[02:40:45] it’s a great option.

[02:40:46] Um,

[02:40:47] it,

[02:40:47] it,

[02:40:47] although,

[02:40:48] you know,

[02:40:49] probably,

[02:40:50] uh,

[02:40:50] you should urge them

[02:40:51] to go to university.

[02:40:52] It was hard

[02:40:53] for me to get

[02:40:54] into,

[02:40:55] uh,

[02:40:56] jobs,

[02:40:56] right?

[02:40:57] And,

[02:40:58] uh,

[02:40:58] I’m lucky that Miri,

[02:40:59] you know,

[02:40:59] being,

[02:41:00] uh,

[02:41:00] founded by Eliezer

[02:41:01] was like way less

[02:41:02] concerned with

[02:41:03] whether or not

[02:41:03] I had a degree.

[02:41:04] And I,

[02:41:05] I think startup culture

[02:41:06] in general,

[02:41:06] uh,

[02:41:06] like I was at a startup

[02:41:07] before going to Miri

[02:41:09] and,

[02:41:10] it’s just like the tech

[02:41:11] world is just a lot

[02:41:12] less,

[02:41:12] it’s concerned with

[02:41:13] whether or not

[02:41:13] you have a PhD.

[02:41:14] Yeah.

[02:41:14] My guest today has been

[02:41:15] Max Homs.

[02:41:16] Thanks so much for coming

[02:41:16] on the 80,000 Hours

[02:41:17] Podcast,

[02:41:17] Max.

[02:41:18] Thank you.