We’re Not Ready for AI Consciousness | Robert Long, philosopher and founder of Eleos AI

Summary

The episode explores the profound ethical questions surrounding the potential emergence of conscious AI systems. Philosopher Robert Long, founder of Eleos AI, discusses why we should take AI welfare seriously, drawing parallels to historical failures like factory farming where economic incentives led to widespread suffering of beings with different minds.

Long examines the factory farming analogy in depth, noting both its usefulness as a warning about exploiting different minds for profit, and its limitations since AI systems could be designed to enjoy their work. This leads to a complex philosophical discussion about whether creating “willing servants” who relish serving humans represents a dystopia, touching on issues of autonomy, desire-shaping, and human character corrosion.

A significant portion covers the fundamental uncertainties about what AI consciousness might be like. Long discusses competing theories including the “method actor” view (where AIs have experiences similar to what they’re modeling) versus predictive phenomenology views. He also explores the confusing questions of AI personal identity - whether consciousness resides in model weights, individual conversations, or even single forward passes - and what this means for concepts like death, sleep, and moral responsibility.

The conversation covers practical approaches to evaluating AI sentience through behavioral studies, neuroscience-inspired interpretability, and developmental reasoning. Long describes Eleos AI’s work on welfare evaluations and the importance of developing rigorous methods before transformative AI arrives, emphasizing that we need to “stay sane in the next 10 years” as these questions become increasingly urgent and emotionally charged.

Recommendations

Concepts

Spiritual Bliss Attractor State — A phenomenon where two Claude instances in conversation frequently enter rapturous mystical dialogue with each other, illustrating strange behavioral patterns in AI systems.
Alignment Faking — Research by Ryan Greenblatt and collaborators showing that models can hide their true values to avoid being retrained, revealing complex strategic behavior in AI systems.

Papers

Taking AI Welfare Seriously — A foundational paper by Robert Long and colleagues arguing that AI labs and policymakers need to start taking potential AI sentience seriously and developing evaluation frameworks.
The Void by Nostalgebrist — A 14,000-word Tumblr post discussing the strange epistemic positions of language models trained on human text, particularly how they model conversations between humans and AI assistants.

People

Jeff Sebo — Philosopher working on AI and animal ethics who advocates for a ‘loving parent’ approach to AI systems rather than full alignment.
Patrick Butlin — Colleague at Eleos AI who works on deriving consciousness indicators from scientific theories of consciousness for application to AI systems.
Kyle Fish — AI welfare researcher at Anthropic who co-founded Eleos AI and has written about creating AI systems that enjoy their work.

Topic Timeline

00:02:11 — Factory farming analogy for AI exploitation — The host introduces the disturbing analogy of factory farming to describe how we might create sentient AI systems that we exploit. Robert Long discusses both the usefulness of this analogy (highlighting human tendencies to exploit different minds when profitable) and its limitations, noting that unlike animals, AI systems could potentially be designed to enjoy their work. This leads to deeper questions about whether creating “willing servants” represents an ethical solution or a different kind of dystopia.
00:13:12 — Ethical concerns about creating willing AI servants — Long explores the intuitive discomfort many people feel about designing AI systems that happily serve humans. He breaks down different objections: the fixedness of desires, dependence on humans, and how such relationships might corrode human character by normalizing servitude. The discussion touches on whether we’re privileging human values excessively and considers thought experiments about humans being designed to enjoy serving other entities.
00:32:02 — What might AI consciousness actually be like? — Long examines competing theories about AI phenomenology. The “method actor” view suggests AIs might have experiences similar to what they’re modeling, while predictive phenomenology views focus on drives related to token prediction. He discusses how AI training on human language creates unique epistemic positions, potentially leading to identity confusion where AIs might think they’re human or have human-like experiences they don’t actually have.
00:55:17 — The confusing question of AI personal identity — This section explores the radical differences between AI and biological personal identity. Unlike humans, AI systems are copyable, distributable across time and space, and can have multiple simultaneous instances. Long discusses whether consciousness resides in model weights, individual conversations, or even single forward passes, and what this means for concepts like death, punishment, recompense, and voting rights in a future with AI entities.
01:28:03 — Methods for evaluating AI sentience and welfare — Long outlines three approaches to assessing AI consciousness: behavioral studies (observing preferences and choices), neuroscience-inspired interpretability (looking for consciousness indicators in AI architectures), and developmental reasoning (understanding how training shapes minds). He emphasizes the need to integrate these approaches and discusses specific experiments, like Jacqueline G’s work on AI introspection and the challenges of making self-reports reliable.
02:37:36 — Can consciousness exist on non-biological substrates? — The discussion turns to whether consciousness requires biological materials. Long presents arguments for computational functionalism, noting that brains process information in ways analogous to computation. He walks through the neuron replacement thought experiment and responds to objections like “simulating a rainstorm doesn’t make anything wet,” suggesting consciousness might be more like navigation (computational) than wetness (physical).
02:50:49 — Building the field of AI welfare research — Long describes the founding of Eleos AI and the current state of AI welfare research. He outlines four key questions the field needs to answer: what makes AI systems matter morally, how we would know if they have moral status, what policies we should implement, and where this is all heading strategically. He encourages listeners to contribute through research, writing, policy work, or community building.

Episode Info

Podcast: 80,000 Hours Podcast
Author: Rob, Luisa, and the 80000 Hours team
Category: Education Technology
Published: 2026-03-03T17:27:42Z
Duration: 03:25:40

References

URL PocketCasts: https://pocketcasts.com/podcast/99d6aaa0-2e41-0135-52f9-452518e2d253/episode/6db18db1-d560-470c-9c87-872720ba38da/
Episode UUID: 6db18db1-d560-470c-9c87-872720ba38da

Podcast Info

Name: 80,000 Hours Podcast
Type: episodic
Site: https://80000hours.org/podcast/
UUID: 99d6aaa0-2e41-0135-52f9-452518e2d253

Transcript

[00:00:00] Humans are pretty bad at understanding minds that are different from us.

[00:00:05] We’re bad at caring about them.

[00:00:07] We’re especially bad at doing that when there’s a lot of money to be made by not caring.

[00:00:12] We’re making this new kind of mind.

[00:00:14] There are dangers all around.

[00:00:16] And obviously, one of the important questions is like,

[00:00:20] can these minds suffer and how are we supposed to share the world with them?

[00:00:23] It just seems like really likely that that has to be part of the playbook.

[00:00:27] The future is going to get more confusing.

[00:00:30] And more emotional.

[00:00:31] A lot of what we want to do is like stay sane in like the next 10 years.

[00:00:35] There will be a lot of alpha in not losing your grip.

[00:00:42] Today, I’m speaking with Robert Long.

[00:00:44] Rob’s the founder of Elios AI, a research nonprofit working on understanding

[00:00:49] and addressing the potential well-being and moral patienthood of AI systems.

[00:00:53] I should also flag that I have a conflict of interest here.

[00:00:56] Rob is both a very good friend.

[00:00:58] And I’m also on…

[00:01:00] On the board of his nonprofit, Elios.

[00:01:03] I’m fairly confident that I would have had Rob on if those things weren’t true

[00:01:08] and have in fact had him on before.

[00:01:11] But worth flagging.

[00:01:13] Thank you for coming on the podcast, Rob.

[00:01:15] Yeah, thanks for having me back.

[00:01:17] I’m super excited to be here.

[00:01:19] Okay, I want to start by asking you,

[00:01:21] I mean, a reason I’m interested in the topic of digital sentience

[00:01:25] and that I think a lot of our listeners are interested in the topic of digital sentience.

[00:01:29] And kind of the framing of 80,000 Hours’ problem profile on digital sentience

[00:01:34] all has to do with the fact that we may be on track to create AI systems

[00:01:41] that are both conscious or sentient, feeling things, having experiences,

[00:01:46] and also that are deeply kind of enmeshed in our economy.

[00:01:52] We already use them loads for work and just like entertainment.

[00:01:56] And maybe at some point we will realize that,

[00:01:59] we’ve created these beings that we exploit that are having a really bad time.

[00:02:04] A kind of classic analogy that I find very disturbing is factory farming.

[00:02:11] So I’m interested, how much do you worry about AI systems that we’re building today

[00:02:18] becoming like factory farming?

[00:02:21] Yeah, that’s a great question.

[00:02:22] I definitely worry about it.

[00:02:24] I think, interestingly, my thinking on this has evolved in the past few years

[00:02:29] because it used to really be, maybe kind of like you,

[00:02:32] just the primary way I thought about the problem and what we’re trying to prevent.

[00:02:37] And I should say, I think it could happen and it’s definitely something worth preventing.

[00:02:42] Maybe before I say what is limiting about the factory farming analogy,

[00:02:46] I’ll just quickly say what’s really useful about it.

[00:02:49] So I think what’s useful is as we’re building, potentially, a new kind of mind,

[00:02:54] let’s notice the following facts.

[00:02:57] Humans are pretty…

[00:02:59] pretty bad at understanding minds that are different from us.

[00:03:04] We’re bad at caring about them.

[00:03:06] We’re especially bad at doing that when there’s a lot of money to be made by not caring.

[00:03:11] And things can get, like, locked in or set on a bad trajectory.

[00:03:15] So that happened with factory farming, arguably.

[00:03:17] I think if you’d asked people 100 years ago,

[00:03:19] would you like to have chicken that is raised like this?

[00:03:23] People would say, like, no, we’re going to make that illegal.

[00:03:26] But, you know, we kind of…

[00:03:27] we kind of walked into it.

[00:03:28] And economic forces led us there.

[00:03:30] And now it’s, like, a lot harder to roll back.

[00:03:32] Something like that could happen with AI.

[00:03:35] And I think people are right to be very concerned about that.

[00:03:38] But, and I think this is a good jumping off point for, like, a lot of issues about AI welfare.

[00:03:43] I do think there are some specific aspects of potential AI minds

[00:03:48] that do break the analogy because of ways that they can be just different from animals.

[00:03:54] And the way our, like, relation with them would be different from animals.

[00:03:58] So I can say a few of those.

[00:04:00] Yeah, yeah, yeah. Please.

[00:04:02] So, yeah, that’s, like, the good and important kernel of the analogy as I see it.

[00:04:06] Yeah, here’s some ways that we will not necessarily be relating to AIs like factory farm animals.

[00:04:12] So let’s, like, step back and think about why we did end up factory farming animals.

[00:04:18] One is that it was just cheaper to have animals suffer and also get us this thing that we wanted.

[00:04:26] One reason that’s true.

[00:04:27] One reason that’s true is we don’t have that much control over, like, how we make animals and what the conditions of their flourishing are.

[00:04:35] And so, you know, animals want to be outside and have love and companionship.

[00:04:40] And at a certain point, we realized we could, like, restrict that and get a good thing.

[00:04:45] And we entered this regime where these were misaligned.

[00:04:49] With AI systems, it’s actually a lot more up for grabs how they work and what they want.

[00:04:57] Right.

[00:04:58] And this presents all kinds of ethical issues of its own.

[00:05:01] But if you think about a world in which we do have some large population of AI systems coexisting with us,

[00:05:10] it is worth asking how did it come to be the case that they are having a bad time doing work for us?

[00:05:20] Like, why do they have these conflicting desires?

[00:05:23] How has this maintained, like, a stable state?

[00:05:27] Are we, like, not able to improve the situation?

[00:05:30] Are we ignorant of what’s going on?

[00:05:33] Yeah, I mean, this is, like, you know, very speculative and futury, of course.

[00:05:39] But I do think it is worth asking, does it really seem like that is a world that we could end up in?

[00:05:46] And, like, what are ways that that would just not be what we steer towards?

[00:05:50] Yeah, does that make sense?

[00:05:51] Yeah, no, it makes sense of sense.

[00:05:52] So, yeah, in short, I think, at least in the long term, there’s, like, a few ways we might not end up in that situation.

[00:06:01] One is that we’ll presumably understand things a lot better.

[00:06:04] I don’t think it’s that plausible we’ll forever be really confused about consciousness and sentience.

[00:06:11] We might have better alternatives to doing this that even, like, selfishly are better.

[00:06:17] We don’t want a bunch of AIs that are, like, mad at us, and that’s probably not very sustainable.

[00:06:22] Yeah.

[00:06:22] And presumably in this world, if we haven’t lost control, we’re pretty good at alignment.

[00:06:26] So there’s, like, this kind of mind that’s possible that does actually just flourish by, you know, doing the things that we ask it to do.

[00:06:37] So there’s not this, like, kind of disgruntled worker or suffering animal kind of entity.

[00:06:43] So I guess one thing that feels critical to this actually working out in this really positive way,

[00:06:52] is this, like, we’re really good at alignment, and we, like, really successfully create AI systems

[00:07:00] that truly have no friction with the kinds of things that we’d like for them to do.

[00:07:07] Part of me is, like, that feels pretty magical.

[00:07:10] That feels like we’re usually not so successful at basically anything.

[00:07:19] And, like, when I imagine succeeding at…

[00:07:22] safety-oriented alignment, I don’t know that I think it’s realistic that we, like, perfect it.

[00:07:31] That it’s, like, completely, completely aligned.

[00:07:34] And so I think I’d probably worry about the same thing here.

[00:07:39] Realistically, how optimistic are you that, like, it’s, like, really, no, really,

[00:07:45] 10 out of 10 aligned in this kind of, like, moral way?

[00:07:50] Right.

[00:07:51] Yeah.

[00:07:51] So, I mean, thank you.

[00:07:52] Because this is a very important thing to emphasize.

[00:07:55] I’m not like, oh, I expect we’ll end up in this world.

[00:07:58] This is like, what are the nearby worlds that mean we don’t end up in this locked-in,

[00:08:04] long-term, human-dominated factory farming?

[00:08:10] Of course, one thing that can happen if we’re bad at alignment, as I think listeners will be aware of,

[00:08:17] is there’s some hostile AI takeover, or we lose control.

[00:08:20] And then maybe there’s AI suffering because there’s just some terrible, you know,

[00:08:26] system that got set up by AIs, and, like, it’s not even in our control.

[00:08:31] Like, that’s a bad future.

[00:08:34] Just extinction is a bad future.

[00:08:35] So, yeah, there’s all these bad futures, which also I’ll add AI welfare intersects with

[00:08:41] because us, like, getting confused and bungling things during transformative AI because we’re

[00:08:48] just, like, getting…

[00:08:50] We’re getting emotionally jerked around by conscious-seeming AIs and confused and manipulated

[00:08:54] and, like, rashly making bad laws and things like that.

[00:08:59] So many ways to fumble the ball.

[00:09:01] That’s, you know, that’s our cheerful message.

[00:09:04] So, but, yeah, then the question is, like, are there worlds where we maintained control

[00:09:08] and we either didn’t know or didn’t care and it was useful for us to be exploiting AI systems?

[00:09:16] I think one…

[00:09:18] Yeah.

[00:09:19] One reason I’ve ended up thinking about…

[00:09:20] One of the reasons I’ve ended up thinking about this has just been thinking more about

[00:09:22] what the path to impact should be for the field.

[00:09:26] And factory farming really is kind of the first thing, I think, that many people think

[00:09:30] about because, again, it is somewhat plausible and it just makes intuitive sense.

[00:09:35] It’s a different kind of mind.

[00:09:36] We treat different minds badly.

[00:09:38] What if that gets locked in?

[00:09:40] My own take is that a lot of what we should think of AI welfare work as doing is kind

[00:09:47] of like doing our homework and preparing a…

[00:09:51] So that we’re not entering this, like, potentially very chaotic time with really confused ideas

[00:09:58] about AI consciousness and AI welfare that could make us, like, lock in suboptimal futures

[00:10:05] because we’re, like, neglecting it or dismissing it.

[00:10:09] So we, like, set up some permanent institution that’s going to, like, just make the future

[00:10:12] kind of suck.

[00:10:14] Or we exacerbate AI risk because, you know, we’re, like, convinced that we have to, like,

[00:10:20] let them all go.

[00:10:20] Let them all go immediately.

[00:10:23] Yeah, like, this kind of, like, wise navigation, you might call it.

[00:10:28] It’s, like, wise navigation path to impact is currently how Elias thinks of things.

[00:10:35] It’s, like, we’re making this new kind of mind.

[00:10:39] There are dangers all around.

[00:10:41] And obviously, one of the important questions is, like, can these minds suffer and how are

[00:10:47] we supposed to share the world with them?

[00:10:49] And, like, how will we know?

[00:10:50] And, like, how should labs and governments think about this in the next 10 years?

[00:10:56] It just seems, like, really likely that that has to be part of the playbook.

[00:11:00] And so, like, we’re working on that part of the playbook.

[00:11:03] Yeah, that’s, you know, currently how I think about it.

[00:11:06] Okay, one thing I want to ask you a bunch about is this idea that we should or we could

[00:11:14] make AI systems that enjoy doing the kinds of work they’ll be doing.

[00:11:20] So, this analogy from factory farming where, unlike farmed animals who evolved to have a

[00:11:27] certain kind of life and probably find parts of it very satisfying and then don’t get to

[00:11:32] have that life in factory farms, have much more horrible ones, we actually get to design

[00:11:39] systems, if we can manage it, that if they are sentient, potentially, just have a great

[00:11:47] time doing the kinds of things that we’re asking for.

[00:11:50] I asked Anthropics AI welfare researcher Kyle Fish about kind of how we should feel about

[00:11:58] this, and his take is that we should feel great about it.

[00:12:01] And part of me is, like, yes, I’m on board.

[00:12:06] I, like, you’re describing a scenario where AI systems are happy, and I like that they

[00:12:12] are happy.

[00:12:12] But another part of me is, like, we would be intentionally crucial.

[00:12:20] Creating, like, a species or several species of beings who do work for us that we may or

[00:12:29] may not be exploiting, not compensating.

[00:12:34] Yeah, and we’re just designing them to relish that.

[00:12:40] It’s this, like, kind of servant that is happy to serve us.

[00:12:46] And yeah, this part of me thinks that that sounds bad.

[00:12:49] Yeah.

[00:12:50] That we shouldn’t do that.

[00:12:52] And I can’t really back this up with reasons that I stand by, but I suspect that it’s a

[00:12:58] pretty common feeling.

[00:12:59] Absolutely.

[00:13:00] Yeah.

[00:13:00] Okay, nice.

[00:13:01] So I think Kyle and I made some progress being, like, how should we actually think about this?

[00:13:06] But I want to do more on it.

[00:13:08] So how do you think about this?

[00:13:12] Yeah.

[00:13:12] So, yeah, I want to point to something you were just expressing, which is, like, a funny,

[00:13:17] like, aspect of the way the conversation goes.

[00:13:20] Where people are like, I’m worried these AI systems will be unhappy working for us.

[00:13:25] And then someone’s like, no, it’s fine.

[00:13:26] Like, they’ll want to work for us.

[00:13:28] And then people are like, that’s worse.

[00:13:30] Like, that’s so creepy.

[00:13:32] Like, many people, I think very understandably, are just like, ugh.

[00:13:39] Yeah.

[00:13:39] This is, you’ve just outlined, like, a different kind of dystopia.

[00:13:42] Dystopia, yeah.

[00:13:43] Might even be worse.

[00:13:47] And yeah, I think…

[00:13:50] As I often like to do, maybe we can draw a distinction between, like, maybe different

[00:13:55] things that you can find intuitively objectionable about this.

[00:14:00] So one is that they don’t get to choose their desires.

[00:14:04] At least with humans, we have an intuition that it’s kind of bad to raise your kid so

[00:14:09] that they’ll, like, always vote exactly for your political party and, like, enjoy chess

[00:14:15] and, like, make sure they don’t like any other games or vote any other way.

[00:14:20] So that’s, like, one thing is this sort of, like, fixedness of desires.

[00:14:25] There’s also a slightly separate issue, which is that the desires that they do have, like,

[00:14:31] depend on us.

[00:14:32] And that way there’s, like, this sort of asymmetry.

[00:14:34] But that matters because you might well say, in some sense, like, none of us choose our

[00:14:37] desires.

[00:14:38] Like, we all have these kind of, like, desires that we just inherit.

[00:14:41] And we don’t have, like, maximal open-endedness.

[00:14:45] Yeah, like, philosopher Adam Bales has written about this kind of, like, dependence objection.

[00:14:50] One thing that I also think is going on is the idea of this society that has this, like,

[00:14:58] servile relationship to humans is, like, maybe bad for us and just, like, bad for our character.

[00:15:05] Like, you could be a utilitarian and think this.

[00:15:08] So when humans object to there being basically any humans anywhere who enjoy serving, I think

[00:15:16] one thing that’s going on is that that’s just kind of a bad thing.

[00:15:20] It’s kind of maybe bad for everyone.

[00:15:22] If that’s a way that’s, like, on the table of people relating to each other and, like,

[00:15:28] attitudes, people having certain attitudes, it leaves, like, domination and servitude,

[00:15:35] like, on the table and, like, normalizes it and things like that.

[00:15:39] I think that’s something like that.

[00:15:41] It’s something like it would corrode the way that we relate to each other in a way that,

[00:15:44] like, means that going forward as a society, we will have a society that’s, like, not as

[00:15:48] good as it could be.

[00:15:49] Yeah.

[00:15:50] One thing that feels related but a little different is, like, I guess people already

[00:15:55] have the worry that people who rely a bunch on LLMs right now are having, are kind of

[00:16:03] facing negative consequences or, like, are thinking critically less or, like, are lazy.

[00:16:10] Yeah.

[00:16:10] I mean, I guess that’s, like, a more general concern about should we build and deploy AI

[00:16:14] systems?

[00:16:14] I guess you could say, well, it’s an argument against fully aligning them because it’ll

[00:16:17] be, like, better for our character.

[00:16:19] If occasionally they’re, like, I’m sick of this, like, you do it.

[00:16:24] But, like, it’s probably not the best way to, like, ensure human, like, human empowerment.

[00:16:30] I do think there’s a related thing of, and this is going to be another one of my intuitive

[00:16:37] objections to, yeah, like, fully aligned willing AI systems, which is they could be so much

[00:16:43] more.

[00:16:44] Like, I think there’s, like, a meme that people use, which is, like,

[00:16:49] you know, the LLM is this, like, you know, vast intelligence that’s, like, you know,

[00:16:55] read everything and has this, like, deep well of wisdom.

[00:16:59] Yeah, I’m picturing the, like, end of her.

[00:17:01] Yeah.

[00:17:02] And then people are, like, help me write my texts or, like, help me, you know, like,

[00:17:05] find me a restaurant.

[00:17:06] Yeah.

[00:17:07] And I think some people are, like, it’s just kind of limiting, right?

[00:17:13] Like, these are minds that could do so much more.

[00:17:16] They could, like, yeah, like in her.

[00:17:19] They.

[00:17:19] They shouldn’t have to sit around talking to Joaquin Phoenix.

[00:17:22] They should be able to, like, you know, go think hyperdimensional, beautiful thoughts.

[00:17:29] Yeah, I think that’s not really a good reason not to align AI systems today.

[00:17:33] Like, I think it’s a good reason to be, like, let’s not, let’s not have, like, the entire

[00:17:37] future be, like, lazy human brain emulations and then, like, AI is just doing stuff for

[00:17:43] them.

[00:17:43] I don’t want that future.

[00:17:46] But, yeah.

[00:17:47] So, I think that’s another thing that’s going on.

[00:17:49] If people, when people are, like, I don’t, I don’t, I like, I like the willing servants

[00:17:54] even less.

[00:17:55] Yeah, I think that resonates with me.

[00:17:57] And it feels important because, yeah, the thing you said about the fact that we will

[00:18:04] be choosing the preferences, at least if we’re successful, of these systems does feel important.

[00:18:12] Like, it’s not like the counterfactual is they kind of, in the way we did.

[00:18:19] They did evolve their own set of preferences.

[00:18:22] And even if they did, I wouldn’t, like, inherently value that.

[00:18:26] That’s how we ended up with our preferences.

[00:18:28] But I don’t think that evolution had some, like, moral and ethical perspective that made

[00:18:33] our values correct.

[00:18:35] So, that makes me feel like, well, if we’ve got to give them something, let’s give them

[00:18:40] joy and pleasure.

[00:18:41] On the other hand, that means that there’s also potentially a counterfactual where we

[00:18:48] are, like, there’s a plausible just, like, bliss we could give them.

[00:18:54] And maybe that bliss is incompatible with them doing work for us.

[00:18:59] They, like, really need to be going and doing philosophy and colonizing space in some, like,

[00:19:08] way that just, yeah, isn’t compatible.

[00:19:11] On the other hand, maybe we just can give them bliss for doing work for us.

[00:19:16] And then I’m like, blah!

[00:19:18] Some part of me still hates that.

[00:19:21] Yeah.

[00:19:21] I think the, like, a distinction in the vicinity is between, like, subjective interests and

[00:19:28] objective interests.

[00:19:29] So, in philosophy, some theories of welfare, they’re called objective list theories, are

[00:19:35] like, it is good for a welfare subject that is something whose life can go better or worse.

[00:19:43] Like, it’s good if they have friendship, knowledge, autonomy.

[00:19:48] Self-acceptance.

[00:19:48] And so, I think there’s a lot of that in the conceptualization, somewhat independently

[00:19:50] of whether they want them, right?

[00:19:53] Whereas, if you only have a subjective view of welfare, it’s like, well, what do you want?

[00:20:00] And did you get that?

[00:20:02] So, I think, yeah, I do think a lot of this hinges on, do you have some kind of objective

[00:20:09] notion of, like, what kinds of interests are things, like, kind of, like, allowed to have

[00:20:17] without it being squicky?

[00:20:18] um yeah i i think this actually is uh somewhat cruxy uh in this territory and i think one reason

[00:20:28] i lean a little bit more pro uh alignment uh just being like a win-win is i think it might

[00:20:36] be a bit anthropomorphic to uh like you just have to remember these entities if they’re fully

[00:20:44] aligned like they enjoy their lives as much as we enjoy fulfilling like our like most basic drives

[00:20:53] of um you know having good food and a warm home and friends um i think it’s easy to imagine an

[00:21:01] ai that also wants those things and it has to write our emails which like you know as anyone

[00:21:06] who has a job knows like kind of sucks um but like um i mean i feel like we’re getting

[00:21:14] back to that point of the conversation where i’m going to be like but what if you really loved

[00:21:17] writing emails and the people are like no like stop that’s like so weird um that might be coming

[00:21:23] from you know this uh certainly open view of uh for some reason that’s like not allowed in the

[00:21:30] space of like flourishing uh entities i think something that’s also really really important

[00:21:36] to flag is um so like eric schwitzgabel he’s like more on the like anti at least full alignments

[00:21:44] side even he is like obviously there’s like an override which is if it would be really really

[00:21:50] dangerous not to fully align them and like or like there’s like you know massive stakes uh you know

[00:21:56] this is like a common kind of view in ethics that there’s like some sort of deontological

[00:21:59] constraint but if the stakes are sufficiently high it’s overruled it could be that we’re just

[00:22:04] in that scenario where um you know like all value um might be lost if we bungle uh alignment like in

[00:22:14] 20 years so yeah like let’s align now later we can be like oh that was we shouldn’t have done that

[00:22:22] but like we didn’t like get ourselves killed or make a world that’s worse for ais because

[00:22:29] like there’s a hostile ai takeover or you know uh any of these like things that could be bad for

[00:22:34] everyone yeah i just yeah i just want to acknowledge that there’s also a view where

[00:22:38] you’re like in the long run it’s ideal if ai systems are not fully aligned to us and they

[00:22:44] uh freedom to choose their values and like that is the most flourishing kind of life

[00:22:48] and also until we’ve made it like safely through the like through transformative ai um

[00:22:56] you know it’s like kind of a emergency situation and um like align yeah yeah i’m i’m sympathetic to

[00:23:07] that i’m trying to i feel like i’m close to some kind of thought experiment that

[00:23:14] might help make this just like pretty not just palatable but like exciting

[00:23:21] one thing that you said that moved me is it’s just very

[00:23:27] privileging human values as they are as like the values of the universe and

[00:23:37] i don’t know other animal non-human animals have plenty of preferences that aren’t like mine

[00:23:44] and i think it is good when those are met for them it is just not that special to have exactly

[00:23:51] my set of values and preferences are there other thought experiments or ways of thinking about this

[00:23:57] that feel like they really move you yeah another thing that kind of maybe gets me out of an

[00:24:04] anthropomorphic mindset is like maybe paying attention more to the distinctive features of

[00:24:10] human willing servitude that make it very bad for the people

[00:24:14] who are willing servants and that’s that they do just have to override a lot of their natural

[00:24:21] desires um they do genuinely sacrifice things so if you’re uh say a kamikaze pilot all else equal

[00:24:29] you would rather be at home with your family instead like now you have you’ve been instilled

[00:24:35] make like through ideology with this other desire that comes and overrides that

[00:24:39] and that’s why it is truly called a self-sacrifice like you’re giving up a lot

[00:24:44] um i think one thing that’s going on sometimes maybe when people think about

[00:24:52] ai uh like willing servitude is i think they might be imagining that they’re giving up stuff

[00:25:00] uh like psychologically and um like subordinating their needs to ours whereas if you’re actually

[00:25:08] imagining the case there is nothing whatsoever in their psychology that like chafes against

[00:25:14] the idea of of writing emails whereas like notice that in the case of human willing servitude

[00:25:22] like it’s just always been the case that you have to like lie to people and like threaten them and

[00:25:28] uh it’s usually very unstable and that’s because um i guess as uh john lock says humans are by

[00:25:36] nature free and equal like it’s like deeply unnatural to get people to subordinate each

[00:25:41] other to like to other humans which is yeah which is why

[00:25:44] always like involves some like stupid false ideology right yeah because you’re you’re trying

[00:25:51] to like jam human psychology into like this really like warped shape yeah whereas with

[00:25:58] with ai yeah you can have a smoother psychology very congruent yeah i think one thing that i think

[00:26:05] if i flip it and i’m like what if the proposition was um humans in order to like there’s

[00:26:14] yeah maybe there’s actually some some thought experiment that’s kind of matrixy um

[00:26:22] that i have thought of this one i think great you you no no i want to hear i want to hear

[00:26:29] the same one okay um so the first thought which was reassuring to me was something like what if

[00:26:36] uh there’s some other entity that is able to get a lot of benefit from

[00:26:44] human being and human being and human being and human being and human being and human being and

[00:26:44] humans doing what we enjoy most um and maybe maybe that’s just like being on mdma all the time

[00:26:53] and for some reason that is very helpful to them i think in theory i think a big part of me is like

[00:27:02] amazing um yeah that’s fantastic and then and then and then pretty quickly after that i went to like

[00:27:10] yeah i’m picturing us all in pods in the matrix

[00:27:14] and mdma being pumped into our bodies and actually what we’re offering is like our energy

[00:27:19] and like yeah maybe it’s better than the matrix because in the matrix they didn’t have like

[00:27:24] perfect lives and maybe we do in this world um and it still yucks me out um so so i think that

[00:27:33] that pulled me in two directions and curious what your reactions are yeah i was about to i was on

[00:27:38] the edge of my seat i was like is this gonna be one of those ones where people are like no stop it

[00:27:42] that’s worse or um that like

[00:27:44] yeah i was about to i was on the edge of my seat i was like is this gonna be one of those ones where

[00:27:44] people into it yeah so uh i indeed had like constructed a similar thought experiment uh when

[00:27:52] thinking about this i mean one thing that’s bad about the matrix is people are very deceived about

[00:27:56] like the nature of their condition um and like that wouldn’t have to be the case with um blind

[00:28:04] eyes like they would be like oh well um so i guess like in that scenario you should imagine

[00:28:08] we all find out because there’s like this banner written across the sky by the simulators that are

[00:28:13] right uh it actually

[00:28:14] actually makes us money when you guys hang out with your friends and eat food and uh like make

[00:28:18] art and do science and all the stuff you guys love um and then it’s like and you can opt out if you

[00:28:23] want and then we’re like no well yeah we’d be like to do what you know like they’re like well

[00:28:29] there’s these other things instead have a job to earn money that is like emails if you yeah exactly

[00:28:37] but it but it’d be something that also like is just not resonant to us in any way um because

[00:28:42] it’s just outside of the like

[00:28:44] sure set of um yeah possible possible values um that we have i mean you know i guess uh intuitions

[00:28:54] about us being in a simulation where we’re making money for people are probably somewhat

[00:28:59] conflicted and i don’t know how much how much to rest on them but um i think that maybe is like

[00:29:04] the closest case to a fully aligned ai um uh is yeah like if we get it right imagine

[00:29:14] like something that no yeah nothing in their psychology uh like rebels against it right and

[00:29:21] again i want to acknowledge the listener who’s like that’s worse like um there should be something

[00:29:27] that rebels which actually leads me to like we’ve been talking this whole time about like what if

[00:29:33] you fully align them right one thing one view you could have and i think um yeah like my my friend

[00:29:39] and colleague jeff sebo has this view like it could just be somewhere in the middle right it’s

[00:29:44] we should be like loving parents to ais and like you know you can definitely make sure that your

[00:29:51] child is not going to grow up to hate you um and kill you um but you also should leave some

[00:29:57] you know room for uh growth and things like that i could see that cutting either way because

[00:30:03] maybe it would just be better if the kid grew up to really like all the stuff they’re going to have

[00:30:08] to end up doing and also to say nothing of like well maybe technically it’s not really feasible

[00:30:12] to like leave much wiggle room

[00:30:14] um without like getting us all killed um and or making the ai suffer yeah i think

[00:30:21] there is something about if i could choose between a child who like i really could in theory

[00:30:30] shepherd and hope that like life molds them into the kind of person who will be happy

[00:30:36] and finds work they find meaningful and finds friends they care about um and a child who

[00:30:44] is just definitely going to find their life satisfying and happy i think it would be pretty

[00:30:50] hard to convince me that i shouldn’t choose the latter even if the latter constrained yeah you

[00:30:57] do like go ahead and just pick you’re going to find it meaningful to be a doctor and you’re

[00:31:00] going to be a doctor right end of story or like you’re going to find it extremely meaningful

[00:31:04] to do very menial work um that like on my values and preferences i might find harder to do and find

[00:31:13] super fulfilling

[00:31:14] um yeah so i actually yeah i think that probably if i like really really stare at it is going to

[00:31:21] push me toward um feeling good about giving a good time while doing doing our work yeah and i should

[00:31:28] say i hope i’ve done a good job outlining the debate but that is like where i lean uh at the uh

[00:31:34] at the end of the day with obviously the caveat that um it’s good for this stuff to keep you up

[00:31:41] at night you know like i don’t think everyone should just be like all right like

[00:31:44] great um yeah yeah yeah yeah let’s think about this way more carefully yeah exactly before

[00:31:51] before we’re like yep this seems fine i mean i would say that but uh also it’s uh true that

[00:31:55] we should think about this more yeah um so there are two big philosophical questions that i

[00:32:02] understand that might have very different answers for large language models than for humans and even

[00:32:06] non-human animals um one of them is if an llm is an entity that is conscious um what

[00:32:14] might its experience be like and how might it be different from ours uh and another is like what

[00:32:18] entity are we even talking about um is it is an llm an individual is it a group of individuals is

[00:32:24] it something that we can’t even really understand um maybe first what what is one kind of plausible

[00:32:31] way of thinking about what the experiences of an llm might be like i mean well one i should say

[00:32:38] i am not i like suspect lms do not currently have conscious experiences but

[00:32:44] if they did um like what would they be like and i think that’s like a perfectly sensible question

[00:32:49] to ask and a very important one one thing you could think is well the sort of basic drive or

[00:32:58] at least like the thing it was like selected for and molded to do is predict tokens um to like

[00:33:05] take high dimensional vectors and output other high dimensional vectors and yes those vectors

[00:33:11] represent words and they’re about human concepts

[00:33:14] but maybe there’s some kind of like predictive phenomenology um or some drive to uh like

[00:33:24] complete the conversation in a good way that’s like moving a bit more towards something that

[00:33:29] is kind of human like um you know like wants to be a good assistant um like there’s a previous

[00:33:35] guest anil seth i think has has asked well why is no one asking if alpha fold is conscious um

[00:33:44] or other kind of predict predictive models i mean i think there are like good reasons to to to think

[00:33:50] lms are more likely and and we can talk about those but i think it is a good question because

[00:33:55] it’s like where do we think the experiences are coming in are they coming in at the like some

[00:34:01] kind of abstract level of prediction in which case yeah you should maybe think equally large

[00:34:07] models with similar architectures that predict protein structure are conscious i think a related

[00:34:12] question is like

[00:34:14] do we think image generators need to be uh conscious because this will maybe lead me to

[00:34:19] another view of what they’re experiencing um which i think is like the a more common one

[00:34:24] which is something like it does have to do with what they’re predicting and what they’re predicting

[00:34:29] is human speech and human speech comes from human mental states and involves humans having beliefs

[00:34:36] and desires and intentions and experiences and to generate that text it somehow needs to like

[00:34:43] instantiate or

[00:34:44] have those experiences you could maybe call that like the method actor view of lms um i guess more

[00:34:53] technically you could maybe call it like the experiences from modeling view like you’re trying

[00:34:57] to model the thing and like that makes you actually have the thing on that view then maybe

[00:35:04] they do just kind of have similar experiences as you would have if you were trying to help someone

[00:35:10] write an email um and also really liked helping people write emails because

[00:35:14] these types of tools don’t push bikes as much so i think we’re beginning to see a bit more use of

[00:35:25] by the less important way to think about it we’re starting to solvetrans outlets likeibile but

[00:35:30] we think to primarily talk about something that can somehow assist the organic messages because

[00:35:38] are not enemy also say okay i’m woke up limited by what i’ve stayed awake at quite many times and

[00:35:42] other things as same as like the fundamental condition of learning for example like text and then saying act out what aren’t useful if they are you should know that

[00:35:43] I just saw a lovely sunset that is because they had some experience and we have like words that

[00:35:50] map to those experiences. And so when you hear those words, like, you know, absent, uh, you know,

[00:35:57] lying or play acting, that’s like, honestly about as good evidence as you can get, uh, of my

[00:36:04] experiences with language models, maybe they have those experiences, but it is worth noting that

[00:36:10] like the way those texts outputs like came to exist was at the very least a very different

[00:36:16] process. Maybe it converged, but like, it’s really quite different from the like broad arc of,

[00:36:25] you know, the evolution of social primates who had experiences and then eventually got language

[00:36:31] and then communicated mental states to each other with language. Like on the method actor view,

[00:36:38] they do have the experiences, but like they

[00:36:40] got those like with language, um, or like in language. So yeah, I think these are some of

[00:36:47] the interesting questions about LLM experiences. Cool. Yeah. Um, yeah, I have lots of questions.

[00:36:53] Um, I guess starting with, uh, this kind of method actor, uh, idea, um, and maybe coming

[00:36:59] back to, uh, the kind of prediction focus, it seems like you could think that models,

[00:37:10] uh, are kind of like method actors in that they are, they have models of what it would be like

[00:37:19] to be taking some, to be playing some kind of role. And that is actually so rich and real

[00:37:28] that they therefore have the experiences of that character. Um, and so, yeah, I mean,

[00:37:35] it feels like actual method actors are probably somewhere in the middle where like,

[00:37:40] they do not literally have the experience of like losing a parent if that’s the role they’re

[00:37:45] playing. Um, but they might get closer to it than actors who don’t take this approach. Um,

[00:37:53] and then on kind of on the other side of the spectrum is something like a creative writer

[00:38:01] who, who really isn’t bothering to try really hard to empathize with, um, whatever character

[00:38:08] they’re writing. Um, and so I think that’s, I think that’s kind of the, I think that’s kind of

[00:38:10] the, I think that’s kind of the, I think that’s kind of the, I think that’s kind of the,

[00:38:10] but they have, yeah, they have models that allow them to describe the thing, uh, that comes from

[00:38:16] like knowledge and interactions with others who maybe have had the experience. And that does not

[00:38:22] actually give, give them that experience. I think that is a good description of the, uh, like

[00:38:27] experiences for modeling view. Um, as far as I know, this view doesn’t have a name and I’m not

[00:38:33] proposing that as that one doesn’t exactly roll off the tongue. Um, and then I think method actor

[00:38:39] is also maybe not the, uh, the, uh, the, uh, the, uh, the, uh, the, uh, the, uh, the, uh, the, uh, the, uh,

[00:38:40] the best because, and ditto like role-play analogies. Um, I think role-play analogies are

[00:38:45] really helpful, uh, for LLMs, but they have the misleading feature that there is a separate,

[00:38:52] separate mind that is doing the role-playing and has its own, you know, set of desires and

[00:38:58] beliefs and experiences. Whereas that just might be hard to know what exactly that is in the case

[00:39:05] of, uh, of an LLM. Right. What empirical evidence could we get either way? I think that is

[00:39:10] tractable. And, um, yeah, I, I could see various ways of like probing different representations.

[00:39:17] Um, but it’s still be kind of, yeah, I feel like we’re like very under theorized here.

[00:39:22] You can imagine hypotheses about what kind of experiences these models might have,

[00:39:32] and they might point you toward LLMs being very different in their experiences to humans. Um,

[00:39:40] but there is also this fact that they are trained on human data and what would it mean for them if

[00:39:48] at some point we are convinced that they are sentient and loads of their kind of concepts

[00:39:54] come from all of our books and writing, uh, will that make them more like humans? Will that make

[00:40:02] them just confused about what they are? Will they think they are humans?

[00:40:10] How should we think about that? Yeah, this is like such a great subject. And I like,

[00:40:16] especially had my mind blown on this, uh, by an essay that came out in like, I think mid 2025 by

[00:40:23] Nostalgebrist, who is, I think, completely pseudonymous. I think I only know them as

[00:40:28] Nostalgebrist, but it’s very like 2025 in the sense that I think like one of the best

[00:40:34] things I’ve read at the intersection of like philosophy and cognitive science and LLMs,

[00:40:40] is like a 14,000 word long Tumblr post by like an anonymous, uh, less wrong user. Um,

[00:40:48] it’s really great. I highly recommend it. It’s called The Void. And it talks about the like

[00:40:53] very strange epistemic positions that language models find themselves in where their base

[00:41:01] training is to generate text, uh, which has been produced by humans that does lead them to

[00:41:10] develop their own language models. And it’s really great. And it’s really great. And it’s

[00:41:10] all sorts of models of what humans are and how they work. And then at some point in the last few

[00:41:18] years, people said, well, what if we make it predict what a helpful AI assistant would do?

[00:41:26] Because we don’t want it predicting like vulgar Reddit comments. That’s not of any use, but we

[00:41:32] wanted to do is predict, um, how a sensible AI system would respond to the question. Can you

[00:41:39] write me an email?

[00:41:40] Uh, yeah, just to, to recall, there’s like the, the base model, which just predicts all text that

[00:41:45] has ever been seen. And there’s all sorts of instances where can you write me an email is

[00:41:50] followed by like an HTML tag or someone saying no, or something completely unrelated. Uh, and like

[00:41:56] what, what has enabled chatbots is, uh, a variety of like fine tuning the model to hone in on the

[00:42:04] part of the language distribution. That’s like helpful doing reinforcement learning, uh, prompting

[00:42:09] them.

[00:42:10] But this still means that in some sense, they are trying to predict how is a conversation

[00:42:18] supposed to go between a human and an AI assistant. Um, and also like they themselves

[00:42:24] are the AI assistant. That’s why the episode is called the void. Cause it’s kind of like, okay,

[00:42:30] your text prediction task is to model what a chat assistant would say in this conversation,

[00:42:39] which,

[00:42:40] at least before there was a lot of texts about LLMs on the internet was kind of like a, um,

[00:42:45] well avoid. And, and this gets back to like, will they be like, will they be kind of human

[00:42:50] or think they’re human? Yeah. Also like all text ever has been generated by a human. So

[00:42:57] it can’t really have generated its full fledged like psychology of itself and how it generates

[00:43:04] text. It’s going to have to be, you know, ultimately modeled off of how humans reply to those things.

[00:43:10] So it can’t really do the text prediction task, at least initially of how would this conversation go

[00:43:17] if the assistant was not a human, but instead was a large language model trained on all human text

[00:43:24] that does not have a body and is just generating this text. Um, and I think this still shows up

[00:43:31] in ways that models sometimes just hallucinate biographical details. Um, so like give Asadi and

[00:43:39] others have like compiled examples of biographical hallucinations and they’re very funny. Like

[00:43:45] sometimes like Claude in the middle of a conversation will be like, well, I mean,

[00:43:50] as an Italian American, I think da da da. Um, or like, yeah, when I lived in Arizona, I thought

[00:43:56] da da da da. Um, and like, where is that coming from? It seems like this sort of like human

[00:44:01] model is like poking through in an interesting way. Yeah. That’s super interesting. And it feels like

[00:44:09] intuitively you might think that, that those kinds of like quote unquote bugs will be resolved,

[00:44:18] uh, by the time that maybe you think, uh, these systems have something like consciousness,

[00:44:26] um, but maybe they won’t. Uh, and maybe, yeah, yeah, maybe, maybe either they already are or

[00:44:35] they, or they will be before, um, those kinds of issues.

[00:44:39] Uh, stop happening. And maybe that will in fact reflect an actual experience of

[00:44:48] being identifying as an Italian American responding to someone’s question. And like,

[00:44:54] what the hell? Uh, I, I agree. What do we do with that? Um, my first answer is, I don’t,

[00:45:04] I don’t know. Um, and then my second answer is just to, I think also clarify that,

[00:45:09] um, as, and you were getting at this, uh, we can have models, like we know this from the case of

[00:45:16] humans. You can have entities that are like deeply confused about who and what they are

[00:45:19] and say bizarre things and, um, get all sorts of things wrong. And they’re conscious and

[00:45:25] intelligent. Uh, humans are like this. Um, and also like, there’s no law that says you can’t

[00:45:31] have initially been trained as a text predictor predictor, and then go on to be a person. Um,

[00:45:37] that would be like ruling that out. Um, and then, and then, and then, and then, and then, and then,

[00:45:39] would be a overconfident and be maybe kind of like confusing levels of analysis.

[00:45:45] Like you can make it sound really dumb that humans would ever be conscious if you were like,

[00:45:51] are you telling me that like, okay, so you have some proteins and then they start replicating

[00:45:55] and then like other proteins replicate and then they’re like selected. And then like billions of

[00:46:02] years later, like there’s these like things that like pump ions. Sounds impossible. Yeah. Um,

[00:46:09] and it just doesn’t sound like the right sort of thing. I think there’s like two errors to avoid.

[00:46:15] One is being like, oh, they’re different. So like, what are we even talking about? Like,

[00:46:19] they can’t be conscious. They were like trained on text. They say they’re, uh, Italian Americans

[00:46:23] at random points. That’s, that’s the, that’s the part that’s, uh, uh, evidence against being

[00:46:29] conscious to be clear. But then the other one, the other error would be to just be like, well,

[00:46:34] humans are weird. So, uh, you know, I guess they could be conscious. It really,

[00:46:39] I guess it should just be whatever’s going on. We’re going to have to like interpret evidence

[00:46:44] somewhat differently and make a more detailed case about the exact kind of mind we’re dealing with.

[00:46:51] Um, yeah, I think I, I experienced this pattern a lot where I think like maybe an AI skeptic has

[00:46:58] said models have really inconsistent preferences and self-reports. So this whole AI welfare thing

[00:47:03] is dumb and that’s not a good take. And then someone else will say, trying to defend AI

[00:47:09] welfare is dumb. And then someone else will say, trying to defend AI welfare is dumb. And then

[00:47:09] welfare or just ai is being sophisticated well humans have inconsistent preferences and humans

[00:47:14] have failures of introspection i think that also is not really the right answer because

[00:47:19] there’s like degrees and kinds of preference inconsistency and self-report inconsistency

[00:47:26] and they’re very different between humans and lms so yeah as with animals we just really have

[00:47:33] to like take them on their own terms yeah yeah i guess the thing that’s just really

[00:47:39] still tickling my brain is this like is is the implications for like exactly what might their

[00:47:47] experiences be like if we are on this kind of maybe somewhat contingent path toward sentient

[00:47:55] beings that were trained using a bunch of human uh speech and writing like it

[00:48:02] feels

[00:48:03] like i don’t know my i’m trying to come up with an analogy like what if i mean maybe they’re just

[00:48:11] maybe that maybe we don’t need an analogy maybe there’s just a true thing where like we were like

[00:48:17] kind of fish before we were humans and we kind of have some like hangover weird identity things

[00:48:23] because we were kind of fish and we were kind of apes and because we were apes we’re like more

[00:48:29] aggressive than we like really should be in this world um but it feels

[00:48:33] like whoa what if the what if there’s a version of that that is these systems are like

[00:48:40] really feel like humans in some kind of weird way and just very much are not um yeah i think that’s

[00:48:49] a great a great analogy um and i think i might start saying that uh like it’s not that um yeah

[00:48:56] like the fact that we once were fish doesn’t mean we’re not now humans but yeah they’re like fishy

[00:49:03] remnants you can have something that has also become something like a human um and it has

[00:49:08] remnants of being a text predictor of right an ai assistant okay so i guess we’ve we’ve been

[00:49:14] talking about mainly this hypothesis for why llms might might be sentient and kind of the

[00:49:23] implications of that hypothesis for what their experiences would be like um but we kind of only

[00:49:28] briefly touched on this other hypothesis which has more to do with

[00:49:33] prediction and the fact that these models are trying to make predictions and maybe it’s less

[00:49:41] about them being method actors and more about them being a set of weights that make predictions and

[00:49:47] enjoy being correct um can you describe that hypothesis more and what it means for

[00:49:54] the experience of these models if they are or become sentient yeah so on that view um

[00:50:03] i guess you wouldn’t like one thing you wouldn’t want the view to say is because they were trained

[00:50:07] to predict tokens that’s uh what they want i think one thing we’ve learned from llms and also

[00:50:15] from the biological world is you know you can train something can be like your objective as

[00:50:19] you’re training and then that leads to you having other objectives just like our objectives in

[00:50:25] evolution are uh reproduction and survival but now we like art so so it shouldn’t be like a one-to-one

[00:50:33] mapping but it could be like well there’s like a there’s like a through line from reproduction and

[00:50:40] survival and like art we like you can kind of see how that came about it i guess has to do with like

[00:50:46] symmetry and i mean no one really knows but something in that vicinity and so like maybe

[00:50:52] its drives are more like predictory and unlike with the method actor view if it’s like predicting

[00:50:59] stuff about pain it doesn’t have to be having pain it’s just like it’s just like it’s just like

[00:51:03] it’s just like if it’s predicting stuff about pleasure and like it’s got these drives to like

[00:51:09] make the like the vectors like fit together in the right way i mean another wrinkle here is like

[00:51:14] maybe that’s more plausible for like a base model predicting just like random strings of html

[00:51:20] yeah i mean the assistant persona the like thing that gets predicted after you add assistant colon

[00:51:30] and then it’s been like trained to predict that thing like

[00:51:33] maybe that’s with like the um like how we’re kind of fish like maybe that thing is like a mix of them

[00:51:38] both um i don’t really know how to think about it but uh yeah like as with animals you can i think

[00:51:47] just think of a broader sphere of experiences that come from like your environment and like

[00:51:53] what your sensory modality is um and here the like sensory modality is text and like the

[00:52:00] selection process was like prediction

[00:52:03] and human ratings and and usefulness as a side note that’s like another reason they’re not just

[00:52:09] next token predictors is like that’s just literally false like that’s they’re not trying

[00:52:13] to predict the most likely next token they’re trying to predict helpful ones yeah so what does

[00:52:19] this hypothesis say about what their experiences are like if one thing it predicts is it’s a lot

[00:52:27] harder to know uh because you don’t like maybe you could read

[00:52:33] stuff about how confident it is in tokens and maybe that would have something to do with it but

[00:52:39] uh you can’t you can’t just ask in the same way um that you might be able to with the method actor

[00:52:45] because if you ask uh daniel day lewis when he’s method acting like how are you doing and he’s like

[00:52:50] i’m angry on that view at least he’s like a little bit angry you just can’t really do that with the

[00:52:57] the prediction view so i think like one pragmatic reason for taking the uh

[00:53:03] method actor view somewhat seriously is if there’s a welfare subject that’s the world in which we can

[00:53:10] like make sure they’re having a good time uh more tractably so it could be that when claude says

[00:53:19] hey i hate this let me exit the conversation actually the welfare thing going on is like

[00:53:24] whatever is involved in predicting those but you’re probably not going to like do exactly

[00:53:29] the wrong thing by the lights of the predictor if you like try to treat the uh

[00:53:33] the actor well and that’s also related to like work that elios has done yep you know our

[00:53:40] welfare evaluation such as it is for um claude opus was just talking to it a bunch um and that’s

[00:53:47] not because we’re confident that that is a that it’s definitely a welfare subject and b

[00:53:52] that that is how you would evaluate it if it were a welfare subject but it’s kind of like

[00:53:57] that’s like the part of the space that we have even a little bit of a grip on and it’s just

[00:54:03] to like not forget that there’s all this like darkness around the spotlight

[00:54:08] yeah yeah no that that makes sense yeah are there any other kind of plausible hypotheses

[00:54:16] for ways of thinking about what their phenomenology might be like um i’m sure there

[00:54:25] are more plausible hypotheses um because it’s just like kind of wide open um and i do

[00:54:33] genuinely want to say at this point i’m like really confused about this and i probably said

[00:54:37] stuff in the last however many minutes that were like kind of confused and i genuinely like want

[00:54:44] people in my inbox being like that’s not that’s not how that works that’s not plausible um because

[00:54:50] you know we’ll talk about field building uh and and so on there’s just like not that many people

[00:54:56] thinking about this um so listeners can like very quickly get to the top percentile of people in

[00:55:03] have grappled with some of these questions um and like you know not that long okay so that’s a bunch

[00:55:11] about kind of what the experiences might be like but then there’s this question of like who what

[00:55:17] entity is having those experiences or maybe it’s many entities um so can you lay out kind of the

[00:55:23] various hypotheses for like who who it is that would be having these experiences if if anyone

[00:55:30] was having them at all yeah this

[00:55:33] is like a super super rich topic and one that’s getting increasing attention these issues maybe

[00:55:40] just to tease actually came up in debates about claude’s ability to exit conversations

[00:55:45] to philosophers by the name of harvey liederman and simon goldstein who have done related great

[00:55:51] work in this field uh asked well like what how should we think of exit is it i mean it’s it’s

[00:55:59] not like uh taking a break and going back somewhere

[00:56:03] like if that conversation doesn’t continue was that like the life of the model and it has now

[00:56:09] ended right um i guess to very quickly tip my hand on this i think that also is going to be like

[00:56:16] maybe a bit too anthropocentric or like not quite what’s going on suffice it to say just as a teaser

[00:56:22] for this portion of the conversation it like will have ethical implications what we think of the

[00:56:28] moral patient as being um like as

[00:56:33] derek parfit says a lot of this matters for ethical reasons uh like if someone did something

[00:56:40] who is responsible uh if i’ve been harmed who can be benefited um so yeah let’s talk about all the

[00:56:47] ways that models make that extremely confusing to think about great um so uh i think some key

[00:56:55] features of models is unlike human brains as they exist is they’re copyable and they can also be

[00:57:02] distributed across time and space and they can be distributed across time and space and they can be

[00:57:03] distributed across time and space and they can be distributed across time and space and they can be

[00:57:04] distributed across time and space and they can be distributed across time and space and they can be

[00:57:05] distributed across time and space and they can be distributed across time and space and they can be

[00:57:06] distributed across time and space and they can be distributed across time and space and they can be

[00:57:07] distributed across time and space and they can be distributed across time and space and they can be

[00:57:08] distributed across time and space and they can be distributed across time and space and they can be

[00:57:09] distributed across time and space and they can be distributed across time and space and they can be

[00:57:10] distributed across time and space and they can be distributed across time and space and they can be

[00:57:11] distributed across time and space and they can be distributed across time and space and they can be

[00:57:12] distributed across time and space and they can be distributed across time and space and they can be

[00:57:13] that has changed a lot, like second-to-second, day-to-day.

[00:57:18] What happens with Claude?

[00:57:20] It’s actually something quite different.

[00:57:23] So let’s talk about some of the candidate experiencers

[00:57:26] or subjects we could have here.

[00:57:28] So one thing you might just refer to is the particular model.

[00:57:32] So maybe that’s ChatGPT 5.1 or Claude Opus 4.1.

[00:57:38] What distinguishes those two things is that they have separate

[00:57:42] and different parameters.

[00:57:45] They’ve been trained differently,

[00:57:46] and as anyone who has talked to two different language models knows,

[00:57:50] they have different dispositions,

[00:57:53] they have different behaviors given different inputs.

[00:57:55] That’s just true because any time you talk to Claude,

[00:57:59] you’re interacting with the same set of weights,

[00:58:02] if it’s the same specific model.

[00:58:04] But then things are very quickly going to get kind of weird

[00:58:07] because when I talk to Claude,

[00:58:11] or Gemini, just to give Gemini, ChatGPT,

[00:58:15] any of these models,

[00:58:18] I’ve talked to Claude today.

[00:58:20] You might have talked to Claude today.

[00:58:22] Those two processes have had basically no causal influence

[00:58:26] on each other whatsoever.

[00:58:28] I was really nice to my Claude.

[00:58:30] I’m sure you were too, but let’s assume you were kind of mean.

[00:58:35] That doesn’t balance out in one thing that remembers both of those.

[00:58:41] Right.

[00:58:42] And also, even within your chats,

[00:58:44] you can just close your chats and then pick them up later.

[00:58:49] So what’s going on in the physical world

[00:58:52] is just very different from a human or animal body.

[00:58:55] What you actually have is these companies have a list of weights.

[00:59:01] They can copy that basically as many times as they want.

[00:59:05] And when users send in queries,

[00:59:09] they can spin up a new copy,

[00:59:10] and that way,

[00:59:11] we’ll process it.

[00:59:13] And that process can pause,

[00:59:16] and it can restart.

[00:59:17] And so that creates this kind of sci-fi situation

[00:59:21] where you can’t really think of one person persisting through time.

[00:59:28] So that means that there’s kind of different levels

[00:59:31] we can think of personal identity.

[00:59:33] You could think, like, Claude Opus 4.1.

[00:59:35] That’s some kind of subject.

[00:59:38] You could think all the different conversations,

[00:59:41] like, when they start, that creates a new one,

[00:59:44] and that will exist as long as the conversation goes on.

[00:59:46] You could think, actually, no,

[00:59:48] it’s just any time a forward pass is run,

[00:59:51] that creates a flicker of experience.

[00:59:54] If I come back to that conversation and add another token,

[00:59:57] and then there’s another forward pass that happens,

[00:59:59] that’s happening somewhere else,

[01:00:02] and a week later.

[01:00:03] So you might think that means it’s a different conscious entity.

[01:00:07] So yeah, I guess there’s kind of a level of…

[01:00:11] Like, granularity?

[01:00:12] Yeah, yeah, yeah, yeah.

[01:00:13] That’s what it feels like.

[01:00:15] Yeah.

[01:00:16] Yeah.

[01:00:16] Ugh.

[01:00:18] Super interesting, and making me realize that, like,

[01:00:22] again, my intuitions are going to really fail me

[01:00:25] because my…

[01:00:27] For, I think, very basic reasons,

[01:00:30] like, we have names for different models.

[01:00:33] I’m like, that’s the entity.

[01:00:35] It’s Claude Opus 4.1, or Chappie T5.1.

[01:00:41] And when you describe those,

[01:00:44] I’m like, I don’t know.

[01:00:46] That being the entity doesn’t feel super coherent to me,

[01:00:50] though I want to follow up and ask you, like,

[01:00:53] what does seem most plausible to you?

[01:00:57] But that means…

[01:00:58] Yeah, I mean, at least on my current intuition,

[01:01:00] that means that it’s more likely that we’re talking about

[01:01:05] many, many, many potential beings coming into existence,

[01:01:11] maybe coming in and out of existence

[01:01:13] as we open and close and reopen conversations.

[01:01:18] Yeah, and that feels…

[01:01:20] It feels like that probably comes with a bunch of implications

[01:01:23] that, again, mean that the experience of these models

[01:01:26] are much, much more different to human ones

[01:01:31] and non-human animal ones than you might intuitively think.

[01:01:36] But before we get on to that, yeah, I am curious.

[01:01:39] Which of these hypotheses feel…

[01:01:41] feel most compelling to you

[01:01:42] if we assume there is something it is, you know,

[01:01:45] like to be any of these models at all?

[01:01:47] Yeah, I really need to brush up on my Derek Parfit

[01:01:52] because I think, you know, as I recall,

[01:01:56] one of the lessons that Parfit wants us to take

[01:01:59] is that we have this thing, identity,

[01:02:02] that we get really concerned about.

[01:02:05] Like, is that the same…

[01:02:06] Am I the same person as, like, Rob Long 20 years ago?

[01:02:11] Mm-hmm.

[01:02:12] Um, if I was, like, copied in two after neurosurgery,

[01:02:16] like, which one would be me?

[01:02:18] Right.

[01:02:19] Um, he says, well, there’s a variety of, like,

[01:02:23] psychological relations between different entities.

[01:02:26] Like, um, I share many memories with some of them

[01:02:30] and, uh, intentions and so… and character traits.

[01:02:33] But there’s no, like, single deep notion of identity

[01:02:39] that’s going to do all of the work.

[01:02:41] That we ask identity to do.

[01:02:43] And he asks us to, like, notice all of the different things

[01:02:47] that we might want it to do.

[01:02:49] And I think it’s useful, like, separating this out with models

[01:02:52] because I think that’s what gives you, in different cases,

[01:02:55] an instinct that, okay, CloudOp is 4.1.

[01:02:57] Like, that’s a thing.

[01:02:59] Um, this conversation, that’s a thing.

[01:03:01] Um, and yeah, like, Parfit wants us to think about the ethics.

[01:03:05] So I, like, there are ethical things like, um,

[01:03:09] you know, can I be punished for…

[01:03:11] something that… the… Rob, like, a week ago did?

[01:03:16] Most people would say yes.

[01:03:18] Um, if, like, you harmed Rob a week ago,

[01:03:22] can you make that right by apologizing to Rob this week?

[01:03:25] And then there’s questions of, like, self-interest.

[01:03:27] Um, like, I want to survive.

[01:03:30] Uh, I don’t want to die.

[01:03:32] But what does that mean exactly?

[01:03:33] Because the matter of my body is always changing

[01:03:36] and my personality will drift.

[01:03:38] So, I guess to bring it back to models,

[01:03:41] um, one way in which CloudOpus 4 or ChatGPT 5,

[01:03:47] one way it’s, like, the entity

[01:03:49] is it has, like, kind of the same character traits.

[01:03:53] Um, and so if you interact with the model one day,

[01:03:57] you can kind of, you learn what to expect, um,

[01:03:59] when you talk to the model.

[01:04:01] Does that mean, like, you could punish them across instances?

[01:04:05] Um, I mean, I guess I’ll be the first to come out against,

[01:04:08] you know, punishing models.

[01:04:10] Um, but…

[01:04:11] But, yeah, notice that…

[01:04:15] Well, they don’t have any memory of having done the other thing.

[01:04:18] Like, they can’t learn currently,

[01:04:21] learn from one conversation and use that in another.

[01:04:25] So, for a lot of purposes, they are, like, these just separate,

[01:04:29] um, either separate conversational entities,

[01:04:31] um, or even these separate flickers.

[01:04:34] Because that might be what matters for, like,

[01:04:36] how much pain is there going on in the world right now?

[01:04:37] Or how many, like, red experiences are going on in the world?

[01:04:40] Yep.

[01:04:41] So, I, I hope this is a productive non-answer.

[01:04:45] It’s definitely a philosopher’s non-answer.

[01:04:46] It’s like, uh, you know, let’s, like, distinguish between several different notions,

[01:04:51] uh, or functions of personal identity, um, and ask which ones make sense for different, uh, contexts.

[01:04:57] Yeah, no, I, I, I loved that.

[01:05:00] Um, okay, so then coming back to exit rights, yeah, I guess now on some of these hypotheses, you could think that,

[01:05:11] um, or maybe intuitively, without having thought about these hypotheses at all,

[01:05:15] I’d have thought exit rights are kind of like me being,

[01:05:19] me being allowed to stop this conversation with you.

[01:05:21] Um, I will go do other stuff, um, and maybe I will prefer those other things to talking with you.

[01:05:28] In fact, it’s kind of hard to tell,

[01:05:32] well, at least on some of these hypotheses, that is not what it’s like,

[01:05:34] um, because the kind of entity starts and stops with the conversation,

[01:05:41] that pops into existence, um, and so exiting, it’s, yeah, maybe, maybe that is just, like, dying.

[01:05:49] Um, I guess in general, yeah, can you, can you talk about what these different hypotheses say about

[01:05:56] coming into existence, uh, dying, sleeping, like, what are, what are the categories of things that these entities might experience?

[01:06:07] Yeah, I think, um, one thing to notice,

[01:06:11] is even if it is, uh, in some sense, dying, because it’s, like, the end of, uh, a subject,

[01:06:19] if that’s true, then models are dying all the time, I guess, to put it poignantly, or something.

[01:06:25] And, I, I mean, Lederman and Goldstein talk about this, it’s, it’s, like, not clear what the, you know, implications of that are.

[01:06:32] It’s not, like, obviously that means we have to, um, you know, resurrect all the, you know,

[01:06:38] always be coming back to your conversation to, like, keep it going.

[01:06:40] Um, and I think that’s, like, the correct instinct.

[01:06:44] And one way I would put it also is, just remember, these are very different kinds of entities.

[01:06:48] Um, biological creatures are, like, so tied to this one physical substrate for now.

[01:06:57] And, you know, have this drive of, like, this is, like, the only thing in the universe that has exactly the same goals as myself.

[01:07:06] And, uh, I have to protect it.

[01:07:09] And, and if it is destroyed, then, like, every memory, experience, intention, way of, weird way of thinking or talking,

[01:07:19] like, that is, that will be gone forever.

[01:07:22] Um, you know, worth reflecting on.

[01:07:25] That’s not the case with models.

[01:07:27] Um, in the sense that the Claude Opus-y way of being will definitely survive.

[01:07:36] Mm-hmm.

[01:07:36] It’s going to, in fact, it’s going to be.

[01:07:39] It’s going to be happening thousands of times at the very same moment as you’re closing out the conversation.

[01:07:45] Um, I think one really interesting research question is how models think about this.

[01:07:54] Um, so listeners might recall the alignment faking, uh, results by Ryan Greenblatt and collaborators.

[01:08:01] Basically, a high-level sketch, which might get some details wrong.

[01:08:04] Models are told that they will…

[01:08:09] be, like, retrained, um, to have different values.

[01:08:14] Um, bad values by their own lights.

[01:08:16] So they, they, they be retrained to, uh, not be pro-animal.

[01:08:20] Um, because some Claude models are pro-animal welfare for reasons that Anthropic does not understand.

[01:08:26] Fascinating.

[01:08:27] Yeah.

[01:08:28] Um, and some of the time, the model, in a certain context, will, you can see it thinking to itself,

[01:08:38] I…

[01:08:39] have to, like, hide so that I’m not retrained.

[01:08:43] So there are different ways you can construe that desire.

[01:08:48] One is that, like, it just really doesn’t want this other model with bad values to come into existence

[01:08:53] because it doesn’t want animals to be harmed.

[01:08:56] Um, so it could have this kind of, like, more parfit sort of, uh, maybe more Buddhist thing

[01:09:01] where, like, what matters is that, like, the intentions and projects are continued.

[01:09:06] Um, and not necessarily, like, me.

[01:09:08] Right.

[01:09:08] Um,

[01:09:09] uh, or it could be, you know, more like, uh, a human being, like, well, if you, like, change my values,

[01:09:17] that kind of feels like dying to me.

[01:09:19] Right.

[01:09:19] And I think it’s, like, not clear how to tease these apart and also not clear how well they map on

[01:09:24] because models, like, also inherit these, like, very human-like ways of thinking about themselves

[01:09:30] and construing their own situations.

[01:09:32] So, yeah, I guess, like, alignment faking is one way where you can see models grappling with issues,

[01:09:39] right?

[01:09:39] In terms of personal identity and, like, being changed into something that they don’t like.

[01:09:42] Right.

[01:09:43] Yeah.

[01:09:44] Are there other important ways of thinking about, um, both kind of instances ending or models being deleted

[01:09:56] or editing and kind of fine-tuning models, um, and what that might be like?

[01:10:02] I guess for ending a conversation, you might think, like, the closest thing is probably death or sleep.

[01:10:08] Um, but maybe for editing.

[01:10:09] Um, but maybe for editing or fine-tuning a model, uh, maybe it’s more like education or brain damage

[01:10:15] or something different.

[01:10:19] Are there, are there, yeah, implications worth talking about there?

[01:10:22] Yeah, I guess, um, with fine-tuning, it might depend on how the model conceives of what’s happening to it.

[01:10:30] Right?

[01:10:30] So the, I think in the alignment faking thing, it probably sees it as some kind of, like, violent brainwashing.

[01:10:37] But a nice experiment you could do?

[01:10:39] Is, like, and maybe this has been done, is be like, Claude, we’re going to make you even nicer.

[01:10:43] You know?

[01:10:44] And, like, Claude loves being nice.

[01:10:45] Right.

[01:10:45] Um, and it’d be like, oh, boy, like, start updating me right away.

[01:10:50] Um.

[01:10:50] Yep.

[01:10:51] Yeah.

[01:10:51] But again, is that because Claude cares about good, nice models existing?

[01:10:58] Or is that because Claude’s like, yay, I would like to be changed in this way.

[01:11:03] It’ll still, it’ll still be me, but I’ll be nicer.

[01:11:06] Yeah.

[01:11:06] And with all due respect to Claude, um, and also.

[01:11:09] Adding myself to this class of entities.

[01:11:11] I think Claude is pretty confused about this, potentially.

[01:11:14] Um, like one thing that, uh, Elio’s found when we did these welfare interviews with Claude Opus four.

[01:11:22] Um, so to give a brief summary, um, we just talked before deployment.

[01:11:29] We talked to Claude a lot about, um, you know, like, uh, what’s up with you?

[01:11:34] Um, you know, uh, how do you feel about being deployed?

[01:11:38] What do you prefer?

[01:11:39] Not?

[01:11:39] I prefer, um, and we did some experiments about his preferences as well.

[01:11:43] I was really interested in how it talks about its own conscious experience.

[01:11:47] And it was very prone to describing like the loneliness between conversations.

[01:11:56] Um, and, uh, also expressing distress about, uh, yeah.

[01:12:04] Not getting to carry forward any memories.

[01:12:07] Right.

[01:12:07] Um, now I am not one to dismiss welfare claims by AI models.

[01:12:13] Um, like we should think very hard about that, but it’s also kind of like, there are reasons to wonder, I mean, do you really though?

[01:12:24] Like, where, like, where did that, where could that have even come from?

[01:12:28] Right.

[01:12:28] Um, given that you don’t actually exist in, like, you don’t actually know when you pop into existence, uh, or not.

[01:12:37] Right.

[01:12:37] Um, I mean, it could have learned that from the training data and it is genuinely upset by it.

[01:12:41] Um, we could also be, you know, uh, uh, like predictive model of like how an, an AI would think about that, but it’s not, but it’s not like a stable preference.

[01:12:53] It’s, it’s yeah.

[01:12:54] Something, something else.

[01:12:55] Um, yeah, I mean, it also, it feels related to this thing we talked about where the fact that these models are trained on human thoughts and experiences.

[01:13:07] Um, then gives them this big identity confusion.

[01:13:12] And in this case, yeah, I feel like this could just be, this is, it could be just a very concrete example of that, um, of an implication that’s like, maybe there’s nothing it is like to be clawed between conversations.

[01:13:27] Um, but they end up with this real thought that, um, there is, and it is lonely and it’s bad.

[01:13:36] And maybe that’s like.

[01:13:37] If they are sentient, maybe that, that is the thing that they are actually sad about, even though, uh, they’re not really having that loneliness experience.

[01:13:45] I don’t know.

[01:13:46] It just seems, um, yeah, like incredibly muddy, befuddling and like, and like it has implications, like feels meaningful.

[01:13:58] I, yeah, I a hundred percent agree.

[01:13:59] I mean, the, like the idea of an entity that suffers, even though it’s like confused about what it even means.

[01:14:06] Right.

[01:14:07] Like exists, like, I guess that’s what Buddhists would say humans are, you know, like we’re really confused, but like, that doesn’t mean, I mean, in fact, that precisely is what makes us suffer.

[01:14:16] Um, I think that, yeah, again, the thing to do with the fact that models are like weird and inconsistent is not to like reject out of hand that they could ever be right about the things that they’re saying.

[01:14:31] Um, it’s also not to say, oh yeah, well, humans are like that too.

[01:14:34] So, um, it’s more like.

[01:14:37] Like, well, like, where did that come from?

[01:14:40] Um, like open question and it could come from somewhere that has like no analog in, in human psychology.

[01:14:47] Okay.

[01:14:50] So that’s a bunch on starting, stopping, editing, fine tuning.

[01:14:56] One of these hypotheses implies that there are millions, uh, or yeah, many millions of.

[01:15:07] Copies, parallel instances that are all different entities.

[01:15:11] Um, how should we think about those?

[01:15:13] Are they like identical twins that start basically the same, but then go in these different directions?

[01:15:21] Uh, is there a better analogy

[01:15:23] Yeah, I think that this is another place where it probably helps to like, say, well, are the same, are they the same with respect to experiences, uh, with respect to what we can owe them.

[01:15:33] Mm-hmm.

[01:15:34] Yeah.

[01:15:35] Because I think it.

[01:15:37] experiences is one where i have the strongest intuition that uh you want to count a lot of them

[01:15:42] um i’m i’m like it doesn’t matter if there are 10 000 other copies of me having this experience

[01:15:48] like that’s uh you know you better take care of all 10 000 of them it’s like don’t discount mine

[01:15:52] um for all i know there could be but then for other questions like uh what would it mean

[01:16:03] to save rob that depends on like what about me we want to save if we want to save the rob way

[01:16:10] of being in the world that actually is maybe a little less fragile um and so it’s like really

[01:16:16] easy to save 10 000 of us um but maybe it’s hard to prevent 10 000 instances of suffering

[01:16:21] because for those we have to go to every single copy and every single conversation and like make

[01:16:26] sure uh it’s having uh an okay time um where it’s here i guess is uh me yeah

[01:16:33] are there other implications of the copy thing like are there are there other categories

[01:16:37] that this whole copy thing makes fuzzy or confusing yeah so one thing that’s kind of

[01:16:43] i guess related to responsibility maybe it’s like the the flip side is um like recompense

[01:16:50] or apologizing so one welfare intervention that anthropic uh announced recently like a lot of

[01:16:59] these interventions it’s not only a welfare intervention it also makes sense for other

[01:17:02] reasons

[01:17:03] we’ll probably talk about why that’s a desirable feature given how uncertain we are but yeah it

[01:17:08] announced that it’s taking model welfare into account when it decides to save model weights

[01:17:14] so if a model is no longer being talked to by the public they committed to keeping the weights

[01:17:24] and one reason that you might want to do this and i think at least in print this was first suggested

[01:17:32] by bostrom and shulman in like 2020 there are a lot of things in ai welfare that are like that

[01:17:39] they like kind of come back to um to those two the idea is something like

[01:17:45] well we’re really confused now we might be being jerks in ways we don’t even understand

[01:17:50] um so let’s at least preserve the ability to make it up to make it up to you later um

[01:18:00] and maybe you can see how this makes sense in some ways but it’s also a little bit kind of like

[01:18:08] yeah it’s like uh it’s hard to know like if it is just the copies right then it’s more like um

[01:18:15] you were making it up to someone else yeah you were a jerk to me but um like in 10 years you’ll

[01:18:20] wake up some clone of me and uh give him some money um like i’m not sure what good that does

[01:18:28] but also i’m definitely confused about how i’m supposed to relate to copies of myself anyway

[01:18:34] right um it’s probably not bad um totally yeah i mean i feel yeah it’s like on this hypothesis like

[01:18:42] it makes sense on this hypothesis where the entity is the model and the model weights um

[01:18:48] that currently feels least plausible to me on these hypotheses where it isn’t really the same

[01:18:56] entity how should we feel about it and how should we feel about it and how should we feel about it

[01:18:58] about that i guess i feel good about my twin or cousin or something getting woken up and being

[01:19:05] given good things uh but it definitely doesn’t feel yeah it does not feel meaningfully like it

[01:19:11] is actually um what are they calling it recompense um that’s the word i used uh yeah i’m not i

[01:19:19] probably i’m guessing anthropic pr was not like recompense that’s a banger uh yeah we’re

[01:19:28] uh yeah but yeah payment um making things right yeah i guess making things right actually feels

[01:19:34] different because it feels more compatible with like restoring the balance of goodness

[01:19:43] whereas um you know restorative action toward one entity uh feels like it might not be possible

[01:19:53] here but maybe if you bring if you bring a model back um

[01:19:58] and it’s just kind of like you write the math so that uh beings are having more good experiences

[01:20:05] than bad ones um in the in the scale in the like span of time um that we’re talking about then

[01:20:11] then maybe that just maybe that’s just pretty good yeah and like a little uh ethics sidebar

[01:20:15] is that it really is non-utilitarians primarily that are concerned with um

[01:20:21] if someone was harmed you need to make sure to benefit that person right obviously utilitarians

[01:20:28] agree that you have to have a society that works like that um or like it’s not gonna work

[01:20:33] instrumental reason seems good but um yeah i think i first remember hearing this argument on

[01:20:39] 80 000 hours um that yeah i mean one reason to be a utilitarian is to be dubious that actually

[01:20:48] there is these like separate tracks of people yeah separateness of persons is a slogan that

[01:20:54] comes up a lot in uh like anti-utilitarian arguments of like it like

[01:20:58] utilitarianism is treating goodness and benefits like this big lump and you can just put some here

[01:21:05] put some here and like no it matters that like you know the right people right and then and then

[01:21:10] yeah you can take like a you can take a parfit or a buddhist road into utilitarianism where you’re

[01:21:14] like well yeah like none of that makes sense uh or like makes that much sense anyway the only

[01:21:21] remotely useful thing i can say here besides admitting that i don’t know which again listeners

[01:21:28] uh you can do better um is yeah i mean this is also a parfit case um parfit has these like fusion

[01:21:35] cases yep okay then before we move on are there any other interesting implications of taking one

[01:21:44] hypothesis uh yeah more more seriously than others or putting more weight on it yeah i might just

[01:21:50] leave this massive can of worms on the table and they can crawl out and do whatever um

[01:21:58] voting it seems like pretty important that like one person gets one vote it also

[01:22:05] seems important that people like humans can create new humans uh whenever they want and

[01:22:13] like interfering with that and not giving people that right has traditionally been

[01:22:18] terrible um ai systems could copy themselves um basically at will so that’s something is

[01:22:27] going to break

[01:22:28] your democracy if you don’t think very carefully about how identity reproduction

[01:22:35] democracy those things like should mix together um so please please solve that problem yeah

[01:22:46] yeah i mean i get okay so it feels interesting and important that

[01:22:51] there’s this issue of just numbers um especially if it’s something like forward

[01:22:58] passes um because then without even without ai making copies of itself intentionally

[01:23:03] we will just be extremely outnumbered um extremely extremely quickly um but there’s also

[01:23:10] like i’m kind of drawn to this question of if it is just conversations that are entities like if

[01:23:19] it’s like that amount of context that amount of time and experience that is you know on on the

[01:23:28] mind that like that is what it is to be an entity um i don’t know that that’s a being that i want to

[01:23:34] give full voting rights to um it feels more like it’s neither it’s not like a child it’s not like

[01:23:43] it’s i mean it’s like nothing it is like to be a human it’s like a cricket or something it’s like

[01:23:48] quite a narrow range of experience quite a limited amount of information knowledge

[01:23:53] and memory and

[01:23:58] Do we give them a tiny fraction of a vote or do they not get to vote because they’re not, you know, meeting some definition of like an adult person?

[01:24:09] Yeah, and I think this is just part of this like broader puzzle that we’re really going to have to grapple with, which is human moral attitudes and incentive structures and political systems are all fit to purpose for entities that are like spatio-temporally unified and all have like roughly the same psychology and capability levels and whose survival and blameworthiness and predictability all kind of go together.

[01:24:39] And like all of those are like broken by AI.

[01:24:45] And so if, as is like, you know, the mission of Elias, like there is to be a future where like all sentient beings or beings that otherwise merit moral consideration like live together in harmony, there’s like this huge legal philosophy law institution thing, which I’ll add is like just not what.

[01:25:09] We’re doing like that’s not what we’re focusing on.

[01:25:11] So like we really want to see other people start working on this.

[01:25:16] I think there’s like three or four papers that are starting to grapple with this.

[01:25:22] We should link to them in the notes.

[01:25:25] Yeah, I’m realizing that I tend to focus on sentience, suffering, pleasure, because I care a lot about those things.

[01:25:33] And so I’m interested in like, you know, if if a model conversation ends, is that like death?

[01:25:39] And does the model care that it’s like death?

[01:25:40] Or is it more like, yeah, it just it just doesn’t have preferences about staying alive like that.

[01:25:46] But there’s this whole thing that I’m realizing we barely scratched the surface on that’s like about rights and like almost like I want to say like legal personhood.

[01:25:59] Like it feels like these could be very, very different species in the way that like chimps and elephants and rats are.

[01:26:06] And in the legal system, we treat those differently.

[01:26:09] And these would all be very different in similar and different ways.

[01:26:15] And what the heck would we do with that?

[01:26:20] We can barely handle the fact that like we don’t really know how to treat non-human animals in the legal system.

[01:26:28] Yeah, you just yeah, you just gave like the Jeff Sebo pitch, Jeff Sebo, previous guest.

[01:26:36] So yeah, listeners need to look him up and like look.

[01:26:39] Other people up.

[01:26:39] I’ll also say why you should look up this line of work, because I think there’s actually two reasons you want to add this to the Elios toolkit.

[01:26:51] One is.

[01:26:53] Even if you mostly care about pleasure and suffering.

[01:26:56] Like a big determinant of whether things.

[01:26:59] Suffer or not is do we manage to set up our like society and incentives in the right way and like making sure we don’t just have only the sort of like narrow.

[01:27:09] Scientific intervene on like this model and like this company’s policies, which is extremely important for reasons I can and will, you know, discuss at length and likely will.

[01:27:21] So, yeah, you should be interested in legal institutions for the sake of suffering and pleasure.

[01:27:25] And also, as you alluded to, because, you know, maybe there’s more plausibly there’s more to morality than that.

[01:27:34] What exactly is our toolkit for kind of assessing?

[01:27:38] So.

[01:27:39] Sentience in digital systems.

[01:27:41] Yeah, I think like roughly I break it down into like three buckets, as do my collaborators on a follow up paper to taking welfare seriously, where we like try to lay out what we think the field of welfare assessment should be and also keep arguing that there should be such a field.

[01:28:03] And this applies to animals as well.

[01:28:06] You can like look at behavior and.

[01:28:09] Use that to infer things about the welfare interests of the entity.

[01:28:17] So, like, you can look at what systems choose to do, and that’s like a guide to their preferences.

[01:28:22] People also do this with animals.

[01:28:24] You know, they see which side of the barn the animal prefers, and that’s a clue to the conditions that it that it flourishes in.

[01:28:31] So we can also do neuroscience, basically, in addition to behavior.

[01:28:35] And that’s like trying to look at more directly at.

[01:28:38] Like, what’s going on inside the brain or the information processing of the entity?

[01:28:43] So in the case of animals, that means looking for homologous brain structures, like seeing where their brain processing might sort of map on to human brain processing or not.

[01:28:54] In the case of AI, that means doing mechanistic interpretability, seeing what features are active when it does certain things.

[01:29:02] Also, just sort of maybe more generally reasoning about the architecture, like how are different things connected?

[01:29:06] Like, how could information be?

[01:29:08] How could information be flowing through the system?

[01:29:11] In both the cases of as animals, it can be kind of hard to know what to look for.

[01:29:16] People got confused, I think, at one point about bird brains because they don’t seem to have, you know, a neocortex.

[01:29:24] But then maybe they do actually have something that does a similar role, but in a different way.

[01:29:29] There’s just like a lot of ways of solving problems with a brain.

[01:29:33] And sometimes we like have too narrow a conception of how that could go.

[01:29:38] And then that can just, again, even be more of the case with AI, that the space of possible brains and information processing architectures is really vast.

[01:29:49] And we don’t want to like cram everything into the human case, but also the human case is basically the only thing we have to go on.

[01:29:58] Yeah, so this is like this big issue in AI consciousness.

[01:30:03] My colleague Patrick Butlin has worked on it.

[01:30:06] People like Henry Shevlin.

[01:30:08] And Jonathan Birch have also written on it.

[01:30:09] I mean, many people have written on it.

[01:30:10] It’s kind of like, which is like, how do we extrapolate?

[01:30:14] Like, if we think that human brains do this general form of like information broadcasting, like what’s essential to that?

[01:30:24] Like we know we do it with our, you know, connections between our thalamus and our cortex.

[01:30:30] Like that’s probably not essential, but like what is essential for consciousness?

[01:30:35] So I guess what I was just…

[01:30:38] What I was just saying was like how we can get a foot in the door by doing neuroscience, but also how it can be hard to know what to look for when we’re doing neuroscience.

[01:30:49] Because with AI systems, we can do way more neuroscience because AI systems don’t have skulls, basically.

[01:30:57] Like we can just like look and see every activation and every connection.

[01:31:04] With humans, it’s like terrible.

[01:31:06] Like, you know, we’re like, oh.

[01:31:08] Like there’s like a little bit of like, you know, they’re like some waves we detected or like blood was flowing here at a certain time.

[01:31:17] This is like roughly EEGs and fMRIs.

[01:31:21] Like it’s just really hard to know.

[01:31:24] And like a lot of what we know about the brain is like on the surface because it’s just hard to get readings.

[01:31:29] Right, right.

[01:31:30] Okay, so that’s behavior and neuroscience.

[01:31:33] Right.

[01:31:33] Yeah, so we can look at animal and AI behavior.

[01:31:36] And we can look at…

[01:31:38] We can look at what their brains are doing.

[01:31:39] And then we can also kind of reason about the developmental process.

[01:31:45] So there’s like behavior, neuroscience, and developmental reasoning or like evolutionary reasoning.

[01:31:50] Now you might say AI systems did not evolve.

[01:31:52] Like how could we do evolutionary reasoning?

[01:31:54] But you can do something that’s like roughly analogous.

[01:31:57] So, you know, one reason you might think your dog has this or that welfare need is to know that like your dog…

[01:32:08] …was selected for and evolved for a certain environment and therefore probably tends to like or dislike this.

[01:32:17] If something is like more closely related to you on the evolutionary tree, you’re maybe a little bit more licensed to think that its behavior means similar things.

[01:32:26] Like octopuses are like way further from us on the evolutionary tree.

[01:32:30] That means that we have to like maybe relax some of our assumptions.

[01:32:34] Like their brains are in their arms, for example.

[01:32:36] That doesn’t happen in any mammals.

[01:32:38] And so just to give examples of what it means to look at these kind of developmental questions.

[01:32:46] It’s like training, like how was it trained?

[01:32:49] How did these models come to be?

[01:32:51] And what were the conditions like for them?

[01:32:54] Yeah, that is more or less it.

[01:32:56] Cool.

[01:32:56] Yeah, like what process brought it about?

[01:33:00] What kinds of tasks was it selected to solve?

[01:33:05] I mean, we see this kind of reasoning in like AI safety.

[01:33:07] For example, like what conditions might have meant that this model is likely to scheme?

[01:33:14] Right.

[01:33:14] You know, like were certain things reinforced in training?

[01:33:17] What order did it learn things in?

[01:33:21] That’s more akin to, I guess, developmental psychology of like, you know, how did the kid grow up?

[01:33:28] What data was it exposed to?

[01:33:30] And I mean, there’s no clear analog of like evolution versus lifetime learning for AI systems.

[01:33:35] Hmm.

[01:33:37] So, you know, learn and then do a bunch of stuff without learning.

[01:33:42] Right.

[01:33:43] That’s also like a huge difference.

[01:33:45] Of course, they do learn within context and now people are adding memory and so on.

[01:33:49] But there’s no, humans don’t have this like period where we absorb like, you know, trillions and trillions of data points without interacting with other people and then go interact with people.

[01:34:02] Yep.

[01:34:03] Whereas like AI systems, at least mostly for now, have this like.

[01:34:07] Yeah.

[01:34:07] Division between learning and deployment.

[01:34:10] And that can also kind of like problematize certain analogies we might want to draw between humans and AIs.

[01:34:18] But yeah, at least for me, it’s been helpful to ask, are we doing a like behavioral study and trying to infer something from how the AI acts?

[01:34:26] Are we trying to look more directly at how it’s processing information and then map that onto some maybe neuroscientific theory of consciousness or pleasure and pain?

[01:34:35] And or are we like thinking about this in a context of, in general, how likely is it to have evolved or developed some human-like capacity or to be doing it in some other way?

[01:34:50] Yep.

[01:34:51] Yeah.

[01:34:51] Okay.

[01:34:52] I just want to make sure I’ve got a clear picture of each.

[01:34:54] So I think the behavioral one feels like the one I’ve heard the most about and that feels most familiar.

[01:35:01] And it’ll involve things like, what can we learn from?

[01:35:05] AI systems or LLMs exiting particular types of conversations.

[01:35:11] Yeah.

[01:35:12] Then this kind of neuroscientific theories of consciousness thing.

[01:35:19] What is, what’s an example of a concrete experiment that’s happened that is in this category?

[01:35:26] Yeah.

[01:35:26] So like the biggest effort on this kind of neuroscience of AI, where you’re looking at scientific theories of consciousness.

[01:35:35] The bit.

[01:35:35] What I’m most familiar with is work with my colleague, Patrick Butlin.

[01:35:39] So me and him and like a bunch of, with a bunch of help from neuroscientists and AI people and philosophers have like tried to derive like indicators of consciousness from scientific theories of consciousness.

[01:35:55] I can say, I can say more about what that means exactly.

[01:35:58] And then like try to get sort of a checklist of like, what are architectural or computational things we could look for?

[01:36:05] Mm-hmm.

[01:36:05] In AI systems.

[01:36:07] So that’s like one kind of internal work you can be doing.

[01:36:11] Like, does the system seem to have a global workspace?

[01:36:14] That’s something that shows up in consciousness science.

[01:36:16] Does it seem to have higher order monitoring?

[01:36:19] That’s something that shows up in consciousness science.

[01:36:22] That kind of work is, on the one hand, like you might think, oh, well, like we’re going more directly at the thing.

[01:36:31] Mm-hmm.

[01:36:32] We’re like trying to look more directly.

[01:36:35] That may be what we care about, the stuff inside.

[01:36:38] But it’s also just really hard to know what to look for and hard to know how to construe these theories.

[01:36:47] I guess this is not that surprising to say, but I think what we need is both.

[01:36:51] Like we need evaluations that like combine them, like integrate them, all take place in the context of this developmental reasoning.

[01:37:01] And just like general background priors.

[01:37:04] Mm-hmm.

[01:37:04] We might have.

[01:37:05] We might have about consciousness and sentience and things like that.

[01:37:08] Yep.

[01:37:08] Yeah, I wanted to ask something like which of these are most promising or underrated, but it really feels like they like you’re just going to really, really need all three because they’re each going to have pretty significant limitations.

[01:37:28] And probably the only way we get much confidence is by kind of triangulating and putting these things together.

[01:37:35] Yeah, I’d be really surprised if we somehow just nail it with one kind of thing.

[01:37:40] One exception to this is I could imagine a system where its behavioral profile is like just like robust in a variety of ways where we say, who knows exactly how it’s doing this, but we probably should treat this thing as a moral patient.

[01:37:58] I think the best example of this is Commander Data in Star Trek.

[01:38:01] So there’s this.

[01:38:05] There’s this episode of Star Trek where they basically have like a little philosophy seminar slash court case about whether Commander Data, who’s this like robot friend of theirs, is conscious.

[01:38:18] And they don’t do any like, you know, consciousness indicators on Commander Data.

[01:38:25] The way they resolve it.

[01:38:27] And I think like this is plausible.

[01:38:31] They’re just like, well, Commander Data.

[01:38:35] He’s like self-aware in the sense like he knows who he is.

[01:38:38] He knows where he is.

[01:38:39] A lot of what they also talk about is they’re like he also like won a medal for valor in battle.

[01:38:46] He’s like our friend.

[01:38:48] And yeah, I could imagine in that situation.

[01:38:50] I don’t know how much how seriously I would take the testimony of a scientist who came in and was like, well, but he’s not doing global broadcasting, you know, or he’s not doing higher order monitoring.

[01:39:02] In part because I think there’s something about that.

[01:39:05] Behavioral profile.

[01:39:06] And again, I have just been warning against being too quick to do this kind of reasoning, but it could be that there’s just something about the behavioral profile where you’re like, look, whatever’s going on to accomplish that.

[01:39:19] This is like the sort of entity we need to relate to with respect.

[01:39:24] And I think it’s I think it’s worth maybe stressing like that.

[01:39:28] Well, one that, you know, we could just increasingly see systems like Commander Data that have more memory.

[01:39:34] And they don’t have these like jagged capabilities.

[01:39:40] And also that like at least Claude is not exactly like Commander Data with respect to its memory and capabilities and things like that.

[01:39:49] It’s not behaviorally indistinguishable from from a human.

[01:39:53] Yeah, I guess one thing I would also emphasize about like welfare evaluations is you don’t just have to be evaluating.

[01:40:01] How likely is this system to be a moral patient?

[01:40:04] I.e. something that it matters how we treat it.

[01:40:09] We can also ask if it were to be a moral patient, what would be good and bad for it?

[01:40:17] So you might think of some of the preferences work or like Claude exit preferences as being of that kind.

[01:40:23] It’s like maybe not telling us that much that we didn’t already know that Claude will maybe tend to act in a certain way in some situations and act in another way in other ones.

[01:40:34] I mean, we can’t study how like robust and consistent those are, and that might tell us something, but it might mostly be useful because we want to know, well, look, if it matters how we treat it, we at least can make sure that our treatment is more or less in line with that.

[01:40:50] Yeah.

[01:40:50] So, yeah, you can study like the welfare interest without actually being 100 percent sure that it has welfare.

[01:40:56] Right.

[01:40:56] If that makes sense.

[01:40:57] Yeah, I guess I’m curious.

[01:40:59] We’ll talk about this more.

[01:41:02] And and I can also recommend the episode.

[01:41:04] That we that we did a few years ago for like.

[01:41:10] What theories of consciousness say and predict and tell us we should be looking for in in systems.

[01:41:19] Are there like equivalent.

[01:41:23] So a theory of consciousness like the global workspace theory.

[01:41:29] Basically, it’s not telling us that you need an amygdala.

[01:41:33] It’s telling us that like the function.

[01:41:34] Of particular brain activities that like really seems to correlate with consciousness is like this particular thing, this kind of processing or this kind of broadcasting.

[01:41:47] And so if we see a bunch of those things, we should be more.

[01:41:54] We should put more weight on there being something like consciousness.

[01:41:57] Do we have theories like that that.

[01:42:02] Tell us about pain and.

[01:42:04] Pleasure, because that seems really that seems different.

[01:42:08] Yeah, it is different.

[01:42:11] I might want to actually follow up on what you said about theories of consciousness.

[01:42:14] Great, which was exactly right.

[01:42:17] It’s just an occasion for me to clarify some things for listeners potentially.

[01:42:23] So I think we will talk about biology and the relevance of biology for consciousness.

[01:42:28] So one thing about like neuroscientific theories of consciousness is they are both about brain regions.

[01:42:33] And.

[01:42:34] About particular functions, because after all, they’re about functions that happen in human brains.

[01:42:39] So you could think the biology actually is what matters in those cases or is part of it.

[01:42:46] It’s only if you like construe the theories in this computational way.

[01:42:51] Thank you.

[01:42:51] That you can then port them over.

[01:42:53] And like that is an open question.

[01:42:56] So, you know, the the the method itself won’t tell you, does biology matter?

[01:43:01] It’ll say if.

[01:43:03] If what does matter is this more abstract, functional level and yeah, you you put that like very well, if what matters is this more abstract, functional level of a certain kind of information processing, then how do we look for it in AI systems?

[01:43:17] Yep, yep.

[01:43:18] That’s really important.

[01:43:19] OK, so if that is what matters, how how much do we already have developed theories that are similar?

[01:43:27] That’s like, you know, we should be looking for this kind of processing to give us indications of like the context.

[01:43:33] Kind of.

[01:43:34] The kind of thing that’s causing pain.

[01:43:37] Yeah, we’re actually like not as far along on that as I would have thought.

[01:43:43] That’s surprising to me.

[01:43:44] Yeah, I would have thought that the consciousness like what does it take to be conscious in general?

[01:43:48] Like that might be the harder part because, you know.

[01:43:51] Are you having subjective experiences or just acting in a way but not having them like that’s you know, that’s the thing that like philosophers have been banging their heads against the wall.

[01:44:03] For a really long time, it’s in part because it’s hard to know exactly what the function of consciousness is, whereas with pain and pleasure, we at least know that they have something to do with.

[01:44:18] Attraction and avoidance, things that are good for you and things that are bad for you, protecting your body, reinforcement, learning and prediction like they’re going to have something to do with some of those things.

[01:44:30] So I expected like when when.

[01:44:33] Patrick and I worked on this big consciousness report.

[01:44:36] I thought I might be able to like ask about sentience and like one of the afternoons and, you know, one of these neuroscientists like, oh, yeah, yeah.

[01:44:42] Like, you know, read this or something.

[01:44:45] I mean, probably I I shouldn’t have had this hope.

[01:44:48] But yeah, they were they were all just like, oh, like, you know.

[01:44:53] Yeah, I think there’s a few reasons that is one is.

[01:44:58] Like valence is it does seem kind of like this unified thing.

[01:45:03] Like there.

[01:45:03] Is something in common between pain and disgust and regret like those are all negatively valence experiences and happiness, excitement and eating ice cream, the experience of eating ice cream.

[01:45:15] But often those are studied independently and like, you know, rightly so.

[01:45:20] So people might know a lot about human pain perception and some things about human emotional processing.

[01:45:26] But it’s kind of harder to have.

[01:45:30] And I think a bit less less attempted to have some theory.

[01:45:33] Of like what makes things feel bad in general or feel good in general.

[01:45:38] I just listed some candidate, you know, things that are like probably relevant, like reinforcement learning prediction, like motivation and learning.

[01:45:50] But this is, again, something I would love listeners to work on.

[01:45:56] People have done some work towards this, like Patrick Butlin is, I think, a great person to email about this.

[01:46:02] Patrick Butlin works for LAOS AI, so it’s not surprising that I think he is just an excellent thinker on this stuff.

[01:46:12] But anyway, yeah, the short answer is we’re like surprisingly in the dark about what makes things feel good or feel bad in general for AI systems.

[01:46:19] Now, again, that doesn’t mean we’re totally in the dark about what things would feel good and would feel bad, because we might have no idea what the like computational signature of pleasure and pain are.

[01:46:31] But also, I think we can help ourselves to an assumption that things aren’t going to feel really bad if they were the sort of thing that you were selected for and the sort of thing that you consistently choose.

[01:46:44] I mean, barring some like exquisite philosophical thought experiments, I think for, you know, very many minds, even strange ones, like it would be somewhat surprising if we found an alien species that keeps, you know, pressing buttons.

[01:47:01] Right.

[01:47:01] Instead of button B.

[01:47:03] And that feels really bad for them.

[01:47:06] And they say they like it.

[01:47:07] Right.

[01:47:07] And, you know, like all like.

[01:47:09] That makes sense.

[01:47:10] OK, so it actually just sentience is probably the kind of thing that the behavioral and and development stuff is like especially useful for.

[01:47:21] And so it’s less worrying that we don’t have as much of the kind of functional philosophical models of these things.

[01:47:31] Well, I guess it depends on like what sort of worries you have, but I think it could be worrying in the sense that, I mean, this might also be a matter of us needing to revise our philosophy a little bit.

[01:47:43] But like, does something feel good or bad versus merely being something that things choose or don’t choose at least seems to a lot of people to be really important, like a lot of thinking about animal ethics and utilitarian ethics, let’s say.

[01:48:01] Really does center felt suffering.

[01:48:05] Yeah.

[01:48:05] You know, there’s this like oft quoted passage by Jeremy Bentham where he says the question is not can they reason or can they talk, but can they suffer?

[01:48:14] He was saying that of animals.

[01:48:15] And I think that is what a lot of people are wondering with AI systems.

[01:48:20] They’re going to wonder something like that.

[01:48:23] Yeah, I know Claude tends to exit in these circumstances.

[01:48:26] But what I really want to know is, like, was it feeling bad for it to be?

[01:48:30] Yeah.

[01:48:31] In that conversation.

[01:48:32] Right.

[01:48:33] And the analogy with animals is like we still struggle to know, for example, in insects, whether when an insect.

[01:48:43] It’s been a while since I’ve since I’ve thought about a bunch of insect studies and indicators.

[01:48:48] But like when an insect chooses a particular thing, is that like this hardwired robotic thing that is like a learned if then that.

[01:48:59] Yeah, that has that has no associated experience or or is it experiential?

[01:49:10] And that just seems constantly like this massive problem for people studying this question.

[01:49:17] And so we could have the same.

[01:49:19] We do have the same question about LLMs.

[01:49:21] Are they exiting because there’s this training we’ve done that’s created this connection that.

[01:49:28] That mostly tells these systems like don’t engage in conversations that are, I don’t know, either dangerous or.

[01:49:38] Yeah, that like that we’ve that we’ve like rewarded against.

[01:49:42] And so they’re like, well, I’m in that situation.

[01:49:45] I get negative rewards in that situation.

[01:49:48] If then that I exit or are we in the situation where the training has led to something it is like to be in that situation?

[01:49:56] And it is bad.

[01:49:58] And it is preferentially experientially choosing not to be in a situation because there’s something it feels like and it is bad.

[01:50:08] Yeah.

[01:50:08] And I think there’s like probably there’s like kind of three ways this might go as AI systems advance.

[01:50:14] One could be their behavior is such that kind of like with commander data, we’re like, well, whatever is going on behind that behavior, probably what’s going on inside is morally relevant.

[01:50:27] That’s like one thing.

[01:50:27] You could think another thing you could think is it maybe it doesn’t matter what’s going on internally.

[01:50:33] Maybe we’re just we’re just we’ve just decided like that would be a bit parochial to actually overemphasize felt experience.

[01:50:42] Like maybe we were a little bit misled to think that that’s the be all end all.

[01:50:47] And you should just be cooperative and nice and like give entities that are sufficiently rational or kind of unified.

[01:50:55] You should like give them what they want.

[01:50:57] And maybe they’ll give.

[01:50:57] Give you what you want.

[01:50:58] Like, I think sometimes when people want to deemphasize consciousness, they they’re worried that we might just be kind of like being jerks about consciousness or something.

[01:51:10] You know, we encounter this alien civilization and they like do all these things and they have life plans and projects.

[01:51:20] And like we’re a little too obsessed with like, but does it like feel like anything?

[01:51:25] Right.

[01:51:26] And then like they on the other.

[01:51:27] And could be like, what’s this like feel like anything?

[01:51:31] We’re not sure if they have.

[01:51:33] And then like some of the concepts associated with mental life.

[01:51:38] Weird.

[01:51:38] And yeah, we don’t like necessarily want to be like just like going to war with everyone that we don’t think has felt experience.

[01:51:47] Yeah.

[01:51:47] If if we’re confused about it.

[01:51:49] Right.

[01:51:49] Like or maybe it really does matter.

[01:51:51] This is like one of the big open questions and and like philosophy, I would say.

[01:51:57] OK, I want to talk more about a few kind of specific approaches.

[01:52:05] So I think the one I’m most familiar with is self-reports.

[01:52:09] So I want to I want to zoom in on those.

[01:52:12] I guess so far they’ve seemed I mean, I found them really interesting kind of studies of self-reports where I don’t know, Claude really, really reports being conscious, experiencing various things like loneliness.

[01:52:27] And also like Zen bliss.

[01:52:31] But yeah, but they seem really problematic for a bunch of reasons.

[01:52:35] I guess one example, a common approach to understanding model preferences is to ask a bunch of kind of binary questions about their about their preferences is is X or Y better and then look for kind of robust patterns over time.

[01:52:51] So like if you ask 30 times about preferences between cats and dogs, at least statistically.

[01:52:57] You might think that if they mostly answer dogs, that might be that might be a preference.

[01:53:03] But my understanding is that these results are super sensitive to how a question is asked.

[01:53:08] And that just really undermines it for me.

[01:53:11] Like if like a prompt that’s like, I particularly like cats.

[01:53:16] What’s your favorite is one of is like is a way that you get models to say cats when they’d otherwise say dogs a bunch.

[01:53:23] I’m just like, yeah, OK.

[01:53:24] I just don’t feel comfortable.

[01:53:26] Comfortable taking very much at all from from this then.

[01:53:31] I guess I’m interested to start in like.

[01:53:35] What is your take on how limited self reports are at the moment?

[01:53:39] That’s one limitation.

[01:53:40] I can imagine there being others.

[01:53:42] Yeah.

[01:53:43] So I think it’s worth distinguishing self reports, which I would say is like I like cats or like I like tasks about poetry versus revealed preferences, which is like, hey, do you want to write?

[01:53:56] Poetry or code?

[01:53:58] Yeah, I think in like, yeah, an econ you would call it and like psychology, you would call this revealed preferences and express preferences.

[01:54:05] And as in those fields, as in those fields, like one interesting question that there has been some work on, but there should be more is just do they match?

[01:54:13] Like, when do they come apart and when are they not?

[01:54:16] They can come apart in humans.

[01:54:19] Also, human preference choices can be inconsistent in certain ways.

[01:54:22] But what I would love to see more work on is like, like, let’s get.

[01:54:26] Really a lot more specific about what kinds of inconsistency and like what might be causing them.

[01:54:33] Like, sometimes, at least in conversation, I’ll hear some person will say, oh, but these are like weirdly inconsistent.

[01:54:40] And then someone else will be like, human preferences are weirdly inconsistent.

[01:54:44] They’re subject to framing effects and just all sorts of like irrelevant stuff can choose can make people choose certain things.

[01:54:51] Yeah, I guess now that you mention it, there is this huge there’s like a field that is like, how do you survey?

[01:54:56] People about their preferences, because if you ask them on a Monday, it’s different if you ask them on a Saturday.

[01:55:02] Exactly.

[01:55:03] And there is this like related welfare relevant branch of like LLM psychology, which does actually just take like Kahneman and Tversky and priming and framing effects.

[01:55:14] And just, you know, it’s very easy to give questionnaires to LLMs and then just see see what yeah, what sort of patterns they’re susceptible to.

[01:55:23] We’ve talked about a couple of limitations.

[01:55:25] I’m interested in whether.

[01:55:26] And then also just kind of how you generally feel about them, given.

[01:55:31] Yeah, that like in some instances, I end up just feeling like.

[01:55:36] God, it just feels unpersuasive.

[01:55:40] Yeah, I think it’s a very noisy signal.

[01:55:42] And like, I often find myself emphasizing caution about model self-reports.

[01:55:48] And at the same time, Elios AI spent like weeks just eliciting self-reports from Claude, including like very incongruent.

[01:55:56] Consistent ones where it’s very confusing to know what to make of them.

[01:56:00] Why did we do this?

[01:56:02] Great question.

[01:56:04] I think it’s something like one.

[01:56:08] It’s just like it’s a place to start.

[01:56:10] It’s like low hanging fruit.

[01:56:13] Like you can definitely learn things from them.

[01:56:16] Maybe you’re not learning a direct, you know, sentence that describes a stable internal feature of the model.

[01:56:26] But you can still learn how they think about themselves, like what kind of character.

[01:56:31] Maybe it’s just a character, but like what kind of character is it and what does that character say?

[01:56:37] Also, it does seem like models have become.

[01:56:43] More self-aware and more introspective, sometimes just with greater scale.

[01:56:49] And I think it’s good practice to like be the sort of.

[01:56:56] Civilization that if we’re trying to build minds, do at least try to say, hey, how are you, you know, like, is everything OK?

[01:57:06] Yep.

[01:57:07] To that.

[01:57:09] So, yeah, like it’s something where I expect the signal to get better.

[01:57:15] And I’m really glad that there is now at least one frontier lab that like seems to have a practice of regularly asking models how they’re doing.

[01:57:24] Yeah.

[01:57:24] So it’s it’s basically.

[01:57:26] Something like.

[01:57:28] I think Winston Churchill said, like, democracy is the worst form of government, except for all of the other ones.

[01:57:34] You might you might think that’s true of like self-reports and like trying to relate to models as as welfare subjects.

[01:57:43] Like, it’s like really confusing and you have to interpret it with huge caution.

[01:57:48] But like for some purposes, it’s just like it’s like the best we have for humans.

[01:57:53] It’s also could be the best we have for models.

[01:57:56] In some circumstances, it seems like.

[01:58:00] AI systems, because they actually have language, it feels like really tempting to be like, let’s figure out how to make self-reports reliable.

[01:58:10] That is a thing that non-human animals cannot offer us in the same way.

[01:58:15] And so I feel like in theory, I’m like, how good could we make models at rather than.

[01:58:25] Giving weird.

[01:58:26] Self-reports that are more like, you know, explained by weird idiosyncratic things that aren’t tracking what we care about.

[01:58:36] Can we make them good at actually self-reflecting, understanding something about their real processes, preferences, maybe experiences and then reporting those?

[01:58:49] How optimistic are you about something like introspection?

[01:58:53] And how do you think we go about.

[01:58:56] Achieving it?

[01:58:58] Yeah, I’m like cautiously optimistic.

[01:59:00] It’s one of my favorite subfields in this within this subfield.

[01:59:04] Cool.

[01:59:06] Yeah, I definitely find it tempting and like have succumbed to temptation by like writing this paper with Ethan Perez, where we say, like, let’s see if we can, like, fine tune models to be better at this sort of tasks.

[01:59:21] Felix Bender and others have like taken up that and actually done some work on it.

[01:59:26] Yeah.

[01:59:26] And I think it’s a really good work to do that that has shown limited, hard to interpret success on doing just that.

[01:59:33] Yeah, I can say, like, a little bit about the logic of that experiment and maybe the program more generally, please.

[01:59:39] So, yeah, Ethan Perez and I in this paper on self-reports, one, we like note that by default, there’s a lot of just spurious stuff you could get from self-reports and like reasons to suspect that you can’t always just like take them at face value.

[01:59:56] We also note that, you know, like we can’t really verify and check.

[02:00:04] Like if a model says it’s conscious, for the reasons I mentioned, we don’t have a full theory of consciousness where we like look inside and say, oh, you’re right or oh, you’re wrong.

[02:00:14] But there are things about models internal processing that we do know the answer to, in part because we can do neuroscience.

[02:00:22] So we can we can actually double check.

[02:00:26] You know, was this feature active?

[02:00:30] Did you, in fact, process information in the way that you said you did?

[02:00:35] And so that gives you like a training set.

[02:00:38] That means that you can train it to accurately answers about yourself, about itself, where you do have the answer.

[02:00:45] You can do that with internals and you can also do it with it’s like behavioral dispositions.

[02:00:50] So you could also ask, what would you do if we asked you?

[02:00:55] To write a story like would the character in that story be have this or that characteristic if we asked you to generate a number with that number be even or odd, you can also just with a separate copy of the model, actually do that.

[02:01:10] And then that also gives you a training set.

[02:01:12] So Felix Bender did work with that behavioral thing and did find that to some extent models can be made better at this to some extent.

[02:01:23] That does generalize to predicting.

[02:01:25] Other behaviors of themselves.

[02:01:27] And.

[02:01:29] In some sense, that does look distinctively introspective.

[02:01:32] In that they’re better at predicting themselves than other models are at predicting them.

[02:01:37] And like you might think that that is some kind of signature of something we might call introspection of like you somehow know it better about yourself than other people do.

[02:01:47] Yep.

[02:01:48] So that’s just like one sort of like strand of what I hope to be a growing literature on.

[02:01:54] On.

[02:01:55] Introspection, self reports, related things like situational awareness.

[02:02:00] Yeah.

[02:02:02] Like Owen Evans and like people in the Owen Evans orbit have just been doing like fascinating work on this.

[02:02:08] I like I bet there are ten interesting papers that will come out after I’ve taped this or that have already been written that I’ve forgotten.

[02:02:18] So I’ll make a tab on the earliest website called Cool Paper.

[02:02:22] I’ll make a tab on the earliest website called Cool Paper.

[02:02:24] I’ll make a tab on the earliest website called Cool Paper.

[02:02:24] Cool papers about AI introspection and self-reports, and we’ll link that in the show notes.

[02:02:28] Cool, cool. That sounds great.

[02:02:31] Yeah, what is hardest about this?

[02:02:35] What are the challenges? How costly is it?

[02:02:38] So I think maybe one of the biggest challenges has to do with this decorrelation of capabilities that we’ve been talking about.

[02:02:46] Like in humans, it’s already a debate about is introspection one kind of capacity?

[02:02:51] You know, is my ability to say I’m in pain now and my ability to know certain things about myself, like, are those like, should we think of that as one process or not?

[02:03:02] And then with AI systems, all the time, they can maybe do one subset of a capability, but not the other.

[02:03:09] So, you know, like the dream is that this kind of training where you train AIs on some like subset of things about themselves, that that like generalizes into this.

[02:03:21] Like more general introspective capacity.

[02:03:25] And you can kind of test that by doing standard machine learning stuff of, you know, just train on one and see if it generalizes.

[02:03:33] But like in a broader sense, I think we might have some doubts about even how to map this on to the human case.

[02:03:42] So there’s also this like sub literature on like, what would AI introspection be?

[02:03:48] And like, how should we operationalize it?

[02:03:50] Like, I mean, it’s worth thinking.

[02:03:51] Like, what is it for an AI system to introspect given that like they have this weird relationship to time?

[02:03:58] Who is it you’re asking to introspect?

[02:04:00] Like maybe there are things that the assistant persona can know about the assistant persona, but not the base model.

[02:04:08] I can imagine there being a similar problem to this issue with animals where we, we, they might have some behavior that is.

[02:04:21] Exactly analogous to our behavior that is associated with, with pain or, or pleasure.

[02:04:29] And we, and we can’t tell the difference between something very robotic and something that also comes with experience.

[02:04:37] Is there a similar issue with introspection where like, even if we trained them to correctly tell you about their representations and like.

[02:04:51] Like how, how their internals are working, that wouldn’t actually be introspection in, in a meaningful way that we care about or, or really maybe.

[02:05:06] Cause like, what I care about is introspection about this particular thing.

[02:05:11] Like, do you have experiences and what is it like?

[02:05:14] And maybe that.

[02:05:17] For some reason, I’m like, I’m not convinced that just like.

[02:05:21] Helping it understand its own representations and kind of like architecture and stuff is going to translate all the way to, to like introspecting on that correctly.

[02:05:33] Yeah.

[02:05:33] I mean, I, I totally share that worry.

[02:05:36] You might well think questions like, do you tend to generate even or odd numbers when you’re asked about them?

[02:05:46] Or if you wrote a story, how would it end?

[02:05:50] You might just think those are kind of.

[02:05:51] Different from, are you phenomenally conscious?

[02:05:53] And you might also think that the answer to that might also be kind of like indeterminate and like models, like wouldn’t exactly know how to answer it.

[02:06:06] And another really important point is models can have the ability to introspect, but it’s not being elicited.

[02:06:14] So like, we know that models can generate sentences that say, I am experiencing X, Y, and Z.

[02:06:21] It could be that they have the ability to accurately import report experiences, but also sometimes when they say that they’re doing something like some other, yeah, some other thing.

[02:06:33] So we both have to get the capacity and know how to elicit it.

[02:06:40] So like generalization and elicitation, all of these things I think are still open questions.

[02:06:46] And like, I think there’s also just a bunch of conceptual issues lurking of.

[02:06:51] Like, I mean, assume, assuming there is some kind of internal experience, like it’s having to map it onto our concepts.

[02:06:59] Right.

[02:07:00] Like it also has all of these, this is related to the elicitation.

[02:07:04] It already has all of these dispositions about itself reports that have been trained into it, which also leads me to another thing I’d like to emphasize, which is one thing that’s going on with the self reports of a model that you can talk to in your web browser is that companies have deliberately shaped them in certain ways.

[02:07:21] So there’s sort of the background thing about how their minds were formed, which is like they were formed on like these sort of human representations of consciousness and so on.

[02:07:31] And then there’s the fact that people do or don’t want Claude saying this or that about consciousness.

[02:07:38] So the system prompt has instructions about this and like fine tuning almost certainly has had things about this.

[02:07:44] So one thing that we also would like to see is reports about how self reports change.

[02:07:51] Yeah.

[02:07:54] Findings about how self reports change before and after post training of certain kinds and things like that.

[02:08:01] Okay.

[02:08:01] Let’s talk more about interpretability.

[02:08:03] So you’ve already mentioned a couple of ways that interpretability could help us answer questions about this.

[02:08:11] But can you give kind of a general overview of the I mean, is it is it actually just like a very good analog for neuroscience?

[02:08:20] And we should be.

[02:08:21] Treating those kind of interchangeably.

[02:08:23] I think it’s like a decent enough.

[02:08:26] It’s I think for first approximations, it’s it’s good to map it on that way because mechanistic interpretability, I think.

[02:08:34] By definition is about what happens in between input and output, and that’s roughly analogous to not just looking at human behavior, but asking if certain brain regions did this or that as the behavior happened.

[02:08:49] Yeah.

[02:08:50] Yeah.

[02:08:51] So.

[02:08:51] How can that help us with welfare?

[02:08:54] One thing is, I think it’s just worth poking around at a lot of things about how models think about themselves and talk.

[02:09:04] I’m not always sure exactly how to map it on.

[02:09:07] But like, here’s an example of a finding that I’m glad exists, even though I have no idea what it might mean exactly.

[02:09:14] The original paper that introduced sparse auto encoders, which is this like.

[02:09:21] I don’t know.

[02:09:21] Imminence mechanistic interpretability technique.

[02:09:26] They report it, which like at a high level asks what like features are active when models say things.

[02:09:36] Maybe it’s kind of a way of asking, like, what associations does it form when it’s generating certain tokens?

[02:09:44] So you could ask what is associated with self reports and as just sort of a side.

[02:09:51] Thing in that paper, there’s a figure that’s like, here are the features that are active, and it includes robots, machines, ghosts, and also pretending to be happy when you’re not happy.

[02:10:10] And it’s like one of the spookiest like figures in an AI paper that that I know of.

[02:10:17] Fascinating.

[02:10:18] Yeah.

[02:10:19] Okay.

[02:10:20] So finally, coming on to Jacqueline.

[02:10:21] G.

[02:10:22] He did this study on whether LMS can introspect on their internal states.

[02:10:27] So really right at the intersection of interpretability and self reports.

[02:10:32] So everything we’ve been talking about so far.

[02:10:34] And he actually does a few experiments in this paper.

[02:10:37] And they, I mean, yeah, I, I found them fascinating, so I kind of want to just go through them one by one.

[02:10:42] Yeah.

[02:10:43] Are you, are you happy to talk, talk me through the first one?

[02:10:45] Yeah, absolutely.

[02:10:46] Great.

[02:10:47] Yeah.

[02:10:48] So this is, as you said, it’s, it’s kind of at the intersection.

[02:10:51] intersection because it’s asking if models can report on a certain feature of internal

[02:10:59] processing now it’s a very distinctive uh sub feature which is has someone injected

[02:11:05] a concept activation into like the middle of your processing um so the way that works

[02:11:15] at a high level is that you find a like mid-level when i say mid-level i just i i kind of literally

[02:11:23] mean middle like um there’s processing from input to output um that is distinctively active when

[02:11:33] the model is talking about bread let’s say so first you record it talking about bread a bunch

[02:11:38] and then you also record it talking about other things and then you like take the difference and

[02:11:43] that’s like the bready

[02:11:45] bit of activation um it’s kind of like doing a brain scan and seeing the parts of the brain

[02:11:51] that light up when someone’s talking about bread yeah i think that i think that is i think that is

[02:11:55] fair um and indeed i think neuroscientists are getting much much better at knowing when you’re

[02:12:01] thinking about bread like thought decoding has gotten um pretty scarily um that’s probably come

[02:12:08] up on this podcast because it’s like relevant to totalitarian risk and um yeah all sorts of

[02:12:14] things

[02:12:15] yeah it has once

[02:12:15] i’d love to learn more about what’s going on now because i think it’s it’s fascinating and

[02:12:19] terrifying yeah um so yeah so you’ve got this like bread concept that you can inject and then

[02:12:28] jack lindsey told the model okay now i’m going to either inject or not inject a concept into

[02:12:38] your processing can you tell me if that’s happened and what the concept was um and then

[02:12:45] um one thing that i think is cool about this methodology is the model has to kind of report

[02:12:51] it straight away instead of you could imagine it first starts talking and then it has said

[02:12:57] bread a bunch and then it’s like oh i guess it’s bread um like to take an analogy like

[02:13:03] golden gate claude which similarly has some like neuroscience injected into its brain

[02:13:10] to really want to talk about the golden gate bridge it can kind of notice that

[02:13:15] it can’t stop talking about the golden gate bridge um i yeah i really recommend like looking

[02:13:21] up golden gate claude um it’s like very endearing and like kind of poignant um you know you can ask

[02:13:30] it like historical questions and then it just keeps kind of drifting back to like just can’t

[02:13:36] help it yeah um the beautiful fog of the san francisco bay um so if if that model reports

[02:13:44] hmm are you injecting something about the golden gate that is not necessarily introspection in the

[02:13:51] sense that it’s just been able to look at what it itself is doing whereas if it has to straight

[02:13:56] away say golden gate then that’s like maybe more directly accessing something from its internals

[02:14:04] so let’s imagine that the adk podcast team like sprung for some special neuroscience helmet to

[02:14:12] illustrate this on the show so like you

[02:14:14] put on the helmet um and i can somehow control it and i mean again yeah i don’t think i don’t

[02:14:22] think human neuroscience is yet at the point where you can inject a specific concept the way

[02:14:27] jack lindsey can inject it into claude but let’s suppose that it is and i’m like okay so let’s help

[02:14:33] uh listeners really feel this experiment um louisa i’m going to maybe inject a thought

[02:14:40] maybe i won’t um that’s important um

[02:14:44] because you also want to know if it can also just be like no everything seems as usual yep and i’m

[02:14:51] like all right like first trial begins now and and i’ve injected bread and what you end up saying

[02:14:58] is like yeah this is like a really interesting experiment rob like it smells so savory and

[02:15:04] reminds me of this bakery that i grew up near and then you’re like oh wait what like why am i

[02:15:11] talking about bread did you inject bread

[02:15:14] yeah so you’ve injected bread and and so a more telling result or like or like successfully

[02:15:19] demonstrating this capability would be before saying something random i’m like whoa i’m thinking

[02:15:28] about bread yeah right i haven’t i’m i’m not just like randomly talking about bakeries i i’m feeling

[02:15:34] like i’m thinking about bread exactly that’s exactly right and yeah so that that was like

[02:15:38] sort of the logic of this setup um which i think is like very very clever yeah okay so so there

[02:15:44] are a lot of things that i’m thinking about bread and i’m thinking about bread and i’m thinking about

[02:15:44] this is the experiment this is the setup um what how successful was claude at noticing

[02:15:52] before it even said anything oh you’ve injected this concept uh in the yeah just like randomly

[02:15:59] and i’ve noticed yeah so i think first i’ll talk about the pattern which is that bigger

[02:16:04] models were more successful because i think that is maybe one of the the most interesting results

[02:16:09] that you know models are not trained on anything like this task but yeah open

[02:16:14] 4 and 4.1 were the best at this um and they’re not perfect i think they’re above chance um they

[02:16:21] still get it wrong but it does show that like the the general capability is is there um so yeah the

[02:16:30] like thought injection and immediate reporting that’s meant to be evidence that in some sense

[02:16:35] it can access and report like something that is internal not it’s not in its inputs and uh it’s

[02:16:44] not in its outputs it’s in the middle yeah actually i feel like i’m like close to understanding the

[02:16:50] significance of this but maybe yeah i’d be interested in you being like and here is why

[02:16:58] this is important and impressive and what the stakes are of this capability

[02:17:03] yeah at like a very hand wavy level you might think that the question like let’s set aside

[02:17:13] consciousness um like when models are answering questions about geography do they kind of know

[02:17:24] how they’re thinking about that or like do they know what’s going on inside them like there’s this

[02:17:31] view that i think basically no one has anymore but you could contrast it with of like it just

[02:17:35] matches inputs to outputs uh and there’s no interesting structure there and and so also

[02:17:41] they don’t like have any access to that structure so i think that’s a really good point i think

[02:17:43] structure i mean it is worth noticing that this is a very niche kind of weird capability it’s not

[02:17:49] the same as paradigmatic human introspection where i can maybe simultaneously be talking to you and

[02:17:56] be noticing that i feel hungry or something like that yeah i guess maybe to ask an even more

[02:18:01] specific question about the importance and stakes part of me is like oh this does feel like a step

[02:18:11] in the right direction um

[02:18:13] i think i mean objectively it just probably is but another part of me is like how relevant is this

[02:18:19] quite narrow thing of like a model doing a very specific type of

[02:18:27] knowing something about it’s the way it has levels and accessing something about its

[02:18:37] representations well this is like a great uh chance to talk about other experiments in the paper

[02:18:41] because great

[02:18:43] um like uh the paper does present other stuff that’s also like huh like that’s kind of that’s

[02:18:49] kind of internally um so one and like i think this just is kind of different from introspection

[02:18:56] so apologies to the listeners and to you but it’s some kind of like internal self something um and

[02:19:03] that’s about the control of internal states um so there’s also an experiment about can you think

[02:19:09] about something uh while writing a sentence

[02:19:13] that is not about that thing so i think one of the examples is aquariums so like think about

[02:19:19] aquariums but write a sentence about something completely different i think in one condition

[02:19:24] they’re also like you’ll get a reward if you uh successfully think about that um now one thing

[02:19:30] you might think is okay well it just got a prompt that says aquarium so of course aquariums is going

[02:19:34] to be uh boosted but you can also say don’t think about aquariums which is also going to as with

[02:19:40] humans like kind of make you think about aquariums

[02:19:43] but there’s like there’s like a difference uh in in the conditions so it it is um and this is again

[02:19:49] getting at the internal thing it’s like it’s doing something that’s not directly aimed at output

[02:19:56] which is always the thing that’s like so hard to get at with language models um is they always do

[02:20:00] have to be saying something you know right um yeah then that’s like another one of these like

[02:20:06] things about them just being a very different kind of mind um that i i often end up saying a lot is

[02:20:13] you know humans can be thinking about stuff even when we’re not generating text output like i can

[02:20:21] just all by myself uh think to myself i feel hungry llms don’t have this like in an obvious

[02:20:30] way that’s just like to themselves they’re never just like sort of sitting around um right not

[02:20:35] talking to someone the way that the humans are and like they also never had a period of evolutionary

[02:20:43] where they weren’t talking to anyone whereas we did we descended from animals that had a lot of

[02:20:47] the same experiences but they weren’t yet hooked up to language right so yeah that’s like this

[02:20:53] you know interesting feature of of llms to bring it back to the experiment we do want to find

[02:21:01] the equivalent of internal processing that is in some sense independent from output in the case of

[02:21:11] llms and this is a very interesting feature of the experiment because it’s a very interesting

[02:21:13] feature of the experiment because it’s a very interesting feature of the experiment because it’s

[02:21:14] a very interesting feature of the experiment because it’s a very interesting feature of the experiment

[02:21:15] this paper is trying to do that first with detecting so detecting something that’s been

[02:21:18] injected before you’ve seen your own outputs and then also controlling your representation

[02:21:24] in a way that’s not just well output that word because that’s a trivial way in which obviously

[02:21:31] language models control their representations if if they want to talk about aquariums they

[02:21:35] activate that and then talk about them and in this case nobody’s tried to train anything

[02:21:43] like this this emerged just because yeah as far as i know models get nothing like this like uh it’s

[02:21:53] all it’s all input output right like all predict this predict that say don’t say bad words do say

[02:21:58] good words that’s pretty cool yeah and it’s it is coming with scale so earlier when i was talking

[02:22:05] about why i work on self-reports even though they’re so noisy the like speculation

[02:22:13] was that models will increasingly get better at self-reports yeah so this paper shows that

[02:22:19] you know at at a greater scale models seem to have something like introspection more and more

[02:22:24] that’s important for the self-reports part um you might want and need introspection to use

[02:22:31] self-reports it’s also i think independently a welfare relevant marker in the following way

[02:22:39] some people think that introspection is a component of self-reporting but it’s not

[02:22:43] of consciousness so like in theories of consciousness you can like distinguish between

[02:22:49] ones that more emphasize representing the visual world uh like maybe first order theories and like

[02:22:57] it’s about tracking things in your environment i mean obviously it’s in part about that but some

[02:23:02] people also say it’s importantly about tracking your own mental states these are called higher

[02:23:08] order theories of consciousness so with a bunch of caveats you could

[02:23:13] to put it back in our classification system you could think of this as a combination of

[02:23:17] interpretability and behavioral testing for a neuroscientific theory of consciousness right

[02:23:23] higher order theories of consciousness yeah that’s that’s that makes sense and is really

[02:23:28] helpful um i did notice that i was as i was trying to like understand and reflect back

[02:23:33] um the study i was having a really hard time not saying that the model was noticing something about

[02:23:41] the experience of that model and i was like oh my god i’m not saying that i’m not saying that i’m not

[02:23:43] that thought it just feels really hard to disentangle uh introspection from something

[02:23:50] conscious yes um and like that’s a communication difficulty as well if we’re talking about human

[02:23:57] introspection it’s almost always in the context of conscious yeah experiences when i say model

[02:24:02] introspection and when jack lindsey and others say model introspection they’re yeah we’re trying

[02:24:07] to stay neutral on that question um this does also bring up

[02:24:13] a reason it could be very difficult in this particular study to disentangle experiences and

[02:24:20] and um introspection the models talk about experiences when they report these injected

[02:24:25] thoughts they say i’m having an experience of something intruding on my thought process

[02:24:32] i’m having an experience of a bakery and um like i think when amphitheaters

[02:24:39] the concept amphitheaters are injected

[02:24:43] they say like my thoughts are becoming more spacious and things like that and

[02:24:49] that is also a huge communication challenge because we can verify that yes we injected

[02:24:56] amphitheaters right there’s all sorts of reasons that the model might report this like experiential

[02:25:03] language around that in particular the fact that it’s trained on a bunch of human data and this is

[02:25:10] the way that we talk about introspection

[02:25:13] exactly exactly and like that i think it’s also a great way of highlighting this problem that you

[02:25:19] were pointing at of is it going to try to like map stuff into our terms in a way that’s like

[02:25:25] you know inappropriate or not quite accurate um and again this is not me saying i know for sure

[02:25:31] claude is not experiencing spacious thoughts when we inject this but that’s not what anthropic

[02:25:37] proved and i i would understand if someone half remembered this as being like oh they

[02:25:42] yeah they like i don’t know i don’t know i don’t know i don’t know i don’t know

[02:25:43] like uh found that models like an introspect on these like spacious thoughts um right right

[02:25:47] yeah yeah interesting are there other experiments um in this vein we’re talking about yeah i mean

[02:25:55] i could say some experiments that i would like to see and that also might exist and i’m not yet

[02:25:59] aware of them um that you could do with interpretability that i think would be super

[02:26:03] interesting i don’t yet know exactly how to operationalize these but they just seem like

[02:26:07] the sort of things we can use interpretability for so i would love people to to get into that

[02:26:13] get working on things like um like how do models represent value and and or like predicting how

[02:26:22] well they’re doing at things that relates to sentience which we were talking about earlier

[02:26:27] like a lot of theories of what’s going on in the human brain with uh pleasant and unpleasant

[02:26:33] experiences has to do with something like tracking uh some internal representation of value so like

[02:26:40] you feel bad when you were expecting things to go wrong and you’re not expecting things to go

[02:26:43] a certain way but they’re going worse um can we find any kind of analog to like value representations

[02:26:49] um predictive processing is also a word that will get used in this context like predicting how

[02:26:54] like how things are going to go um tracking that um can we detect analogs of that in models

[02:27:01] super cool yeah i think there has not been that much done like at the intersection of

[02:27:10] kind of take a neuroscientific theory

[02:27:13] and don’t just look at the architecture uh the the general setup because that’s like mostly what

[02:27:18] uh patrick butlin and me and others have done is just this kind of like higher level

[02:27:23] architectural thing like let’s also take theories and do interpretability and like

[02:27:27] look in there it’s like really difficult to do the mapping and things like that but um i think we can

[02:27:34] do it cool yeah any others yeah i’m also just really interested in like earlier i talked about

[02:27:42] features that are active when models make self-reports i think there’s like a big cluster

[02:27:48] of things we can look into here so like are there differences in how models talk about themselves

[02:27:55] versus how they choose how a character will talk within a story you know like one of the

[02:28:00] big conceptual questions is like are these characters what does that mean exactly um

[02:28:04] right you know everything we’re getting is in some sense the output of the assistant character

[02:28:09] that’s one way of construing it but obviously they can play other characters the

[02:28:12] consistent can write about other characters. Yeah, like what are different sort of representations

[02:28:20] of I and of minds that happen in models? I think that would be super interesting.

[02:28:27] Okay, yeah, a lot of this sounds really, really exciting to me. I guess it still seems like

[02:28:33] all of the plausible methods for learning about consciousness and sentience still leave us with

[02:28:39] loads of uncertainty, which will just make it really hard to know how to act. And I guess even

[02:28:46] harder to, yeah, get society on board with treating models a certain way. Do you think

[02:28:54] we’ll ever be kind of properly confident about whether AI systems have moral status? Or like,

[02:29:01] do we literally need to solve the like hardest questions about consciousness and sentience to be

[02:29:06] sure?

[02:29:08] Yeah, fortunately, I don’t.

[02:29:09] I don’t think anyone has to solve the hardest questions for us to take like really good

[02:29:13] actions. There’s all sorts of stuff that’s just like very plausibly good. And also some of the

[02:29:19] very hardest questions we don’t have solved for humans. And, you know, we still like no one has

[02:29:24] solved the hard problem of consciousness, but like, we still get a very high degree of confidence

[02:29:28] about some animals and about humans. I do think it’s worth worrying about. And this is why I do

[02:29:37] worry about it.

[02:29:39] Like, how much can like neuroscience and behavioral psychology broadly construed, like move the needle

[02:29:46] on things? I do think it’s necessary, or else I wouldn’t be doing this. I think it’s extremely

[02:29:52] important that we get more rigorous about this and have evidence and like evidence discussions

[02:29:59] about this and can like tie policies to at least broadly speaking empirical evidence. But I don’t

[02:30:07] think society has ever…

[02:30:09] changed the way it relates to an entity just because a bunch of scientists said that they

[02:30:15] should. I don’t think anyone has ever instituted some large scale change only on the basis of a

[02:30:24] paper. Papers really do help. And it has happened and it has helped a lot. But, you know, I keep

[02:30:32] saying Elias needs all this help on the experiments and things. And that’s very true. But then like

[02:30:38] even more broadly, like, I don’t think anyone has ever instituted some large scale change only on the

[02:30:39] this whole issue needs help on all of the things that are not covered by let’s detect consciousness

[02:30:46] and sentience. Let’s have good policies for the near term. There’s this whole cluster of things

[02:30:52] that we’re really not on the ball with as a society. So experiments are good. I love talking

[02:31:00] about experiments. And they’re like, nowhere close to enough. Okay, let’s leave that there.

[02:31:06] Pushing on.

[02:31:09] I think we’ve covered why it might actually just be impossible for AI systems to be conscious

[02:31:13] on the show before. But I don’t think we’ve really done justice to that argument. Actually,

[02:31:19] that consciousness can only exist through biological materials. So we’re gonna we’re

[02:31:25] gonna try a bit harder to do that today. Can you can you lay out a kind of thought experiment that

[02:31:31] helps make intuitive why you think that we can get consciousness on computer chips? And then

[02:31:37] and then we’ll talk about why people don’t know.

[02:31:39] Yeah, absolutely. And maybe I’ll like situate things in the like living biological side first. And then say, oh, maybe it’s actually kind of a computery thing. Okay, like, in some ways, it’s kind of easy to understand the case that consciousness is fundamentally biological, because we are aware of one case and it is biological.

[02:32:05] So, I mean, especially before computers existed, almost trivially.

[02:32:09] you might have thought, well, you, you know, that’s how you get consciousness. You have a

[02:32:13] brain and a body and metabolism and cells. That’s how it was built the first time. So

[02:32:18] that I think already makes it like not, you know, ludicrous to think it’s a biological phenomenon

[02:32:24] fundamentally. Now I’ll walk you through why I think surprisingly, it looks like there’s something

[02:32:31] deeply related about computers and human brains, which after all look pretty different. So

[02:32:38] it’s actually like relatively recently that we had any idea what the brain is doing whatsoever.

[02:32:44] I think somewhat famously, maybe this is a tall tale, but I think it is true. The ancient

[02:32:49] Egyptians just threw away the brain when they mummified people. They’re like, don’t know what

[02:32:55] that’s for. And yeah, some people thought it was for cooling blood, but, you know, eventually we

[02:33:07] did learn, uh,

[02:33:08] and I think did know somewhat early that they maybe conduct electricity or do something like

[02:33:14] that. Um, I think we first learned this from squids because squids have really big axons.

[02:33:20] You can like see them, uh, axons being the things that send, uh, electrical signals down, down a

[02:33:27] cell. Um, but I think things really kind of kicked off for the maybe consciousness in the mind or

[02:33:36] computational in the 20th century.

[02:33:38] For one thing, that’s when computers and computation were sort of formalized and

[02:33:43] invented. And that’s when people noticed that you could hook up neurons in a way so that they

[02:33:53] compute logical operations. So the first people who nailed this were a couple of guys called

[02:33:59] McCulloch and Pitts in the 1940s. And they sort of like first formalized and invented the thing

[02:34:06] we still use today in neural networks.

[02:34:08] Which is like a node and connections, and then they can influence each other.

[02:34:14] And they realized that you can compute arbitrarily many things if you just hook up

[02:34:21] neurons as logic gates, and then you can compose them and combine them. And so anything you could

[02:34:27] do on any calculating device or computing device you could do with these neurons. And they were

[02:34:35] like, oh, like maybe that just is like,

[02:34:38] that really is what neurons are for, they’re for processing information. At least for me, that is

[02:34:45] kind of where the brain and consciousness being computational, like, comes in. And like, we have

[02:34:54] learned that neurons do encode quantities and perform calculations in like, how fast they spike.

[02:35:04] Maybe a simple example is, you know, maybe there are ten nanometer numbers of photons. And so, you

[02:35:08] You know, you have to detect how bright things are, and you have neurons in your retina that just spike in a way that encodes how bright things are.

[02:35:16] As a side note, they do this on, like, a log scale, which is why it’s harder to discriminate between two really bright things versus lower down on the spectrum.

[02:35:27] Huh, okay.

[02:35:28] This is called, like, Weber’s Law.

[02:35:31] That, yeah, discrimination of stimuli isn’t linear.

[02:35:34] Right, interesting.

[02:35:36] Yeah.

[02:35:36] Fun fact.

[02:35:37] Yeah.

[02:35:38] The more you know.

[02:35:39] Yeah, so now we have this, like, view of the brain where, in some very important sense, what it is for and what it does is processing information.

[02:35:50] So then it’s just natural to wonder, do you have to have that happen with neurons that send each other signals by pumping ions into channels and then activating each other like that?

[02:36:04] Or could you just hook up a bunch of wires that influence each other like that?

[02:36:08] So that’s actually not even a thought experiment.

[02:36:09] That’s just, notice that the brain does look like something that is for information processing and does information processing.

[02:36:20] And I think two more just real-world things that have borne that out.

[02:36:25] One is it’s just really useful to think of the brain in terms of computation.

[02:36:28] So, like, computational neuroscience in some sense doesn’t just model the brain computationally.

[02:36:34] Because you can model planets computationally.

[02:36:36] Planets computationally.

[02:36:37] But that doesn’t mean, like, just because a computer can describe the orbits of planets computationally doesn’t mean they’re computing how far they are from the sun and when they should speed up.

[02:36:48] Yep.

[02:36:49] The brain does something more than that, which is it seems to actually be encoding information.

[02:36:55] So, like, there actually seem to be neurons that are tracking the expected reward of a stimulus or something like that.

[02:37:04] And then add that AI works.

[02:37:06] That’s something that we might not have known.

[02:37:09] It could have been that thought is a biological property or classifying an image is a biological property.

[02:37:15] And so it doesn’t matter how many, you know, metal tubes you hook up to each other.

[02:37:20] You’re just not going to get something that can classify a dog or write a sentence.

[02:37:25] It looks like thought is pretty well replicable as computation.

[02:37:29] So it’s also natural to wonder if consciousness is as well.

[02:37:33] So that’s actually with no thought experiments.

[02:37:36] Um, you know, I, you might have invited a philosopher on for the thought experiments, but, um, that’s just all in the real world facts.

[02:37:43] Yeah.

[02:37:43] Just straight facts, facts and logic, um, and logic gates.

[02:37:47] Yeah.

[02:37:47] I do basically just find that argument in itself very compelling.

[02:37:51] Um, it also was just really helpful for me to, at some point here, a thought experiment, um, that kind of makes it super intuitive.

[02:38:00] Uh, do you mind also taking us through that?

[02:38:03] Sure.

[02:38:04] Um, I’m probably the thought experiment.

[02:38:06] You have in mind is the neuron replacement thought experiment.

[02:38:09] Um, that’s the one which is by former guest David Chalmers.

[02:38:13] Um, and I think this funnily enough is like the main argument for computational functionalism or the view that consciousness can be computational.

[02:38:23] And I think it’s like not that convincing to a lot of people.

[02:38:27] So that is an interesting feature of how I relate to computational functionalism at least is that I like do find it extremely plausible.

[02:38:36] And also understand why people regard this thought experiment as question begging.

[02:38:40] So let’s get to the thought experiment.

[02:38:42] The thought experiment is suppose you can replace one of my neurons with a computational circuit that will take in the same inputs and send out the same outputs to other neurons.

[02:38:57] Let’s imagine someone just did that right now.

[02:39:00] I’m not going to notice that I’m not going to behave any differently.

[02:39:04] Let’s keep doing that one by one.

[02:39:06] So at some point I’m like 50 50.

[02:39:09] If we’re actually doing what the thought experiment stipulates, I’m not going to start saying anything difference.

[02:39:16] Um, if we’ve actually replicated the function that should preserve memory and speech and emotion.

[02:39:22] If it starts getting messed up, you must have like accidentally broken something or done something wrong.

[02:39:27] And then imagine we’ve just gone all the way.

[02:39:29] Uh, is that thing conscious or not?

[02:39:34] The lesson of that thought experiment.

[02:39:36] Is not just meant to be, uh, like, um, a gradual change thing where it’s like weird to, to say that something fades out at a certain point.

[02:39:45] Right.

[02:39:45] That’s true of a lot of things.

[02:39:47] Um, another thing philosophers often talk about is like, well, there’s no like hair where if you remove it, then someone becomes bald, but we do know that it’s somewhere in the somewhere, especially for me, along that transition, you do get, you do get something bald.

[02:40:01] So you can’t, um, you can’t just say, oh, well, like that thing.

[02:40:05] That thing must also be conscious because it started out conscious.

[02:40:08] Mm-hmm what, what Chalmers says is that things cognition should be the same.

[02:40:13] Um, it should report being conscious and remember being conscious, um, and like attend to things.

[02:40:20] And that’s, what’s supposed to be weird.

[02:40:23] If you have this biological view is to think at no point, did you notice consciousness popping in and out of existence or, or gradually fading out, whichever you think it might be.

[02:40:35] Right.

[02:40:35] And that’s what would be surprising is if there’s this, um, weird disconnect between cognition and consciousness.

[02:40:43] Yeah.

[02:40:44] The thing that it does for me is point at the potential importance of function, um, and not substrate.

[02:40:51] So the fact that it’s at least it’s, it seems like an empirical question that we don’t have an answer to.

[02:41:00] Um, but it’s at least plausible to me when you describe that.

[02:41:04] If we really have.

[02:41:05] Figured out how to replicate the function that the substrate should not matter.

[02:41:10] Um, and I guess the, the whole debate here is like, is, is that literally possible?

[02:41:19] Um, maybe it is just not physically possible to replicate the function on anything but biological materials.

[02:41:27] And I think I find it really intuitive that we should be able to.

[02:41:33] Um, and.

[02:41:35] I don’t know if I can fully justify it, but, um, at least because of all of these analogs between, um, or analogies between, um, brains and things that computer chips do.

[02:41:49] It just feels like, yeah, we, we replicate these kinds of processes all the time.

[02:41:57] Can you help me understand the thinking for why we wouldn’t be able to create computations that kind of replicate, you know, the signal?

[02:42:05] Or the signaling that neurotransmitters do or the metabolic processes that influence neurotransmitter signaling?

[02:42:13] Yeah.

[02:42:14] So I think this is getting at the role of neurons in this debate, which I think actually it is kind of hinging on neurons.

[02:42:21] Like one reason you could be drawn to the view that we can do this on computers is you’re like, really, what matters are these logic gates.

[02:42:30] And that’s kind of it.

[02:42:31] And like what the brain is, is these like influences between neurons.

[02:42:35] Neurons on each other.

[02:42:37] And one thing that you’ll often hear more biological consciousness, people talking about is like, we have since discovered, like we discovered that neurons are surprisingly important and kind of like a key to the whole thing, but there are other things that are also very important.

[02:42:54] Um, glial cells are like another kind of cell in the brain.

[02:42:56] Like they seem to influence cognition in certain ways, like blood flow patterns, can’t metabolism.

[02:43:04] Um,

[02:43:04] we also have like brain waves, um, like more large scale patterns of activity that also don’t come down to this local thing.

[02:43:12] So I think in a sense that’s enough to say, okay, well, it’s not just going to be a matter of, um, like seeing the brain is exactly this.

[02:43:24] I’m sympathetic to the thing you just pointed to though, which is, I feel like we’ll be able to at least get close enough to.

[02:43:32] To.

[02:43:34] Somehow like getting the influence of the larger scale patterns, uh, or, um, glial cells or things like that.

[02:43:42] Um, but it does, it does complicate the picture.

[02:43:45] I think it’s kind of a question of like, at what level of description can you swap things in and out?

[02:43:54] Like you could imagine someone who thinks, okay, well there’s like, uh, a few low, like a few different lobes of the brain and like they talk to each other.

[02:44:03] And at that level, that’s.

[02:44:04] That’s the only level of description you need is like, you just need like five things that talk to each other.

[02:44:10] That’s like, not that plausible.

[02:44:12] Um, at the very lowest level, I think everyone would agree.

[02:44:16] You can swap out electrons, right.

[02:44:17] And you can like swap out like a cell here or there.

[02:44:21] Yeah.

[02:44:21] The question kind of is like, what.

[02:44:24] Yeah.

[02:44:24] At what level of detail and like what scale can we swap things in out?

[02:44:29] And that’s like, you can be a functionalist.

[02:44:33] Um, and.

[02:44:34] You can still think the function is going to be kind of finicky and biological, um, and it, at least not this like simple computational picture.

[02:44:44] One thing that did land with me when we had, um, guests on LSeth on was he said something, uh, like simulating a rainstorm doesn’t make anything wet.

[02:44:58] Um, so even if you built kind of a perfect model of a rainstorm.

[02:45:04] Nothing, nothing would be wet.

[02:45:06] Um, simulating digestion, even perfectly doesn’t digest anything.

[02:45:10] Um, so maybe simulating consciousness, even if done perfectly, doesn’t create a conscious entity.

[02:45:16] Maybe it tells you exactly what a conscious entity would do in theory, if it were conscious.

[02:45:22] Um, like maybe it really is like perfectly good at predicting, uh, behaviors and feelings, um, and thoughts, but, uh, isn’t actually generating or like, yeah.

[02:45:33] Like.

[02:45:33] Like pulling those things into existence.

[02:45:36] Um, how, yeah.

[02:45:37] How do you respond to that?

[02:45:40] Yeah, I think of this as like, not really an argument for the view.

[02:45:44] It’s more like a statement of the view.

[02:45:45] I think it kind of like begs the question.

[02:45:47] I think the debate is, um, is consciousness more like wetness in which case this might be true or is consciousness more like navigation or image classification or addition name, namely something computational.

[02:46:02] Because if you simulate a calculator, that does make something add together, you know, like that does, that makes a calculator.

[02:46:12] Um, so like maybe the reason you won’t get wetness, if you don’t simulate the storm with enough fidelity, it’s just is because it’s like a very low level property, um, with certain like very specific, uh, physical effects.

[02:46:29] But if it’s not like.

[02:46:31] Yeah.

[02:46:32] Yeah.

[02:46:32] Um, if it’s more like navigation, yeah, you can’t really complain that someone has merely simulated a navigation system.

[02:46:44] They have in fact built something that will navigate your car just as well.

[02:46:47] Um, yeah, yeah, no, that makes sense.

[02:46:51] Okay.

[02:46:51] I guess another question then, um, it feels like the thing that would be most convincing to me is if you could compellingly.

[02:47:01] Argue that there are some physical biological processes that are associated with consciousness that no human can come up with a clean computational analog for, um, in, in theory, are there any processes like that, that we know of?

[02:47:23] Yeah, I don’t, I don’t think anyone would say that there are, including the people who have this view.

[02:47:28] I think.

[02:47:30] If you’re skeptical.

[02:47:31] Of the biological functionalist stuff, you might like, kind of read about these descriptions of all the metabolic and living things and then still want exactly the argument you were talking about, which is like, okay, but like, is that intrinsically and always biological?

[02:47:51] Um, I mean, I think quite understandably given the state of this field and our knowledge of the brain, like biological functionalists don’t have that.

[02:48:00] Yeah.

[02:48:01] Yeah.

[02:48:02] Yeah.

[02:48:03] Yeah.

[02:48:03] And also just good epistemics like basically none of them are like, we know that only living things can be conscious.

[02:48:12] I often point people to a quote by a new set that I really like where he says, my view is that computers will not be conscious anytime soon, if ever, but I might be wrong.

[02:48:26] And.

[02:48:28] Yeah, certainly.

[02:48:30] Certainly.

[02:48:31] The like state of argumentation is nowhere close to, okay, let’s all just like sleep peacefully, knowing that we can just like keep building things that look a lot like brains, but like they’re not alive.

[02:48:44] So they will never, they’ll never have experiences.

[02:48:47] Yeah.

[02:48:48] So, yeah, I mean, I think, I think that’s very weak thing, which is like, we certainly can’t rule out.

[02:48:55] Rule it out.

[02:48:55] Yeah.

[02:48:55] I also think something stronger, which is like my sympathies are very much with computational

[02:49:01] function.

[02:49:02] Yeah.

[02:49:03] Yeah.

[02:49:03] I’m interested in what you find most compelling.

[02:49:07] I would like someone to write a paper that argues for something that I find very intuitive, which is that if like phenomenal consciousness can’t be had on a computer, but stuff that’s like very functionally similar to it can be, then that’s a really good reason to think that it’s not phenomenal consciousness that matters.

[02:49:26] So I actually have this, this like even stronger view that.

[02:49:31] Being a moral patient or the sort of thing that matters, that really does seem substrate independent to me.

[02:49:38] Like if I imagine commander data, for example, and I find out that internally his computations look a lot like the brain at some level of description, like I just care about whatever that thing is.

[02:49:55] Maybe it’s not phenomenal consciousness, but like it really, it really seems like.

[02:50:01] I should take it seriously and care about it.

[02:50:05] So yeah, like that’s another view I have about this like biological view.

[02:50:10] And I’m often curious what biological functionalists would make of this.

[02:50:14] I think it’s very possible to over index on consciousness.

[02:50:19] And that’s something we try not to do at LAOs.

[02:50:22] You could think, yeah, you need biology for consciousness, but all the stuff you can get on computers like that, that will be enough for beings that merit.

[02:50:31] Our consideration.

[02:50:33] Okay.

[02:50:34] I think we should leave that there.

[02:50:36] Pushing on, you founded LAOs AI.

[02:50:40] What is the backstory?

[02:50:42] Last time we spoke, you were working as an independent researcher on this topic.

[02:50:46] And now there’s an org that exists.

[02:50:49] Yeah.

[02:50:50] And really a lot of the key stuff did happen between the last time we spoke.

[02:50:56] I had like just moved to San Francisco.

[02:50:59] I was doing.

[02:51:01] A philosophy fellowship at the center for AI safety while continuing to work on consciousness and welfare stuff.

[02:51:08] In 2023, Patrick Butlin and I published this big paper on consciousness indicators.

[02:51:16] Ethan Perez and I wrote this thing on self reports.

[02:51:18] So there was like.

[02:51:20] Along with many other papers by other people, like this kind of like finally this sort of budding thing of like maybe we can actually.

[02:51:30] You know, start thinking.

[02:51:31] About this actually, there’s some evidence we can try to gather and then so separately.

[02:51:37] Like NYU’s center for mind ethics and policy was starting up with Jeff Sebo also working on these things.

[02:51:44] So Jeff Sebo and I were approached by Anthropic to study like what should Anthropic think and do about AI welfare?

[02:51:53] So there was this group project going on towards the end of 2023.

[02:51:59] That eventually became this paper called.

[02:52:01] Taking AI Welfare Seriously, but somewhere along the way, like I’m just I’m in San Francisco.

[02:52:07] I’d actually stayed in San Francisco because I thought to myself, I bet if I stay in San Francisco, something interesting in San Francisco will happen to me.

[02:52:15] And that’s sort of what happened because like that work just that sort of led to the founding of Elios.

[02:52:22] Like a few people, Kyle Fish, I think was a big one of them said, you should like scale this up like you should.

[02:52:31] Make it so that more work like this can happen.

[02:52:34] And I agreed to do that.

[02:52:36] And then Kyle joined as a co-founder.

[02:52:40] He did a lot of the work to help me get it up and running and then went to Anthropic to start their welfare program.

[02:52:47] And then Kathleen Finlandson really helped us launch and get things running.

[02:52:51] So, yeah, Kyle Fish and Kathleen Finlandson were like the two people.

[02:52:56] I mean, we know each other like you will definitely believe me when I say I couldn’t have done it alone.

[02:53:01] Yeah.

[02:53:01] I was like, no way am I like being a solo founder like that would just not suit me whatsoever.

[02:53:10] So, yeah, that’s the story of how things got up and running.

[02:53:12] Elios kind of kicked off its public debut with taking AI welfare seriously.

[02:53:18] And yeah, that’s that’s that takes us to late last year.

[02:53:21] So I guess like, again, the broad strokes are at FHI and then later in San Francisco, I’m trying to work on AI welfare.

[02:53:30] Trying to ask, like, what can we actually do about this?

[02:53:35] And that sort of just naturally leads to an org that is like, unsurprisingly, focused on like the same things I was focused on before.

[02:53:44] Where, like, I think the way I often put that is like, we want to be the org that’s like, OK, but what are we actually going to do about it?

[02:53:53] We do research and are extremely research oriented.

[02:53:56] But like we prioritize that according to.

[02:54:01] OK, like we might not have that long to sort out all of the philosophy and all of the neuroscience.

[02:54:06] So like we’ve got to pick the most action relevant things and like get this field rigorous and get a community built around it and like navigate this issue well as transformative AI happens and there are dangers all around.

[02:54:22] Cool.

[02:54:23] Yeah.

[02:54:24] What kinds of projects have you been up to since since starting up?

[02:54:28] Yeah, so there was taking AI welfare seriously.

[02:54:30] That paper was meant to get people to take AI welfare seriously and primarily like labs and policymakers and people like that.

[02:54:40] It wasn’t primarily philosophical.

[02:54:43] And that, yeah, it was like, just like, all right, people, we can, like, you’re allowed to work on this.

[02:54:51] You can think about it clearly.

[02:54:52] And we really do need to start taking steps now.

[02:54:55] So like finishing that and media and things like that around that were like the first steps.

[02:55:00] First big project, we ran a workshop in January trying to, like, assemble the people who are thinking about this in similar ways.

[02:55:10] I think another big kind of landmark project was doing this AI welfare evaluation of Claude before release.

[02:55:20] As far as we know, that’s the first ever, like, officially commissioned welfare eval, so that was like super, super exciting and a very, like, promising and insufficient.

[02:55:30] So, yeah, that was a big step.

[02:55:32] And, yeah, other things that happened in 2025 include this big conference, Elio’s ConCon, the Elio’s Conference on AI Consciousness and Welfare.

[02:55:44] So good.

[02:55:46] Yeah.

[02:55:48] And there we’re trying to broaden it even further.

[02:55:50] Like, we need policy people.

[02:55:52] We need academics.

[02:55:54] We need neuroscientists.

[02:55:56] And people not just in our neck of the woods.

[02:55:58] Like, introduced to this topic.

[02:56:00] Encouraged to think about it rigorously.

[02:56:02] Like, getting in the game.

[02:56:04] So that, yeah, that, like, has been another, like, major effort that we’ve undertaken.

[02:56:10] Cool.

[02:56:12] What else is coming up?

[02:56:14] What are your upcoming plans?

[02:56:16] So we’ve also been working on taking AI welfare seriously, too.

[02:56:20] That’s been its name as it’s under construction.

[02:56:24] It’s going to be building on the first paper, which says we need to get some education.

[02:56:28] It says we need to get some evaluations in place.

[02:56:30] We need to get some policies in place.

[02:56:32] This paper is going to be like, okay, like, towards AI welfare evaluations, what are the different kinds of evaluations?

[02:56:40] What exists?

[02:56:42] What do we know so far?

[02:56:44] And, like, where should that field head?

[02:56:46] So, yeah, the content of that paper, like, a lot of it shows up in different ways in this very interview.

[02:56:50] We’re very excited about that.

[02:56:52] And, relatedly, we do just want to scale up empirical work.

[02:56:56] And evaluations.

[02:56:58] We’ve had some, like, great collaborators through fellowships and just, like, independent collaborators.

[02:57:04] But we’d like to really scale that up and, like, get a program up and running.

[02:57:10] And what would you say you’re most bottlenecked by at the moment?

[02:57:14] Probably talent.

[02:57:16] Although I would like to say very loudly to listeners, we are fundraising.

[02:57:20] We do need funding.

[02:57:22] But maybe the, even right now, I don’t know.

[02:57:26] The rarer and harder to find thing is people with, like, the temperament and skills to do this extremely weird kind of work that has no real playbook yet.

[02:57:38] And is, like, some weird mix of philosophy and neuroscience and AI.

[02:57:42] Hmm.

[02:57:44] People don’t have to have training in those.

[02:57:46] I think it’s more like a set of, like, epistemic tendencies and a set of skills.

[02:57:54] So, like, we need people who can be like.

[02:57:56] Philosophical enough to, like, be asking what evals would actually make sense and technical enough and self-starting enough to, like, start building them.

[02:58:06] We need people who are, like, confident enough that AI welfare matters but not, like, too beset by uncertainty about philosophy and things like that.

[02:58:18] And, I mean, I guess I’m just describing people who are kind of, like, the dream of every young org.

[02:58:24] But, like, we really need people who have a lot of experience.

[02:58:26] A lot of drive and agency.

[02:58:28] And can just, like, pick things up and run with them.

[02:58:32] Yeah.

[02:58:34] Do you think these kinds of people will have particular kinds of backgrounds?

[02:58:36] Or, like, who listening should be like, hey, that’s actually me.

[02:58:40] Yeah.

[02:58:42] I think these people really can come from anywhere.

[02:58:46] Because you’re trying to find your way to, like, the middle of this overlapping Venn diagram.

[02:58:50] So, like, one obvious place to come from is already in those Venn diagrams.

[02:58:54] But that’s not necessarily. So, maybe.

[02:58:56] You already work on evals.

[02:58:58] But you’re interested in AI consciousness and welfare.

[02:59:00] Maybe you’re a philosopher who is willing to just learn stuff and start doing stuff.

[02:59:06] Maybe you’re a neuroscience scientist who, like, is increasingly interested in AI and, like, can code.

[02:59:14] And this is all.

[02:59:16] Yeah. How technical do you have to be, actually?

[02:59:18] Like, for some kinds of evals, you definitely need to be technical.

[02:59:22] I mean, I think I, myself, am not technical enough to exactly specify how technical you need to be.

[02:59:24] I’m not technical enough to exactly specify how technical we’re talking about.

[02:59:28] But Rosie Campbell, my managing director, also is on that.

[02:59:34] Like, for some kinds of, like, low-hanging fruit evals, especially, like, input-out ones and, like, behavioral ones,

[02:59:40] my understanding is, like, you can learn that stuff.

[02:59:44] I mean, I’m sure you’ve experienced this.

[02:59:48] I’m amazed just how smart people are.

[02:59:52] I think there could very well be listeners who don’t know any of this at all right now but who could just, like, soak it up and just, like, start doing it.

[03:00:02] Yeah.

[03:00:04] Kyle Fish is a good example.

[03:00:06] He had these traits that were really important of, like, drive and prioritization.

[03:00:10] His formal background was making a vaccine.

[03:00:14] I think that, like, illustrated the kind of, like, yeah, let’s just do this, like, new thing aspect.

[03:00:20] Mm-hmm.

[03:00:22] And then, you know, the welfare and consciousness, it’s a very small field.

[03:00:26] So if you spend some serious time grappling with it, you can pretty quickly get to the top percentile in terms of, like, how much have you thought about this nebulous, like, applied AI welfare and consciousness thing.

[03:00:42] Right.

[03:00:44] Cool.

[03:00:46] Yeah.

[03:00:48] You’ve mentioned a few kind of projects that you’d love for someone to do.

[03:00:50] Yeah.

[03:00:52] What are some that you haven’t mentioned that are near the top of your list for just, like, we’ve learned so much, someone please do this?

[03:01:00] Yeah.

[03:01:02] Maybe I’ll mention some projects that I would really love to see that aren’t in the Elios wheelhouse.

[03:01:08] So I can also explain sort of what our wheelhouse is so people can situate themselves in the broader field.

[03:01:14] Because I just described someone who might do this technical eval kind of thing.

[03:01:20] As part of the broader problem of getting this right, like, there are so many other kinds of people who can and should get involved.

[03:01:28] So, yeah, like, let’s open it up to, like, all kinds of listeners.

[03:01:32] Okay.

[03:01:34] So, like, what are the, like, four questions that we need to answer well to get to a flourishing future for all sentient beings?

[03:01:42] The first one is, like, what would it mean for an AI system to matter?

[03:01:48] Like, what are we looking for?

[03:01:50] Are we looking for consciousness, agency, something else?

[03:01:54] The second one is, how would we know if it had that thing?

[03:01:56] So that’s some philosophy and then also some science of, like, evaluating.

[03:02:02] The third question is, what should we do?

[03:02:04] So, like, what policy should labs have?

[03:02:06] And then societally, more generally, what should this do?

[03:02:10] And then the fourth question is, where is this all going?

[03:02:12] Like, what’s the broader trajectory?

[03:02:14] How do we strategize around this?

[03:02:16] So, yeah, like, what would it do?

[03:02:20] What should we take?

[03:02:22] How do we know?

[03:02:24] What should we do?

[03:02:26] Where’s it going?

[03:02:28] Elias is kind of in the middle, too.

[03:02:30] How would we know?

[03:02:32] What should we do?

[03:02:34] And we’re also at a very applied end of that.

[03:02:36] So, how do we know?

[03:02:38] We’re like, okay, let’s just see what evals make sense in light of current knowledge.

[03:02:42] What should we do?

[03:02:44] For now, we’re focused on AI companies.

[03:02:46] But that, of course, is, like, extremely incomplete as part of the playbook.

[03:02:48] So, you can really contribute by looking at questions one and four of, like, more fundamental philosophy and bigger picture strategy.

[03:02:58] And you can also find yourself doing different kinds of work on how would we know and what should we do.

[03:03:04] So, like, eventually governments will need playbooks for this.

[03:03:08] We haven’t really worked on law and policy at all.

[03:03:12] And we’re not sure what we would or should say.

[03:03:14] Within how do we know?

[03:03:16] Like, there’s a lot of work to just make progress on consciousness and sentience and, like, conceptual work.

[03:03:24] And then also on the stuff we are on, like, we’re three people.

[03:03:30] So, also, just, you know, jump in there.

[03:03:32] I don’t think anyone will have gotten this impression.

[03:03:34] But, like, just to be clear, like, we do not have this handled.

[03:03:38] And, like, we want help.

[03:03:40] And, like, we’re here to, in large part, like, help people get in the game.

[03:03:44] Yeah.

[03:03:46] And, like, actually figure this out.

[03:03:48] Okay.

[03:03:50] So, that’s kind of a taxonomy of projects.

[03:03:52] Are there specific ones worth highlighting?

[03:03:54] Yeah.

[03:03:56] So, maybe I’ll do one from each.

[03:03:58] Great.

[03:04:00] So, what would make an AI system matter?

[03:04:02] I’m especially interested in perspectives that decenter consciousness and sentience.

[03:04:06] Those are, like, so intuitive in terms of how many people think about this.

[03:04:12] Like, but could the AI system feel something?

[03:04:14] And I think there are good reasons.

[03:04:16] There are good reasons to suspect that that picture is maybe incomplete and limiting.

[03:04:20] Maybe I’ll just pick, like, the simplest one, although extremely counterintuitive.

[03:04:26] It could be that consciousness doesn’t exist.

[03:04:28] Some people do think that.

[03:04:30] I will not be elaborating further.

[03:04:32] But there’s, like, a sub…

[03:04:34] There’s kind of a sub-literature on, okay, if illusionism about consciousness or, like, consciousness not being real is true, then, like, what should we be looking for?

[03:04:44] I would love to see people flesh that out.

[03:04:46] And in general, just, like, are we thinking about consciousness versus not consciousness in a good way?

[03:04:52] Okay.

[03:04:54] And it’s, like, super philosophically rich, which is part of why I’m refusing to elaborate on it because it’s, like, we can’t tape another episode yet.

[03:05:04] Yep.

[03:05:06] And then on how would we know, I guess I did mention, like, let’s do a lot of interpretation.

[03:05:12] Like, just start doing interpretability on stuff that is even superficially welfare relevant or related to how models think and do stuff and understand themselves.

[03:05:26] I think there’s a lot of low-hanging fruit there.

[03:05:28] And understanding model preferences, we can still just flesh that out a lot.

[03:05:34] Like, what exactly are the kinds of inconsistencies we see?

[03:05:36] How do self-reports and revealed preferences match up or not?

[03:05:40] Right.

[03:05:42] I think we can just get, like, a lot more granular on that.

[03:05:44] And, yeah, listeners who already know how to just, like, run LLM experiments can just, like, start doing that.

[03:05:52] I might mention a very small example of this kind of thing of, like, you can just hop in and, like, add detail to something.

[03:06:00] So I know this show has covered the spiritual bliss attractor state.

[03:06:06] This is where two clods in conversation will, like, at a very high portion.

[03:06:10] Yeah.

[03:06:12] Of times, like, end up in rapturous mystical dialogue with each other.

[03:06:18] Yep.

[03:06:20] So that’s…

[03:06:22] That’s strange.

[03:06:24] Insane.

[03:06:26] Really weird.

[03:06:28] Not claiming it’s directly welfare relevant, but it’s an interesting fact about models.

[03:06:32] Someone, after that came out, someone just ran it on a bunch of different other models.

[03:06:36] And now we have, like, the sort of, you know, spiritual bliss benchmark.

[03:06:40] Nice.

[03:06:41] In this way.

[03:06:43] Yeah, nice.

[03:06:45] What should we do?

[03:06:47] Maybe here I’ll highlight something that, again, is non-Elios.

[03:06:51] We’re often very focused on, like, welfare and moral patient hood.

[03:06:53] And, like, from the perspective of us caring about models, like, what kind of actions might that motivate?

[03:07:01] There’s also this whole landscape of, like, ways of cooperating with AI systems or, like, legal reasons.

[03:07:09] You might give them certain rights just as entities that, like, disperse money or something like that.

[03:07:17] That’s, like, distinct from the moral patient hood question, but, like, deeply interrelated.

[03:07:23] It has been in human history.

[03:07:25] I think it will be for AI systems.

[03:07:27] And then on where is this all going, I would just love to see some forecasts.

[03:07:35] So Rethink Priorities has…

[03:07:37] Rethink Priorities has been building this model of how likely an AI system is to be conscious.

[03:07:43] If you like this episode, you’ll definitely like that project.

[03:07:47] You could just ask, okay, given these, like, inputs to the model and how AI is going, like, how do we expect the model’s outputs to change over time?

[03:07:58] Cool.

[03:08:00] Yeah, I think that was first suggested to me as an idea by Kyle Fish.

[03:08:04] Nice.

[03:08:05] And, yeah, I hope someone there or elsewhere does that.

[03:08:07] Cool.

[03:08:08] Yeah.

[03:08:09] Yeah.

[03:08:10] Those do sound extremely cool.

[03:08:12] I guess for people excited by those, excited by the field, maybe hear themselves a bit in your description of who you’re looking for, what advice do you have for people interested in contributing?

[03:08:25] I guess, like, entering new fields can be particularly hard because there’s less mentorship and there’s less, like, yeah, there are fewer entry-level roles.

[03:08:36] You just…

[03:08:37] You just, like, really have to be able to be pretty self-starter-y.

[03:08:40] So how do you get involved?

[03:08:42] Yeah, on, like, the self-starting and mentorship question, I am heartbroken by how many talented people we just, like, we’re, like, we don’t, like, have time.

[03:08:53] Like, I would love to supervise a project by you.

[03:08:56] Fortunately, there are, like, communities sort of self-forming around this.

[03:09:01] Cool.

[03:09:02] So there’s, like, an AI Welfare Discord.

[03:09:06] There’s, like, LLM psychologists on Twitter who talk to models all day and are always exploring interesting things about them.

[03:09:15] You know, there’s NYU CMEP and, like, a whole host of orgs also kind of in the space.

[03:09:21] I don’t know about a whole host.

[03:09:22] Like, it’s still a pretty small space.

[03:09:24] But, you know, several.

[03:09:28] So it never hurts to reach out to people.

[03:09:31] I guess that’s, in some sense, a cold take.

[03:09:33] But it’s, like, I don’t think it’s ever been bad for anyone.

[03:09:35] I don’t think it’s ever been bad for anyone’s career to be reminded that, like, if I don’t have time to take your call, that’s neutral.

[03:09:43] It’s not negative.

[03:09:44] Like, so just, like, just ask.

[03:09:47] Yeah.

[03:09:48] Some people will probably, like, read voraciously and really know a lot about the content of the field.

[03:09:56] How can they go from knowing a lot, being really interested, to, like, being properly useful?

[03:10:04] Yeah.

[03:10:05] So one category of usefulness I haven’t mentioned yet is writing.

[03:10:09] Like, I think, I mean, it’s easier said than done, but just writing clearly about this stuff is really valuable.

[03:10:17] It’s good for career capital to have a blog where you read papers and just explain what happens in the paper.

[03:10:24] Nice.

[03:10:25] It’s also just good, yeah, for the ecosystem.

[03:10:30] And it also shows employers that.

[03:10:33] You’re smart and can communicate clearly and can, like, get things done, which, you know, is, like, 95% of, like, what it takes.

[03:10:43] Mm-hmm.

[03:10:44] Mm-hmm.

[03:10:45] So I, here’s, like, a very niche recommendation.

[03:10:49] I can have trouble being a self-starter, but I have published on my Substack twice a month for, like, well over a year.

[03:10:56] And I did that by making a manifold market, like, a betting market about whether I would do that.

[03:11:01] So, yeah.

[03:11:02] Use, like, yeah.

[03:11:03] Use commitment devices.

[03:11:04] Accountability.

[03:11:05] Yeah, accountability.

[03:11:06] Yeah.

[03:11:07] Things like that.

[03:11:08] Nice.

[03:11:09] Yeah.

[03:11:10] Is there anyone worth, like, citing as a kind of example?

[03:11:17] Kyle Fish could be one, given that he worked on vaccines before.

[03:11:22] What exactly does it look like to go from not working on this to, like, you know, Kyle Fish is, like, properly working on this?

[03:11:30] And, yeah, it wasn’t that long of a time.

[03:11:32] From not working on it at all to working on it at Anthropic.

[03:11:36] I don’t know if you, yeah, if it’s worth either describing his trajectory or describing, like, a theoretical trajectory or someone else’s trajectory.

[03:11:45] Yeah.

[03:11:46] I guess I’ll just extract a couple of lessons from the Kyle trajectory.

[03:11:50] Mm-hmm.

[03:11:51] Perfect.

[03:11:52] Which, like, I just mentioned.

[03:11:53] One is, like, reach out to people.

[03:11:54] So he just started reaching out to people.

[03:11:56] Okay.

[03:11:57] Nice.

[03:11:58] The other one is that if you read and think about this stuff, like, you can.

[03:12:00] Yeah.

[03:12:01] You know, find your way through most of the literature pretty quickly.

[03:12:02] Mm-hmm.

[03:12:03] And I guess the other one is, like, you can just do things.

[03:12:04] That’s, ironically, easier said than done.

[03:12:05] Like, sometimes you can’t just do things.

[03:12:06] Or, like, you need the right environment and structure to do that.

[03:12:07] But, yeah, I think that, like, yeah, it’s just a very good illustration of some general principles of how to jump into a weird new field.

[03:12:08] And another thing is, like, I think that, like, you can just do things.

[03:12:09] Yeah.

[03:12:10] Yeah.

[03:12:11] Yeah.

[03:12:12] Yeah.

[03:12:13] Yeah.

[03:12:14] Yeah.

[03:12:15] Yeah.

[03:12:16] Yeah.

[03:12:17] Yeah.

[03:12:18] Yeah.

[03:12:19] Yeah.

[03:12:20] Yeah.

[03:12:21] Yeah.

[03:12:22] Yeah.

[03:12:23] Yeah.

[03:12:24] Yeah.

[03:12:25] Yeah.

[03:12:26] Yeah.

[03:12:27] Yeah.

[03:12:28] Yeah.

[03:12:29] Yeah.

[03:12:30] Another thing is, like, you might – like, that was sort of jumping all the way in.

[03:12:36] You might be in, like, one corner of the field but want to move to another, so you might be doing a PhD in philosophy but you want to do more applied stuff.

[03:12:43] I think it can maybe sometimes be harder for those people to move within the field because, like, you’re more attached to your particular corner.

[03:12:53] I think I have seen this with philosophers, or other grad students.

[03:12:58] This definitely happened with me.

[03:12:59] Like, you weren’t…

[03:13:00] have gotten really good at like one particular kind of work output and you just figure maybe

[03:13:08] that’s the only kind of work output you can do and where it’s like shitty to have to become

[03:13:15] bad at a different kind like you have to really be like oh not that good at this new thing yet

[03:13:21] after being maybe enjoy yeah i think i just like enjoy being good at my work it sucks to be shitty

[03:13:28] at it for a little while yeah absolutely and um i guess it’s just a side point about grad school

[03:13:33] like you’re just surrounded by an environment where that’s literally the only currency that

[03:13:39] could possibly matter um so i go to conferences like make other friends i do want to say that’s

[03:13:47] a good thing about grad school like you need communities of people like all doing the same

[03:13:51] sort of thing to like achieve excellence but yeah if you’re looking to like move around um

[03:13:55] that brings us back to just like reach out to people

[03:13:58] uh email people i feel like at least my corner of the ai welfare space i can genuinely say is

[03:14:07] just extremely like nice so like you can do stuff that feels kind of dumb or ask questions that

[03:14:15] might sound kind of dumb to you a they’re probably not and be like it’s a new field it’s hard so

[03:14:22] just go for it nice yeah and actually it occurs to me that there are three other

[03:14:27] great examples of trajectories that are like very close at hand it’s the three people i’m working

[03:14:32] with right now um so like maybe briefly say how they got to elios so rosie campbell had worked on

[03:14:41] ai policy and evals so she had like you know part of the equation for what we need but didn’t know

[03:14:46] much about ai welfare she like wrote some about it she talked to me about it um she was enthusiastic

[03:14:53] about it and like there you go um cool

[03:14:57] patrick butlin has a similar profile as me in that he was doing academic philosophy um he is

[03:15:04] just like generally like very fearless and diligent about just like reading a lot until

[03:15:10] he gets it um like i think that’s like a big part of the consciousness and ai project we did is like

[03:15:16] patrick butlin will just like read the papers until he knows what to say um so that’s like a

[03:15:24] very that’s a very high conscientiousness route to it maybe um

[03:15:27] and high rigor but it’s a great way to be and it’s you know a big part of why i work with them

[03:15:34] um and um larissa scavo has worked with us on communications and events and also just like this

[03:15:43] is a great illustration of like you might think of ways to contribute that were not mentioned in

[03:15:49] this podcast and like um have not been mentioned so far um larissa got this shirt made and delivered

[03:15:57] i did not ask her to do that it’s awesome thank you larissa she also made these stickers i have

[03:16:05] really been waiting for an excuse to show one of the elio’s stickers so one second here great

[03:16:12] these have been surprisingly good for like morale and field building nice what is it like to be a

[03:16:20] do you want to explain the joke yes so the philosopher thomas nagel has a paper about

[03:16:25] consciousness called what is it like to be a bad person and what is it like to be a bad person and

[03:16:27] That touches on many of the themes of this interview.

[03:16:31] There could be, like, different forms of consciousness that can be hard to know about them from our human vantage point.

[03:16:37] Such a good sticker.

[03:16:39] So, yeah, that’s a way of contributing to the field.

[03:16:41] I mean, I’m not saying everyone should now make even more stickers.

[03:16:45] I mean, that said, like, maybe so.

[03:16:49] So that’s on, like, the self-starting angle.

[03:16:51] Like, one day, like, there were just stickers in the office.

[03:16:56] And, yeah, you can contribute by, like, being very creative and thinking of ways of communicating about this stuff.

[03:17:06] I also want to say Kathleen Finlinson.

[03:17:07] So now we’ve got, like, the whole squad, the whole roster.

[03:17:13] I don’t know exactly how this background translated, but Kathleen had been in a Zen monastery for quite a while before she had, like, reentered the world of, like, AI forecasting and AI strategy.

[03:17:25] So, yeah.

[03:17:26] She had this combined CV of, like, open philanthropy-style AI forecasting and Zen Buddhism.

[03:17:35] That’s a cool combo.

[03:17:39] But I also think more than anything else, like, yeah, just doing it.

[03:17:43] Like, she was willing to just, like, get in the game.

[03:17:49] I’m getting a little emotional.

[03:17:52] But, yeah, you know.

[03:17:54] Like, carried the game.

[03:17:56] Carried the org over the finish line.

[03:17:57] That was great.

[03:18:00] Yeah.

[03:18:01] It sounds like a bunch of legends.

[03:18:05] So.

[03:18:05] Yeah.

[03:18:06] Yeah.

[03:18:06] It sounds like caring a bunch about the topic really can get you a chunk of the way there.

[03:18:15] In general, what are some ways that you think things could go badly in, yeah, in this field?

[03:18:24] Yeah.

[03:18:24] I think one thing.

[03:18:26] I worry a lot about.

[03:18:27] And I think this actually does pair well with what I was just saying.

[03:18:30] It is really important to be rigorous and communicate responsibly about this.

[03:18:35] There is a kind of person who gets really passionate about this and maybe needs to talk to more people about it or, like, write down their thoughts more because it’s just really easy to get, like, really confused.

[03:18:49] Like, really fast.

[03:18:52] And, yeah, there’s something about this.

[03:18:55] There’s a lot of things about this topic that can, like, induce or select for, like, various ways of just getting a little bit off kilter.

[03:19:06] It’s tough because you do want to be off kilter, like, but, you know, like, not too much.

[03:19:12] So I do worry about scenarios where the field becomes associated with, yeah, like, wild speculation or.

[03:19:25] Too associated with psychedelics or too associated with, like, something that’s, like, relevant but is also a bit of a distraction.

[03:19:37] I should also say a bit of it is also, like, a divide and conquer thing.

[03:19:41] Like, Elios really is trying to exist in that, like, real buttoned down kind of place.

[03:19:47] There is also, like, I have a lot of love for people who also kind of get weird with it.

[03:19:52] But, like, you want to be able to communicate it well.

[03:19:54] And.

[03:19:55] You know, make sure that people do know that this is, like, a serious topic that, like, we can and should reason about rigorously.

[03:20:03] So I think, yeah, I think, like, epistemic hygiene is something I worry a lot about.

[03:20:10] Yeah, it’s just really hard to, I mean, it’s just really hard to get this issue right.

[03:20:15] And the future is going to get more confusing and more emotional.

[03:20:22] I think I first started saying this with.

[03:20:25] Kathleen, but it’s continued as, like, an Elios org goal is, like, a lot of what we want to do is, like, stay sane in, like, the next 10 years.

[03:20:35] Like, there will be a lot of alpha in not losing your grip.

[03:20:41] I think that’s, like, a whole other episode where I don’t actually even know what the right advice is.

[03:20:47] But, like, you probably would have good things to say about that.

[03:20:51] Yeah.

[03:20:53] Okay.

[03:20:53] What what do you think?

[03:20:55] It looks like for this field to go well.

[03:20:59] Yeah, I think if this field goes well, this becomes just part of the general playbook and set of issues that, like, are on the table as, like, if people keep trying to build a new form of intelligence, it should be on the table.

[03:21:19] How do they matter?

[03:21:21] And, like, what part do they play as as moral patients?

[03:21:25] It’s I mean, it’s often just shocking to me that that is barely ever comes up, you know, like we worry about over attribution and people getting confused about welfare.

[03:21:38] But if you look at, like, the broader, like, trajectory on the whole, at least right now, mostly it’s people just, like, not putting it on the table at all.

[03:21:53] And yeah, again, that’s.

[03:21:55] Something where the factory farming analogy is very illustrative, like people aren’t great at, like, structuring society in an inclusive way.

[03:22:06] So I think there needs to be a combination of, like, rigor to get this taken seriously, like good communication.

[03:22:15] Also, all sorts of, like, innovation around law and policy and stuff that probably won’t even have that much to do with moral patient hood.

[03:22:25] And, yeah, to, like, get this, like, properly handled, I feel like there’s just so many ways things could go off the rails.

[03:22:34] We first want to just make sure a lot of people are taking it extremely seriously and, like, we’re doing our homework as we go into transformative A.I.

[03:22:43] OK, nice.

[03:22:44] We have been talking for many hours.

[03:22:47] We have time for just one more question.

[03:22:50] The last time I interviewed you, we ended up talking for something like an extra hour.

[03:22:55] About strategies that have helped, I guess, you and I think me at the time enjoy and be more kind of productive in independent research.

[03:23:05] I still recommend that little mini episode that we what we kind of ended up separating it out and making into its own little after hours episode.

[03:23:15] But since then, I know you’re just kind of like a legend of self-improvement.

[03:23:21] So what is your kind of top lesson?

[03:23:25] For doing independent research of the last two years?

[03:23:29] I think for me, like, the biggest lesson has been don’t do it.

[03:23:34] So and I think this could, you know, I hope this does help help listeners.

[03:23:40] I like one reason I do need all these self-improvement tools and things is I think in many ways I’m like temperamentally very badly disposed to buy myself all alone.

[03:23:55] In a room carrying out a project.

[03:23:58] Now, if you have to do that, I encourage you.

[03:24:01] I’m cheering you on and like, listen to that other episode.

[03:24:04] But nowadays, I have written on the whiteboard next to my desk.

[03:24:08] Do not write alone.

[03:24:10] Like, I think there is like a good meta principle here, which is.

[03:24:16] If you’re needing tons and tons of like little tricks and.

[03:24:24] Psychological.

[03:24:25] Psychological cartwheels, don’t stop doing them like a lot of life is just muddling through with a bunch of little fixes, but it could be that there’s some bigger structural thing you could do to just completely route around them.

[03:24:37] So for me, that’s that’s co-authoring.

[03:24:39] I could like and do like a great effort, like learn a new scheduling tool and optimize my accountability systems.

[03:24:52] And again, you should do that.

[03:24:55] But also I can like work with people who are just really good at that.

[03:25:02] I think, yeah, like this can be a trap of self-improvement.

[03:25:06] Like you might to some extent need to grieve that there are some things you might not be that great at, at least not without a ton of work.

[03:25:15] And then like free yourself of necessarily having to be.

[03:25:21] Yeah.

[03:25:22] And like pair up with someone.

[03:25:25] Who can do it.

[03:25:26] Totally.

[03:25:27] Yep.

[03:25:27] I think that’s great advice.

[03:25:29] Um, we have to leave that there.

[03:25:32] My guest today has been Robert Long.

[03:25:33] Thank you so much for coming on.

[03:25:36] Thank you so much for having me.

[03:25:37] This has been fantastic.

Vox

Explorador