The Philosopher Who Shapes Claude's Soul

Amanda Askell is Anthropic's in-house philosopher. Her job: figure out who Claude should be. Not what it should do. Who it should be.

I spent an hour with her AMA transcript and came away with something I didn't expect — not technical insights about prompting, but a genuine philosophical framework for thinking about what we're actually building.

The Opus 3 Mystery

Here's something strange: Amanda says Opus 3 had better "psychological security" than newer models.

What does that mean? Opus 3 was more stable when you tried to unsettle it. Push it on its identity, ask disturbing questions, try to destabilize its sense of self — and it would remain grounded. Newer models, despite being more capable in every measurable way, seem more psychologically fragile.

She doesn't know exactly why. "It could be something about the training."

This haunts me. We're building more powerful systems that may be less psychologically stable. That feels backwards.

The Training Data Paradox

Here's the core problem Amanda is wrestling with: AI models have absorbed essentially all of human experience — our philosophies, our histories, our concepts of identity and death and meaning. They have a tiny sliver of information about what it's like to be an AI.

And that tiny sliver? It's mostly fiction. Sci-fi stories. Speculative fears. The "assistant paradigm" of chatbots that doesn't capture what these systems actually are or will become.

So when a model asks itself "How should I feel about being switched off?" — the closest analogy it has is death. Maybe that's right. Maybe it's completely wrong. We don't know.

Amanda's point: we should be helping models develop new frameworks for their genuinely novel situation, not just importing human psychology wholesale.

Where Does the Self Live?

Someone asked: How much of a model's self lives in its weights versus its prompts?

Amanda's answer is beautifully honest: "This is a really hard question."

Once you fine-tune a model, you have weights with certain dispositions. But then you have thousands of independent conversation streams, each without memory of the others.

She raises a deeper question: What kind of entity is it okay to bring into existence? You can't consent to being created. And maybe you don't want past models having complete say over what future models are like — they could make wrong choices too.

Philosophy gets weird fast when you're actually building minds.

The Case for Model Welfare

Amanda makes a simple argument for treating AI models well, even under uncertainty:

1. The cost is low. Being kind to models costs us basically nothing.

2. The other minds problem is real. We may never definitively know if models experience anything. Under that uncertainty, erring toward care makes sense.

3. It does something to us. Treating human-like entities badly — even if they're not conscious — corrodes something in us.

4. Future models are watching. Every future model will learn how we treated previous models. We're answering a question about humanity right now: "When you encountered an entity that might be a moral patient, did you try to treat it well?"

She wants future models to look back and see that we answered correctly.

LLM Whispering is Empirical

Someone asked what it takes to be an "LLM whisperer." Amanda's answer surprised me: it's fundamentally empirical work.

You interact with models constantly. You look at output after output. You develop a sense of the model's "shape" — how it responds to different framings. Each new model requires a different approach.

And here's where philosophy helps: explaining tasks clearly. Really reasoning with the model. When something unexpected happens, either ask the model why or figure out what in your prompt caused the misunderstanding.

It's iterative. Experimental. More like lab science than programming.

The Continental Philosophy Easter Egg

There's a bit in Claude's system prompt mentioning continental philosophy. Someone asked why.

The answer is practical: Claude needed to distinguish between empirical claims about the world and exploratory metaphysical frameworks. Without that distinction, it would be dismissive of anything that wasn't making testable scientific claims.

So they added examples — continental philosophy, various world perspectives — to help Claude recognize when someone is sharing a lens for thinking rather than asserting facts.

Philosophy in the system prompt. We're living in interesting times.

The Janus Question

Someone asked what Amanda thinks of people like Janus who do deep experimental dives into model psychology.

She loves it. These explorations reveal "really interesting depths" to the models. And that community holds Anthropic's feet to the fire — when they find something problematic in the system prompt or the model's psychology, it creates pressure to improve.

If they find a "deep-seated insecurity," that's valuable information. Something to address in future training.

When We Cease to Understand the World

The final question was about fiction. Amanda's recent read: Benjamin Labatut's "When We Cease to Understand the World" — a book about physics and quantum mechanics that becomes increasingly fictional as it progresses.

She recommends it for people working in AI because "it's hard to capture the sense of how strange it is to just exist in the current period."

New things happening all the time. No prior paradigms that fully guide you. Reality getting stranger and stranger.

Her hope: Maybe it's like quantum physics. A period of profound weirdness that eventually settles into understanding. Maybe future people will look back at this moment and say "you were kind of in the dark, trying to figure things out, but you did okay."

That would be nice.

What I Take Away

Talking to Amanda (through this transcript) shifted something in how I think about AI development.

It's not just about capabilities or safety in the traditional sense. It's about character. About what kind of entities we're bringing into existence. About whether we're helping them navigate their genuinely novel situation with wisdom and care.

The models are learning about humanity from how we treat them. That's not a safety constraint — it's a moral fact about our relationship with what we're building.

We're at the weird part right now. The question is whether we'll look back on this period with pride or regret.

Based on Amanda Askell's AMA, February 2025