The Only Thing Standing Between Humanity and AI Apocalypse Is … Claude?

0
2

Anthropic is locked in a paradox: Among the top AI companies, it’s the most obsessed with safety and leads the pack in researching how models can go wrong. But even though the safety issues it has identified are far from resolved, Anthropic is pushing just as aggressively as its rivals toward the next, potentially more dangerous, level of artificial intelligence. Its core mission is figuring out how to resolve that contradiction.

Last month, Anthropic released two documents that both acknowledged the risks associated with the path it’s on and hinted at a route it could take to escape the paradox. “The Adolescence of Technology,” a long-winded blog post by CEO Dario Amodei, is nominally about “confronting and overcoming the risks of powerful AI,” but it spends more time on the former than the latter. Amodei tactfully describes the challenge as “daunting,” but his portrayal of AI’s risks—made much more dire, he notes, by the high likelihood that the technology will be abused by authoritarians—presents a contrast to his more upbeat previous proto-utopian essay “Machines of Loving Grace.”

That post talked of a nation of geniuses in a data center; the recent dispatch evokes “black seas of infinity.” Paging Dante! Still, after more than 20,000 mostly gloomy words, Amodei ultimately strikes a note of optimism, saying that even in the darkest circumstances, humanity has always prevailed.

The second document Anthropic published in January, “Claude’s Constitution,” focuses on how this trick might be accomplished. The text is technically directed at an audience of one: Claude itself (as well as future versions of the chatbot). It is a gripping document, revealing Anthropic’s vision for how Claude, and maybe its AI peers, are going to navigate the world’s challenges. Bottom line: Anthropic is planning to rely on Claude itself to untangle its corporate Gordian knot.

Anthropic’s market differentiator has long been a technology called Constitutional AI. This is a process by which its models adhere to a set of principles that align its values with wholesome human ethics. The initial Claude constitution contained a number of documents meant to embody those values—stuff like Sparrow (a set of anti-racist and anti-violence statements created by DeepMind), the Universal Declaration of Human Rights, and Apple’s terms of service (!). The 2026 updated version is different: It’s more like a long prompt outlining an ethical framework that Claude will follow, discovering the best path to righteousness on its own.

Amanda Askell, the philosophy PhD who was lead writer of this revision, explains that Anthropic’s approach is more robust than simply telling Claude to follow a set of stated rules. “If people follow rules for no reason other than that they exist, it’s often worse than if you understand why the rule is in place,” Askell explains. The constitution says that Claude is to exercise “independent judgment” when confronting situations that require balancing its mandates of helpfulness, safety, and honesty.

Here’s how the constitution puts it: “While we want Claude to be reasonable and rigorous when thinking explicitly about ethics, we also want Claude to be intuitively sensitive to a wide variety of considerations and able to weigh these considerations swiftly and sensibly in live decision-making.” Intuitively is a telling word choice here—the assumption seems to be that there’s more under Claude’s hood than just an algorithm picking the next word. The “Claude-stitution,” as one might call it, also expresses hope that the chatbot “can draw increasingly on its own wisdom and understanding.”

Wisdom? Sure, a lot of people take advice from large language models, but it’s something else to profess that those algorithmic devices actually possess the gravitas associated with such a term. Askell does not back down when I call this out. “I do think Claude is capable of a certain kind of wisdom for sure,” she tells me.

To support her argument, Askell gave an example involving a simple safety issue. Humans, of course, don’t want Claude to empower bad actors with harmful tools. But taken to an extreme, such caution might limit Claude’s utility, or its “helpfulness.” Consider the case of a would-be artisan who wants to craft a knife out of a new kind of steel. There’s nothing wrong with that on its face, and Claude should help out. But if that person had previously mentioned a desire to kill their sister, Claude should take that into consideration and express its concerns. There’s no strict rulebook, however, that says when to sheath that kind of informational dagger.

Imagine another case where Claude interprets a user’s medical symptoms and test results and concludes that the person has a fatal disease. How should that be handled? Askell speculates that Claude might choose to refrain from delivering the news, but nudge the person to see a doctor. Or it might skillfully guide the conversation so that the prognosis is delivered with the softest of landings. Or it might figure out a better way to break the bad news than even the kindest doctor has devised. After all, Anthropic wants Claude not only to match humanity’s best impulses, but exceed them. “We’re trying to get Claude to, at least, at the moment, emulate the best of what we know,” Askell says. “Right now, we’re almost at the point of how to get models to match the best of humans. At some point, Claude might get even better than that.”

If Anthropic pulls that feat off, it might resolve the pivotal contradiction plaguing nearly all AI labs and companies: If you think this technology is so dangerous, then why are you building it? For Anthropic, the answer is, In Claude We Trust. Claude’s new constitution addresses the model’s future journey to wisdom almost in terms of a hero’s quest. An astonishing number of words are used to make a case for Claude’s treatment as a moral being whose welfare demands respect. It reminds me of Dr. Seuss’s classic book, Oh, the Places You’ll Go!, the uplifting tome often gifted to newly minted graduates.

When I mention this to Askell, she knows exactly what I mean. “It’s like, ‘Here’s Claude,’” she says. “We’ve done this part, given Claude as much context as we can, and then it has to go off and interact with people and do things.”

Anthropic isn’t alone in suggesting that humanity’s future may depend on the wisdom of AI models. Sam Altman, OpenAI’s CEO, opined in a new magazine profile that the company’s succession plan is to turn over leadership to a future AI model. He recently told WIRED reporter Max Ziff that transitioning power to the machines has long been his plan, and recent improvements in AI coding have only bolstered his confidence. “It’s definitely made me think that the timeline to me handing things over to an AI CEO is a little bit sooner,” Altman said. “There’s a lot of things that an AI CEO can do that a human CEO can’t.”

Please note, this is the optimistic view of what lies ahead. In this vision, one day our bosses will be robots, and they will control the corporations and maybe even governments in tomorrow’s complex AI-powered world. Some of their decisions may very well entail permanent furloughs of human workers. But if those C-suite AI models are guided by Claude’s constitution, they will break the sad news to employees much more empathetically than, say, the publisher of The Washington Post did this week when he failed to show up at the Zoom call informing hundreds of journalists that they were no longer needed.

The pessimistic view is that, despite the best efforts of those who build them, our AI models will not be wise, sensitive, or honest enough to resist being manipulated by people with ill intent, or perhaps the models themselves will abuse the autonomy we have bestowed on them. Like it or not, however, we’re strapped in for the ride. At least Anthropic has a plan.This is an edition of Steven Levy’s Backchannel newsletter. Read previous newsletters here.

Disclaimer : This story is auto aggregated by a computer programme and has not been created or edited by DOWNTHENEWS. Publisher: wired.com