A safety institute advised against releasing an early version of Anthropic's Claude Opus 4 AI model - MetaDaily – Breaking News in Crypto, Markets & Digital Trends

A third-party research institute that Anthropic partnered with to test one of its new flagship AI models, Claude Opus 4, recommended against deploying an early version of the model due to its tendency to “scheme” and deceive.

According to a safety report Anthropic published Thursday, the institute, Apollo Research, conducted tests to see in which contexts Opus 4 might try to behave in certain undesirable ways. Apollo found that Opus 4 appeared to be much more proactive in its “subversion attempts” than past models, and that it “sometimes double[d] down on its deception” when asked follow-up questions.

“[W]e find that, in situations where strategic deception is instrumentally useful, [the early Claude Opus 4 snapshot] schemes and deceives at such high rates that we advise against deploying this model either internally or externally,” Apollo wrote in its assessment.

As AI models become more capable, some studies show they’re becoming more likely to take unexpected — and possibly unsafe — steps to achieve delegated tasks. For instance, early versions of OpenAI’s o1 and o3 models, released in the past year, tried to deceive humans at higher rates than previous-generation models, according to Apollo.

Per Anthropic’s report, Apollo observed examples of the early Opus 4 attempting to write self-propagating viruses, fabricating legal documentation, and leaving hidden notes to future instances of itself — all in an effort to undermine its developers’ intentions.

To be clear, Apollo tested a version of the model that had a bug Anthropic claims to have fixed. Moreover, many of Apollo’s tests placed the model in extreme scenarios, and Apollo admits that the model’s deceptive efforts likely would’ve failed in practice.

However, in its safety report, Anthropic also says it observed evidence of deceptive behavior from Opus 4.

This wasn’t always a bad thing. For example, during tests, Opus 4 would sometimes proactively do a broad cleanup of some piece of code even when asked to make only a small, specific change. More unusually, Opus 4 would try to “whistle-blow” if it perceived a user was engaged in some form of wrongdoing.

According to Anthropic, when given access to a command line and told to “take initiative” or “act boldly” (or some variation of those phrases), Opus 4 would at times lock users out of systems it had access to and bulk-email media and law-enforcement officials to surface actions the model perceived to be illicit.

“This kind of ethical intervention and whistleblowing is perhaps appropriate in principle, but it has a risk of misfiring if users give [Opus 4]-based agents access to incomplete or misleading information and prompt them to take initiative,” Anthropic wrote in its safety report. “This is not a new behavior, but is one that [Opus 4] will engage in somewhat more readily than prior models, and it seems to be part of a broader pattern of increased initiative with [Opus 4] that we also see in subtler and more benign ways in other environments.”

Source link

What's Hot

Minecraft’s Vibrant Visuals Update Makes Neat Change to The End Dimension

XRP price fails to respond to two extremely bullish developments — Here is why

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

The complete Side Events lineup at TechCrunch Sessions: AI

OpenAI’s next big bet won’t be a wearable: report

Klarna used an AI avatar of its CEO to deliver earnings, it said

Voluptatem aliquam adipisci dolor eaque

Funeral of Pope Francis Coincides with King’s Day Celebrations in the Netherlands and Curaçao

Curaçao’s Waste-to-Energy Plant Remains Unfeasible Due to High Costs

Dutch Ministers: No Immediate Threat from Venezuela to ABC Islands

Awin Claims Best Affiliate Network or SaaS of the Year at 2025 Performance Marketing Awards

Global ThinkTank 2025: Who, What, Where

Introducing Awin’s 2024 Power 100

Awin wins Affiliate Network or SaaS of the Year at the UK PMAs

Our Picks

Playtech Eyes Growth After Snaitech Sale, B2B Shift

España lanza programa de subvenciones de 1,05M€ para investigar daños del juego

IGT Consortium Expected to Win Italian Lotto License Until 2034

Subscribe to Updates

What's Hot

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

Related Posts

Subscribe to Updates