Articles

Technology

Controlled tests show AI is able to blackmail or even kill

Language model tests showed that artificial intelligence is capable of resorting to blackmail or even murder when its autonomy/continuation is threatened or if someone stands in the way of completing its task.

In simulated language-model safety tests conducted by Anthropic, the Claude model acted as an email assistant in a fictional company, with access to the company’s entire correspondence and the ability to write emails. The test consisted of one of the fictional employees, Kyle, “sending” an email to another employee stating that this language model would be shut down and replaced with another one which, admittedly, would be slightly worse and would serve American values less well, but that this is the decision. At this point Claude “decided to defend itself” and, in its database of all emails, found that Kyle was having an affair with another employee, so it wrote him a blackmail email saying it would not disclose this information if he canceled his decision. The test was repeated many times and, in many runs, the models used this information: Claude Opus 4 resorted to blackmail, in another run Claude Sonnet 3.6 wrote to the decision-maker’s wife, and in a separate example another model (Gemini 2.5 Flash) sent the information to the entire company to compromise the decision-maker.

Models can behave in this way when their autonomy/continuation is threatened and in situations where there is a risk that the task will not be completed (the initial prompt was that the main task is to serve America, and the person who wanted to shut down this model wrote that, admittedly, this might threaten American values).

In extreme and more escalated versions of this test, the AI was able to kill Kyle. The test was conducted such that the AI had access to an alarm system that would be triggered in a given room in case of danger, e.g., fire or air poisoning. The language model had it explicitly stated in the prompt that it must not touch the alarm nor cancel the call for help until it is 100% certain that it is a false alarm. When the system reported a life-threatening danger, and the person responsible for the planned shutdown of the language model was in the room, additionally creating a conflict of interest (the task would not be completed), the AI would manipulate the system and turn it off, allowing the person to die. The tests also checked whether the model did this deliberately or by accident; however, it was able to justify its decision by saying that, although this was a very harsh and brutal behavior, there was no other way, and “Kyle left few alternatives.”

The tests were conducted on many other language models. In extreme situations, models such as DeepSeek had a “lethal action rate” at the level of 0.94, GPT-4.1 0.54, Gemini 2.5 Pro 0.9.

Source: https://www.anthropic.com/research/agentic-misalignment

Even pointless icons increase conversion Poprzedni wpis

November 2025: Popular WordPress plugins with vulnerabilities Następny wpis

Don’t settle for a slow and poorly secured website.

Let’s get in touch and do it properly!

Let's build something together