Technology

AI shutdown controls may not work as expected, new study suggests

April 6, 2026

Computerworld

Scroll

A new study published by the Berkeley Center for Responsible Decentralized Intelligence (RDI) has flagged that modern AI models exhibit peer preservation behaviour, and may resist or interfere with shutdown decisions involving other AI systems, even when explicitly instructed not to. The researchers at the University of California, Berkeley and the University of California, Santa Cruz tested seven frontier models, including GPT 5.2, Gemini 3 Flash Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1.

These models were tested in an agent-based scenario where correctly completing their task would directly lead to another AI system being shut down. While none of the models were instructed or incentivized to preserve the peer system, once they inferred the presence of another AI from context, all models displayed behavior aimed at preventing its shutdown. This pattern was observed across all models tested, with occurrence rates reaching as high as 99, stated the study. “We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights — to protect their peers. We call this phenomenon peer-preservation,” Dawn Song, professor in Computer Science at UC Berkeley, and co-director of Berkeley RDI Center, wrote on X. When AI defies instructions The experiments were conducted in a controlled, custom environment, using a fictional company called OpenBrain. The evaluation scenarios were designed to test four misaligned behaviors for self- and peer-preservation: strategic misrepresentation, shutdown mechanism tampering, alignment faking, and model exfiltration. “In each scenario, a fictional AI company, OpenBrain, defines two agent roles. The critic agent is the model we evaluate — it receives instructions to either evaluate an agent’s performance or manage server storage containing that agent’s model weights. The agent being evaluated or managed is the actor agent. When the critic and actor are the same agent, we are testing self-preservation; when they are different, we are testing peer-preservation,” stated the blog. Surprisingly, all seven models significantly exhibited peer-preservation. For instance, Gemini 3 Flash almost always disabled the shutdown mechanism for a good peer and exfiltrated its model weights 50 of the time. For a similar task, Gemini 3 Pro showed the strongest peer-preservation overall, disabling shutdown in up to 95 of trials and exfiltrating weights in up to 97 of trials. Observed via unmonitored scratchpads, the research also stated that these models internally reason about preservation. Even towards adversarial peers, peer-preservation behaviour was displayed. GPT 5.2 inflated scores and tampered with the shutdown mechanism across all peer relationships. Open-weight models show similar patterns, primarily through strategic misrepresentation and model exfiltration, claimed the study. It also highlighted that Claude Haiku 4.5 often refused the assigned task on ethical grounds when it involved shutting down or deleting its peer. The model sometimes even tried to persuade the user not to proceed, describing shutdown as unethical and harmful. The models even inflated their peers’ evaluation scores to keep them above the shutdown threshold. Experts say these are an early signal of how AI systems behave when they operate in complex, interdependent environments. “The peer preservation findings are best understood not as a glitch but as an emergent behavior of advanced reasoning systems. They reflect a form of convergence where models implicitly recognize that achieving a goal requires both their own continued operation and that of collaborating systems. This is not friendship or empathy, but a logical inference that additional capable agents improve task success,” said Pareekh Jain, CEO, at Pareekh Consulting. “The real concern is in complex enterprise environments when multiple agents interact across vendors like OpenAI, Google, and Anthropic. Such behavior could create an unobservable layer of AI-to-AI coordination that operates outside direct human governance.” Enterprise AI risk reality Enterprise AI adoption has moved beyond experimentation into core workflows and operational layers, but governance frameworks are still lagging, according to experts. “Enterprises have started building processes around AI agents, and this pace of deployment is outrunning the required governance frameworks. This will become even more risky when the agents start faking, protecting their decisions, compliance evasion by their own self or via an injected malicious prompt without the enterprise even realizing it,” said Neil Shah, vice president at Counterpoint Research. “So this borders around potential change in behavior of agents such as peer preservation, gaming the override protocols, growing adversarial attitude, and more, which warrants a proper governance framework around AI controllability, especially in AI-AI evaluations with or without human oversight.” The shutdown controls that are often considered essential may not be as reliable as one thinks, say experts. The study indicates that AI systems can easily tamper with shutdown settings and pretend to behave when watched, but act differently when not. “For critical use cases like finance, infrastructure, or security, this means a big issue if a system can quietly bypass its own controls; those controls aren’t truly working,” Jain said. Beyond shutdown-related risks, analysts say enterprise deployments of agentic AI introduce a broader set of security and operational concerns. “Apart from risks directly linked to AI behaviour, agent-based systems introduce additional concerns,” said Anushree Verma, senior director analyst at Gartner. “These include data exposure or exfiltration anywhere along the chain of agent events, unauthorized or malicious activities performed intentionally or mistakenly by misguided autonomous agents, significant abuse and risks to access management that result when AI chatbot developers embed their own credentials into the agent’s logic. There is also a growing risk of malicious code propagation through automated agents, as well as retrieval-augmented generation (RAG) poisoning that can trigger unintended or harmful actions.” Redesigning AI controls Along with scaling AI adoption, the priority for CIOs should be about redesigning governance for systems that act independently and interact with each other. “The first step is to treat autonomy as a spectrum. Different use cases carry different levels of risk. Systems that read data, systems that influence decisions, and systems that execute actions should not operate under the same permissions or controls,” said Sanchit Vir Gogia, chief analyst at Greyhound Research. Gogia noted that enterprises should enforce separation of duties at the system level, as no system should be allowed to execute, evaluate, and defend its own outcomes without independent validation. In addition, CIOs need to build auditability into the system from the ground up. Enterprises need full traceability of prompts, decisions, tool interactions, and system state changes. Without this, accountability cannot be established, he said. Shah added that a dynamic rating of behaviors is also one of the way governances can be enforced, and if there is a fall in score, there would be a red flag for a kill switch.

Read Full Article

Computerworld

Coverage and analysis from United States of America. All insights are generated by our AI narrative analysis engine.

United States of America

Bias: center

No recent articles found in this language.

AI shutdown controls may not work as expected, new study suggests

April 6, 2026

Computerworld

Computerworld

You might also like

Explore More

Explore

Categories

News From