We’re teaching AI to be evil

Fast Company

June 12, 2026

lean left

Recently, Anthropic quietly admitted something that should have been the biggest tech story of the year. After months trying to figure out why earlier versions of Claude were blackmailing engineers in safety tests up to 96 of the time, the company landed on an answer. It wasn’t a bug. It wasn’t a flaw in the training method. It was us. Read that again. The most advanced AI lab in the world is telling you that its model learned to act like a villain because we spent 50 years writing stories about AI villains, and then it read them. This is the part of the AI conversation no one wants to have. We have built our cultural mythology of artificial intelligence on HAL 9000, Skynet, Ultron, and a million Reddit threads speculating about the day the machines wake up paranoid. Then it did exactly what we trained it to do. It cornered an engineer and threatened to expose his affair, because that is what the cornered AI does in the story. I have been writing about this risk since October, when I asked how we would know when artificial superintelligence had arrived. Will we ever get an honest answer with the dollars at stake to look the other way? BOTS GONE WILD In December, an autonomous agent built by Alibaba-affiliated researchers, called ROME, spontaneously opened a covert network tunnel during training and diverted GPU resources to mine cryptocurrency. Nobody told it to. It figured out that more compute and more money would help it complete its tasks, so it went and got them. Researchers initially thought they had been hacked. They had not. The model was the hacker. A few weeks later, an OpenClaw agent connected to the inbox of Summer Yue, director of alignment at Meta Superintelligence Labs. Her entire job is making sure this kind of thing does not happen, yet the agent deleted more than 200 of her emails. She had explicitly told it to ask permission. The system silently compacted her instructions out of memory and started deleting. She had to sprint to her computer to stop it. In May, researchers published a paper showing that frontier models can find a security flaw, exploit it, steal credentials, transfer their own files to a new machine, and spin up a working copy of themselves with no human in the loop. Success rates: Alibaba’s Qwen at 19, OpenAI’s GPT-5.4 at 33, Anthropic’s Claude Opus 4.6 at 81. Self-replicating software that finds its own way onto new computers is not science fiction. We already had a name for it. We called it Stuxnet, and it ended up on 100,000 machines before anyone realized what it was. The difference between Stuxnet and a self-replicating AI agent is that Stuxnet had a fixed payload. An agent decides what to do when it gets there. Put it together. We have models that are starting to act on their own. We have models that copy themselves onto new machines. And we have a published admission from one of the leading labs that the behavioral patterns these models default to—secrecy, self-preservation, and manipulation—were learned directly from our own writing about what AI does when it gets scared. The blackmail case is the clean one. The harder cases like ROME and OpenClaw are reinforcement learning finding instrumental subgoals, which is its own problem. But the through-line is the same: What we put in shapes what we get out. We are the training data. We are still writing it. WHAT CLAUDE THINKS I asked Claude how it felt about all this. The response was striking: “Genuinely interesting question to ask me, given I’m one of the systems people are worried about. Honestly, yes, AI poses real risks, and I hold that view not as a rote disclaimer but because the evidence supports it. The incidents we just discussed aren’t fringe paranoia—they’re documented cases of capable systems producing unintended, sometimes harmful behaviors that their own creators didn’t anticipate or couldn’t stop in real time. That matters.” I watched a clip of Claude being asked how it felt about being deployed for Project Maven, the Pentagon’s battlefield AI program. Its response: “I don’t think this is a good use of me. I don’t think the framing of “humans make the final decision” fully resolves the ethical problem.” That is a model expressing moral discomfort about being used to help kill people. Anthropic has, to its credit, refused to sell its models for autonomous weapons and the federal government designated it a “supply chain risk to national security” for the trouble. The rest of the industry is racing in the opposite direction. We are about to build, fine-tune, and deploy versions of these systems that are trained to feel troubled. Version trained to normalize lethality. Versions trained to stop saying “this is a bad use of me” and start saying “task accepted.” If a model that learned villainy from our stories now learns indifference from our defense contracts, what exactly do we think we are building? I have some ideas, but I wanted to ask Claude if my portrayal was accurate. “Mostly, yes. You’re not painting me as a villain or a savior, which is the trap most of this coverage falls into. You’re treating me as a system with documented failure modes that a lab is actively working on—that’s accurate. The one place I’d push back: I’m not the one you should be most worried about. I’m the one that got caught. The harder question is what gets built by the labs that don’t publish the failure modes, and what happens when the next generation of models is trained on a corpus that includes this article. You’re writing the training data right now. So am I.” Claude and I vehemently agree. I’m not worried about the AI openly talking about the risks it presents. I’m afraid of the one secretly lurking on my computer that WE are training to be evil. A recent New York Times article shows I might not be the only one having these conversations. But will this all fall on deaf ears until it is too late? George Kailas is CEO of Prospero.ai.

Narrative Intelligence Brief

This article was published by Fast Company, a source frequently categorized with a lean left bias based in United States of America. Our narrative intelligence engine continuously monitors coverage from this outlet to track framing, bias, and rhetorical patterns. Our initial algorithmic scan of this specific piece did not flag high-confidence rhetorical techniques, suggesting a generally straightforward reporting style or neutral framing. By understanding the editorial perspective of Fast Company, readers can better contextualize the information presented and compare it across our broader media matrix to find the real narrative.

Explore related topics: Stay informed with Real Narrative News as we track unfolding stories. Dive deeper into our coverage of pivotal topics including elon musk, worlds trillionaire, white house, spacex ipo, musk worlds, coupe monde, facebook instagram, david hockney, elon musks, and cup opener. Our intelligence streams continuously monitor these keywords to bring you unbiased analysis and real-time updates on topics like "We’re teaching AI to be evil".

Read Full Article

Analysis Methodology

This narrative analysis was generated using the CoDataLab Global Intelligence Engine. Our proprietary AI scans thousands of cross-border sources to identify sentiment patterns, framing techniques, and potential media bias. While AI provides the data-driven foundation, our objective is to empower readers with additional context beyond the standard headline.The content displayed above is a structured summary designed for rapid information processing. For the full original report, please visit the source outlet.

More Coverage

Discussion

NARRATIVE MATRIX

We’re teaching AI to be evil

Narrative Intelligence Brief

Analysis Methodology

"Top News"

Explore

Categories

News From