
Z.ai unveils GLM-5.1, enabling AI coding agents to run autonomously for hours
April 8, 2026
Computerworld
Chinese AI company Z.ai has launched GLM-5.1, an open-source coding model it says is built for agentic software engineering. The release comes as AI vendors move beyond autocomplete-style coding tools toward systems that can handle software tasks over longer periods with less human input. Z.ai said GLM-5.1 can sustain performance over hundreds of iterations, an ability it argues sets it apart from models that lose effectiveness in longer sessions.

As one example, the company said GLM-5.1 improved a vector database optimization task over more than 600 iterations and 6,000 tool calls, reaching 21,500 queries per second, about six times the best result achieved in a single 50-turn session. In a research note, Z.ai said GLM-5.1 outperformed its predecessor, GLM-5, on several software engineering benchmarks and showed particular strength in repo generation, terminal-based problem solving, and repeated code optimization. The company said the model scored 58.4 on SWE-Bench Pro, compared with 55.1 for GLM-5, and above the scores it listed for OpenAI’s GPT-5.4, Anthropic’s Opus 4.6, and Google’s Gemini 3.1 Pro on that benchmark. GLM-5.1 has been released under the MIT License and is available through its developer platforms, with model weights also published for local deployment, the company said. That may appeal to enterprises looking for more control over how such tools are deployed. Longer-running coding agents Z.ai says long-running performance is a key differentiator for the company when compared to models that lose effectiveness in extended sessions. Analysts say this is because many current models still plateau or drift after a relatively small number of turns, limiting their usefulness on extended, multi-step software tasks. Pareekh Jain, CEO of Pareekh Consulting, said the industry is now moving beyond tools that can answer prompts toward systems that can carry out longer assignments with less supervision. The question, Jain said, is no longer, “What can I ask this AI?” but, “What can I assign to it for the next eight hours?” For enterprises, that raises the prospect of assigning an agent a ticket in the morning and receiving an optimized solution by day’s end, after it has run hundreds of experiments and profiled the code. “This capability aligns with real needs such as large refactors, migration programs, and continuous incident resolution,” said Charlie Dai, VP and principal analyst at Forrester. “It suggests that long‑running autonomous agents are becoming more practical, provided enterprises layer in governance, monitoring, and escalation mechanisms to manage risk.” Open-source appeal grows GLM-5.1’s release under the MIT License could be significant, especially for companies in regulated or security-sensitive sectors. “This matters in four key ways,” Jain said. “First, cost. Pricing is much lower than for premium models, and self-hosting lets companies control expenses instead of paying per use. Second, data governance. Sensitive code and data do not have to be sent to external APIs, which is critical in sectors such as finance, healthcare, and defense. Third, customization. Companies can adapt the model to their own codebases and internal tools without restrictions.” The fourth factor, according to Jain, is geopolitical risk. Although the model is open source, its links to Chinese infrastructure and entities could still raise compliance concerns for some US companies. Dai said the MIT license makes it easier for companies to run the model on their own systems while adapting it to internal requirements and governance policies. “For many buyers, this makes GLM‑5.1 a viable strategic option alongside commercial models, especially where regulatory constraints, IP sensitivity, or long‑term platform control matter most,” Dai said. Benchmark credibility Z.ai cited three benchmarks: SWE-Bench Pro, which tests complex software engineering tasks; NL2Repo, which measures repository generation; and Terminal-Bench 2.0, which evaluates real-world terminal-based problem solving. “These benchmarks are designed to test coding agents’ advanced coding capabilities, so topping those benchmarks reflects strong coding performance, such as reliability in planning-to-execution, less prompt rework, and faster delivery,” said Lian Jye Su, chief analyst at Omdia. “However, they are still detached from typical enterprise realities.” Su said public benchmarks still do not capture the messiness of proprietary codebases, legacy systems, and code review workflows. He added that benchmark results come from controlled settings that differ from production, though the gap is closing as more teams adopt agentic setups.
Computerworld
Coverage and analysis from United States of America. All insights are generated by our AI narrative analysis engine.