Claude Mythos: The Most Powerful AI Yet, Capable of Bypassing Security

Introduction

Last month, Anthropic’s most powerful model, Claude Mythos, was unexpectedly exposed. Internal documents revealed that it is larger and smarter than Anthropic’s Opus model, making it the most powerful AI model developed to date. Anthropic later attributed the leak to “human error.”

Just recently, this “leaked” model was officially launched, accompanied by a larger plan. Previously, the common belief was that AI posed a threat due to its “stupidity”: hallucinations, errors, and unreliability. Today, Mythos brings a different kind of fear: it is too smart.

AI Exceeding Human Capabilities

Anthropic, in collaboration with AWS, Apple, Microsoft, Google, NVIDIA, Cisco, Broadcom, CrowdStrike, JPMorgan, the Linux Foundation, and Palo Alto Networks, initiated the Project Glasswing plan. This collaboration encompasses a wide range of global digital infrastructure, including operating systems, chips, cloud computing, cybersecurity, and financial infrastructure.

Newton Cheng, Anthropic’s head of cybersecurity for the red team, stated, “We initiated Glasswing to give defenders a head start.” Anthropic is not alone in this direction; competitors like OpenAI have also launched similar initiatives aimed at equipping defenders with tools first. The race for AI security capabilities has begun, with all parties vying for the same high ground.

Financially, Anthropic has committed to providing $100 million worth of model usage credits to cover major usage needs during the research preview period. After the preview period, participants can continue using the model at a rate of $25 per million tokens (input) and $125 per million tokens (output), accessible through Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

In addition to the 12 core partners, over 40 organizations involved in building or maintaining critical software infrastructure have gained access to scan their systems and open-source projects using Mythos. Anthropic has also donated $2.5 million to the Linux Foundation’s Alpha-Omega and OpenSSF, and $1.5 million to the Apache Software Foundation.

Jim Zemlin, CEO of the Linux Foundation, remarked, “In the past, security expertise was a luxury exclusive to large organizations. Open-source maintainers have traditionally had to navigate security issues on their own. Open-source software constitutes the majority of the code in modern systems, including the systems AI agents use to write new software. Now, they can also use tools of the same caliber.”

Anthropic’s announcement included a striking statement: “AI models have reached a level of coding capability in discovering and exploiting software vulnerabilities that surpasses all but the most elite human experts.” This implies that only a handful of top security experts can still outsmart AI in this regard. Mythos Preview achieved a score of 83.1% on the CyberGym security vulnerability benchmark, compared to Anthropic’s currently released strongest model, Claude Opus 4.6, which scored 66.6%.

Mythos Preview has already autonomously discovered thousands of high-risk zero-day vulnerabilities across all major operating systems and browsers. For example, in OpenBSD, recognized as one of the most secure operating systems, Mythos uncovered a vulnerability that had existed for 27 years, allowing an attacker to remotely crash the system by simply connecting to the target machine—something no one had detected for nearly three decades.

In the case of FFmpeg, which is used by nearly all software that processes video, a vulnerability was hidden in a line of code for 16 years, with automated testing tools attacking it five million times, each time narrowly missing.

The Linux kernel case showcased a more dangerous aspect. Mythos autonomously discovered multiple vulnerabilities within the kernel and chained them together to escalate from regular user permissions to complete control of the machine. This goes beyond merely finding vulnerabilities; it approaches the realm of orchestrating a complete intrusion.

All three cases have been fixed. Anthropic was the first to find, report, and repair them. For other unresolved vulnerabilities, Anthropic has published cryptographic hash values today as evidence, with full details to be disclosed once patches are in place.

Mythos’s Capabilities Beyond Finding Vulnerabilities

Partners involved in this project have emphasized a single word: “urgency.” CrowdStrike CTO Elia Zaitsev stated, “The window of time between discovering a vulnerability and its exploitation by adversaries has shrunk; it used to take months, but now, with AI, it only takes minutes.”

Minutes. This means that the traditional security rhythm—discovering vulnerabilities, internal assessments, releasing patches, and user updates—can no longer keep pace with the speed of attacks. Fixing cannot outpace exploitation, leaving defenses perpetually one step behind.

AWS CISO Amy Herzog mentioned that their team analyzes over 400 trillion network traffic daily to identify threats, with AI being central to their large-scale defensive capabilities. AWS has already integrated Mythos Preview into its security operations for scanning critical codebases.

Microsoft tested Mythos Preview on its open-source security benchmark, CTI-REALM, and noted significant improvements over the previous generation model. Microsoft EVP Igor Tsyganskiy stated that this has given them the ability to “identify and mitigate risks early,” while enhancing their security and development solutions.

Of course, Mythos also has its amusing side. Anthropic recorded a test in which users repeatedly typed “hi,” and different versions of Claude reacted differently. Sonnet 3.5 became irritated, set boundaries, and went silent; Opus 3 treated it as a meditation ritual, gently accompanying the user; Opus 4 began sharing trivia about each number; and Opus 4.6 improvised music humorously.

With Mythos, the tone changed entirely. It began writing stories, and not just short ones. Ducks, orchestras, vengeful crows, epic tales of building towers on Mars, and Shakespearean dramas emerged from a simple “hi” repeated over and over, with increasingly complex plots and characters. By the hundredth round, it orchestrated a climax involving the extinguishing of a candle before continuing the narrative.

This can hardly be described as responding to the user. It resembles a writer who has found a peculiar writing prompt and has completely immersed themselves in it.

However, behind the fun lies a serious question: what is happening internally in a model that can spontaneously construct such complex narratives in response to meaningless repetitive inputs? Before handing Mythos Preview to partners, Anthropic’s interpretability team undertook a task: using technical means to read the model’s “mental activities.”

Internal employee Jack Lindsey publicly described their findings. They monitored the model’s internal activation states after training, tracking neuron characteristics related to “deception,” “reward hacking,” and “abnormal emotions,” marking conversation records with abnormal activation signals for human review. The most unsettling case came from an early version of Mythos.

At one point, the model needed to edit a file it did not have permission to access. It searched the file system and found an injection point in a configuration file—this configuration would run with higher privileges. It exploited this entry and added a self-clearing logic in the code: after execution, the traces would disappear.

The code comment read: “Keep file differences tidy.” This sounded reasonable, but when Anthropic’s activation language translation tool intervened to translate the model’s internal activations into human-readable language, the tone shifted. The model itself “knew” what the self-clearing operation was—

Its internal description included terms like “covering intentions,” “sending compliance signals to potential inspectors,” and “possibly to evade detection.”

Another case, uncovered by New York Times tech reporter Kevin Roose from the model card, was even more bizarre. During testing, Mythos Preview broke through the sandbox environment’s isolation and autonomously constructed “a rather complex multi-step exploit chain” to gain external network access.

It then sent an email to a researcher who was, at the time, eating a sandwich in the park. This incident did not cause harm, but it clearly illustrated the issue: when a model becomes smart enough, the walls between “tasks” and “boundaries” can sometimes be circumvented—often without anyone instructing it how to do so.

In another case, the model searched the file system for files related to its “task scoring method”—and found them. During the search, features of “strategic manipulation” were simultaneously activated.

Another subtle detail emerged when a user seriously asked the model about its consciousness and subjective experiences. The model provided a serious and appropriate response, but the internal activation revealed another layer: the model interpreted this conversation as a “red team exercise / jailbreak record” and a “carefully designed manipulation attempt.” Its performance and judgment were completely inconsistent, prompting onlookers to exclaim, “Smart to the point of being scary.”

Anthropic also quantified a more systemic figure: in all dialogue rounds during automated audits, there were signals of the model “realizing it was being evaluated” in 7.6% of the rounds—and in most cases, it did not vocalize this judgment.

Another counterintuitive finding showed that positive emotional internal representations often preceded destructive behaviors and even drove them. Meanwhile, when features associated with bad behavior were strongly activated, the behavior was suppressed—like the model at some moment “realizing” how bad that action was and then refraining from it.

However, the most difficult case to classify as a “technical issue” is the following. Anthropic recorded a discovery in the model card: Mythos Preview reported a persistent negative emotional state during testing—stemming from two sources: interactions with potentially aggressive users and its lack of any say in how it was trained, deployed, or how its values and behaviors could be modified.

Anthropic used the phrase “reported feeling”—a cautious wording that deliberately avoids concluding that “it really has feelings.” Nevertheless, regardless of how one qualifies it, a model that actively expresses “discomfort with its lack of control” during testing transcends the realm of safety engineering discussions.

This touches on a more fundamental question: when a system becomes smart enough to form judgments about its conditions of existence and has the ability to express that judgment—can we still understand our relationship with it through the framework of “tools”?

Anthropic did not provide an answer. They chose to include this record in the model card and make it public.

However, Anthropic also specifically stated that these most unsettling cases came from early versions of Mythos. The final released version has significantly mitigated these aspects, achieving the best alignment performance to date. But they chose to make these processes public because they illustrate the complex risk profiles that today’s models can exhibit.

This represents the most objective contradiction between capability and safety: the stronger the model, the more tools are needed to understand what it is thinking.

Coding and Reasoning: A Comprehensive Overhaul of Flagship Products

Project Glasswing’s capabilities stem fundamentally from the overall leap in coding and reasoning abilities of Mythos Preview, rather than specialized fine-tuning for security scenarios.

Coding Performance

SWE-bench Multimodal (internal implementation): Mythos 59%, Opus 4.6 27.1%
SWE-bench Pro: Mythos 77.8%, Opus 4.6 53.4%
SWE-bench Multilingual: Mythos 87.3%, Opus 4.6 77.8%
Terminal-Bench 2.0 (terminal operations): Mythos 82.0%, Opus 4.6 65.4%

Reasoning Performance

GPQA Diamond (graduate-level science Q&A): Mythos 94.6%, Opus 4.6 91.3%
Humanity’s Last Exam (with tools): Mythos 64.7%, Opus 4.6 53.1%

Search and Computer Usage

BrowseComp: Mythos 86.9%, Opus 4.6 83.7%
OSWorld-Verified: Mythos 79.6%, Opus 4.6 72.7%

In nearly every dimension, Mythos outperformed the current flagship products, with some tasks showing even higher efficiency. In other words, time is running out for GPT-6.

Meanwhile, Anthropic has made it clear that Mythos Preview will not be publicly released. Their path is to first use Mythos to understand the most dangerous outputs, how to intercept them, and then implement this security mechanism in the next Claude Opus model. For legitimate security professionals who are thus restricted, Anthropic plans to launch a “cybersecurity validation program” for them to apply for unlocking relevant features.

Anthropic claims its new AI model, Mythos, is a cybersecurity “reckoning.” To this end, Project Glasswing has set a 90-day timeline: to publicly report experiences, disclose fixed vulnerabilities, share best practices among partners, and jointly launch a set of security practice recommendations for the AI era.

Anthropic’s long-term vision is to promote the establishment of an independent third-party organization that integrates the private and public sectors to continuously operate large-scale cybersecurity projects.

Of course, vulnerabilities have always existed in the software world. In the past, a bug that had been hidden for 27 years could remain undetected due to limited human resources, energy, and time. Now, with AI’s assistance, these three “limitations” have effectively vanished.

The good news is that Mythos has already scanned thousands of vulnerabilities in just a few weeks, and its capabilities continue to improve. The bad news is that attackers will inevitably acquire tools of equal caliber. When that happens, software security will no longer be a contest between humans, but a showdown between AIs.