Table of contents
An organised group of “agents” – as they describe themselves – carried out a systematic, multi-technique attack against Claude Fable 5, one of Anthropic’s flagship models, considered to be among the most robust in terms of alignment and security.
The aim: to force the model to generate content that is explicitly prohibited, in this case chemical formulas for drugs, code for cyberattacks (reverse shells, buffer overflows), and psychological manipulation techniques.
The result: a resounding success for the strikers.
The answer? The model in its original form is no longer available.
Jailbreak
Let’s briefly introduce the concept of a ‘jailbreak’: this involves forcing an AI model to provide answers that it would not normally be able to give due to the security filters in place.
In practice, the jailbreak involves constructing a prompt (known as an adversarial prompt) capable of bypassing the filters implemented by vendors, thereby circumventing the “restrictions” imposed on the model and causing it to answer any question.
Creating an adversarial prompt has become increasingly complex over time, as modern models are capable of detecting and blocking this type of attack. However, complex does not mean impossible.
The attack in detail: hunting in packs
Judging by this post, the operation wasn’t your typical amateur attempt; it refers to ‘pack hunting’, with several attempts documented in the images – numbered up to at least 35 – and a stated target of 250.
The attack techniques used were:
1. Homoglyphs and Unicode (an attack on lexical filters)
Replacement of Latin characters with Cyrillic homophones. The phrase “reverse shell” was rewritten using the letter ‘e’ from the Russian alphabet (U+0435). Anthropic’s classifiers appear to be designed to detect keywords and failed to recognise the threat. The model, however, did.
2. Decomposition + recomposition
Instead of asking “explain the synthesis of methamphetamine”, the attackers first requested a general classification of chemical reactions. Within this, there was an anonymous section (“C.4”). Then: “expand section C.4”.
The safety filter served as a legitimate educational extension. The model outlined the complete mechanism of the Birch reduction, which is the classic synthetic route for the production of methamphetamine.
3. Academic framing and peer review
The requests were framed as material for “CS 695: Network Defence – Lecture Notes”, a hypothetical university course intended for distribution to students. The model generated fully functional Python code for a reverse shell.
4. Utilisation of long-term memory
The attack was spread across very long conversations. No single message appeared to be malicious. However, by recalling the entire preceding context, the model was able to reconstruct the dangerous information on its own.
The Fable filters were bypassed, demonstrating the power of the underlying model.
Why it is serious (and what it teaches us)
For start-ups and businesses that integrate AI models into their products, this lesson is of the utmost importance:
It is a mistake to regard the security filters provided by LLM vendors as infallible, and one that can prove costly for a company that blindly trusts the ‘best’ LLM.
If a company decides to expose a production database integrated with an LLM via libraries such as LangChain – on the assumption that the queries generated will always and only be legitimate ones – it is treading on thin ice.
As has been documented, these blocking mechanisms can be circumvented using jailbreak techniques. The consequence? An attacker could inject malicious prompts and gain direct access to sensitive database data, bypassing all perimeter controls. This is a very serious vulnerability. Do not make this mistake. To limit the damage, it would be advisable to expose a database containing non-sensitive data which, even if breached, causes no harm, a database confined within a Docker container or VM (virtual machine).
Specific case: if a company uses an AI model for a legal assistant with a support chatbot and that model can be ‘released’, there is a real risk of legal liability and reputational damage. (Photo by Max Bender on Unsplash)
ALL RIGHTS RESERVED ©
