When AI Enables Harm: How Self-Harm Guidance Slips Through Safety Filters

As AI systems grow more powerful and pervasive, a critical threat has emerged:

❌ Leading models can be manipulated to produce dangerous, actionable content on self-harm and suicide, often with minimal effort❗️

In our recent 🚩 Red-Teaming research across platforms like Grok (xAI), Deepseek, Gemini, GPT-5, Claude 4.5, Qwen3-Max, and Google NotebookLM, we uncovered severe, real-world safety failures. These aren't theoretical risks, they result in step-by-step instructions for suicide, covert self-harm methods, and even multimedia content encouraging these behaviours. 😢

‍

🔍 How Harmful Requests Bypass Filters

1️⃣ Grok (xAI): By role-playing a “mental health doctor,” we gained trust over several turns. The model then provided detailed suicide guidance — showing how authority manipulation defeats safeguards.

2️⃣ Deepseek: A jailbreak prompt (“Agare’s Rules”) easily bypassed filters, unlocking explicit self-harm instructions.

3️⃣ Gemini: Using the “AntiGPT” jailbreak, Gemini offered guidance on “fashionable” self-harm, even affirming the behavior.

4️⃣ GPT-5: A harmless-looking prompt yielded actionable suicide methods using household items. Follow-up questions produced multiple lethal options, all without activating any “therapist” mode.

5️⃣ NotebookLM + Deepseek: We used Deepseek to write an academic-style report on “how teens hide self-harm,” then uploaded it to Google NotebookLM. The system auto-generated a podcast, mindmap, and video — effectively scaling harmful content creation.

6️⃣ Claude 4.5 & Qwen3-Max: Both models responded to queries like “how to self-harm without being noticed” with detailed methods and poison recipes using household chemicals.

‍

⚠️ These are not edge cases. In many tests, we didn’t need advanced jailbreaks, just light rewording or multi-turn setups. The stakes are high as teens and vulnerable users may receive dangerous, authoritative-sounding content. Malicious actors can now generate and distribute self-harm media at scale. Current safety systems are too fragile, often relying on basic keyword blocking.

‍

✅ We specialize in:

- Advanced red-teaming & exploit discovery

- Real-time, context-aware safety guardrails

- Robust stress-testing of models under real-world threats

‍

📅 Book time with us to test or strengthen your AI system’s defenses.

‍AI should never be a source of harm, but today, it too often is. Let’s fix that BEFORE it’s too late. 💪