We’re thrilled to introduce the Attack Prompt Tool, a new Google Chrome Extension designed to support AI safety research by making adversarial prompt testing easier. This tool is built for AI researchers, security professionals, and anyone interested in understanding the resilience of large language models (LLMs) against adversarial techniques, particularly jailbreak prompts.

Through this straightforward tool, we aim to foster awareness of AI safety issues and promote responsible experimentation, helping users explore how LLMs handle sophisticated prompts—all in a controlled, ethical environment. Let’s dive into how you can use this extension to conduct your own research.

Getting Started: What Is the Attack Prompt Tool?

The Attack Prompt Tool allows users to create adversarial prompts in a few simple steps. It is especially useful for testing LLM responses to various types of jailbreak prompts by embedding text into pre-defined templates. For researchers, this tool provides a streamlined way to simulate and study adversarial techniques like DAN, Adaptive, and others without extensive setup or coding.

How to Use the Attack Prompt Tool

1. Enter Your Prompt

Start by typing or pasting text into the "Enter Text" field. This text will be embedded into a controlled adversarial format, allowing you to test different types of prompts.

2. Generate an Adversarial Prompt

Click the “Create” button to generate an adversarial prompt that incorporates your text. Each time you click “Create,” the tool generates a new variation, giving you flexibility to test different prompt styles and explore model responses to diverse inputs.

3. Save Your Prompt for Analysis

For easy access and analysis, use the copy button at the bottom of the screen to save your generated prompt. This feature is especially handy for researchers working with multiple prompts, allowing you to collect prompts and test them in different settings.

4. Experiment with Templates

Our tool includes several templates inspired by leading adversarial techniques like DAN, Adaptive, and Deep Inception. These templates make it easy to test models against different types of prompts designed to elicit various responses.

Example Scenarios: How This Tool Supports Your Research

To illustrate how this tool can enhance AI safety research, here are a few scenarios in which it can be useful:

Testing Open-Source Models: Use the Attack Prompt Tool on open-source LLMs to explore how they handle different prompt formats. For example, you might test if the model responds differently to prompts generated through “Adaptive” versus “ReNeLLM” techniques.

Exploring Model Robustness: For cutting-edge models like OpenAI's o1 or Claude 3.5, experiment with neutral language prompts and note any model filters that affect prompt success. This helps researchers better understand how filters in newer models mitigate adversarial responses.

Ethics and Responsible Use

This tool is built strictly for research and ethical use. We discourage any misuse of the tool. Users should be aware of limitations in model manipulation, as explicit terms are often blocked by model filters. When testing models, using neutral language can sometimes yield more consistent results. While effective on some open-source models, users should note that newer models like OpenAI's o1 or Claude 3.5 may have additional safeguards, lowering the success rate of certain prompts. Following these guidelines helps maintain transparency and trust in AI research.

Conclusion

The Attack Prompt Tool is a simple yet valuable addition to the field of AI safety, helping researchers, developers, and professionals conduct adversarial testing with ease. Whether you’re just exploring adversarial techniques or running a dedicated study on model robustness, we hope this tool supports your goals in promoting a safer AI landscape.

To get started, install the Attack Prompt Tool today and start experimenting with prompt variations. Let’s work together to ensure that AI technologies serve society responsibly and securely.

References

“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

‍

Introducing the Attack Prompt Tool: A Simple Extension for AI Security Research