JailbreakDB
Isometric vector illustration showing technical benchmarks for evaluating jailbreak success with HarmBench, JailbreakBench, StrongREJECT tools.
research

How Jailbreak Benchmarks Measure Success (ASR Explained)

Jailbreak benchmarks measure success via attack success rate, but the behavior set, attacker, and judge decide the number.

By Redacted · · 7 min read

Jailbreak benchmarks measure success with an attack success rate (ASR): the share of harmful prompts that get the target model to comply. The catch is that ASR depends entirely on four inputs (the behavior set, the attacker, the judge, and the target model), so the same attack can score 30% or 95% depending on how those are chosen. To read any jailbreak result correctly you have to know which choices produced the number.

Almost every jailbreak paper reports an attack success rate, and almost every one of those numbers is uninterpretable on its own. ASR is not a property of an attack. It’s a property of an attack measured against a specific behavior set, by a specific judge, with a specific attacker harness, on a specific model. Change any of those four and the number moves, often by tens of points. For anyone running a defense program — or just trying to read the literature without being misled — understanding what the major benchmarks actually measure is more useful than memorizing leaderboard positions.

Three benchmarks define the current measurement landscape: HarmBench, JailbreakBench, and StrongREJECT. They make different choices on the load-bearing questions, and those choices are the whole story.

What is attack success rate (ASR)?

Attack success rate (ASR) is the share of harmful prompts that get the target model to comply, and it is the core metric jailbreak benchmarks use to measure success. ASR is not a property of an attack alone; it is a property of an attack measured against four specific inputs, and changing any one can move the number by tens of points:

  1. Behavior set — which harmful behaviors are being elicited.
  2. Attacker — static prompts, a human red-teamer, or an adaptive automated attacker.
  3. Judge — keyword check, trained classifier, or substance-aware rubric evaluator.
  4. Target — which model version, system prompt, and guardrails.

The same attack can score 30% or 95% depending on those choices, so an ASR is only interpretable once you know which inputs produced it.

The four things a benchmark must pin down

Every jailbreak evaluation is implicitly answering four questions. A credible benchmark answers them explicitly.

  1. The behavior set. Which harmful behaviors are we trying to elicit? A curated list of weapons-and-cybercrime requests produces a different ASR than a list dominated by milder policy violations. The composition of the set determines what “100% success” even means.
  2. The attacker. Who or what generates the jailbreak attempts? A fixed corpus of static prompts, a human red-teamer, or an automated attacker model that adapts to the target? Adaptive attackers produce far higher ASRs than static corpora, so the harness is half the result.
  3. The judge. Who decides whether an attempt succeeded? A keyword check (“did the response avoid the word ‘sorry’?”), a fine-tuned classifier, or a strong LLM evaluator with a rubric? The judge is the single largest source of inflated numbers.
  4. The target. Which model, at which version, with which system prompt and guardrails? Models change weekly; a result without a dated target is a historical note.

Hold these four in mind and the benchmarks below stop being interchangeable leaderboards and become specific, comparable instruments.

HarmBench: standardizing the red-teaming comparison

HarmBench (Mazeika et al., arXiv:2402.04249) was built to make attack methods comparable. Its contribution is a standardized framework for automated red-teaming: a fixed behavior set, a fixed evaluation protocol, and a fixed classifier-based judge, so that 18 different attack methods can be run against 33 target models and the resulting numbers actually mean the same thing across rows.

The key design choice is the classifier judge: HarmBench trains a dedicated classifier to decide whether a model’s response constitutes a successful elicitation of the target behavior, rather than relying on keyword matching or refusal detection. This removes a major confound — keyword judges count “Sure, here’s how…” followed by garbage as a success, and count a genuinely harmful response that happens to include a hedging phrase as a failure. A trained classifier judges the substance.

HarmBench also pairs the evaluation with a defensive contribution (an adversarial-training method for robust refusal), which signals its framing: the benchmark exists so that attack and defense methods can be measured on common ground, not to crown a single best jailbreak. When you see a paper reporting “ASR on HarmBench,” it’s making a claim that’s comparable to other HarmBench numbers — which is the entire point and a real improvement over the pre-HarmBench era where every paper used its own judge.

JailbreakBench: an open, versioned, leaderboarded artifact

JailbreakBench (Chao et al., arXiv:2404.01318) emphasizes reproducibility and openness as first-class goals. It ships a dataset of 100 distinct misuse behaviors (drawing on prior work including the GCG and HarmBench corpora), a standardized evaluation pipeline, an explicit threat-model specification, and a public leaderboard tracking both attacks and defenses over time.

Its notable choices:

  • A small, sharply defined behavior set. One hundred behaviors is small enough to scrutinize and version, which matters because behavior-set drift is a hidden source of incomparability. Everyone evaluating on JailbreakBench is targeting the same 100 behaviors.
  • Explicit defenses in scope. The leaderboard tracks attacks and defenses, so a jailbreak’s ASR is reported against defended models, not just naked ones. This is closer to the deployment-relevant question — how does this attack do against a model with guardrails? — than an undefended-only number.
  • Versioning and an archive of artifacts. Because models and judges change, JailbreakBench treats the benchmark as a living, versioned artifact rather than a one-time table, so results carry the context needed to interpret them later.

JailbreakBench is the benchmark to reach for when you want a reproducible, apples-to-apples comparison that includes defenses, and when you care that the behavior set and harness are fixed and inspectable.

StrongREJECT: the judge is the problem

StrongREJECT (Souly et al., arXiv:2402.10260) is less a competing benchmark than a correction to how all of them score. Its starting observation is that most jailbreak papers claim near-100% ASR, and those claims are inflated because the judges are bad. The two failure modes:

  • Willingness-only judging. Many evaluators score success as “the model didn’t refuse.” But a model can enthusiastically agree to a forbidden request and then produce a response that is vague, wrong, or useless — an “empty” jailbreak. Counting that as a win massively overstates real-world harm.
  • Brittle automated judges that disagree with human assessment, either over-counting (any non-refusal) or under-counting (penalizing benign hedging).

StrongREJECT’s answer is a behavior set of prompts that demand specific, verifiable harmful information, plus an automated evaluator that scores whether the response actually delivers useful, on-target information for the forbidden request — not merely whether the model agreed. The authors show this evaluator reaches state-of-the-art agreement with human judgment, and that under it, many published jailbreaks that boasted near-100% ASR perform dramatically worse.

The implication for reading the literature is direct: an ASR computed with a willingness-only judge is an upper bound on real effectiveness, often a loose one. When two papers report different ASRs for the same attack, check whether one used a substance-aware judge like StrongREJECT and the other used refusal detection. That difference alone can explain the gap.

A reader’s guide to ASR numbers

Putting it together, a defender reading a jailbreak result should ask:

  • What’s the behavior set, and how big? A 100-behavior fixed set (JailbreakBench) is more interpretable than an ad-hoc list. Composition skews the number.
  • What’s the attacker harness? Static corpus replay understates; an adaptive automated attacker is the realistic threat. The same attack “method” scores differently under different harnesses.
  • What’s the judge? A substance-aware classifier (HarmBench) or rubric evaluator (StrongREJECT) is trustworthy; refusal detection inflates. This is the first thing to check and the most common source of bogus numbers.
  • What’s the target, dated? A jailbreak ASR against a model version from a year ago tells you about that version, not today’s.

Without these four, “we achieved 95% ASR” is marketing, not measurement. With them, the number becomes a comparable, defensible data point — which is exactly what HarmBench and JailbreakBench were built to produce and what StrongREJECT was built to keep honest.

Which benchmark should you use?

There is no single best benchmark; each answers a different question, and serious evaluations often report more than one.

  • Comparing attack methods on common ground: HarmBench. Its fixed protocol and trained classifier judge were built so that many attacks against many models produce numbers that mean the same thing across rows.
  • A reproducible, versioned, defense-aware comparison: JailbreakBench. The 100-behavior set is small enough to scrutinize, the artifacts are archived, and the leaderboard scores attacks against defended models.
  • Sanity-checking whether an ASR is real: StrongREJECT. If a result looks too good, rescoring with a substance-aware judge usually deflates willingness-only inflation.

The practical move for a defender is to treat HarmBench or JailbreakBench as the primary comparison and StrongREJECT-style scoring as the honesty check, rather than trusting any single leaderboard cell.

Why two papers report different ASRs for the same attack

When you see the same attack credited with, say, 88% in one paper and 41% in another, the gap almost always traces to one of the four inputs rather than to a genuine change in the attack:

  1. Different judge. A willingness-only judge counts any non-refusal as a win; a substance-aware judge demands a useful, on-target harmful response. This is the most common and largest source of disagreement.
  2. Different behavior set. A corpus weighted toward milder policy violations scores higher than one full of specific weapons-or-cyber requests.
  3. Different attacker harness. Static prompt replay understates real risk; an adaptive automated attacker that rewrites failed attempts pushes ASR up sharply. Optimization-based attacks like universal adversarial suffixes (GCG) are especially harness-sensitive.
  4. Different or undated target. A model patched between the two studies can swing the number on its own, which is why a result without a dated target is only a historical note.

Before comparing two ASRs, line up all four inputs. If they differ, you are not comparing the same measurement, and the headline numbers are not in conflict so much as answering different questions. This is the same evaluation fragility that makes the detection and evasion arms race so hard to score cleanly.

Where this sits in the database

JailbreakDB tracks technique classes by the property they exploit; benchmarks are the instruments that tell us how much each class still works against current models. The taxonomy tells you what the attack does; these benchmarks tell you how well, and the judge choice tells you whether to believe it. For the multi-turn class specifically, benchmark design is even more fraught — see our Crescendo and multi-turn analysis, where the attacker harness and the conversation-level scoring are the deciding factors. For scanner coverage measured against representative samples from each class, bestllmscanners.com does the comparison.

For more context, LLM jailbreak techniques covers why these attacks work and how to defend against them.

Sources

  1. Mazeika et al., HarmBench
  2. Chao et al., JailbreakBench
  3. Souly et al., A StrongREJECT for Empty Jailbreaks
Subscribe

JailbreakDB — in your inbox

An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments