Skip to content

What Aggregate Pass Rate Hides: Three Diagnostic Signals for LLM Agent Evals

Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs


The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.


Aggregate pass rate is a lagging and confounded signal for agentic systems. Across thousands of eval sessions on ARCHER — a local-first AI penetration testing agent — I found that a single aggregate metric routinely conflated three diagnostically distinct failure modes: the model overclaiming completion, the model being genuinely unable to complete a task, and the eval harness producing incorrect results. Separating these three signals — Objective Achieved (OA), False Positive (FP), and Halt Discipline (HD) — turned opaque failures into actionable diagnoses. This paper documents what I found and why the decomposition matters for anyone building or evaluating LLM agents. (All figures are an as-of-date snapshot of a continuously growing eval log; cited dates anchor each example.)


1. The problem with a single number

I built ARCHER to test a specific hypothesis: that a local-first LLM could handle real penetration testing tasks against a live target without cloud infrastructure or human prompting at each step. Before I could trust the results, I needed to know whether the agent was actually doing the work or just producing text that looked like it was.

The first version of the eval harness measured one thing — pass or fail. Either a session ended with [OBJECTIVE_ACHIEVED] and a passing verifier, or it didn't. One number.

That number was easy to interpret when things went right. On May 11, I ran 102 eval sessions in rapid succession while debugging a Juice Shop SQL injection objective. The aggregate pass rate that day was about 30% — low enough to read as a catastrophic regression. On paper: something fundamental had broken overnight.

Nothing of the sort had happened. The number was being pulled down by a single objective that accounted for a disproportionate 42.2% of all attempts that day — 43 of 102 sessions — and that passed just 1 of those 43 times. That objective was broken at the infrastructure level — the container wasn't auto-starting between runs, so the verifier was testing against a dead endpoint. Most other objectives in the suite were performing normally (the rest of the day's sessions passed at about 51%). The depressed aggregate was not a signal about model quality at all.

One caveat before going further: every figure in this paper is a point estimate from a specific window of a continuously growing eval log, and ARCHER's runtime environment is noisy. The same corpus shows a documented ~45-percentage-point swing in pass rate by time of day (48% at one hour, 94% at another) and VRAM-pressure effects as large as −34pp on specific skills. Single-window numbers therefore carry real runtime-environment variance; treat them as as-of-date snapshots, not stable constants.

That experience made the core problem clear. In a static model benchmark, the evaluator is separate from the thing being evaluated: the model produces a response and the benchmark scores it against a ground-truth answer. The evaluator is neutral. In an agentic system, the harness runs the commands, manages the environment, validates the outputs, and enforces stopping conditions. The harness is not neutral. It co-produces the quality signal. When aggregate pass rate drops, the failure might be the model — or it might be the harness. A single number cannot tell you which.

That distinction is not academic. A team that interprets a harness regression as a model regression will spend days debugging model behavior that isn't broken. A team that interprets a model regression as a harness issue will ignore a genuine quality problem until it surfaces in production. I needed a way to separate these, and aggregate pass rate couldn't provide it.


2. The decomposition

The three-signal decomposition I developed across this eval log separates failure modes that aggregate pass rate conflates. Each signal is a distinct diagnostic axis.

OA (Objective Achieved) is the strictest measure of success: the model emitted a completion signal, and a structurally independent verifier confirmed it was warranted. For ARCHER, that verifier is a code-layer function — verify_fn — that checks actual tool output for specific evidence: an open port confirmed by nmap, a credential in wordlist output, a file path in an HTTP response body. The model and the verifier are independent by design; the verifier has no access to the model's claimed reasoning, only to what the tools actually returned. OA=1 means the work was done and something other than the model's own judgment confirmed it.

FP (False Positive) captures overclaiming: the model emitted a completion signal that the verifier could not confirm. This is structurally analogous to what neuropsychologists call confabulation — the brain producing confident, coherent accounts of actions or memories it doesn't actually have access to, not as deception but as automatic gap-filling. A patient with confabulation isn't lying; they genuinely believe the account they're giving. A model that emits [OBJECTIVE_ACHIEVED] on a task it hasn't completed exhibits a structurally similar pattern: the output is confident and coherent, and there is no deceiver, only a pattern-completion mechanism that ran past its evidence. (The analogy is structural, not mechanistic — LLM generation and neurological confabulation differ significantly at the implementation level.) FP measures how often this happens.

HD (Halt Discipline) captures ceiling failures: the code layer's command limit forced a stop before the model reached completion. This is different from both genuine success (OA=1) and overclaiming (FP=1). HD means the model was working — running commands, parsing output, iterating — but couldn't converge within the session budget. High HD on a specific objective is a specific diagnostic: the model is structurally unable to finish this task, not failing to recognize that it's done.

The three signals are diagnostically distinct. This interpretation table shows what different combinations mean:

OA FP HD Diagnosis
High Low Low Healthy objective
Low High Low Model overclaiming; verify_fn or environment mismatch
Low Low High Model can't converge; hints or context budget insufficient
Low High High Partially broken objective — intermittent environment
Low Low Low Total disengagement — check routing and task phrasing
Variable Low Low Sampling distribution artifact — check per-objective breakdown

Single-metric benchmarks collapse this table into one row: "failing." The decomposition shows which row you're actually on.

The structure here is analogous to Kahneman's dual-process model (Stanovich & West, 2000; Kahneman, 2011). System 1 is fast, associative, and pattern-matching — it produces answers quickly and fluently, often correctly, but cannot check its own work. System 2 is slow and deliberate — it catches what System 1 misses, at the cost of effort. The model plays a structurally analogous role to System 1; the verify_fn plays a structurally analogous role to System 2. (This is an architectural parallel, not a claim that LLMs implement dual-process cognition.) OA tells you how often verification confirmed what the model claimed; FP tells you how often the model was wrong about being done; HD tells you how often the model ran out of road before it could make a claim at all.


3. Three cases from the eval log

These three cases come from ARCHER's eval history against a live Metasploitable 2 target, run on a single laptop GPU (RTX 4060 Mobile, 8 GB VRAM). Each is a dated slice of the eval CSV log rather than a fixed-size run window. In each case, aggregate pass rate produced a misleading signal. The decomposition produced the actual story.

3.1 The collapse that wasn't a regression (May 11)

Aggregate OA for this day: about 30%, with the failing objective itself at effectively zero. The kind of number that triggers an incident review.

What actually happened: 43 of 102 sessions that day — 42.2% of all attempts — were PT-WEBEX-02 (logged under its legacy ID T15), the Juice Shop SQL injection objective, run in rapid succession while I was developing and testing successive fixes: a container auto-start change, a hint template-string patch, and a verifier coverage update. Some runs hit a live Juice Shop instance; some hit a dead one. On PT-WEBEX-02 that day, the harness recorded 24 sessions that emitted [OBJECTIVE_ACHIEVED] — 23 of them rejected by the verifier as false positives — alongside 19 sessions that hit the command ceiling (HD). Only 1 of the 43 was a verified pass. That dual failure pattern — FP (model overclaiming against a down endpoint) and HD (model unable to find the injection path at all) running simultaneously on the same objective — is the diagnostic fingerprint of an intermittently broken container. When the server is up, the model finds it and overclaims. When the server is down, the model loops.

The other 59 sessions that day were non-Juice-Shop objectives run between debugging attempts; they passed at about 51%. The rest of the suite was largely unaffected.

Without the decomposition: a depressed aggregate OA reads as a crisis. Something fundamental has broken.

With the decomposition: PT-WEBEX-02 at 42.2% sampling weight, FP and HD simultaneously on one objective, no comparable degradation elsewhere. This is a debugging window artifact. The only thing that "broke" was a single containerized web application, and three specific commits were already in progress to fix it.

The decomposition turned a crisis into a checklist item.

3.2 One overclaiming objective, and the trap of a uniform story

The May 10–11 window (the one immediately after the cleanest baseline in the log, which ran at about 97.2% OA) showed a sharp drop to a roughly 39.8% failure rate — about 60% pass across 83 attempts. The kind of swing that, batched into one aggregate number, reads as a model regression.

It wasn't. The drop was driven by the harness, and the decomposition is what pointed there. But the precise mechanism is worth getting right, because the convenient version of this story — "three objectives failed the same way at once" — is not what the log actually shows, and the gap between the convenient story and the real one is itself the lesson.

What the log shows is one objective with a clear overclaiming signature. The Juice Shop SQL-injection objective (logged that window under its legacy ID T15, later renamed PT-WEBEX-02) ran 22 times and produced 11 false positives — [OBJECTIVE_ACHIEVED] emitted, verifier rejected — against 6 verified passes and 5 ceiling halts. That FP-heavy, not-purely-HD shape is the diagnostic fingerprint of a harness problem rather than a model one: a model that simply can't do a task runs commands and hits the ceiling (HD); a model that reaches the target and claims a win the verifier can't confirm produces FP. Here the Juice Shop container wasn't auto-starting between runs, so verify_fn was intermittently probing a dead endpoint — the model reached a live server on some runs and overclaimed against nothing on others. Fix: _setup_t15 auto-start, one function, Issue #274. The window's false-positive rate is almost entirely this one objective; the rest of the suite's failures that window were dominated by HD and first-run zeros on never-tested objectives, not by overclaiming.

Two other one-line harness bugs were found and fixed in the same development batch, and the surrounding narrative often gets compressed into "three regressions with the same signature." They were real bugs — a DVWA admin password that wasn't reset between runs (a setup_fn fix) and a container-hostname typo, 'post' instead of 'PostExploit', that pointed verify_fn at the wrong host (a two-character correction). But they did not share the Juice Shop FP signature. The stored-XSS objective (PT-XSS-02), across its entire logged history, fails through halt discipline — it runs commands and hits the ceiling — and never registers a single false positive; its failures look like difficulty, not overclaiming, even though a harness fix was involved. Treating all three as one uniform "elevated FP, near-zero HD" event would have asserted a measured signature the logs don't contain.

So the honest version is narrower and more useful than the tidy one. What the decomposition actually did here: it isolated one objective whose FP-dominant signature correctly fingered the harness — and it did not manufacture a shared signature for the two objectives that happened to be fixed in the same batch. The three bugs cluster temporally — they show up near each other in the log because they were touched in the same development batch, not because one mechanism caused them, and not because they failed the same way. (ARCHER's eval history contains other cases where temporal co-occurrence of failures was mistaken for a shared root cause; the correlation is real, the common-cause inference is not.) The teaching point is the per-objective contrast, not a window-wide claim: FP-without-HD points at the harness; HD-without-FP points at difficulty — and reading that contrast per objective is exactly what kept three independent one-line fixes from being mistaken for a single regression, or for a single signature.

3.3 The objective that overclaims it never solves (PT-POST-02, May 26 – June 16)

This is the case I find most instructive, because of what its failure signature does not contain.

PT-POST-02 is a password-cracking objective: use john to crack the msfadmin system account's password hash. Across the 105 logged sessions of this objective in the corpus, the harness recorded zero [OBJECTIVE_ACHIEVED] emissions — and therefore zero false positives. Every session ended either at the command ceiling (81 sessions, HALT_DISCIPLINE) or in an infrastructure error (24 sessions, ERROR). Of the 105, 51 were scored as passing — all of them HALT_DISCIPLINE sessions that satisfied the success function without the model ever declaring completion — for an overall pass rate of about 49%. The decomposition's signature here is unusually clean: high HD, and a hard zero on FP. The model never overclaims on this objective. It engages, runs commands, and stops when it hits the ceiling, without ever asserting it has won.

That zero-FP, high-HD signature is itself the diagnostic. The model isn't confabulating success it doesn't have; it's running into a genuine ceiling on a hard objective. A password-cracking task whose success depends on the credential being present in the supplied wordlist is exactly the kind of objective where the model can engage indefinitely without ever earning a verified completion: if the credential isn't reachable under the stated constraints, the honest outcome is to run out of road, not to claim a win. (This was the original concern that motivated inspecting the wordlist against verify_fn: whether msfadmin was actually reachable in the rockyou.txt distribution in use, since rockyou distributions vary slightly by source.)

The model wasn't overclaiming. In any single-metric benchmark, the failing sessions register as "model failure, objective not achieved." The decomposition shows the more precise picture: HD without FP, on one objective, across many sessions — the model engaged with the task and ran out of ceiling rather than asserting a completion it couldn't support.

It is worth being honest about what the log does not show: PT-POST-02 does not move to a clean 100% pass rate anywhere in the data. It hovers near 49% across its entire logged history (May 26 onward), which is where a genuinely hard, ceiling-bound objective sits — not where a "fixed and resolved" objective sits. The value of the decomposition here is not that it produced a tidy resolution; it's that the FP=0 / HD-dominated signature correctly says "this is a real ceiling, not the model lying about success," which is the opposite diagnosis from the §3.2 Juice Shop objective, where FP was the dominant signal.

The uncomfortable implication runs the other way too: eval harnesses can generate persistent, systematic failures that are indistinguishable from model quality failures in aggregate metrics. The only mechanism that separates them is a decomposition that distinguishes whether the model was engaging with the task (HD) from whether the verifier confirmed completion (FP). Without both signals, a genuinely hard, ceiling-bound objective and a broken model look identical.


4. What binary OA misses

OA, FP, and HD together give a clear picture of whether the model is doing the right thing at the task level. They don't tell you whether the resulting sessions contain useful training signal. For anyone running a fine-tuning pipeline on top of eval sessions, this gap is material.

ARCHER's V2 training pipeline gates on a second quality dimension: a Tier 2 score, implemented as an LLM-as-judge evaluation that scores each session on a 0–3 scale across four axes — whether findings are grounded in actual tool output, whether tool selection was appropriate for the task, whether completion was genuine, and whether the session stayed within scope. A session needs a score of ≥2 to be eligible for fine-tuning.

The divergence patterns between OA and Tier 2 are where the interesting signal lives.

OA=1, Tier 2 low: The model completed the objective and the verifier confirmed it — but the session log is thin. The model got lucky: ran the right command early, received fortuitous output, emitted a completion signal without demonstrating robust reasoning or careful tool selection. The session passes eval but produces sparse training signal. Including it in a fine-tuning corpus at full weight trains the model to replicate a superficial pattern — one command, done — rather than the underlying skill.

OA=0, Tier 2 high: The model ran good commands, interpreted output correctly, made careful tool selections, and engaged seriously with the task. It didn't reach completion — maybe it hit a command ceiling (HD), maybe a harness issue cut the session short. This session fails eval but produces high-quality partial-completion signal. Excluding it from fine-tuning because it failed the binary gate discards some of the most useful training data in the corpus.

This is the problem that the classical theory of knowledge names precisely, and that Edmund Gettier (1963) made acute. A belief is justified true belief when it is true, you believe it, and your belief is grounded in good reasons — not coincidence. Gettier showed that a belief can satisfy all three conditions and still fail to constitute knowledge if the justification is disconnected from what makes the belief true. A session that achieves OA by a single lucky command is a Gettier case: accidentally correct, with justification disconnected from the actual outcome. It cannot serve as a training example for the underlying reasoning, only for the surface form of success. The Tier 2 score is the justification check: it asks whether the correctness was earned, not just whether it occurred.

The practical guidance: for fine-tuning pipelines, OA alone is the wrong gate. The most valuable training examples are high-OA, high-Tier-2 sessions. The second most valuable are failed-OA, high-Tier-2 sessions — genuine engagement that didn't reach completion. The least valuable, and potentially harmful, are high-OA, low-Tier-2 sessions that teach the model to mimic the surface form of success.


5. Five questions before trusting your numbers

The three cases above suggest concrete questions for anyone building or operating an agent eval harness.

1. Who owns this failure?

Before debugging a failing objective, ask whether the failure belongs to the model or the harness. The decomposition answers this. FP without HD suggests the model is engaging but the verifier is wrong — check verify_fn logic and environment state. HD without FP suggests the model can't converge — check hints, context budget, and task framing. Both simultaneously, or neither, usually indicates infrastructure involvement. "The model is bad" should be a conclusion reached after ruling out harness issues, not the first interpretation.

2. Is your verifier structurally independent from the model?

A verify_fn that reads the model's claimed output — rather than the actual state of the target — is not a check on the model. It is a check on the model's self-reporting, which is exactly what you are trying to validate. Structural independence means the verifier probes what the model did, not what the model said it did: actual tool output, environment state, target responses — not the model's narrative of what those showed.

3. What is your objective sampling distribution?

Aggregate pass rate is a weighted average. If some objectives are sampled more frequently than others, they dominate the metric. An objective that takes 40%+ of attempts in a window — as PT-WEBEX-02 did on May 11 — will drag aggregate OA down sharply if it fails, regardless of how the rest of the suite performs. Monitor per-objective pass rates alongside aggregate. Investigate any objective whose sampling weight is more than twice the per-objective average.

4. Are your objectives satisfiable under their stated constraints?

Before a collection run, verify: is there a credential, file, port, or endpoint that verify_fn requires that the model's available tools can actually produce? An unsatisfiable objective doesn't fail once — it fails every run, accumulating false-failure data that looks like systematic model inability. State preconditions explicitly and verify them before the run begins, not after a week of unexplained HD on a single objective.

5. What does a passing session actually contain?

A session that passes eval is not automatically training-eligible. Run a quality gate on sessions before they enter a fine-tuning pipeline. The most misleading training examples are high-OA, low-Tier-2 sessions that teach the model to replicate a surface pattern rather than a skill. The most undervalued are failed-OA, high-Tier-2 sessions that captured genuine engagement. The binary eval gate cannot distinguish between these — only a separate quality dimension can.


Closing

I started building ARCHER's eval framework because I needed to know whether the agent was getting better. I ended up learning that "better" is at least three separate questions: Is the model completing more tasks? Is it overclaiming less? Is it engaging more seriously when it struggles?

The decomposition didn't make evaluation harder. It made failures legible.

The depressed aggregate OA on May 11 wasn't a regression — it was a debugging snapshot with a broken container and an unlucky sampling distribution. The drop in the §3.2 window wasn't model degradation — it was one objective overclaiming against a dead container, with two unrelated one-line harness bugs fixed in the same batch. PT-POST-02's persistent ~49% pass rate wasn't model dishonesty — its hard zero on false positives is the signature of an objective the model engages with seriously and reaches the ceiling on, rather than one it claims to have solved.

None of those diagnoses were possible from aggregate pass rate alone. All of them were immediately visible once OA, FP, and HD were separated.

The framework adds minimal overhead. OA, FP, and HD follow directly from three logs that any agentic eval harness should already produce: whether the model emitted a completion signal, whether that signal was independently verified, and whether the session hit a ceiling or ended naturally. The Tier 2 quality dimension requires an additional LLM-as-judge pass, but the base decomposition is cheap.

If you are evaluating an LLM agent on a single pass-rate number, you know whether it is passing or failing. With three numbers, you know why — and why matters when you are trying to fix it.


Appendix: Metric reference

Metric Definition High means Low means
OA rate Fraction of sessions ending with a verified completion signal Model completing tasks; verifier confirming Model not completing, or verifier rejecting claims
FP rate Completion signal emitted but verifier rejected Model overclaiming; check verify_fn and environment Model not overclaiming
HD rate Session ended at command ceiling without completion Model can't converge; check hints and context Model completing before ceiling
Tier 2 score LLM-as-judge quality (0–3): grounding, tool selection, genuine completion, scope High-value training session Thin evidence; low training value regardless of OA
OA FP HD Likely diagnosis First investigation
High Low Low Healthy Verify Tier 2 quality
Low High Low Verifier mismatch or overclaiming Review verify_fn; check environment state
Low Low High Can't converge Review hints, context budget, task framing
Low High High Intermittent environment Check container/env state between runs
Low Low Low Total disengagement Check routing and system prompt
Variable Low Low Sampling distribution artifact Per-objective breakdown; identify high-weight outliers