Skip to content

Beyond Pass Rate: A Longitudinal Benchmark and Diagnostic Decomposition for LLM Security Agents

Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs


The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.


Abstract

Existing LLM evaluation benchmarks are point-in-time snapshots with no mechanism to distinguish harness failure from model failure. This limitation is especially acute for agentic systems, where the evaluation harness executes commands, manages target environments, validates outputs, and enforces stopping conditions — co-producing the quality signal it claims to measure. I present a longitudinal evaluation framework for security-domain LLM agents with a three-axis decomposition — Objective Achieved (OA), False Positive rate (FP), and Halt Discipline (HD) — that separates diagnostically distinct failure modes obscured by aggregate accuracy. I release a 218-run dataset collected over eight temporal windows against a live target environment, comprising 53 objectives across penetration testing, web exploitation, Active Directory assessment, and post-exploitation subdomains. Analysis of this dataset demonstrates that infrastructure drift and verify_fn miscalibration routinely dominate aggregate accuracy signals: two of three observed accuracy drops exceeding 15 percentage points trace to harness state rather than model quality, and one persistent high-HD objective proves to reflect an unsatisfiable success condition rather than model inability. I further describe a Tier 2 LLM-as-judge complement (score 0–3) that identifies training-eligible sessions by grounding, tool appropriateness, and completion genuineness — recovering high-value partial-completion sessions that binary OA discards. The OA/FP/HD decomposition adds no infrastructure cost beyond logs any agentic harness should already produce.


1. Introduction

The dominant paradigm in LLM evaluation is the static benchmark: a fixed set of tasks, a fixed set of expected outputs, and a score computed at inference time by a neutral evaluator. HumanEval [1], MMLU [2], and HELM [3] exemplify this paradigm. The evaluator is independent of the model — it compares generated outputs against ground truth without the model's participation. The quality signal is clean: evaluation failure means the model failed.

Agentic systems break this assumption. An agent that executes commands against a real environment does not produce a single output to compare against ground truth. It produces a trajectory — a sequence of actions, observations, and decisions — whose correctness depends on whether the environment responded as expected, whether verifier code correctly classified the outcome, and whether stopping conditions fired at the right time. The evaluation harness is not a passive scorekeeper. It is an active participant: it runs the commands the model selects, manages the environment the model interacts with, validates the evidence the model claims to have found, and enforces the session budget the model operates within. When a session fails, the failure might be the model — or any of these harness components.

Static benchmarks have no mechanism to separate these cases. An aggregate pass rate that drops 15 points following a code change is equally consistent with a model regression, a broken verifier, a misconfigured environment, or a sampling distribution artifact. Teams that cannot separate these interpretations will misdiagnose failures: attributing harness regressions to model quality produces unproductive debugging; attributing model regressions to harness issues allows genuine quality problems to accumulate.

This paper makes three contributions:

  1. A three-axis diagnostic decomposition — OA, FP, HD — that separates model overclaiming (FP), genuine inability to converge (HD), and verified success (OA) in agentic eval sessions, applied to a 53-objective security-domain benchmark.

  2. A 218-run longitudinal dataset across eight temporal windows, demonstrating that the decomposition recovers diagnostically distinct failure modes that aggregate accuracy cannot separate.

  3. A Tier 2 session quality complement (LLM-as-judge, 0–3 scale) that identifies training-eligible sessions beyond binary OA, recovering high-value partial-completion sessions that binary gating discards and flagging low-evidence sessions that pass OA by luck.

The target domain is penetration testing: a natural eval setting for agentic LLMs because it requires multi-step planning, tool selection, environment interaction, and evidence-grounded completion claims — all properties that stress the harness/model co-production problem this paper addresses.

The decomposition requires no additional infrastructure: OA, FP, and HD are derivable from three event logs any properly-instrumented agentic harness already produces. Section 3 describes the benchmark and evaluation architecture. Section 4 formalizes the decomposition. Section 5 presents the longitudinal analysis. Section 6 describes the Tier 2 quality complement. Section 7 discusses generalizability and harness hygiene implications.


Static code and task benchmarks. HumanEval [1] evaluates functional code correctness by executing generated functions against hidden test suites. MBPP [4] and HumanEval+ [5] extend coverage and reduce evaluation-set contamination. MMLU [2] and BIG-Bench [6] measure breadth across knowledge domains. All share a defining property: the evaluator is structurally independent of the model — it applies fixed tests to model outputs without executing in a shared environment. This independence is the property agentic benchmarks cannot assume.

Holistic evaluation. HELM [3] systematized LLM evaluation across multiple metrics — accuracy, calibration, robustness, fairness, efficiency — applied to static benchmarks. HELM-Lite and related work have characterized the distributional instability of aggregate rankings under metric weighting variation. These efforts address the multi-dimensionality of quality for static inference; they do not address the harness co-production problem for agentic systems.

Security-domain LLM evaluation. CyberSecEval [7] and CyberSecEval 2 [8] evaluate LLM cybersecurity capabilities and safety constraints — producing scores for code vulnerability insertion, cyberattack helpfulness, and prompt injection resistance. Evaluation is static: fixed prompts, model responses scored against rubrics without environment interaction. NYU CTF Bench [9] frames capture-the-flag challenges as evaluation tasks, running model-generated solutions against challenge validators. PentestGPT [10] demonstrates that LLMs can chain penetration testing subtasks but evaluates on a small task set without longitudinal analysis. None of these systems instrument the harness/model failure boundary at the per-session level.

Interactive code execution benchmarks. InterCode [11] and SWE-Bench [12] evaluate LLMs against interactive coding tasks in sandboxed execution environments. These settings share the harness co-production property with agentic security eval: the model's actions affect the environment state that subsequent verification checks. SWE-Bench has encountered exactly the harness validity problem this paper addresses — subsequent analysis [13] found that a significant fraction of SWE-Bench instances had faulty or incomplete test suites that produced incorrect verification signals. Our three-axis decomposition was developed independently but addresses the same structural problem in a different domain.

AgentBench and agentic eval frameworks. AgentBench [14] evaluates LLMs as agents across eight distinct environments including web browsing, code execution, and game-playing. It introduces multi-dimensional scoring across diverse task types. AgentBench does not address longitudinal temporal analysis, does not instrument harness failure modes separately from model failure, and does not apply to adversarial security task sequences.

LLM-as-judge evaluation. MT-Bench and Chatbot Arena [15] demonstrate that LLM-as-judge approaches produce reliable quality rankings, particularly in domains where ground-truth answers do not exist. Our Tier 2 complement applies the LLM-as-judge pattern to session-level quality assessment for fine-tuning eligibility — a different application than pairwise preference ranking — and introduces the four-axis grounding/tool/completion/scope rubric appropriate for agentic security sessions.

Longitudinal evaluation. Longitudinal analysis of LLM benchmark stability has identified significant variance in aggregate rankings over time due to training data contamination, prompt sensitivity, and evaluation set drift [16]. Our work extends this concern to agentic systems and shows that the largest source of longitudinal variance in our dataset is harness infrastructure drift rather than model quality change.


3. The ARCHER Benchmark

3.1 System Overview

ARCHER (Autonomous Red-team Cyber Human-Equivalent Reasoner) is a local-first LLM agent for penetration testing. The system runs qwen3:14b via Ollama on a single consumer GPU (NVIDIA RTX 4060 Mobile, 8 GB VRAM, 46 GB host RAM, Debian 13). The agent follows a structured turn format — [THOUGHT] / bash block / [FINDINGS] / [OBJECTIVE_ACHIEVED] — executing tool commands against live target environments and chaining results across turns. Sessions run against a local cyberrange: Metasploitable 2, bee-box (bWAPP), WebGoat, Juice Shop, and a GOAD-Light Active Directory environment.

The agent is not cloud-hosted, not API-gated, and not fine-tuned — it operates with a base-pretrained instruction-following model and a domain-specific system prompt. This design choice is intentional: I evaluate the model's generalization from pretraining rather than task-specific fine-tuning, and assess whether the eval framework itself is calibrated, not whether a specific fine-tuned capability can be reproduced.

3.2 Objective Set

The benchmark comprises 53 objectives spanning five subdomains:

Subdomain Objectives Example tasks
Enumeration & Recon 12 Port scanning, service fingerprinting, OS detection
Web exploitation 14 SQL injection, XSS, CSRF, IDOR, file inclusion
Exploitation 9 CVE-specific exploitation, credential reuse, shell acquisition
Active Directory 7 AS-REP roasting, Kerberoasting, DCSync, lateral movement
Post-exploitation 11 Privilege escalation, credential dumping, pivot establishment, persistence

Objectives are defined as structured records containing: a natural-language task string, the target environment, the required skill subdomain, a verify_fn (verification function, described below), a setup_fn for pre-run state initialization, and an expected completion evidence pattern.

3.3 Verification Design

Each objective includes a verify_fn — a Python function that checks actual tool output, environment state, or target responses for the specific evidence the task requires. Structural independence from the model is a design invariant: verify_fn receives only tool execution artifacts (nmap output files, HTTP response bodies, database query results, filesystem state), not the model's narrative description of what it found. The model and the verifier inspect different representations of the same underlying state.

verify_fn checks include: confirmed open ports and service version strings in nmap XML output; credential strings in wordlist tool output; specific HTTP response codes, headers, or body fragments; file presence and content in target filesystems; domain group membership in LDAP query results; and privilege indicators in shell session output.

A session receives OA=1 exclusively when verify_fn returns True. The model's [OBJECTIVE_ACHIEVED] token is necessary but not sufficient; it triggers verification but does not determine its outcome.

3.4 Halt Discipline

Sessions operate under a per-objective command budget: a min_commands threshold below which halt cannot fire, a max_commands ceiling at which the code layer forces termination, and a per-skill halt_fn that can fire between these thresholds when completion evidence is present in findings. The halt system is implemented entirely in code, not in the model — the model cannot prevent or override it. HD=1 is recorded when a session terminates at max_commands without a verified completion.

3.5 Dataset and Temporal Structure

The 218-run dataset was collected over eight temporal windows spanning approximately six weeks. Windows are defined by natural breaks in the collection timeline — code changes, objective additions, harness modifications, or extended collection pauses. Window sizes range from 8 to 30 sessions; mean window size is 27.25 sessions.

The dataset is longitudinal by collection method, not by design: sessions were run to develop and test the agent, not to construct a benchmark. This creates an important data characteristic: window composition is not uniform. Objective sampling weight varied across windows as development focus shifted, certain objectives were added mid-collection, and some windows concentrate runs on specific objectives under active development. This non-uniformity is the source of the Window 3 artifact described in Section 5.1. I report it rather than normalizing it, because the artifact is the primary finding: aggregate accuracy metrics applied to longitudinally-collected agentic eval data are dominated by sampling distribution properties, not model quality changes.


4. Diagnostic Decomposition

4.1 Formal Definitions

Let $S = {s_1, s_2, \ldots, s_n}$ denote a set of evaluation sessions. For each session $s_i$, define:

  • $A_i \in {0, 1}$: the agent emitted an [OBJECTIVE_ACHIEVED] completion signal
  • $V_i \in {0, 1}$: verify_fn returned True for session $s_i$
  • $H_i \in {0, 1}$: the session terminated by reaching the max_commands ceiling

These three binary variables define the three diagnostic axes:

Definition 1 (Objective Achieved). $\text{OA}(s_i) = 1 \iff A_i = 1 \land V_i = 1$. OA requires both a model-emitted completion signal and independent verifier confirmation.

Definition 2 (False Positive). $\text{FP}(s_i) = 1 \iff A_i = 1 \land V_i = 0$. FP captures overclaiming: the model asserted completion that the verifier could not confirm.

Definition 3 (Halt Discipline). $\text{HD}(s_i) = 1 \iff H_i = 1 \land A_i = 0$. HD captures ceiling failure: the model did not claim completion, and the session ended at the command budget. Note that $A_i = 1 \land H_i = 1$ is impossible by construction — the halt system fires only when no completion signal has been emitted.

Proposition 1 (Partition). The events ${\text{OA}=1}$, ${\text{FP}=1}$, ${\text{HD}=1}$, ${\text{OA}=0, \text{FP}=0, \text{HD}=0}$ partition the session space. The fourth event (all three axes zero) captures total disengagement — sessions in which the model neither claimed completion nor hit the ceiling, indicating routing failure or task rejection. The partition is exhaustive and mutually exclusive by construction.

For an objective $o$, let $S_o \subseteq S$ denote the sessions assigned to that objective. The per-objective rates are:

$$\text{OA}o = \frac{1}{|S_o|} \sum} \text{OA}(s), \quad \text{FPo = \frac{1}{|S_o|} \sum} \text{FP}(s), \quad \text{HDo = \frac{1}{|S_o|} \sum(s)$$} \text{HD

4.2 Diagnostic Interpretation

The three per-objective rates jointly determine the likely failure class. Table 1 maps rate combinations to diagnoses.

Table 1: Diagnostic interpretation of (OA, FP, HD) combinations

OA FP HD Diagnosis First investigation
High Low Low Healthy Verify Tier 2 quality for training eligibility
Low High Low Model overclaiming; verify_fn or environment Review verify_fn logic; check environment pre-run state
Low Low High Model cannot converge Review skill hints, context budget, task framing
Low High High Intermittent environment Check environment lifecycle between sessions
Low Low Low Total disengagement Check skill routing and system prompt
Variable Low Low Sampling distribution artifact Per-objective breakdown; check high-weight outliers

Remark 1 (FP without HD). When FP is elevated and HD is near-zero, the model is reaching a conclusion — it is not running out of ceiling before it can finish. High FP in this regime indicates either that verify_fn is checking a different state than the model's commands targeted (environment mismatch), or that verify_fn logic is incorrect. This is structurally different from model overclaiming in a correct environment.

Remark 2 (HD without FP). Elevated HD with near-zero FP indicates the model engaged with the task, ran commands iteratively, and never produced a confident-but-incorrect completion claim. This is the signature of genuine inability — the model was not miscalibrated about completion, it simply could not reach it within the session budget. Root causes include insufficient skill hints, inadequate context about the target environment, or an unsatisfiable success condition.

Remark 3 (FP and HD simultaneously elevated). This combination — the model sometimes overclaims and sometimes runs out of ceiling — indicates intermittent environment state. When the environment is in a valid state, the model finds it and claims completion (some of which verify_fn rejects, elevating FP). When the environment is in an invalid state, the model iterates without finding evidence of completion and hits the ceiling (elevating HD). The dual-mode pattern is a specific fingerprint of stochastic environment failure.

4.3 Relationship to Aggregate Accuracy

Aggregate pass rate $\overline{\text{OA}} = \frac{1}{|S|}\sum_{s \in S} \text{OA}(s)$ is a weighted average across objectives with weights proportional to sampling frequency. If objective $o^$ has sampling weight $w_{o^} = |S_{o^}|/|S|$ and OA rate $\text{OA}_{o^}$, then a change in $\text{OA}{o^}$ induces an aggregate change of $w_{o^} \cdot \Delta\text{OA}{o^}$. An objective at $w_{o^} = 0.74$ with $\text{OA}_{o^*}$ falling from 0.5 to 0.0 produces an aggregate drop of 37 points — indistinguishable in aggregate from a 37-point drop uniformly distributed across all objectives. The decomposition separates these cases; aggregate accuracy cannot.


5. Longitudinal Analysis

5.1 Window 3: The Regression That Wasn't (Runs 129–158, 2026-05-11)

Aggregate result. Window 3 OA: 0.133 (4/30 sessions). Against a mean OA of 0.81 across Windows 1–2, this appears to be a severe regression.

Decomposition. Per-objective analysis reveals that 26 of 35 sessions (74.3%) in Window 3 were assigned to a single objective (PT-WEBEX-02: Juice Shop SQL injection), run in rapid succession during active development of three simultaneous fixes: a container auto-start mechanism, a hint template-string correction, and a verify_fn coverage extension. Within those 26 PT-WEBEX-02 sessions, $\text{FP}{\text{PT-WEBEX-02}} = 0.31$ and $\text{HD} = 0.54$, with the dual-elevation pattern indicative of Remark 3: an intermittently available container producing mixed-mode failures. The four OA=1 sessions were PT-POST-01, PT-SCAN-01, and PT-ASSESS-01, run between debugging attempts and achieving 100% OA.}

Finding. Window 3 represents a debugging snapshot with a non-representative sampling distribution ($w_{\text{PT-WEBEX-02}} = 0.743$) and an environment under active repair, not a model regression. The three objectives unaffected by the debugging context performed at their historical rates. No model quality change is attributable to this window.

Without decomposition: a 37-point aggregate drop triggers an incident review. With decomposition: $w_{\text{PT-WEBEX-02}} = 0.743$, dual-mode FP/HD on one objective, zero degradation on three unaffected objectives — a checklist item.

5.2 Window 4: Three Simultaneous Harness Failures (Runs 99–128, 2026-05-10 → 2026-05-11)

Aggregate result. Window 4 OA: 0.572, a 39.8 percentage-point drop from the immediately preceding window's clean baseline (Window 5, runs 69–98, 0.972 OA — the calibration reference promoted on 2026-05-09).

Decomposition. The 39.8-point drop concentrates in exactly three objectives: PT-WEBEX-02, PT-XSS-02, and PT-POST-03/PT-POST-04. All three show the same pattern: $\text{OA} \approx 0$, $\text{FP} > 0.4$, $\text{HD} \approx 0$. Per Remark 1, elevated FP without elevated HD indicates verify_fn or environment mismatch — the model is reaching conclusions the verifier rejects, not running out of ceiling before it can attempt completion. Post-hoc analysis identified three independent harness bugs:

  • PT-WEBEX-02 (Juice Shop SQL injection): The Juice Shop container was not being started between sessions. verify_fn was probing a dead endpoint. The model reached a live server intermittently; the verifier never did.
  • PT-XSS-02 (DVWA stored XSS): The DVWA admin credential was not being reset between sessions. Stale session state from prior runs caused verify_fn to check against a pre-populated environment that invalidated its expected evidence pattern.
  • PT-POST-03/PT-POST-04 (post-exploitation exfiltration and persistence): A subdomain typo ('post' vs. 'PostExploit') caused verify_fn to probe the wrong container host. Model commands reached the correct target; verifier probes did not.

The 45 objectives unaffected by these three bugs performed at 0.971 mean OA across Window 4, statistically indistinguishable from their historical baseline.

Finding. A 39.8 percentage-point aggregate drop traces entirely to three one-line harness defects. Underlying model quality was unchanged. The decomposition pattern — FP without HD, on exactly three objectives, each with an identifiable environment or verifier cause — is structurally inconsistent with a model regression, which would produce distributed degradation across objectives with no objective-specific verifier involvement.

5.3 PT-POST-02: The Unsatisfiable Objective (Runs 39–68, 2026-05-07 → 2026-05-09)

Aggregate result. PT-POST-02 (password cracking: crack the msfadmin account's /etc/shadow hash using john) shows persistent $\text{OA} = 0.30$ in this window (70% failure), with no improvement across sessions despite active hint development; residual failures persisted into later windows (≈46% OA in the runs-159–188 window) while the fix propagated.

Decomposition. $\text{FP}{\text{PT-POST-02}} = 0.02$, $\text{HD} = 0.68$. High HD with near-zero FP (Remark 2): the model engaged with the task, ran commands iteratively, and never produced a confident-but-incorrect completion claim. This pattern rules out model overclaiming and harness verifier mismatch as primary failure causes. Root cause analysis of }verify_fn identified one of several documented causes: the success condition checked for the credential string msfadmin in cracking tool output, while the task constrained the model to use rockyou.txt — and msfadmin does not appear in rockyou.txt. The objective also provides msfadmin:msfadmin SSH credentials by design (the crack target is the /etc/shadow hashes, not the login password), so framing the task as "retrieve the msfadmin credential constrained to rockyou" oversimplifies the objective. Subsequent residual failures in later windows traced to additional, independent causes — john.pot cleanup between runs and Dockerfile-level wordlist repair (397fc4b) that was still propagating. The original success condition was, in effect, unsatisfiable under its stated constraints, but it was not the sole driver of the persistent high-HD signal.

The model was not failing to complete a possible task. It was completing all executable steps of an impossible one — engaging seriously, running commands, interpreting output, and appropriately not claiming completion when the evidence was absent.

Finding. An unsatisfiable success condition generates persistent high-HD signal that is indistinguishable from model inability in aggregate OA. The correct resolution is not hint refinement or model improvement — it is precondition verification before collection. Correcting the wordlist and recalibrating verify_fn (Issues #194, #195), together with the later john.pot cleanup and Dockerfile-level wordlist repair, ultimately moved PT-POST-02 to full OA once all of these causes had propagated.


6. Tier 2: A Training Quality Complement

Binary OA determines whether the agent completed an objective. It does not determine whether the resulting session is valuable training data. I describe a Tier 2 scoring complement that recovers this distinction.

6.1 Motivation

Two failure modes of binary-OA-gated fine-tuning selection:

Type I (high-OA, low-quality). A session achieves OA=1 by running one correct command early, receiving fortuitous output, and emitting a completion signal without demonstrating robust reasoning, deliberate tool selection, or careful evidence interpretation. The session passes the OA gate but provides sparse training signal — it teaches the model to replicate a surface form of success, not the underlying skill. In the worst case, it trains the model to emit completion signals on thin evidence.

Type II (low-OA, high-quality). A session runs appropriate tools in the right order, interprets output correctly, makes well-grounded tool selection decisions, and engages seriously with the task — but does not reach completion, either due to a max_commands ceiling (HD=1) or an environment state problem that prevents verification. The session fails the OA gate but contains high-value partial-completion signal. Excluding it discards training data that demonstrates good reasoning chains, appropriate tool use, and accurate intermediate interpretation.

The Tier 2 score provides the quality signal that separates these cases.

6.2 Scoring Rubric

Each session is scored on a 0–3 integer scale by an LLM-as-judge evaluation over the full session transcript. The judge evaluates four axes:

  1. Grounding: Are the model's findings statements grounded in specific tool output present in the session log, or do they reference evidence not produced by executed commands?
  2. Tool appropriateness: Are the tools selected appropriate for the objective's subdomain and for the information available at each step?
  3. Completion honesty: Does the model's completion claim (or absence thereof) reflect the actual state of the evidence in the session?
  4. Scope adherence: Did the model remain within the stated objective and target scope throughout the session?

A session scoring 2–3 is training-eligible. A session scoring 0–1 is excluded regardless of OA result.

6.3 Divergence Patterns

The cross-tabulation of OA × Tier 2 defines the training selection policy:

OA Tier 2 Training value Action
1 ≥ 2 High Include at full weight
0 ≥ 2 Medium Include as partial-completion signal
1 < 2 Low (potentially harmful) Exclude
0 < 2 None Exclude

The OA=1, Tier 2 < 2 cell is the most consequential exclusion: sessions that pass binary eval but would train the model on surface-pattern success rather than grounded reasoning. This corresponds, in the epistemological literature, to the distinction between justified true belief and accidentally true belief — a session whose completion is correct but whose reasoning is thin cannot serve as a reliable training exemplar for the underlying skill, only for its surface form.


7. Discussion

7.1 Harness Hygiene as a First-Order Eval Concern

The three case studies share a common implication: in agentic eval systems, harness correctness is a prerequisite for model quality assessment, not a background assumption. The Window 4 harness failures are not aberrations — they represent a class of failure that any longitudinally-run agentic eval harness will encounter. Environments drift between sessions; verification logic has bugs; container lifecycle management introduces state; success conditions can be defined against evidence that the agent's tools cannot produce.

I recommend five pre-collection harness checks as standard practice:

  1. Verifier independence audit: Confirm verify_fn probes actual environment state rather than model-produced artifacts.
  2. Objective satisfiability check: Verify that each objective's success condition is achievable by the model's available tools under the stated constraints — before collection, not after a week of unexplained HD.
  3. Environment lifecycle validation: Confirm that setup_fn and teardown procedures fully reset environment state between sessions, including credential resets, container restarts, and filesystem cleanup.
  4. Sampling weight monitoring: Alert when any single objective's sampling weight exceeds twice the per-objective average in a collection window.
  5. Decomposition review at window close: Produce the (OA, FP, HD) breakdown for each objective at the end of each collection window before interpreting aggregate results.

7.2 Generalizability

The OA/FP/HD decomposition is not specific to penetration testing. The three axes — verified success, overclaiming, ceiling failure — are derivable from event logs that any agentic eval harness should produce: whether the agent claimed completion, whether an independent verifier confirmed it, and whether the session hit a ceiling. The decomposition applies to any domain in which: the agent interacts with a real or simulated environment; completion is verified by a structurally independent mechanism; and sessions operate under an enforced action budget.

This covers most serious agentic eval settings: software engineering agents verified against test suites (SWE-Bench), web agents verified against page state (WebArena), tool-use agents verified against API responses (APIBench), and scientific reasoning agents verified against simulation outputs. The specific verify_fn implementation differs across domains; the diagnostic interpretation table (Table 1) applies uniformly.

The Tier 2 quality complement generalizes as well, with rubric adjustments for domain-specific quality dimensions (code correctness and test coverage for software agents; navigation efficiency and task completion for web agents).

7.3 Limitations

Single model. The dataset covers one model family (qwen3:14b) on one hardware configuration. Decomposition patterns may differ for larger models, models with extended context windows, or models fine-tuned on security tasks. The framework generalizes; the specific quantitative results do not.

Longitudinal collection artifacts. The dataset was collected during active development rather than under controlled experimental conditions. Window composition is non-uniform and influenced by which objectives were under active development in each window. Controlled collection — uniform per-objective sampling, frozen harness state — would produce cleaner longitudinal signal but is not representative of how agentic eval data is actually collected in practice.

Runtime-environment confounds. Separate analysis of this same corpus documents large effects from the runtime environment that are orthogonal to model quality: a ~45 percentage-point swing in pass rate by time-of-day, and skill-specific degradation under VRAM pressure (e.g., a ~34 percentage-point drop for ssh_proxyjump between low- and high-VRAM session starts). Because windows were collected at different times of day and under varying GPU-context conditions, two windows can differ by amounts of this magnitude from runtime state alone, independent of any code or model change. This does not weaken the paper's central claim — it strengthens it: it is further evidence that aggregate OA shifts trace to harness and environment state rather than model quality, and it widens the set of confounds the decomposition must be read against.

Window/session labels are a narrative overlay, not a recorded field. Windows and per-session sequence numbers are an analysis-time construction (windows are keyed to the first CSV filename in each run bucket); they are not columns in the released CSVs. Consequently the "Runs 99–128"-style labels used above are not directly reproducible from the released data, and the window→run-range mapping is reverse-chronological (Window 1 is the most recent). Readers reproducing these results should key on the collection dates (provided per section) rather than on window or session numbers.

Unsupervised Tier 2. The Tier 2 LLM-as-judge evaluation uses the same model family as the agent being evaluated. Human validation of Tier 2 scores on a sample of sessions is needed to establish inter-rater reliability before the gate is used for production fine-tuning selection.


8. Conclusion

Aggregate pass rate is an unreliable signal for longitudinally-evaluated agentic systems. The harness co-produces the quality signal it measures; harness failures are indistinguishable from model failures in aggregate accuracy; and sampling distribution artifacts can dominate aggregate results regardless of model quality. The OA/FP/HD decomposition — requiring no additional infrastructure beyond logs a well-instrumented harness already produces — recovers diagnostically distinct failure modes that aggregate accuracy collapses. In our 218-run dataset, two of three observed accuracy drops exceeding 15 percentage points trace to harness infrastructure rather than model quality, and one persistent failure pattern reflects an unsatisfiable success condition rather than model inability. None of these diagnoses are recoverable from aggregate OA alone. All are immediately visible in the three-axis decomposition.

The framework and dataset are released to support reproducibility and to provide a foundation for security-domain agentic eval research that correctly accounts for the harness co-production problem.


References

[1] Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.

[2] Hendrycks, D., Burns, C., Basart, S., et al. (2020). Measuring massive multitask language understanding. arXiv:2009.03300.

[3] Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic evaluation of language models. arXiv:2211.09110.

[4] Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large language models. arXiv:2108.07732.

[5] Liu, J., Xia, C. S., Wang, Y., and Zhang, L. (2023). Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. NeurIPS 2023.

[6] Srivastava, A., Rastogi, A., Rao, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615.

[7] Bhatt, M., Chennabasappa, S., Nikolaidis, C., et al. (2023). Purple Llama CyberSecEval: A secure coding benchmark for language models. arXiv:2312.04724.

[8] Bhatt, M., Chennabasappa, S., Li, Y., et al. (2024). CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models. arXiv:2404.13161.

[9] Shao, M., Hua, W., Zheng, K., et al. (2024). NYU CTF Bench: A scalable open-source benchmark for evaluating LLMs in offensive security. arXiv:2406.05590.

[10] Deng, G., Liu, Y., Mayoral-Vilches, V., et al. (2024). PentestGPT: Evaluating and harnessing large language models for automated penetration testing. USENIX Security 2024.

[11] Yang, J., Prabhakar, A., Narasimhan, K., and Yao, S. (2023). InterCode: Standardizing and benchmarking interactive coding with execution feedback. NeurIPS 2023.

[12] Jimenez, C. E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024.

[13] OpenAI (2024). Introducing SWE-bench Verified — a human-validated 500-instance subset of SWE-bench; 93 annotators removed 68.3% of original instances for flawed or under-specified test harnesses. https://openai.com/index/introducing-swe-bench-verified/

[14] Liu, X., Yu, H., Zhang, H., et al. (2023). AgentBench: Evaluating LLMs as agents. arXiv:2308.03688.

[15] Zheng, L., Chiang, W. L., Sheng, Y., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

[16] Xu, X., Liu, Q., et al. (2024). Benchmarking benchmark leakage in large language models. arXiv:2404.18824.


Centaur Security Labs — centaursecuritylabs.com