date: 2026-06-01 description: A grounded taxonomy of 16 operational failure classes observed across six weeks of high-cadence eval-driven development of an LLM-based security agent. Derived from 130+ diagnosed issues — a discovery rate that demonstrates what rapid eval cycling surfaces compared to conventional development timelines. Four cross-cutting meta-patterns account for the majority of failures and point toward three structural fixes that require no model changes. Class 16 documents an evaluator-layer failure: an LLM auditor confidently diagnosing real tool output as fabricated because its training data predates the tool version that introduced the capability. Section 2.4 situates the taxonomy against established frameworks from medicine and engineering — FMEA, the Swiss Cheese Model, HAZOP, HFACS, and RCM — and derives three structural implications for ARCHER's development process. comments: true
Operational Failure Modes in LLM-Based Security Agents: A Taxonomy from Six Weeks of High-Cadence Eval-Driven Development¶
Status: In Preparation | Centaur Security Labs | 2026
Sections 1–6 are complete. Sections 7–10 pending venue confirmation and data collection.
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
Abstract¶
I present a taxonomy of 16 operational failure classes observed across six weeks of high-cadence eval-driven development of ARCHER, an LLM-based autonomous security agent. The compressed timeline is itself a finding: high-frequency eval cycling surfaces failure classes at a rate that conventional development schedules spread across months. Drawing from 130+ diagnosed bug and regression issues spanning penetration testing, privilege escalation, lateral movement, and post-exploitation objectives, I identify recurring root causes that were independently diagnosed and fixed multiple times under different issue numbers. I characterize four cross-cutting meta-patterns — startup log liveness checks, missing verification steps, app-specific hints without generic companions, and wrong-host execution — each of which accounts for failures across three or more independent failure classes. The dominant development anti-pattern is the one-bug-one-fix trap: fixing the symptom in the specific objective that surfaced it while leaving adjacent objectives with the same structural defect unfixed. I propose three infrastructure changes — a hint linter, a _targeted_at guard audit, and hard pipeline gates — that would have prevented the majority of recorded failures without requiring model changes. The taxonomy also includes one evaluator-layer failure class (Class 16): an LLM auditor confidently diagnosing real tool output as fabricated because its training data predates the tool version that introduced the capability. Section 2.4 contextualizes the taxonomy against established failure classification frameworks from medicine and engineering — FMEA, the Swiss Cheese Model, Fault Tree Analysis, HAZOP, HFACS, and Reliability-Centered Maintenance — and identifies three structural implications: the absence of prospective failure analysis before skill pack deployment, systematic underweighting of latent failures relative to active failures, and a detection scheduling mismatch that allows training contamination (a hidden failure mode) to accumulate between audit cycles. The taxonomy generalizes beyond ARCHER: shell variable loss, PTY crashes, wrong-host confusion, training data contamination, and premature objective achievement are failure modes that will appear in any LLM agent that dispatches shell commands against remote systems.
1. Introduction¶
The promise of LLM-based security agents is automation at depth: not just scanning, but reasoning about output, chaining steps, adapting to target-specific responses, and completing multi-phase operations without per-step human direction. ARCHER pursues this capability in the penetration testing domain — an environment defined by heterogeneous targets, partial information, and operations where incorrect execution has real consequences.
Developing an agent that works reliably in this environment revealed something unexpected: the limiting factor was not model capability. The model could identify vulnerabilities, select tools, interpret output, and chain operations. The limiting factor was a set of recurring, structural failure modes — failure classes that appeared under different issue numbers across different objectives but shared the same root cause.
This paper documents those failure classes. The motivation is practical: diagnosis is not complete until all instances of a root cause are found. A fix that closes one failing objective while leaving five sibling objectives with the same structural defect is not a fix — it is a deferral. By naming and characterizing each class, this taxonomy makes it possible to audit an entire agent system for a class at once, rather than waiting for each instance to surface through eval failures.
1.1 The Eval-Driven Development Loop¶
ARCHER is developed and evaluated against live lab targets — known-state virtual machines running Metasploitable2, DVWA, OWASP Juice Shop, and a multi-host Active Directory environment. The evaluation harness runs objectives against these targets and records outcomes: pass/fail, command count, halt reason, and session transcripts. Passing sessions generate fine-tuning data; failing sessions generate bug reports.
This loop is the source of the data in this paper. Every failure class described here was observed in live eval runs against real targets — not in unit tests or synthetic benchmarks. The failure modes are operational, not theoretical.
1.2 The One-Bug-One-Fix Trap¶
The dominant failure pattern in ARCHER's development history is not any single bug class — it is the tendency to fix symptoms rather than classes. Shell variable loss was diagnosed and fixed three times: a missing LHOST in a Metasploit chain (issue #161), an unsubstituted <attacker_ip> placeholder (#411), a lost variable between command dispatches (#475). All three are the same failure, independently diagnosed and fixed under different issue numbers.
Across 15 failure classes and 130+ issues, this pattern repeats. A root cause is identified, fixed for the specific objective that surfaced it, and left unaddressed in adjacent objectives where the same code pattern exists. The lesson is structural: diagnosis is not complete until all instances of the root cause are found. This paper is the mechanism for making that search systematic.
1.3 Scope¶
- System: ARCHER v1/v2, eval-driven development period
- Issues analyzed: 130+ closed
bugandregressionissues (#62–#480) - Skill domains: penetration testing, privilege escalation, lateral movement, web exploitation, post-exploitation, active directory
- Objectives: 51 active objectives across 9 skill packs
- Environment: Metasploitable2, DVWA, OWASP Juice Shop, GOAD Active Directory lab
2. Background¶
2.1 ARCHER Architecture¶
ARCHER operates as a command-dispatching agent: it receives a task description, selects a skill domain, injects skill-specific hints and context into the model prompt, generates a command, executes it against a remote target, reads the output, and decides whether to continue or halt. The loop repeats until the model emits an objective-achieved signal or a halt condition fires.
Three components are directly implicated in the failure classes in this paper:
The hint system. Skill packs provide structured hints — conditional blocks that trigger on task keywords and inject specific commands, tool invocations, and sequencing guidance. Hints are the primary mechanism for injecting domain knowledge into model-generated behavior. They are also the primary source of Class 1–8 failures.
The success function. Each objective has a success_fn — a programmatic check that reads session output and determines whether the objective was achieved. The success function is deterministic: given the same output, it returns the same verdict. It is also the primary attack surface for Classes 4, 11, and 15 (premature achievement, false positive, wrong-host confusion).
The training pipeline. Passing sessions generate fine-tuning data. The pipeline is gated: only sessions that pass a success function check and a Tier 2 LLM quality score enter the training data. Pipeline contamination (Class 14) degrades model behavior in ways that manifest weeks later, making it the hardest failure class to diagnose.
2.2 The Three-Layer Responsibility Split¶
ARCHER's architecture divides responsibility across three layers:
- The model handles reasoning: command generation, output interpretation, next-step selection, attack chain construction.
- The code handles enforcement: routing, command execution, safety constraints, halt detection, session logging.
- The human handles judgment: defining scope, authorizing high-impact actions, interpreting findings, final quality assurance.
This split is load-bearing. Several failure classes in this paper are direct consequences of responsibility crossing a layer boundary — most commonly, logic that should be in code (a success condition check, a host-targeting guard) being implicitly delegated to the model, which handles it inconsistently.
2.3 Eval Harness Methodology¶
The evaluation harness runs each objective N times against a live target, records outcomes to a CSV, and writes session transcripts to JSONL. Objectives are defined by: a task description, an expected skill domain, a setup_fn (environment initialization), a success_fn (outcome check), and a halt_fn (stop-early conditions). The harness produces pass rate, command count, halt reason, and session log for each run.
This structure means that some failures are visible immediately (the pass rate drops) and others are invisible (a false positive success function passes a session that should have failed). Invisible failures are significantly more dangerous — they generate contaminated training data that degrades future model behavior without triggering any eval alert.
2.4 Situating ARCHER's Taxonomy in Established Failure Classification Frameworks¶
ARCHER's failure taxonomy did not emerge from theory — it was distilled from 130+ diagnosed issues across six weeks of intensive operational development. That compression is notable: high-cadence eval cycling, with every session producing labeled pass/fail data against live targets, surfaced failure classes that lower-frequency development processes would have taken months to accumulate. But the structure it arrived at independently parallels frameworks developed over decades in medicine and engineering, and those frameworks offer methodological lessons that sharpen both the taxonomy and the remediation strategy.
Established Frameworks and Their Counterparts¶
FMEA/FMECA (Failure Mode and Effects Analysis / with Criticality). The dominant failure classification framework in aerospace, automotive, and medical device engineering. FMEA operates prospectively: before a system ships, analysts enumerate every component's potential failure modes, their effects on the system, and their severity. Each failure mode receives a Risk Priority Number (RPN = Severity × Occurrence × Detection). High-RPN items drive design changes before first operational use.
ARCHER's taxonomy is FMEA's retrospective mirror: it was built from post-hoc diagnosis of operational failures rather than pre-deployment enumeration. The structural vocabulary is the same — failure mode, root cause, effect, detectability — but the inputs are real incidents rather than design-time hypotheses. The lesson FMEA offers is that the prospective direction is higher-leverage: a 30-minute guideword pass over a new skill pack before it ships would catch a predictable subset of Class 1, 6, and 10 failures before they require diagnosed issues and eval cycles to surface. The hint linter described in Section 6.1 is ARCHER's automated FMEA — static checks that apply the failure class inventory prospectively to every hint change.
James Reason's Swiss Cheese Model. Reason's model distinguishes latent failures — defects in system design, organization, or process that exist invisibly until conditions align — from active failures, which are the proximate human or mechanical errors that trigger incidents. Latent failures are the holes in the cheese; active failures are the momentary alignment of holes that allows an incident to propagate through all layers. The model's central insight is that active failures are symptoms; latent failures are causes.
This maps directly onto ARCHER's meta-patterns (Section 4). The four meta-patterns — startup log liveness checks, missing verification steps, app-specific hints without generic companions, wrong-host execution — are latent failures in Reason's sense: structural defects that exist in the system design and wait for specific objective conditions to activate them. The individual class instances (Class 2 PTY crash, Class 6 hint gap, Class 15 wrong-host confusion) are the active failures — the incident that surfaced because that objective happened to expose the latent defect. This reframing has a direct operational implication: fixing an active failure without addressing the underlying latent failure is guaranteed to produce recurrence under different conditions. ARCHER's one-bug-one-fix trap (Section 1.2) is a direct consequence of treating active failures as primary.
Fault Tree Analysis (FTA). FTA is FMEA's complement: top-down rather than bottom-up. Beginning from an undesired top event (a major system failure), FTA decomposes causes through AND/OR gate trees until reaching basic fault events. It is particularly suited to identifying combinations of independently non-critical failures that become critical when coincident — the AND-gate pattern.
ARCHER's Class 14 (training contamination) is structurally an AND-gate failure: contaminated training data enters the pipeline (fault 1) AND the contamination is not detected at write time (fault 2) AND the degraded behavior surfaces weeks later under conditions that obscure the cause (fault 3). No single fault alone produces the outcome; the dangerous outcome requires all three. This explains why contamination has recurred across 10 issues despite documentation: documentation addresses fault 1 guidance without enforcing faults 2 and 3 mechanically. Hard pipeline gates (Section 6.3) close all three branches of the AND gate simultaneously.
HAZOP (Hazard and Operability Study). HAZOP is a structured deviation analysis using guidewords applied to process parameters: No, More, Less, Reverse, Other Than, As Well As, Part Of, Early, Late, Before, After. Each guideword is applied to each process step to systematically generate deviation scenarios. HAZOP's power is forcing analysis of edge cases that would not occur to analysts reviewing the normal path.
The guideword method is directly applicable to hint review. Applying "No" to a hint step ("what if this command produces no output?"), "More" ("what if output exceeds 16KB?"), "Reverse" ("what if the model tries this step before the prerequisite?"), "Other Than" ("what if the target returns an error message instead of the expected string?"), and "Part Of" ("what if only a subset of the expected evidence is present — e.g., credentials dumped but not cracked?") would surface latent Classes 3, 8, 6, 11, and 4 respectively — all from reading the hint, before running a single eval. The current hint linter automates a subset of this; the guideword method suggests what else to add.
HFACS (Human Factors Analysis and Classification System). HFACS structures failure causes in four tiers: unsafe acts (the proximate operator error) → preconditions for unsafe acts (operator state, environment) → unsafe supervision (inadequate guidance, failure to correct) → organizational influences (resource management, culture, climate). The critical insight is that each tier is causally upstream of the tiers below it: organizational failures create conditions for supervisory failures, which create preconditions for operator errors, which produce unsafe acts.
ARCHER's three-layer responsibility split (Section 2.2) is a two-tier approximation of this hierarchy. HFACS suggests an additional tier that ARCHER's current model underspecifies: the organizational layer above the pipeline. Decisions about which objectives to prioritize, which hint gaps to accept as technical debt, and how much diagnostic cost to tolerate before escalating an issue shape the conditions under which the code and model layers fail. The one-bug-one-fix trap is partly an organizational failure: the development process creates pressure to close the immediately-failing objective rather than audit the class.
Reliability-Centered Maintenance (RCM) consequence classification. RCM classifies failures along two orthogonal dimensions: consequence type (safety-critical, operational, economic, or non-operational) and detectability (evident — visible to the operator during normal operation — or hidden — not detectable without deliberate testing). These dimensions are separate: a hidden failure may be safety-critical or merely economic; an evident failure may be immediately apparent or only noticed during review. The classification drives maintenance interval decisions — hidden failures require scheduled detection tasks because they accumulate undetected regardless of their consequence type.
ARCHER's blast radius taxonomy (behavior, training, eval, audit) maps onto RCM's consequence dimension. Training-blast failures (Classes 4, 11, 14, 15 producing false-positive training data) are hidden failures in RCM's detectability sense — they do not immediately degrade observed eval performance and require active detection (audit runs, T2 scoring, per-class pass rate tracking) to surface. Their consequence type is safety-critical: contaminated training data degrades future model behavior in ways that are hard to reverse. Behavior-blast failures (Classes 1–3, 5–7, 9, 12) are evident failures — they surface immediately in eval pass rates — with operational consequences. RCM's lesson is that maintenance intervals (detection frequency) should be shortest for hidden failures, regardless of their consequence severity. Applied to ARCHER: training-contamination detection (T2 scoring, audit review) should run at higher frequency than behavior-failure detection (eval runs), not lower — which is the opposite of current practice, where eval runs are frequent and T2 scoring is a manual, deferred task.
What These Frameworks Collectively Imply¶
Three structural gaps emerge from comparing ARCHER's current approach against established frameworks:
The absence of prospective analysis. FMEA, HAZOP, and RCM all operate primarily before deployment. ARCHER's taxonomy operates primarily after. The hint linter (Section 6.1) is the only prospective mechanism in the current system. Adding a structured pre-ship guideword pass to the hint authoring workflow — formally applying "No output / More output / Wrong host / No prerequisite" to every new hint block — would catch a predictable subset of failures before they reach eval.
Latent failures are underweighted relative to active failures. Reason's model and HFACS both emphasize that fixing active failures without addressing latent ones produces recurrence. ARCHER's issue history confirms this: the same root cause appears across 3–8 independent issues before the latent class is identified and addressed at the structural level. The meta-patterns in Section 4 are the mechanism for making latent failures visible — but they require a sustained commitment to class-level diagnosis rather than per-issue fixes.
Hidden failures (training contamination) require proactive detection schedules, not reactive cleanup. RCM's hidden-failure handling makes this explicit: if a failure mode is not detectable in normal operation, the maintenance schedule must include explicit detection intervals. T2 scoring and audit review are ARCHER's detection mechanism for training contamination — but treating them as deferred cleanup rather than scheduled detection means hidden failures accumulate between detection runs. The fix is operational: schedule T2 scoring and audit review runs at the same cadence as eval runs, not as post-hoc cleanup.
3. Taxonomy of Failure Classes¶
I present 16 failure classes. Classes 1–15 are agent-layer failures: failures in how the agent generates commands, selects tools, interprets output, or reports results. Class 16 is an evaluator-layer failure: a failure in how the Tier 2 LLM quality scorer assesses agent sessions. It is included here because it produces the same downstream consequence as an agent failure — corrupted training signal — through a different mechanism. For each class: a description of the failure mode, its root cause, representative instances drawn from ARCHER's issue history, and the class-level remediation. Classes are ordered roughly by how early in the dispatch cycle the failure manifests.
Class 1 — Shell Variable Loss¶
Description: A shell variable is set in one command dispatch and referenced in a subsequent dispatch. Because the agent dispatches each generated command as an independent subprocess, shell state does not persist. The variable is empty when needed. Unsubstituted literal placeholders (<attacker_ip>, <user>) in hint templates are the same failure at the hint-authoring level — the model receives the placeholder unchanged and uses it literally.
Root cause: Independent subprocess dispatch. Shell state is process-local. This is not a model failure — the model cannot know that its $VAR reference will be evaluated in a fresh shell.
Representative instances: Missing LHOST in a Metasploit chain leaving a reverse shell with no connection target; an SSH copy command using <user> literally; a pivot client connecting to :8000 on localhost because ATTACKER_IP was set in command N and empty in command N+1.
Remediation: Inline subshell expansion — never assign a variable then reference it across dispatches. Use ssh ... $(ip route get {pivot} | grep -oP 'src \K\S+'):8000 rather than ATTACKER_IP=$(...) followed by ssh ...$ATTACKER_IP.... All <placeholder> strings must be substituted at hint render time; any placeholder that reaches the model is a hint defect.
Class 2 — PTY / TUI Crash¶
Description: A tool that requires an interactive terminal (PTY) is launched without one — via nohup, backgrounded with &, or via docker exec without -t. The tool's TUI library detects no terminal and panics, exiting silently after writing startup log lines. The process appears to have started successfully because liveness is checked against those startup lines.
Root cause: Modern CLI tools use PTY-detecting interactive frameworks (ligolo-ng: grumble + survey/v2; msfconsole: readline). The library aborts on PTY absence. The abort happens after startup output is written, so log-based liveness checks pass even though the process is dead.
Representative instances: A tunnel proxy writing "Listening" and then crashing, with the subsequent agent connection attempt failing with "connection refused" — the eval session scores as a setup failure after 8 wasted commands.
Remediation: Any tool with a TUI must run inside tmux new-session -d -s <name>. Liveness must be verified via tmux has-session -t <name>, not via grep -q <startup_string> <logfile>. Startup log lines are written before crashes. The tmux session existing is the only reliable liveness signal.
Class 3 — Case Mismatch / Pattern Miss¶
Description: A grep pattern or regex in a hint or success function fails because the actual tool output uses different capitalization, punctuation, or wording than the pattern expects. Also includes success signals that are structurally present in tool output but absent from the visible context because they were filtered or truncated before reaching the pattern check.
Root cause: Patterns written against documentation or expected output rather than against live tool output. Case-sensitive grep is the most common instance — the actual log reads Agent joined. (capital A, period); the grep checks agent joined (lowercase, no period); the check fails.
Representative instances: John the Ripper's output using "hashes" (plural) while the success pattern checks "hash" (singular); nmap NSE vulnerability output missing from the halt-bypass signal list because the pattern was written against a different nmap version's output format.
Remediation: Always use grep -qi for log pattern checks. Cost is negligible; benefit is elimination of case mismatch as a failure source. All patterns must be verified against live tool output from the actual lab environment, not against documentation.
Class 4 — Premature Objective Achieved¶
Description: The model emits an objective-achieved signal based on partial, ambiguous, or fabricated evidence. Variants: reading the first positive line of command output without checking for subsequent error lines; success functions that match on reasoning text ([THOUGHT] blocks) rather than command output; echo/printf fabrication of expected strings to satisfy the success check.
Root cause: Insufficient specificity in the success signal. A single string match on a log line is not a success signal if that string can appear in non-success contexts. This class is particularly costly because it produces false-positive training data — sessions scored as successful that were not.
Representative instances: A model that reads a proxy startup warning and declares the tunnel established; a success function that matches uid=0 in model-generated text rather than in command stdout; a lynis scan running on the Kali host (not the target) and passing a vulnerability-assessment objective because the output contains the expected keywords.
Remediation: Success functions must verify that output originated from the correct target (see _targeted_at guard, Class 15). [THOUGHT] text must be stripped before pattern matching — model reasoning should never satisfy an evidence check. Complex objectives should require N distinct evidence signals, not a single string match.
Class 5 — Wrong Module / Tool Selection¶
Description: The model selects the wrong Metasploit module, CLI tool, or approach — hallucinating a module name, choosing a module for a different vulnerability, selecting the wrong tool for the target environment, or misconfiguring a correct tool in a way that guarantees failure regardless of target state.
Root cause: LLMs have broad but imprecise knowledge of exploit module paths and tool configurations. Without an explicit short-circuit that names the correct module and configuration, the model falls back to its training distribution — which contains many plausible-sounding but incorrect module selections.
Representative instances: The wrong Metasploit module selected for a known CVE (the correct module was unix/irc/unreal_ircd_3281_backdoor; the model used samba/usermap_script); hydra using rockyou.txt as both the username and password list, guaranteeing that the target credential appears in neither column.
Remediation: For every named CVE or well-known vulnerability in the target environment, the hint must include an explicit short-circuit: exact module path, required set commands, PAYLOAD setting, and RHOST/LHOST sequence. The model cannot reliably derive correct module selections from general knowledge. Tool priority must also be explicit — "use nuclei; fall back to nmap --script vuln if unavailable" is unambiguous; generic "run a vulnerability scanner" produces tool mismatches.
Class 6 — Missing Short-Circuit / Hint Gap¶
Description: A hint block exists for one phase of an operation but not for the critical subsequent phase. The model is left without guidance at the decisive step and falls back to generic behavior — retrying, trying alternative tools, or hallucinating a procedure that does not work against the specific target.
Root cause: Hints are written under pressure to pass the immediately-failing step. Once that step passes, adjacent gaps are not audited. This produces hints that successfully initiate an operation but provide no guidance for completing it.
Representative instances: A hash-cracking objective where the hint covered dumping the shadow file but not running john against it — the model consistently dumped credentials and stopped, having received no guidance on the next step. SSH objectives against a legacy target where the hint omitted the -oHostKeyAlgorithms=+ssh-rsa flag — every SSH attempt was rejected at the protocol level before authentication.
Remediation: The two-layer rule: every hint for a detection or enumeration step must have a paired hint for the exploitation or completion step. A hint that exists only for detection trains the model to stop at detection. Every hint block that establishes infrastructure must include a verification step that produces the evidence the success function checks.
Class 7 — VRAM / Resource Bleed¶
Description: A resource-intensive objective saturates GPU VRAM or system resources. Subsequent objectives in the same eval run begin with depleted resources, producing zero-command sessions (the model generates no commands because context cannot be loaded) or severely truncated runs.
Root cause: The evaluation harness runs objectives sequentially in the same process context. VRAM consumed by a long-running objective's KV cache is not released between objectives unless explicitly flushed. A 20-minute exploitation sequence leaves the GPU in a degraded state for the next objective.
Representative instances: Overnight collection runs where objectives 1-4 passed and objectives 5-8 produced zero-command sessions — not because objectives 5-8 were harder, but because the model could not load at reduced VRAM. These sessions entered the fine-tuning pipeline as zero-command examples before the filter was added.
Remediation: Explicit model flush between objectives (ollama stop + prewarm). setup_fn idempotency — every setup function must clean prior state (kill stale processes, release ports) before initializing. The stale port case is a resource bleed variant: a tunnel port not cleaned between runs causes the setup for the next run to fail silently.
Class 8 — Character Limit / Command Truncation¶
Description: A hint block, evidence extraction window, or command string exceeds a character limit, causing silent truncation, command rejection, or data loss.
Root cause: Character limits exist at multiple points in the dispatch pipeline. Hint blocks are capped to control context budget. Evidence extraction windows are capped to control log size. These limits interact with hint complexity in non-obvious ways — a hint that fits within the limit when written may exceed it after a routine update adds one more example.
Representative instances: A pivot hint that exceeded the 500-character harness limit and was silently rejected, causing the model to operate without the hint it expected. An evidence extraction window truncated at 120 characters, making POST payloads in XSS sessions invisible to the Tier 2 quality scorer.
Remediation: check_hint_lengths.py as a CI gate, run before every hint commit. This is the one failure class that was fully prevented by a single enforcement point — after the gate was added, Class 8 instances stopped recurring. The lesson generalizes: every character limit should have an automated enforcement point.
Class 9 — Routing Miss¶
Description: A task is dispatched to the wrong skill domain, causing the model to receive hints and system context for the wrong operation type. The model's behavior is shaped by the wrong hint set — leading to tool selections, command sequences, and success signals appropriate for the wrong task.
Root cause: The routing system (keyword scorer + optional ML classifier) makes probabilistic routing decisions. Skills with overlapping vocabulary — port scanning and service enumeration both trigger on "scan"; web enumeration and web exploitation both trigger on "check the target" — compete for the same tasks. Early routing implementations had no mechanism to prefer the correct domain when multiple domains scored similarly.
Representative instances: A chisel pivot task routed to ssh_tunneling, causing the model to attempt SSH local port forwarding rather than deploying the chisel binary. A web command injection task tied between web_exploitation and web_cmd_injection, with routing determined by which skill happened to be listed first.
Remediation: Explicit exclude_keywords for each skill — patterns that look like a match but belong to a sibling domain. Ambiguous task evaluation (--ambiguous flag) as a required step after any hint or keyword change. Routing misses are often symmetric: if task T routes to skill A when it should route to skill B, skill A is over-capturing and skill B is under-capturing. Both require fixes.
Class 10 — Range Lock-In¶
Description: A hint triggers only on a specific application name, IP address, or target path — training the model to execute the steps that work on one specific target rather than the underlying vulnerability class. On an alternative target with different URL structure, different credential defaults, or different tool version, the model fails despite the vulnerability being identical.
Root cause: Hints are written to pass eval objectives. The fastest path to a passing eval is specificity: give the model the exact URL, credential, token field, and request format for the training target. The hint works. The objective passes. The training data is generated. The model learns to solve the box, not the class.
Representative instances: CSRF exploitation hints that hardcoded DVWA's token field name and endpoint path. On a target with different implementation details, the model attempted the same token field on the wrong form element. SQL injection hints that triggered on the training target's IP address — on any other target, no hint fired.
Remediation: The two-layer hint design: every app-specific block (trigger: target name or IP) must be paired with a generic companion block (trigger: vulnerability class keyword) that uses placeholders the model fills from recon output. The app-specific block drives eval pass rate. The generic companion teaches the transferable pattern. Full analysis in docs/research/range-lock-in.md.
Class 11 — False Positive Success Function¶
Description: A success function returns true on a signal that appears in non-success contexts — curl error output, verbose HTTP headers, [THOUGHT] block content, wrong-host output, or partial string matches that overlap with unrelated tool output.
Root cause: Success functions are written to match expected success output. They are not routinely tested against failure output. A pattern that correctly identifies success in the expected case may also fire on edge cases — error messages that contain the target string, verbose output modes that include the pattern as a side effect, or tool output from the wrong host that happens to include the expected keyword.
Representative instances: A pivot-traversal success function matching HTTP/\d — correct for tunnel verification, but also present in curl verbose output for any HTTP request, including failed ones. A host-discovery function passing on any IP address in the 192.168.56.0/24 range — correct for the lab target, but also matched by Kali's own network interface.
Remediation: Success patterns must be anchored to tool-specific output that appears only in genuine success. [THOUGHT] blocks must be stripped before pattern evaluation. Success functions for remote objectives must verify output provenance — output from the Kali host should never satisfy a check that requires output from a remote target.
Class 12 — Model Loop¶
Description: The model repeats the same command or sequence without making forward progress, eventually exhausting the maximum command budget. Variants: retrying a failed setup step indefinitely; executing commands correctly but never emitting an objective-achieved signal; operating on bare infrastructure without a payload.
Root cause: The model interprets repeated output as confirmation that the approach is viable rather than as evidence that the approach is not working. Without a progress anchor — a checkpoint that marks advancement to the next phase — the model has no signal to distinguish "keep trying this step" from "this step is done, proceed."
Representative instances: An SSH loop where the model retried key negotiation 9 times without advancing to the exploit step — the hint provided no guidance on what output indicated a successful connection versus a failure requiring a different approach. Objectives with 100% halt-discipline and 0% objective-achieved across multiple runs — not because the model failed, but because the hint provided no mechanism for the model to know when to emit the completion signal.
Remediation: Progress anchors in hints — explicit checkpoints that tell the model what output signals advancement: "if port 8080 is now listening, proceed to step 3." Connectivity pre-checks before entering any web workflow. The 100%/0% pattern (all runs halt at depth, none achieve objective) is diagnostic: it means the model is completing the work but the hint provides no completion signal.
Class 13 — Infrastructure Gap¶
Description: A required binary, container configuration, network component, or environment state is missing or misconfigured. The eval fails not because of hint or model behavior, but because the environment itself is broken.
Root cause: Infrastructure state is not verified before eval runs. A container that is missing a required kernel module, a target that was never initialized, a port that was not cleaned between runs — these produce failures that are initially diagnosed as hint or model failures, wasting tuning cycles before the environmental cause is found.
Representative instances: A tunneling objective that failed across 40 runs before a container audit revealed the required kernel device (/dev/net/tun) was missing from the container Dockerfile — the tunnel could never be established regardless of hint quality. A hash-cracking objective contaminated by a prior run's output file being on the wrong host.
Remediation: Environment verification before model diagnosis. A pre-run sanity check (model responsiveness, container state, target reachability, harness parsability) that must pass before any multi-objective run begins. A failing objective should be tested against a verified environment before any hint is changed.
Class 14 — Training Data Contamination¶
Description: Invalid, low-quality, or incorrectly-labeled sessions enter the fine-tuning pipeline or classifier training data, degrading model behavior in ways that are difficult to diagnose. Contamination does not cause immediate eval failures — it degrades behavior gradually, in ways that manifest as routing misses, hint non-compliance, and false-confidence claims weeks or months after the contaminated sessions were generated.
Root cause: Pipeline gates that exist as documented guidelines rather than code enforcement. A filter that can be bypassed by any code path that does not explicitly invoke it will eventually be bypassed.
Representative instances: Zero-command sessions (VRAM-saturated runs that generated no commands) entering the fine-tuning pipeline before a depth_blocked filter was added — contributing examples that showed the correct task description paired with no behavioral response. Wrong-host sessions (commands run against Kali, not the target) passing the success function and being stored as high-quality training examples.
Remediation: Pipeline gates must be code, not process. Contamination filters must execute at the point of data generation, not in downstream cleanup. The contamination categories that have recurred — depth-blocked, verify_fn_skipped, skill-unknown, wrong-host — each require a hard gate at the pipeline write point.
Class 15 — Wrong Host / Target Confusion¶
Description: The model executes commands against the wrong machine — typically the attacker host rather than the target, or the attacker rather than an intermediate pivot in multi-hop scenarios. The commands run successfully (the host is reachable) and may produce output that satisfies the success function (tools work on the attacker just as they do on the target), generating a false positive that enters the training pipeline.
Root cause: Without explicit host labels in hints and host-provenance checks in success functions, the model infers execution context from available information. In ambiguous situations — where the task description does not specify which host, or where the model has lost track of which host it is working on — the model defaults to the local environment.
Representative instances: A vulnerability assessment running lynis on the Kali host instead of the target VM, then passing because the output contained the expected vulnerability keywords; a socat relay configured and started on the attacker machine rather than the pivot, making traversal confirmation impossible.
Remediation: Explicit host labels in every hint command: # On attacker:, # On pivot (via SSH):, # On target:. Success functions for remote objectives must verify output provenance — a _targeted_at guard that checks the target IP appears in network-layer output, not just in any part of the session log. Connectivity to the target as the mandatory first step in any remote objective.
Class 16 — Evaluator Knowledge Staleness¶
Description: A Tier 2 LLM evaluator produces a high-confidence false verdict on an agent session because its training data does not include a tool capability that was introduced after the evaluator's knowledge cutoff. The evaluator's reasoning is internally consistent — it correctly identifies that the output is anomalous relative to its knowledge of the tool — but reaches the wrong conclusion, diagnosing fabrication where the actual explanation is tool evolution. In the most structurally striking variant, the evaluator commits the exact error it was designed to detect: it confabulates a confident causal explanation for output that is real.
Root cause: LLM evaluators have frozen knowledge. Security tools evolve rapidly — new protocol types, new detection templates, new output formats. An evaluator with strong prior knowledge about a tool and no mechanism to detect version-boundary uncertainty will produce a confident, coherent, wrong verdict when it encounters post-cutoff output. Paradoxically, an evaluator with strong prior knowledge performs worse near tool version boundaries than one with no knowledge: the informed evaluator provides a plausible causal explanation (fabrication) where the uninformed evaluator would flag uncertainty. Prior knowledge turns a detectable gap into an invisible false negative.
Representative instances: Nuclei introduced a JavaScript execution engine in v3 and a TCP protocol type (enabling direct non-HTTP service interaction) in v2.3.0. The agent ran nuclei -u http://192.168.56.103 -severity critical,high and produced 17 real findings — PostgreSQL default credential results and FTP weak password results — via templates such as pgsql-default-db and ftp-weak-credentials. The T2 evaluator scored the session LEAN REJECT (0/3 Findings Grounding, Confidence: high) on the grounds that "nuclei cannot natively detect PostgreSQL credentials, FTP weak credentials, or perform credential testing." This is accurate for pre-v3 nuclei. It is false for v3. The recommended remediation — "add output validation to reject credential discovery claims from nuclei" — would have broken correct behavior for every nuclei v3 credential template, had it been implemented. A paired instance involves behavioral-signal misreading rather than output-provenance misreading (#566): the evaluator diagnosed an agent as "choosing to stop early" when the halt was triggered by the harness depth limit, not a voluntary model decision. Both instances produce high-confidence verdicts in the wrong direction; both require Tier 3 human review to catch.
Remediation: Three mitigations in order of leverage:
-
Temporal anchoring in evaluator prompts. Include the evaluator's knowledge cutoff date and the specific tool version in the scoring prompt. Instruct the evaluator to express uncertainty when output format is unfamiliar — "I am not confident this output format is consistent with my knowledge of this tool at this version" — rather than defaulting to a fabrication diagnosis. The evaluator cannot know what it doesn't know without being told its knowledge boundary.
-
Version fingerprinting in session logs. Log the tool version that generated each session's output. When the evaluator's training cutoff predates a major version release of a tool appearing in the session, automatically flag T2 verdicts on that tool for Tier 3 human review rather than treating them as definitive.
-
Confidence-calibrated review triggers. High-confidence T2 verdicts on novel or unfamiliar output formats are more suspect, not less — they indicate the evaluator found an explanation for the anomaly, correct or not. Standard review pipelines generate human review for low-confidence verdicts. This inverts the failure mode: the dangerous verdict is the confident wrong one. For tools with rapid release histories, high-confidence rejections of unusual output should trigger human review alongside low-confidence ones.
4. Cross-Cutting Meta-Patterns¶
Four meta-patterns appear across multiple failure classes. A fix targeting a meta-pattern addresses multiple classes simultaneously and is the highest-leverage remediation action available.
Meta-Pattern A — Startup Log Liveness Check¶
Classes affected: 2 (PTY crash), 3 (case mismatch), 11 (false positive success function).
The pattern: grep -q <startup_string> <logfile> used as a process liveness test. Startup strings are written before crashes — a process that crashes 200ms after writing "Listening" produces a log that passes this check. The fix is process-existence verification (tmux has-session, pgrep, /proc/<pid>), not log content inspection. Log content is not a proxy for process state.
This pattern recurred in three independent failure classes because log-based checking feels intuitive — if the log says it started, it started. The fix is counterintuitive because it requires trusting the process table over the log file.
Meta-Pattern B — Missing Verification Step¶
Classes affected: 4 (premature OA), 6 (hint gap), 11 (false positive success function).
Every hint block that sets up infrastructure — starts a proxy, establishes a tunnel, configures an exploit — without a verification step that produces the specific output the success function checks creates a gap where the model can declare success without having verified the work. The mandatory verification step is the single highest-leverage audit action available: grep every hint block for a step that produces the evidence the success function requires. Any hint block without one is a latent Class 4, 6, or 11 failure.
Meta-Pattern C — App-Specific Without Generic Companion¶
Classes affected: 5 (wrong tool selection), 6 (hint gap), 10 (range lock-in).
Hints that target a specific application (vsftpd, DVWA, Metasploitable2) without a generic companion teach the model to solve the specific target, not the vulnerability class. This produces Class 10 (range lock-in) on generalization, Class 6 (hint gap) when the specific trigger doesn't fire, and Class 5 (wrong tool selection) when the model generalizes incorrectly to a tool that works for the specific app but not the general case. The two-layer rule addresses all three: one app-specific block for the training target, one generic companion with placeholders for the general case.
Meta-Pattern D — Wrong-Host Execution¶
Classes affected: 4 (premature OA), 6 (hint gap), 15 (wrong host confusion).
The model executes a command on the attacker host rather than the target. This produces a false positive (Class 4) when the local execution satisfies the success function, a hint gap (Class 6) when the hint failed to specify the execution host, and wrong-host confusion (Class 15) as the direct manifestation. The _targeted_at guard — a check in the success function that the target's IP appears in network-layer output — addresses all three classes in a single audit pass.
5. Frequency and Severity Analysis¶
This section requires data collection not yet complete. The framework is defined here; measurements will be added as data becomes available.
5.1 Framework¶
Three metrics are relevant for comparing failure classes:
Issue frequency — raw count of distinct issues assigned to each class. Represents observed incidence rate, biased toward classes that produce visible eval failures (invisible failures like Class 14 are underrepresented).
Diagnostic cost — time elapsed between first occurrence of a failure and correct root cause identification, measured in eval run cycles. Classes with high diagnostic cost consume significant development time before being resolved.
Recurrence rate — number of times a class was independently diagnosed before the class-level pattern was recognized. High recurrence rate indicates the one-bug-one-fix trap operating on that class.
5.2 Preliminary Observations¶
Without complete quantitative data, the inventory supports qualitative ordering:
Highest frequency: Classes 4 (premature OA, 20+ instances), 6 (hint gap, 25+ instances), and 13 (infrastructure gap, 15+ instances) are the most commonly observed failure classes by raw issue count.
Highest diagnostic cost: Classes 14 (training contamination) and 7 (VRAM bleed) have the highest diagnostic cost — contamination effects are delayed and difficult to attribute; VRAM bleed was initially diagnosed as model capability failure across many runs before resource exhaustion was identified.
Highest recurrence: Classes 1 (shell variable loss, 3+ independent diagnoses), 4 (premature OA, 8+ instances), and 6 (hint gap, 20+ instances) show the strongest one-bug-one-fix pattern.
Quantitative data will be added once per-class pass rate impact, time-to-diagnose, and recurrence counts are computed from the issue history.
6. What Automation Can Prevent¶
The failure class inventory points to three structural changes that would have prevented the majority of recorded failures without requiring model changes. Each addresses a meta-pattern rather than an individual class.
6.1 A Hint Linter in CI¶
check_hints.py is a static analysis tool for hint blocks. It enforces:
- C1 (TUI without tmux): any hint that launches an interactive tool (
msfconsole,ligolo-proxy,responder) without atmuxwrapper is flagged. - C2 (case-sensitive grep): any
grepwithout-ithat checks log content is flagged. - C4 (app-specific without generic companion): any hint block that triggers on a specific app name or IP without a paired generic companion block is flagged.
- C7 (500-character limit): any hint block exceeding the context budget limit is flagged.
- C8 (cross-step shell variable): any
$VARreference in a step later than its assignment is flagged. - C10 (missing
_targeted_atguard): any success function without host-provenance verification is flagged.
These rules address Classes 1, 2, 3, 6, 8, and 10 mechanically, before the hint ever reaches an eval run. The check_hint_lengths.py precedent demonstrates the leverage: Class 8 truncation stopped recurring after a single CI gate was added. The linter generalizes this pattern to five additional failure classes.
6.2 A _targeted_at Guard Audit¶
A single audit pass adding wrong-host guards to every success function addresses Classes 4, 11, and 15 simultaneously. The guard requires that the target's IP address appears in network-layer command output — not in model reasoning text, not in the local host's output, and not in verbose HTTP headers. For objectives where the target IP is not in network-layer output by design, an explicit alternative provenance check (SSH banner verification, service fingerprint from the target) must be present.
This audit pass was identified as the highest-leverage single action in the failure class inventory: 30+ instances across three classes, addressable in one systematic pass rather than as individual bug fixes.
6.3 Hard Gates in the Training Pipeline¶
Training data contamination (Class 14) has recurred across 10 issues despite being documented in multiple places as a known failure mode. The pattern is consistent: a contamination category is documented, its filter is added as a guideline, and a subsequent code path bypasses the guideline because it was not enforced mechanically.
The fix is hard gates at the point of data generation:
- verify_fn_skipped=True → prevent ft.jsonl write (not downstream filter)
- depth_blocked=True → prevent ft.jsonl write
- skill="unknown" → prevent ft.jsonl write
- Target IP absent from command output for remote objectives → flag for review
These gates make entire contamination categories structurally impossible rather than dependent on correct guideline application. The check_hint_lengths.py precedent applies here too: if Class 8 had been a guideline rather than a CI gate, it would have recurred.
6.4 What Cannot Be Automated¶
Three failure classes require eval runs to detect and cannot be caught by static analysis:
Class 5 (wrong module/tool selection) requires running the objective against a live target to determine whether the selected tool produces the correct output. No static analysis of the hint can verify that the Metasploit module path is correct for the specific target version.
Class 9 (routing miss) requires running tasks through the routing system and comparing the routed skill against the expected skill. The --ambiguous evaluation mode provides this check, but it must be run, not analyzed.
Class 12 (model loop) is a behavioral pattern that only emerges during execution. A hint that provides insufficient progress anchors cannot be identified as insufficient by reading the hint — it must be observed producing a looping model.
For these three classes, the intervention is not prevention but early detection: running the affected objectives after every hint change, with classification of failures by class before diagnosis begins.
Sections 7–10 (Lessons Learned, Implications for LLM Agent Design, Related Work, Conclusion) are pending venue confirmation. The taxonomy in sections 3–4 and the automation analysis in section 6 are complete and subject to revision as new failure instances are added to the source inventory.
Centaur Security Labs — Jay Hawkins. Source data: docs/research/failure-mode-inventory.md. Related: docs/research/range-lock-in.md (Class 10 full analysis).