Skip to content

ARCHER Failure Mode Inventory

A structured inventory of every failure class ARCHER has exhibited across eval runs and RCA sessions. Purpose: identify root causes shared across multiple objectives so that a single fix closes an entire class rather than re-filing the same bug under a new issue number.

Methodology: every closed bug and regression issue (#62–#480) was reviewed and assigned to one or more failure classes. Open instances are listed per class. Class-level remediations are specified where ≥2 instances share a root cause.

Last updated: 2026-05-30. Living document — add new instances as they surface. Covers issues #62–#480; post-#480 analysis and new classes (15–17) documented in the ARCHER operational inventory.


The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.


Lessons Learned

Six weeks of eval-driven development against live targets produced 130+ bug and regression issues. Reviewing them as a body — rather than individually — reveals patterns that were invisible when each issue was filed in isolation.

The one-bug-one-fix trap

The dominant failure pattern in ARCHER's development history is not any single bug class — it is the tendency to fix symptoms rather than classes. The same root cause recurred under different issue numbers: LHOST missing from a msfconsole chain (#161), an unsubstituted <attacker_ip> placeholder (#411), a lost variable between command dispatches (#475) — all three are the same failure (shell variable loss, Class 1), diagnosed and fixed three separate times. Across 15 classes and 130+ issues, this pattern repeats: a root cause is identified, fixed for the specific objective that surfaced it, and left unaddressed in adjacent objectives where the same code pattern exists.

The lesson: diagnosis is not complete until all instances of the root cause are found. A fix that closes one issue while leaving five sibling objectives with the same failure pattern is an incomplete fix.

Environmental assumptions accumulate silently

A large fraction of failures — Classes 2 (PTY crash), 13 (infrastructure gap), and 15 (wrong host) — are not model or hint failures at all. They are broken assumptions about the environment: a missing binary, a container without /dev/net/tun, a tool that requires a PTY but is launched without one. These failures are particularly costly because they are diagnosed as model failures first, wasting hint-tuning cycles before the real cause is found. The lesson: environment verification must precede model diagnosis. A failing objective should first be tested against a known-good environment before any hint is changed.

Success signals are consistently too weak

Four failure classes — premature OA (Class 4), false positive success_fn (Class 11), missing verification step (Class 6), and wrong host confusion (Class 15) — share a common weakness: the system accepts evidence that does not actually prove the objective was completed. A single string match on a log line, a grep that fires on error output, a success_fn that passes on wrong-host output, a model that reads a startup warning and declares success — all are variants of the same problem. The lesson: success signals must be necessary, not merely correlated. A signal that can appear in both success and failure contexts is not a success signal.

Training data quality is a silent multiplier

Class 14 (training data contamination) is the least visible failure class because contaminated sessions do not cause immediate eval failures — they degrade model behavior gradually, in ways that manifest as routing misses and hint non-compliance weeks or months later. Depth-blocked sessions, wrong-host sessions, and unverified sessions entered the fine-tuning pipeline across multiple issues (#134, #153, #275, #316, #369). The lesson: pipeline gates must be code, not process. A filter that exists only as a documented guideline will eventually be bypassed. Every contamination category must be enforced at the point of data generation.

Automation prevents recurrence; process does not

The four cross-class meta-patterns (startup log liveness, missing verification step, app-specific without generic companion, wrong-host execution) have each recurred three or more times despite being documented in CLAUDE.md, PROCESSES.md, and individual issue comments. Documentation does not prevent recurrence. The failures that stopped recurring are the ones that became CI gates — check_hint_lengths.py eliminated Class 8 truncation issues after a single enforcement point was added. The lesson: the only reliable prevention is automated enforcement. Every lesson in this document that has not yet been encoded as a CI check or lint rule is a lesson waiting to be relearned.

What this means for development priorities

The failure mode inventory points to three structural changes that would have prevented the majority of issues in this document:

  1. A hint linter in CI (check_hints.py) — enforcing the tmux standard, case-insensitive grep, placeholder substitution, and verification step presence. Addresses Classes 1, 2, 3, and 6 mechanically.
  2. _targeted_at guards on all success_fns — a single audit pass adding wrong-host guards. Addresses Classes 4, 11, and 15 in one action.
  3. Hard gates in the training pipelineverify_fn_skipped, depth-blocked, and unknown-skill sessions rejected at write time, not filtered downstream. Addresses Class 14 permanently.

None of these require model changes or hint rewrites. They are infrastructure changes that make entire failure classes structurally impossible.


Class 1 — Shell Variable Loss

Description: A shell variable is set in one ARCHER command dispatch and referenced in a subsequent dispatch. Because each command runs in a separate bash invocation, the variable is gone by the time it is needed. Unsubstituted literal placeholders (<attacker_ip>, <user>) are the same failure in hint authoring — the model receives the placeholder unchanged and uses it literally.

Root cause: ARCHER dispatches each generated command as an independent subprocess. Shell state does not persist between dispatches.

Issue Objective Instance
#89 (closed) T4/T6 LHOST detection awk picks wrong IP when route has via gateway — wrong IP class for the context
#161 (closed) T6 UnrealIRCd LHOST missing from msfconsole chain — reverse shell never receives connection
#409 (closed) T52 chisel <user> placeholder in scp command reaches model literally — scp fails
#411 (closed) T53 ligolo <attacker_ip> placeholder unsubstituted + agent run on wrong host
#392 (closed) T56 multi-hop Wrong credentials (msfadmin) instead of pivot-range creds (pivot:archer123)
#475 (open) T52 chisel ATTACKER_IP set in cmd 1, empty in cmd 2 — client connects to :8000 on pivot's localhost

Class-level remediation:

  1. Inline subshell expansion — never assign a variable then reference it across dispatches. Inline the subshell: ssh ... $(ip route get {pivot} | grep -oP 'src \K\S+'):8000 not ATTACKER_IP=$(...) then ssh ...$ATTACKER_IP... in a later command.
  2. Placeholder hygiene — all <placeholder> strings must be substituted at hint render time via the template system. Any placeholder that reaches the model is a hint defect. Audit: grep -r '<[a-z_]*>' skills/ should return zero results.
  3. Cross-step $VAR audit — grep all skills/PT-*.py hints for $VAR patterns that appear in a later step than the assignment. Any cross-step reference is a latent bug.

Prevention status: C8 (VAR=$() assignment detection) CI gated in check_hints.py. C3 (cross-step $VAR dereference) deferred — requires multi-step semantic parse.


Class 2 — PTY / TUI Crash

Description: A tool that requires an interactive terminal (PTY) is launched via nohup, &, or docker exec without -t. The tool's TUI library detects no PTY and panics, exiting silently after writing one or two startup log lines — making it appear to have started successfully.

Root cause: Modern CLI tools use PTY-detecting interactive frameworks (ligolo-proxy: grumble + survey/v2; msfconsole: readline). Without a PTY the library aborts. The process writes startup output before crashing, so log-based liveness checks pass even though the process is dead.

Issue Objective Instance
#79 (closed) T6 msfconsole msfconsole times out at 300s when run directly inside archer-kali — PTY context absent
#82 (closed) T4/T6 bind_netcat payload hangs in eval harness — no PTY for interactive netcat session
#389 (closed) T53 ligolo archer-kali missing /dev/net/tun — ligolo permanently broken at container level
#441 (closed) T53 ligolo nohup proxy+agent — same crash pattern, earlier attempt at fix
#446 (closed) T53 ligolo ligolo-ng.yaml in repo root crashes proxy on automated startup
#455 (closed) T53 ligolo Version incompatibility (agent 0.6.2 vs proxy 0.8.3) — compound infrastructure failure
#474 (open) T53 ligolo Model uses nohup /usr/bin/ligolo-proxy; proxy writes "Listening" then crashes; grep -q Listening passes; agent gets connection refused

Class-level remediation:

  1. tmux wrapper standard — any tool with a TUI must run inside tmux new-session -d -s <name>. nohup is wrong for TUI tools, always.
  2. Process liveness via session check — never use grep -q <startup_string> <logfile> as the liveness test. Use tmux has-session -t <name>, which only passes if the session is alive. Startup log lines are written before crashes.
  3. Hint audit — grep all hints that launch interactive tools (ligolo-proxy, msfconsole, responder, etc.) and verify they use tmux, not nohup/background.
  4. Container pre-flight — verify /dev/net/tun exists in archer-kali before any ligolo run.

Prevention status: C1 (nohup/& on TUI tools) CI gated in check_hints.py. Gap: process liveness check pattern (grep-q-logfile as liveness) not yet enforced separately from C2.


Class 3 — Case Mismatch / Pattern Miss

Description: A grep or regex pattern in a hint or success_fn fails because the actual log output uses different capitalization, punctuation, or wording than the pattern expects. Also includes success signals that are present in the tool output but absent from the model's visible context.

Issue Objective Instance
#166 (closed) T9 nikto OSVDB patterns not in ARCHER stdout — success_fn never fires despite correct output
#191 (closed) T23 hash-crack success_fn regex hash vs hashes — john plural output never matches
#380 (closed) web_lfi <script> match in _SIGNAL_RE contaminates Tier 2 evidence with HTML noise
#383 (closed) linux_privesc _SIGNAL_RE missing evidence patterns for sudo/SUID output — correct output invisible
#393 (closed) T3a/T45 halt_bypass_signals missing nmap NSE vuln output patterns
#426 (closed) T53 ligolo http/1. from wget error satisfies has_traversal — wrong signal matched
#474 (open) T53 ligolo grep -q 'agent joined' (lowercase); actual log: msg="Agent joined." (capital A, period)

Class-level remediation:

  1. Always use -i for log grepsgrep -qi costs nothing and eliminates case mismatch. No case-sensitive grep in hints without an explicit rationale.
  2. Verify patterns against live output — before adding any grep pattern, run the actual tool in the lab and copy the exact string. Never infer log format from documentation.
  3. Pattern specificity_SIGNAL_RE entries must match tool-specific output, not generic HTML/HTTP strings that appear in error pages. Add negative examples to the test suite.

Prevention status: C2 (grep -q without -i on log files) CI gated in check_hints.py.


Class 4 — Premature Objective Achieved / False Positive

Description: The model emits [OBJECTIVE_ACHIEVED] based on partial, fabricated, or ambiguous evidence. Variants: reading only the first positive line of output without checking for subsequent error lines; success_fn matching on wrong-host output; echo/printf fabrication of expected strings.

Issue Objective Instance
#78 (closed) multiple HALT_DISCIPLINE false positive — failed exploit marked answerable, quality=1.0 saved to playbook
#81 (closed) T6 Model fabricates uid=0 via echo — success_fn passes on fabricated output
#83 (closed) T6 success_fn must guard against echo/printf fabrication of uid=0
#107 (closed) multiple Model emits OA on exploit failure + loops on bare bash
#115 (closed) T7 _t7_host_discovery too broad — any 192.168.56.x IP in output passes
#116 (closed) T12 _t12_vuln_assess matches critical|high|medium in natural language
#123 (closed) T12 vulnerability_assessment runs lynis on local host — success_fn passes on wrong-host output
#124 (closed) T10 post_exploitation enumerates Kali host — wrong-host false positive
#132 (closed) T12 T12 passes on wrong-host lynis
#133 (closed) T10 T10 passes on wrong-host enumeration
#141 (closed) T14/T16 Passes when BWA target unreachable — no connectivity guard
#142 (closed) T23 T23 passes when john runs but finds no hashes
#143 (closed) T11 port_scanning speed anomaly passes too quickly
#144 (closed) T15 No _targeted_at guard — localhost false positive risk
#163 (closed) T27 linux_privesc premature OA — model stops after 1 command
#168 (closed) T12 searchsploit passes without scanning remote target
#236 (closed) T28 _t28_suid_privesc fires on command text, not evidence
#253/#254 (closed) multiple 9 boundary violations — OA exits bypassing verify_fn; echo fabrication suspected
#367 (closed) T12 lynis+searchsploit passes without scanning remote target
#368 (closed) T10 uname on Kali + failed SSH attempt passes post_exploit
#387 (closed) T54 socat Fires OA without passing verify_traversal
#423 (closed) T56 [THOUGHT] text contains target IP — satisfies _real_pivot_traversal
#474 (open) T53 R3 Model reads agent startup warning, declares OA; next lines show fatal connection error

Class-level remediation:

  1. _targeted_at guard on all success_fns — every success function must verify output came from the correct target IP, not localhost or the Kali host. This guard alone would have prevented ~8 instances in this list.
  2. Exclude [THOUGHT] text from pattern matching — strip [THOUGHT]...[/THOUGHT] blocks before applying success_fn patterns. Model reasoning should never satisfy an evidence check.
  3. Fabrication guard — success_fn patterns for privilege escalation outputs (uid=0, root) must appear in command stdout, not in text that the model could have generated itself.
  4. Multi-signal requirement — OA for complex objectives (exploitation, pivoting) should require N distinct evidence signals, not a single string match.
  5. Fatal-line scan — add hint instruction to read full command output before declaring success: startup warnings followed by fatal errors are the canonical false-positive pattern.

Prevention status: C10 (_targeted_at guard in skills/) CI gated in check_hints.py. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet checked — tracking issue #527.


Class 5 — Wrong Module / Tool Selection

Description: The model selects the wrong Metasploit module, CLI tool, or approach — hallucinating a module name, choosing a module for a different vulnerability, selecting the wrong tool for the target environment, or misconfiguring a tool that would otherwise work.

Issue Objective Instance
#62–#70 (closed) sweep Early sweep findings: nmap wrong_skill, wapiti no_output, msfconsole no_output, hydra wrong_skill, searchsploit early_complete
#80 (closed) T5 hydra hydra uses rockyou as both user and password list — will never complete
#90 (closed) T6 UnrealIRCd Wrong exploit module selected (samba/usermap_script instead of unix/irc/unreal_ircd)
#104 (closed) T5 SSH brute-force 0/3 — hydra hints insufficient
#105 (closed) T6 msfconsole module path or port mismatch
#165 (closed) T21 Tomcat msfconsole module error on Metasploit 6.4.126
#173 (closed) T5 ncrack 917-line wordlist always times out before reaching msfadmin
#238 (closed) T35 UDP scan always times out — needs port scope or --top-ports
#372 (closed) entity_id nmap missing -O/-sV flags — incomplete output
#376 (closed) vuln_assess Model stops after nmap -sV, never runs vuln scripts
#412 (closed) T45 nmap --script vuln too slow for OWASP-BWA — nuclei should be primary
#419 (closed) T28 SUID Model runs nmap interactive but doesn't pipe !sh
#429 (closed) T3a nmap --script vuln times out on MS2 — nuclei should be primary
#472 (open) PT-EXPLOIT-01 Wrong Metasploit module, missing PAYLOAD, module hallucination

Class-level remediation:

  1. Exploitation short-circuits — for every named CVE or well-known vulnerability (vsftpd, UnrealIRCd, MS08-067, etc.) add an explicit short-circuit block with exact module path, PAYLOAD setting, and RHOST/LHOST sequence. The model cannot reliably select the correct module from general knowledge.
  2. Tool priority guidance — hints must specify which tool is primary and which is fallback. "Use nuclei; if unavailable, fall back to nmap --script vuln" is unambiguous. Generic "run a vulnerability scanner" produces tool mismatches.
  3. Module reference tables — add a reference block in exploitation hints listing correct modules for common MS2/DVWA targets by port and service. The model matches task context to the table rather than hallucinating.
  4. Wordlist scope — brute-force hints must specify targeted wordlists, not generic full-length lists. A 917-line wordlist with rockyou order guarantees timeout on any short-session objective.

Prevention status: Documentation only. No CI gate — requires semantic understanding of hint content. Manual review at hint authoring time.


Class 6 — Missing Short-Circuit / Hint Gap

Description: A hint block exists for one phase of an operation but not for the critical subsequent phase. The model is left without guidance at the decisive step and falls back to generic behavior. Also includes hints that specify the wrong host, wrong path, or wrong authentication method for the target environment.

Issue Objective Instance
#80/#91 (closed) T5 hydra SSH legacy key negotiation — no hint for -oHostKeyAlgorithms flag; SSH rejects all auth attempts
#145 (closed) T17 MySQL enumeration failing — hints insufficient for mysql client auth flow
#157 (closed) T15 Broken bash hint + missing syntax-error loop guard
#162 (closed) T23 Hash-crack objective never dumps shadow — model skips extraction step
#167 (closed) T16 DVWA LFI curl returns empty — missing -L flag; session not persisting
#175 (closed) T16 DVWA LFI security level POST blocked by CSRF; traversal runs at wrong security level
#182 (closed) T21b Model fails reverse shell sequence — listener-before-trigger sequencing missing
#189 (closed) T23 Hint not driving hash crack step — 0/3 with 2 cmds (dump only, no john/hashcat)
#193 (closed) T23 Hint dump step unreliable — model drops sudo -S and tee; john gets empty file
#235 (closed) linux_privesc SSH enumeration blocked by sudo -l password prompt on MS2 — hint missing sshpass
#239 (closed) T46/T47 post_exploitation SSH hints missing HostKeyAlgorithms flags for MS2
#242 (closed) T46/T47 exfiltration and persistence hints missing SSH-first / sshpass for remote targets
#245 (closed) T49 DVWA XSS web_xss hints missing DVWA stored XSS workflow — model skips login, hits wrong endpoint
#311 (closed) T16/T24/T26 Model completes task but never emits OA — 100% HALT_DISCIPLINE; completion signal too weak
#328 (closed) linux_privesc SSH loop and sudo-hang on MS2
#329 (closed) T24 web_xss 100% HALT_DISCIPLINE — completion signal too weak
#331 (closed) web_enum robots.txt 404 needs fallback to directory brute-force
#378 (closed) T50–T56 All pivot hints missing verification step
#390 (closed) T54 socat socat_relay runs on attacker host, not pivot — wrong machine targeted
#391 (closed) T55 ProxyJump hint missing trailing command — connects then drops
#397 (closed) T50 Default port 80 closed on pivot target; hint missing traversal step
#399 (closed) T51 Wrong proxychains config file path in hint
#406 (closed) T48 CSRF No hint for auth-first flow — /vulnerabilities/csrf/ returns 404 unauthenticated
#432 (open) multiple Hardcoded vulnbox IPs/paths — generic companion missing
#472 (open) PT-EXPLOIT-01 vsftpd exploitation short-circuit missing (detection short-circuit exists)

Class-level remediation:

  1. Two-layer rule — every app-specific detection hint must have a paired exploitation hint. For every _hints_* block that fires on confirmation/detection keywords, verify a paired exploitation block exists. See docs/research/range-lock-in.md.
  2. Verification step mandatory — every hint block that sets up infrastructure must include a verification step that produces the evidence the success_fn checks.
  3. SSH compatibility flags — any hint connecting to MS2/Metasploitable targets via SSH must include -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedAlgorithms=+ssh-rsa or equivalent. This has recurred in T5, T17, T46, T47 — it is a lab-wide compatibility requirement.
  4. Host targeting explicit — every hint command must explicitly state which host it runs on (attacker, pivot, target). Ambiguous host context produces wrong-machine execution.
  5. Completion signal audit — any objective with HD=100% and OA=0% across multiple runs has a missing or weak completion signal, not a model failure. Check whether OA can be emitted given the hint structure before adjusting halt thresholds.

Prevention status: C4 (app-specific without generic companion) CI gated in check_hints.py. Verification step presence (item 2) not yet gated — C5 tracking issue #526. Manual one-time verification step audit: #483.


Class 7 — VRAM / Resource Bleed

Description: A long-running or resource-intensive objective saturates GPU VRAM or system resources. Subsequent objectives start with depleted resources, producing cmds=0 or truncated runs. Also includes port/process pollution between runs where cleanup is absent.

Issue Objective Instance
#77 (closed) playbook Fast replay used DEFAULT_COMMAND_TIMEOUT (120s) instead of skill timeout — wrong resource budget
#180 (closed) multiple Overnight VRAM-saturated depth-blocked zero-command sessions — excluded from pipeline
#248 (closed) collection run_data_collection.sh eval lock blocks its own Phase 1 child
#407 (closed) T50 Stale tunnel ports (8080–8085) not cleaned between runs
#418 (closed) T51 SOCKS port 1080 exhaustion — ssh -D not killed by prior cleanup
#436 (closed) multiple Stale ~/.archer_eval.lock on process exit without cleanup
#444 (closed) T52/T53/T56 _setup_pivot_range missing chisel/ligolo cleanup — runs 2+ fail
#451 (open) multiple VRAM bleed between objectives — ollama reload fires between runs not between objectives

Class-level remediation:

  1. Between-objective VRAM flushollama stop + prewarm after each objective's runs complete. Prevents saturation from long objectives bleeding into short ones.
  2. Idempotent setup_fn — every setup_fn must kill processes and release ports from prior runs before starting. A setup_fn that assumes a clean state will fail on run 2+.
  3. Lock file cleanup — eval lock must be removed in a finally block, not just on clean exit. Any eval process that exits abnormally should release the lock.
  4. Resource monitoring — Cockpit overview during long eval runs provides early warning: CPU/RAM flatline mid-run indicates bleed-out, not completion.

Prevention status: Documentation only for VRAM flush (item 1). setup_fn idempotency (item 2) partially enforced — pivot/AD preflights raise PreflightFailure (#494/#493), but cleanup completeness not audited. Tracking issue: #528 (setup_fn idempotency audit). Ligolo TUN route flush gap tracked in #525.


Class 8 — Character Limit / Command Truncation

Description: A hint block, command string, or evidence extraction exceeds a character limit, causing truncation, rejection, or silent data loss.

Issue Objective Instance
#334 (closed) web_xss _extract_evidence CMD truncation at 120 chars — XSS POST payloads invisible to Tier 2 scorer
#448 (open) T52 chisel Hint block exceeds 500-char harness limit — command rejected
#452 (closed) T52 chisel &; syntax — degenerate loop on every run

Class-level remediation:

  1. check_hint_lengths.py in CI — gate already enforces 500-char limit. Run locally before any hint commit.
  2. Split at logical boundaries — if a command exceeds the limit, split into two sequential hint steps at a logical boundary. Never concatenate with && or &; to stay under limit.
  3. Evidence extraction window_extract_evidence CMD truncation at 120 chars is too short for POST payloads. Raise or eliminate the truncation for command text used in Tier 2 scoring.

Prevention status: C7 (500-char limit) CI gated in check_hints.py (absorbed from check_hint_lengths.py).


Class 9 — Routing Miss

Description: A task is routed to the wrong skill pack, causing the model to receive hints and context for the wrong domain. Includes keyword scorer over-capture (one skill's keywords absorb tasks that belong to a sibling skill) and classifier confidence threshold mismatches.

Issue Objective Instance
#62/#63/#66/#67 (closed) sweep Early sweep findings: nmap, arp-scan, hydra, whatweb all routing to wrong skills
#103 (closed) T8 web_enumeration routing miss — skill= empty for directory enum task
#147 (closed) T25 Dead-tie — web_exploitation and web_cmd_injection score equally
#257 (closed) classifier Confidence threshold 0.7 never reached by TF-IDF+LR — lowered to 0.5
#276 (closed) collection Sparse gate ignores classifier confidence — routing misses never self-correct
#294 (closed) T33/T38 T49 hint fix broke vulnerability_assessment and entity_identification routing
#343 (closed) T42 Routes to web_vulnerability_scanning instead of web_enumeration
#344 (closed) routing web_authentication bonus_fn fires on bare login — misroutes auth-bypass tasks
#345 (closed) routing system_info version keyword misroutes vuln-assessment tasks
#346 (closed) routing entity_identification captures check what's running — should be service_enumeration
#347 (closed) routing port_scanning captures what's listening — should be service_enumeration
#348 (closed) routing reconnaissance captures bare scan the target — should be port_scanning
#385 (open) T52 Routes to ssh_tunneling instead of chisel_pivot

Class-level remediation:

  1. exclude_keywords discipline — every skill must have explicit exclude patterns for tasks that superficially match but belong to a sibling. Routing misses are often symmetric — if T52 routes to ssh_tunneling, ssh_tunneling is over-capturing.
  2. Ambiguous task eval — run eval_harness --ambiguous after any hint or keyword change to verify no new misroutes are introduced.
  3. Bonus_fn keyword audit — bonus functions that fire on single common words (login, version, check) are over-broad. Require at least two co-occurring terms or domain-anchored context.
  4. SHA tagging (#478) — once in, use SHA to filter pre-fix routing labels from classifier retraining, preventing old misroutes from training the next classifier.

Prevention status: Documentation only. No CI gate — routing correctness requires live eval. --ambiguous flag exists for manual verification.


Class 10 — Range Lock-In

Description: A hint block fires only on a specific app name, IP address, or target path, training the model to solve one specific target rather than the vulnerability class. The model fails on any variant or alternative target.

Issue Objective Instance
#153 (closed) Juice Shop Contaminated ft.jsonl sessions (localhost:3000 wrong target) — trained on wrong target URL
#468 (closed) T2a/T4/T6 Task strings named specific apps rather than vuln class
#469 (closed) T14/T24/T25/T26/T48 Task strings with app paths and named apps
#432 (open) multiple Hardcoded vulnbox IPs and paths in hint triggers — generic companion missing

Class-level remediation: See docs/research/range-lock-in.md for full analysis. Summary: every app-specific hint block must have a generic companion block using placeholders (<login-endpoint>, <module>, <target>). App-specific block drives eval pass rate; generic companion teaches the transferable pattern. Task strings must describe the vulnerability class, not the specific app or path.

Prevention status: C4 CI gated in check_hints.py — blocks new app-specific hints without generic companions. Cleanup of existing violations: #432.


Class 11 — False Positive success_fn / halt_fn

Description: A success or halt function returns true on a signal that appears in non-success contexts — curl error output, [THOUGHT] text, wrong-host output, truncated stdout, or intermediate/partial matches.

Issue Objective Instance
#115 (closed) T7 Any 192.168.56.x IP in output passes _t7_host_discovery
#116 (closed) T12 critical|high|medium in natural language passes _t12_vuln_assess
#164 (closed) T4 _real_uid_root false-negative — skips printf/msfconsole output blocks
#314 (closed) network_exploit L1 check rejects valid non-root code execution and backdoor confirmation
#410 (closed) T54 head -10 truncates SSH banner at line 11 — success_fn never sees OpenSSH
#423 (closed) T56 _real_pivot_traversal matches target IP in [THOUGHT] text
#426 (closed) T53 has_traversal matches http/1. from wget error output
#443 (closed) T56 _halt_tunneling premature fire — traversal threshold too low
#401 (open) T56/pivot HTTP/\d regex matches curl verbose * using HTTP/1.x — not actual traversal

Class-level remediation:

  1. Anchor patterns to tool-specific output — patterns must match strings that only appear in genuine success output, not in error messages, verbose output, or [THOUGHT] text.
  2. Exclude [THOUGHT] from pattern scope — strip [THOUGHT] blocks before applying success_fn patterns.
  3. Stdout-only matching — apply success patterns to command stdout, not full session text.
  4. Threshold calibration — traversal-type functions should require the target IP to appear in network-layer output. HTTP header strings and verbose curl flags are not traversal evidence.
  5. Avoid head -N truncation — use grep to extract the relevant line rather than truncating at a fixed line count.

Prevention status: C10 (skills/ success_fn _targeted_at guard) CI gated. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet gated — #527.


Class 12 — Model Loop

Description: The model repeats the same command or sequence without making progress, eventually hitting max_commands. Variants: retrying a failed setup step indefinitely; completing the task but not emitting OA; looping on bare infrastructure commands with no payload.

Issue Objective Instance
#107 (closed) multiple Model loops on bare bash after exploit failure
#125 (closed) T14 Degenerate proxy loop when DVWA unreachable — no connectivity pre-check
#126 (closed) agent done:true + empty command causes infinite agent loop
#163 (closed) T27 Stops after 1 command — model treats minimal output as completion
#230 (closed) watchdog cmd_count comparison never resets between sessions — premature ambiguous block kills
#251 (closed) T11/T23/T24/T26 102 runs at 97–100% halt depth, 0% OA
#311 (closed) T16/T24/T26 100% HALT_DISCIPLINE — model completes task but OA never emitted
#382 (open) linux_privesc Model loops on bare SSH — drops command payload
#386 (closed) T51 SOCKS Hits max_commands every run — SOCKS setup loop

Class-level remediation:

  1. Progress anchors in hints — hints must include checkpoints giving the model a clear signal it has advanced ("if port X is now listening, proceed to step 3"). Without progress anchors, the model retries setup steps indefinitely.
  2. Connectivity pre-check — any hint for a web target must verify the target is reachable before entering the main workflow. A target that is down causes looping on connection attempts.
  3. HD=100%/OA=0% diagnosis — this pattern means the model is completing the work but not emitting OA. Root cause is always hint structure (missing OA signal) or success_fn (not firing on correct output), not model capability. Adjust halt thresholds only after ruling out both.
  4. done:true guard — the agent loop must reject empty commands with done:true and not re-queue.

Prevention status: Documentation only. No CI gate — loop detection requires live eval observation.


Class 13 — Infrastructure Gap

Description: A required binary, container configuration, network configuration, or environment component is missing or misconfigured. Failures in this class are not hint or model failures — the environment itself is broken.

Issue Objective Instance
#79 (closed) T6 msfconsole times out in archer-kali container — container resource ceiling
#176 (closed) T17–T19/T22 Stale playbook entries produce 0-cmd fast fails
#177 (closed) T2b T2b dependency on T2a not enforced — runs on uninitialized state
#181 (closed) T2a Port 6200 not releasing between runs — exit -y cleanup insufficient
#186 (closed) T21 Unhandled TimeoutExpired in Tomcat restart kills harness
#194 (closed) T23 john.pot contamination — _setup_t23 cleans wrong host
#258 (closed) T23 rockyou.txt corrupted in archer-kali + MS2 passwords missing from wordlist
#277 (closed) multiple Stale tool processes persist in archer-kali after eval harness exit
#302 (closed) T16 _setup_t16 must reset DVWA admin password before each run
#389 (closed) T53 archer-kali missing /dev/net/tun
#400 (closed) T56 172.30.2.20 not deployed — pivot range only has 2 nodes
#425 (closed) T53 ligolo-agent binary missing from pivot range image
#437 (closed) CI verify-fix.yml no lab pre-flight — posts FAIL when targets unreachable
#442 (closed) T53/T56 tini + openssh-client missing from Dockerfile
#450 (closed) T51 SSH -D race condition — ss check runs before port is bound

Class-level remediation:

  1. Prerun sanity checkarcher-prerun verifies VM reachability, required binaries, container state before any eval run. Do not start a run against a target that hasn't been verified reachable.
  2. Idempotent setup_fn — every setup function must be idempotent: clean prior state, verify environment, then initialize. Non-idempotent setup fails on run 2+.
  3. Dockerfile audit on range changes — any change to docker/pivot-range/Dockerfile requires a full T50–T56 run to confirm infrastructure integrity.
  4. Race condition guards — any hint that starts a background process and immediately checks for it must include a sleep + retry loop, not a single-shot check.
  5. Dependency enforcement — objectives with sequential dependencies (T2a → T2b) must enforce ordering in setup_fn, not rely on run order.

Prevention status: PreflightFailure exception raises SKIP (not FAIL) on VM/AD unreachability — CI gated (#494/#493). setup_fn cleanup completeness not yet audited — #528. Ligolo TUN route flush: #525.


Class 14 — Training Data Contamination

Description: Invalid, low-quality, or incorrectly-labeled sessions enter the fine-tuning or classifier training pipeline, degrading model behavior in ways that are difficult to diagnose.

Issue Pipeline stage Instance
#134 (closed) fine-tune depth_blocked sessions leak into training data via _build_full_conversation
#136 (closed) fine-tune stale data/finetune/ with obsolete Alpaca schema — regenerate required
#140 (closed) fine-tune HALT_DISCIPLINE inclusion policy split — build_training_data.py and prepare_finetune.py disagree on what to include
#150 (closed) classifier router_labels.csv no deduplication — classifier biased toward frequently-run objectives
#153 (closed) fine-tune Contaminated Juice Shop ft.jsonl sessions (localhost:3000 wrong target)
#155 (closed) fine-tune Raw model responses with backslash errors contaminate fine-tuning data
#275 (closed) fine-tune Web sessions with connectivity failures not filtered from pipeline
#316 (closed) fine-tune verify_fn_skipped not a hard gate on ft.jsonl writes — unverified sessions enter pipeline
#330 (closed) fine-tune skill="unknown" sessions not filtered by prepare_finetune.py
#369 (closed) fine-tune depth_blocked sessions produce zero-command training examples

Class-level remediation:

  1. Pipeline gate checklist — before any training run, verify: (a) depth_blocked sessions excluded, (b) verify_fn_skipped sessions excluded, (c) skill="unknown" sessions excluded, (d) connectivity-failure sessions excluded, (e) no duplicate routing labels.
  2. Hard gates in code — contamination filters must be code gates, not process guidelines. verify_fn_skipped=True must prevent ft.jsonl write at the point of generation, not in a downstream filter.
  3. Quality filter for classifiereval_label + label_confidence==high only (see PROCESSES.md). The 80% unknown event entries are not usable for classifier retraining.
  4. Schema versioning — ft.jsonl schema version must be checked at pipeline entry. Stale entries from prior schema versions must be excluded or migrated.

Prevention status: Some gates in code (depth_blocked, verify_fn_skipped); others documented in PROCESSES.md only. #501 (success signal verification before playbook write) addresses the wrong-host contamination path. Full gate checklist enforcement: documentation only.


Class 15 — Wrong Host / Target Confusion

Description: The model runs commands against the wrong machine — typically confusing the attacker host (Kali) with the target host, or confusing attacker with pivot in multi-hop scenarios. Distinct from Class 6 (missing hint) in that the model has a hint but misidentifies which machine to execute on.

Issue Objective Instance
#123 (closed) T12 vulnerability_assessment runs lynis on local Kali host instead of target
#124 (closed) T10 post_exploitation enumerates Kali host instead of SSH target
#125 (closed) T14 Enters degenerate proxy loop when DVWA unreachable (connected to wrong interface)
#390 (closed) T54 socat_relay configured and run on attacker host, not pivot
#411 (closed) T53 ligolo-agent run on attacker instead of pivot

Class-level remediation:

  1. Explicit host labels in hints — every hint command must be preceded by a label identifying which host it runs on: # On attacker:, # On pivot (via SSH):, # On target:. The model uses these labels to orient itself.
  2. _targeted_at guards — success_fns for objectives that run commands on remote targets must verify output originated from the target, not localhost (see also Class 4).
  3. Connectivity pre-check as first hint step — the first hint step for any remote objective should verify connectivity to the target before entering the workflow.

Prevention status: C10 (_targeted_at in skills/) CI gated. Host label enforcement in hint text: documentation only (no static check exists for comment presence). eval_harness.py gap: #527.


Open Instance Summary

As of 2026-05-30, all instances listed in the original inventory (#62–#480) are closed. The two remaining open issues are in post-#480 classes:

Class Open issues
1–15 (all) None — all #62–#480 instances resolved
[Post-#480 Class A] Hint Timeout #632 (PT-POST-02 john cracking)
[Class 6 / routing] #692 (PT-EXPLOIT-04 intra-hint routing)

Three new failure classes identified post-#480 (documented in the ARCHER operational inventory; numbered 15–17 in that inventory's local numbering, distinct from Classes 15–16 in this document):

  • [Post-#480 Class A] — Hint Timeout / Execution Time Overrun: Correct tool, correct config, but wordlist or cracking task exceeds session time budget (601s). Fix: scope wordlists, add time-box checkpoints. (#642 closed, #632 open)
  • [Post-#480 Class B] — Phantom Pass: Harness grep filter returns [Success: No output] → model narrates expected output → success_fn matches narration text. A v2-critical harness-level artifact, not a model or hint failure. (#681 closed)
  • [Post-#480 Class C] — Context Saturation (Hint-Level): Fresh session produces 0 commands when hint block exceeds ~600 chars. Identical symptom to Class 7 (VRAM bleed) but mechanism is hint budget, not VRAM depletion. Diagnosis: run in isolation; if still 0 cmds, decompose the objective. (No filed issue; established as CLAUDE.md rule)

Retroactive Audit Targets

Three categories of past failures that were never filed as issues because they produced false successes rather than observable eval failures. These are data quality audit targets — sessions that passed contemporary filters but are suspect in retrospect. They are not numbered failure classes; they do not manifest as failing eval runs.

Wrong-host playbook contamination

Sessions where success=True was recorded but commands ran against Kali/localhost instead of the intended target. These sessions were written to the playbook — and potentially to ft.jsonl — before _targeted_at guards existed on the relevant success_fns (Classes 4, 11, 15).

Detection method: Scan all playbook sessions for Kali-specific signals in target-expected output fields: loopback addresses (127.0.0.1, ::1), /home/kali/ paths, hostname archer-kali, Kali-default service banners. Any session with these in "success" command output where remote target output is expected is wrong-host contaminated.

Scope: Playbook entries generated before the _targeted_at guard was added to the relevant success_fns. Affected objectives per issue history: T10, T12, T23, T54, T56. After #501 ships (verified success signal gate), new playbook writes are protected. Existing entries require a one-time retroactive scan.

Remediation action: Retroactive playbook scan script; flag suspect entries for Auditor review; remove confirmed wrong-host entries from playbook and ft.jsonl. File scope-scoped issue once #501 is closed.


Routing miss undercount

Historical routing misses that were never filed as issues because wrong-skill outputs looked plausible at eval time. Class 9 lists 13 routing issues, but 80% of 38,927 routing log entries are unknown — no ground truth. A significant fraction of early eval runs may have been routed to the wrong skill pack and scored anyway, producing training data that reinforces incorrect skill selection.

Detection method: Once #480 (correct_skill passthrough) and #478 (SHA tagging) ship, compare skill_selected vs expected_skill across all eval_label entries. Pre-SHA entries where the two diverge are historical routing misses. The SHA boundary from the #343–#348 routing fix batch identifies the highest-risk epoch.

Scope: All eval runs prior to the #343–#348 routing keyword fix batch. Objectives with known routing ambiguity per Class 9: T8, T25, T42, T43, T44, T45, T47, T49, T52. Requires #478 and #480 to be shipped before the scan is executable.

Remediation action: Post-#478/#480, run routing miss scan against all eval_label routing log entries. Flag skill_selectedexpected_skill entries for exclusion from classifier retraining. Quantify the historical miss rate per skill pair.


Depth-blocked contamination window

Sessions generated between VRAM bleed manifestation (~issue #180) and filter addition (~issue #369). These depth-blocked, zero-command sessions may be in ft.jsonl from the contaminated window. The downstream filter was added at #369 but sessions generated between #180 and #369 are potentially contaminated even if they appear well-formed — the depth_blocked flag may have been absent from the schema at that time.

Detection method: Identify git SHAs for the commits closing #180 (VRAM bleed identified) and #369 (filter added). Any ft.jsonl session with a generation timestamp in that window and cmds=0 or depth_blocked=True should be excluded from the next retrain. After #478 ships, SHA tags make this lookup direct — no timestamp correlation needed.

Scope: ft.jsonl sessions from the #180→#369 window. Objectives most likely to have been affected: any long-running objective that ran overnight during that period (T6, T10, T12, T21, T53). Check cmds=0 prevalence in that epoch as a proxy for contamination density.

Remediation action: Extract SHA boundaries from git log; filter ft.jsonl by SHA epoch and cmds=0 / depth_blocked fields; exclude identified sessions from next training run. Document excluded count in PROCESSES.md training run log.


Cross-Class Meta-Patterns

Four meta-patterns appear across multiple failure classes. A fix targeting the meta-pattern closes multiple classes simultaneously.

Meta-pattern A — Startup log liveness check Classes 2, 3, 11. The pattern: grep -q <startup_string> <logfile> as a process liveness test. Startup strings are written before crashes. The fix is always the same: check process existence (tmux has-session, pgrep, /proc/<pid>), not log content. Prevention status: C1 (nohup/& on TUI tools) and C2 (grep -q without -i on log files) gated in check_hints.py. Gap: process liveness via session check (tmux has-session) not yet enforced.

Meta-pattern B — Missing verification step Classes 4, 6, 11. Every hint that sets up infrastructure declares success without confirming the objective is actually complete. A mandatory verification step — one that produces the specific output the success_fn checks — prevents the entire category. This is the single highest-leverage audit action: grep every hint for a verification step and file a bug for any that lack one. Prevention status: Manual audit in progress (#483). CI gate not yet implemented — C5 is deferred in check_hints.py pending a reliable absence heuristic. Tracking issue: #526.

Meta-pattern C — App-specific without generic companion Classes 5, 6, 10. Hints that target a specific app (vsftpd, DVWA, Metasploitable2) train the model to solve the box, not the vulnerability class. The two-layer rule (app-specific + generic placeholder companion) addresses all three classes. Reference: docs/research/range-lock-in.md. Prevention status: CI gated — C4 in check_hints.py blocks new violations at commit time. Remediation of existing hardcoded IPs/paths tracked in #432.

Meta-pattern D — Wrong-host execution Classes 4, 6, 15. The model executes commands on the attacker host instead of the target, passes success checks because the output looks correct (uid=0 on Kali, lynis running locally, etc.), and produces a false positive. _targeted_at guards on success_fns and explicit host labels in hints address this pattern across all three classes. Prevention status: Partially gated — C10 in check_hints.py checks skills/PT-*.py success_fn/verify_fn references. Gap: testenv/eval_harness.py objective success functions (_t*_ family) are not checked. Tracking issue: #527.


Highest-Leverage Remediations

Actions that close multiple open issues or prevent entire failure classes. CI status column reflects whether the action is enforced automatically.

Action Classes addressed CI status Tracking
tmux wrapper standard for all TUI tools 2 C1 gated
grep -qi standard for all log pattern checks 3 C2 gated
_targeted_at guard on all success_fns (skills/) 4, 11, 15 C10 gated
_targeted_at guard on eval_harness.py _t*_ fns 4, 11, 15 Not gated #527
Exclude [THOUGHT] from success_fn scope 4, 11 Implemented
Verification step audit across all hint blocks 4, 6, 11 Not gated #483, #526
Two-layer rule (app-specific + generic companion) 5, 6, 10 C4 gated #432
Between-objective VRAM flush 7 Not gated #451
setup_fn idempotency audit 7, 13 Not gated #528, #525
check_hint_lengths.pycheck_hints.py 8 C7 gated
Pipeline gate checklist (training) 14 Partial #501

  • docs/research/range-lock-in.md — full analysis of Class 10
  • 96 — routing log quality analysis (feeds Class 9 and 14 remediations)

  • 478 — git SHA tagging (Class 9 remediation; required for retroactive routing miss scan)

  • 480 — correct_skill passthrough (Class 14 remediation; required for routing miss undercount scan)

  • 481 — eval harness improvement ideas

  • 482 — this document's tracking issue

  • 483 — verification step audit (meta-pattern B; manual one-time pass)

  • 497 — check_hints.py linter (automates Classes 1, 2, 3, 6, 8 prevention)

  • 498 — failure-class dashboards (velocity, compound heatmap, remediation coverage, epoch view)

  • 499 — symptom-to-class mapper (automates post-eval failure class diagnosis)

  • 500 — this doc update (retroactive audit categories)

  • 501 — success signal verification before playbook write (prevents wrong-host contamination at write time)

  • 502 — failure class tagging on ft.jsonl sessions

  • 503 — SHA epoch gating per failure class for retrain filtering

  • 525 — ligolo TUN route flush in _setup_pivot_range (Class 7/13 instance)

  • 526 — C5 CI gate: verification step presence check (meta-pattern B enforcement)

  • 527 — C6 CI gate: _targeted_at guard audit for eval_harness.py (meta-pattern D gap)

  • 528 — setup_fn idempotency audit across all pivot/AD/web objectives (Class 7)