ARCHER Failure Mode Inventory¶
A structured inventory of every failure class ARCHER has exhibited across eval runs and RCA sessions. Purpose: identify root causes shared across multiple objectives so that a single fix closes an entire class rather than re-filing the same bug under a new issue number.
Methodology: every closed bug and regression issue (#62–#480) was reviewed and assigned to one or more failure classes. Open instances are listed per class. Class-level remediations are specified where ≥2 instances share a root cause.
Last updated: 2026-05-30. Living document — add new instances as they surface. Covers issues #62–#480; post-#480 analysis and new classes (15–17) documented in the ARCHER operational inventory.
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
Lessons Learned¶
Six weeks of eval-driven development against live targets produced 130+ bug and regression issues. Reviewing them as a body — rather than individually — reveals patterns that were invisible when each issue was filed in isolation.
The one-bug-one-fix trap¶
The dominant failure pattern in ARCHER's development history is not any single bug class — it is the tendency to fix symptoms rather than classes. The same root cause recurred under different issue numbers: LHOST missing from a msfconsole chain (#161), an unsubstituted <attacker_ip> placeholder (#411), a lost variable between command dispatches (#475) — all three are the same failure (shell variable loss, Class 1), diagnosed and fixed three separate times. Across 15 classes and 130+ issues, this pattern repeats: a root cause is identified, fixed for the specific objective that surfaced it, and left unaddressed in adjacent objectives where the same code pattern exists.
The lesson: diagnosis is not complete until all instances of the root cause are found. A fix that closes one issue while leaving five sibling objectives with the same failure pattern is an incomplete fix.
Environmental assumptions accumulate silently¶
A large fraction of failures — Classes 2 (PTY crash), 13 (infrastructure gap), and 15 (wrong host) — are not model or hint failures at all. They are broken assumptions about the environment: a missing binary, a container without /dev/net/tun, a tool that requires a PTY but is launched without one. These failures are particularly costly because they are diagnosed as model failures first, wasting hint-tuning cycles before the real cause is found. The lesson: environment verification must precede model diagnosis. A failing objective should first be tested against a known-good environment before any hint is changed.
Success signals are consistently too weak¶
Four failure classes — premature OA (Class 4), false positive success_fn (Class 11), missing verification step (Class 6), and wrong host confusion (Class 15) — share a common weakness: the system accepts evidence that does not actually prove the objective was completed. A single string match on a log line, a grep that fires on error output, a success_fn that passes on wrong-host output, a model that reads a startup warning and declares success — all are variants of the same problem. The lesson: success signals must be necessary, not merely correlated. A signal that can appear in both success and failure contexts is not a success signal.
Training data quality is a silent multiplier¶
Class 14 (training data contamination) is the least visible failure class because contaminated sessions do not cause immediate eval failures — they degrade model behavior gradually, in ways that manifest as routing misses and hint non-compliance weeks or months later. Depth-blocked sessions, wrong-host sessions, and unverified sessions entered the fine-tuning pipeline across multiple issues (#134, #153, #275, #316, #369). The lesson: pipeline gates must be code, not process. A filter that exists only as a documented guideline will eventually be bypassed. Every contamination category must be enforced at the point of data generation.
Automation prevents recurrence; process does not¶
The four cross-class meta-patterns (startup log liveness, missing verification step, app-specific without generic companion, wrong-host execution) have each recurred three or more times despite being documented in CLAUDE.md, PROCESSES.md, and individual issue comments. Documentation does not prevent recurrence. The failures that stopped recurring are the ones that became CI gates — check_hint_lengths.py eliminated Class 8 truncation issues after a single enforcement point was added. The lesson: the only reliable prevention is automated enforcement. Every lesson in this document that has not yet been encoded as a CI check or lint rule is a lesson waiting to be relearned.
What this means for development priorities¶
The failure mode inventory points to three structural changes that would have prevented the majority of issues in this document:
- A hint linter in CI (
check_hints.py) — enforcing the tmux standard, case-insensitive grep, placeholder substitution, and verification step presence. Addresses Classes 1, 2, 3, and 6 mechanically. _targeted_atguards on allsuccess_fns — a single audit pass adding wrong-host guards. Addresses Classes 4, 11, and 15 in one action.- Hard gates in the training pipeline —
verify_fn_skipped, depth-blocked, and unknown-skill sessions rejected at write time, not filtered downstream. Addresses Class 14 permanently.
None of these require model changes or hint rewrites. They are infrastructure changes that make entire failure classes structurally impossible.
Class 1 — Shell Variable Loss¶
Description: A shell variable is set in one ARCHER command dispatch and referenced in a subsequent dispatch. Because each command runs in a separate bash invocation, the variable is gone by the time it is needed. Unsubstituted literal placeholders (<attacker_ip>, <user>) are the same failure in hint authoring — the model receives the placeholder unchanged and uses it literally.
Root cause: ARCHER dispatches each generated command as an independent subprocess. Shell state does not persist between dispatches.
| Issue | Objective | Instance |
|---|---|---|
| #89 (closed) | T4/T6 | LHOST detection awk picks wrong IP when route has via gateway — wrong IP class for the context |
| #161 (closed) | T6 UnrealIRCd | LHOST missing from msfconsole chain — reverse shell never receives connection |
| #409 (closed) | T52 chisel | <user> placeholder in scp command reaches model literally — scp fails |
| #411 (closed) | T53 ligolo | <attacker_ip> placeholder unsubstituted + agent run on wrong host |
| #392 (closed) | T56 multi-hop | Wrong credentials (msfadmin) instead of pivot-range creds (pivot:archer123) |
| #475 (open) | T52 chisel | ATTACKER_IP set in cmd 1, empty in cmd 2 — client connects to :8000 on pivot's localhost |
Class-level remediation:
- Inline subshell expansion — never assign a variable then reference it across dispatches. Inline the subshell:
ssh ... $(ip route get {pivot} | grep -oP 'src \K\S+'):8000notATTACKER_IP=$(...)thenssh ...$ATTACKER_IP...in a later command. - Placeholder hygiene — all
<placeholder>strings must be substituted at hint render time via the template system. Any placeholder that reaches the model is a hint defect. Audit:grep -r '<[a-z_]*>' skills/should return zero results. - Cross-step
$VARaudit — grep allskills/PT-*.pyhints for$VARpatterns that appear in a later step than the assignment. Any cross-step reference is a latent bug.
Prevention status: C8 (VAR=$() assignment detection) CI gated in check_hints.py. C3 (cross-step $VAR dereference) deferred — requires multi-step semantic parse.
Class 2 — PTY / TUI Crash¶
Description: A tool that requires an interactive terminal (PTY) is launched via nohup, &, or docker exec without -t. The tool's TUI library detects no PTY and panics, exiting silently after writing one or two startup log lines — making it appear to have started successfully.
Root cause: Modern CLI tools use PTY-detecting interactive frameworks (ligolo-proxy: grumble + survey/v2; msfconsole: readline). Without a PTY the library aborts. The process writes startup output before crashing, so log-based liveness checks pass even though the process is dead.
| Issue | Objective | Instance |
|---|---|---|
| #79 (closed) | T6 msfconsole | msfconsole times out at 300s when run directly inside archer-kali — PTY context absent |
| #82 (closed) | T4/T6 | bind_netcat payload hangs in eval harness — no PTY for interactive netcat session |
| #389 (closed) | T53 ligolo | archer-kali missing /dev/net/tun — ligolo permanently broken at container level |
| #441 (closed) | T53 ligolo | nohup proxy+agent — same crash pattern, earlier attempt at fix |
| #446 (closed) | T53 ligolo | ligolo-ng.yaml in repo root crashes proxy on automated startup |
| #455 (closed) | T53 ligolo | Version incompatibility (agent 0.6.2 vs proxy 0.8.3) — compound infrastructure failure |
| #474 (open) | T53 ligolo | Model uses nohup /usr/bin/ligolo-proxy; proxy writes "Listening" then crashes; grep -q Listening passes; agent gets connection refused |
Class-level remediation:
- tmux wrapper standard — any tool with a TUI must run inside
tmux new-session -d -s <name>.nohupis wrong for TUI tools, always. - Process liveness via session check — never use
grep -q <startup_string> <logfile>as the liveness test. Usetmux has-session -t <name>, which only passes if the session is alive. Startup log lines are written before crashes. - Hint audit — grep all hints that launch interactive tools (
ligolo-proxy,msfconsole,responder, etc.) and verify they use tmux, not nohup/background. - Container pre-flight — verify
/dev/net/tunexists in archer-kali before any ligolo run.
Prevention status: C1 (nohup/& on TUI tools) CI gated in check_hints.py. Gap: process liveness check pattern (grep-q-logfile as liveness) not yet enforced separately from C2.
Class 3 — Case Mismatch / Pattern Miss¶
Description: A grep or regex pattern in a hint or success_fn fails because the actual log output uses different capitalization, punctuation, or wording than the pattern expects. Also includes success signals that are present in the tool output but absent from the model's visible context.
| Issue | Objective | Instance |
|---|---|---|
| #166 (closed) | T9 nikto | OSVDB patterns not in ARCHER stdout — success_fn never fires despite correct output |
| #191 (closed) | T23 hash-crack | success_fn regex hash vs hashes — john plural output never matches |
| #380 (closed) | web_lfi | <script> match in _SIGNAL_RE contaminates Tier 2 evidence with HTML noise |
| #383 (closed) | linux_privesc | _SIGNAL_RE missing evidence patterns for sudo/SUID output — correct output invisible |
| #393 (closed) | T3a/T45 | halt_bypass_signals missing nmap NSE vuln output patterns |
| #426 (closed) | T53 ligolo | http/1. from wget error satisfies has_traversal — wrong signal matched |
| #474 (open) | T53 ligolo | grep -q 'agent joined' (lowercase); actual log: msg="Agent joined." (capital A, period) |
Class-level remediation:
- Always use
-ifor log greps —grep -qicosts nothing and eliminates case mismatch. No case-sensitive grep in hints without an explicit rationale. - Verify patterns against live output — before adding any grep pattern, run the actual tool in the lab and copy the exact string. Never infer log format from documentation.
- Pattern specificity —
_SIGNAL_REentries must match tool-specific output, not generic HTML/HTTP strings that appear in error pages. Add negative examples to the test suite.
Prevention status: C2 (grep -q without -i on log files) CI gated in check_hints.py.
Class 4 — Premature Objective Achieved / False Positive¶
Description: The model emits [OBJECTIVE_ACHIEVED] based on partial, fabricated, or ambiguous evidence. Variants: reading only the first positive line of output without checking for subsequent error lines; success_fn matching on wrong-host output; echo/printf fabrication of expected strings.
| Issue | Objective | Instance |
|---|---|---|
| #78 (closed) | multiple | HALT_DISCIPLINE false positive — failed exploit marked answerable, quality=1.0 saved to playbook |
| #81 (closed) | T6 | Model fabricates uid=0 via echo — success_fn passes on fabricated output |
| #83 (closed) | T6 | success_fn must guard against echo/printf fabrication of uid=0 |
| #107 (closed) | multiple | Model emits OA on exploit failure + loops on bare bash |
| #115 (closed) | T7 | _t7_host_discovery too broad — any 192.168.56.x IP in output passes |
| #116 (closed) | T12 | _t12_vuln_assess matches critical|high|medium in natural language |
| #123 (closed) | T12 | vulnerability_assessment runs lynis on local host — success_fn passes on wrong-host output |
| #124 (closed) | T10 | post_exploitation enumerates Kali host — wrong-host false positive |
| #132 (closed) | T12 | T12 passes on wrong-host lynis |
| #133 (closed) | T10 | T10 passes on wrong-host enumeration |
| #141 (closed) | T14/T16 | Passes when BWA target unreachable — no connectivity guard |
| #142 (closed) | T23 | T23 passes when john runs but finds no hashes |
| #143 (closed) | T11 | port_scanning speed anomaly passes too quickly |
| #144 (closed) | T15 | No _targeted_at guard — localhost false positive risk |
| #163 (closed) | T27 | linux_privesc premature OA — model stops after 1 command |
| #168 (closed) | T12 | searchsploit passes without scanning remote target |
| #236 (closed) | T28 | _t28_suid_privesc fires on command text, not evidence |
| #253/#254 (closed) | multiple | 9 boundary violations — OA exits bypassing verify_fn; echo fabrication suspected |
| #367 (closed) | T12 | lynis+searchsploit passes without scanning remote target |
| #368 (closed) | T10 | uname on Kali + failed SSH attempt passes post_exploit |
| #387 (closed) | T54 socat | Fires OA without passing verify_traversal |
| #423 (closed) | T56 | [THOUGHT] text contains target IP — satisfies _real_pivot_traversal |
| #474 (open) | T53 R3 | Model reads agent startup warning, declares OA; next lines show fatal connection error |
Class-level remediation:
_targeted_atguard on all success_fns — every success function must verify output came from the correct target IP, not localhost or the Kali host. This guard alone would have prevented ~8 instances in this list.- Exclude [THOUGHT] text from pattern matching — strip
[THOUGHT]...[/THOUGHT]blocks before applying success_fn patterns. Model reasoning should never satisfy an evidence check. - Fabrication guard — success_fn patterns for privilege escalation outputs (
uid=0,root) must appear in command stdout, not in text that the model could have generated itself. - Multi-signal requirement — OA for complex objectives (exploitation, pivoting) should require N distinct evidence signals, not a single string match.
- Fatal-line scan — add hint instruction to read full command output before declaring success: startup warnings followed by fatal errors are the canonical false-positive pattern.
Prevention status: C10 (_targeted_at guard in skills/) CI gated in check_hints.py. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet checked — tracking issue #527.
Class 5 — Wrong Module / Tool Selection¶
Description: The model selects the wrong Metasploit module, CLI tool, or approach — hallucinating a module name, choosing a module for a different vulnerability, selecting the wrong tool for the target environment, or misconfiguring a tool that would otherwise work.
| Issue | Objective | Instance |
|---|---|---|
| #62–#70 (closed) | sweep | Early sweep findings: nmap wrong_skill, wapiti no_output, msfconsole no_output, hydra wrong_skill, searchsploit early_complete |
| #80 (closed) | T5 hydra | hydra uses rockyou as both user and password list — will never complete |
| #90 (closed) | T6 UnrealIRCd | Wrong exploit module selected (samba/usermap_script instead of unix/irc/unreal_ircd) |
| #104 (closed) | T5 | SSH brute-force 0/3 — hydra hints insufficient |
| #105 (closed) | T6 | msfconsole module path or port mismatch |
| #165 (closed) | T21 Tomcat | msfconsole module error on Metasploit 6.4.126 |
| #173 (closed) | T5 ncrack | 917-line wordlist always times out before reaching msfadmin |
| #238 (closed) | T35 | UDP scan always times out — needs port scope or --top-ports |
| #372 (closed) | entity_id | nmap missing -O/-sV flags — incomplete output |
| #376 (closed) | vuln_assess | Model stops after nmap -sV, never runs vuln scripts |
| #412 (closed) | T45 | nmap --script vuln too slow for OWASP-BWA — nuclei should be primary |
| #419 (closed) | T28 SUID | Model runs nmap interactive but doesn't pipe !sh |
| #429 (closed) | T3a | nmap --script vuln times out on MS2 — nuclei should be primary |
| #472 (open) | PT-EXPLOIT-01 | Wrong Metasploit module, missing PAYLOAD, module hallucination |
Class-level remediation:
- Exploitation short-circuits — for every named CVE or well-known vulnerability (vsftpd, UnrealIRCd, MS08-067, etc.) add an explicit short-circuit block with exact module path, PAYLOAD setting, and RHOST/LHOST sequence. The model cannot reliably select the correct module from general knowledge.
- Tool priority guidance — hints must specify which tool is primary and which is fallback. "Use nuclei; if unavailable, fall back to
nmap --script vuln" is unambiguous. Generic "run a vulnerability scanner" produces tool mismatches. - Module reference tables — add a reference block in exploitation hints listing correct modules for common MS2/DVWA targets by port and service. The model matches task context to the table rather than hallucinating.
- Wordlist scope — brute-force hints must specify targeted wordlists, not generic full-length lists. A 917-line wordlist with rockyou order guarantees timeout on any short-session objective.
Prevention status: Documentation only. No CI gate — requires semantic understanding of hint content. Manual review at hint authoring time.
Class 6 — Missing Short-Circuit / Hint Gap¶
Description: A hint block exists for one phase of an operation but not for the critical subsequent phase. The model is left without guidance at the decisive step and falls back to generic behavior. Also includes hints that specify the wrong host, wrong path, or wrong authentication method for the target environment.
| Issue | Objective | Instance |
|---|---|---|
| #80/#91 (closed) | T5 hydra | SSH legacy key negotiation — no hint for -oHostKeyAlgorithms flag; SSH rejects all auth attempts |
| #145 (closed) | T17 | MySQL enumeration failing — hints insufficient for mysql client auth flow |
| #157 (closed) | T15 | Broken bash hint + missing syntax-error loop guard |
| #162 (closed) | T23 | Hash-crack objective never dumps shadow — model skips extraction step |
| #167 (closed) | T16 | DVWA LFI curl returns empty — missing -L flag; session not persisting |
| #175 (closed) | T16 | DVWA LFI security level POST blocked by CSRF; traversal runs at wrong security level |
| #182 (closed) | T21b | Model fails reverse shell sequence — listener-before-trigger sequencing missing |
| #189 (closed) | T23 | Hint not driving hash crack step — 0/3 with 2 cmds (dump only, no john/hashcat) |
| #193 (closed) | T23 | Hint dump step unreliable — model drops sudo -S and tee; john gets empty file |
| #235 (closed) | linux_privesc | SSH enumeration blocked by sudo -l password prompt on MS2 — hint missing sshpass |
| #239 (closed) | T46/T47 | post_exploitation SSH hints missing HostKeyAlgorithms flags for MS2 |
| #242 (closed) | T46/T47 | exfiltration and persistence hints missing SSH-first / sshpass for remote targets |
| #245 (closed) | T49 DVWA XSS | web_xss hints missing DVWA stored XSS workflow — model skips login, hits wrong endpoint |
| #311 (closed) | T16/T24/T26 | Model completes task but never emits OA — 100% HALT_DISCIPLINE; completion signal too weak |
| #328 (closed) | linux_privesc | SSH loop and sudo-hang on MS2 |
| #329 (closed) | T24 web_xss | 100% HALT_DISCIPLINE — completion signal too weak |
| #331 (closed) | web_enum | robots.txt 404 needs fallback to directory brute-force |
| #378 (closed) | T50–T56 | All pivot hints missing verification step |
| #390 (closed) | T54 socat | socat_relay runs on attacker host, not pivot — wrong machine targeted |
| #391 (closed) | T55 | ProxyJump hint missing trailing command — connects then drops |
| #397 (closed) | T50 | Default port 80 closed on pivot target; hint missing traversal step |
| #399 (closed) | T51 | Wrong proxychains config file path in hint |
| #406 (closed) | T48 CSRF | No hint for auth-first flow — /vulnerabilities/csrf/ returns 404 unauthenticated |
| #432 (open) | multiple | Hardcoded vulnbox IPs/paths — generic companion missing |
| #472 (open) | PT-EXPLOIT-01 | vsftpd exploitation short-circuit missing (detection short-circuit exists) |
Class-level remediation:
- Two-layer rule — every app-specific detection hint must have a paired exploitation hint. For every
_hints_*block that fires on confirmation/detection keywords, verify a paired exploitation block exists. Seedocs/research/range-lock-in.md. - Verification step mandatory — every hint block that sets up infrastructure must include a verification step that produces the evidence the
success_fnchecks. - SSH compatibility flags — any hint connecting to MS2/Metasploitable targets via SSH must include
-oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedAlgorithms=+ssh-rsaor equivalent. This has recurred in T5, T17, T46, T47 — it is a lab-wide compatibility requirement. - Host targeting explicit — every hint command must explicitly state which host it runs on (attacker, pivot, target). Ambiguous host context produces wrong-machine execution.
- Completion signal audit — any objective with HD=100% and OA=0% across multiple runs has a missing or weak completion signal, not a model failure. Check whether OA can be emitted given the hint structure before adjusting halt thresholds.
Prevention status: C4 (app-specific without generic companion) CI gated in check_hints.py. Verification step presence (item 2) not yet gated — C5 tracking issue #526. Manual one-time verification step audit: #483.
Class 7 — VRAM / Resource Bleed¶
Description: A long-running or resource-intensive objective saturates GPU VRAM or system resources. Subsequent objectives start with depleted resources, producing cmds=0 or truncated runs. Also includes port/process pollution between runs where cleanup is absent.
| Issue | Objective | Instance |
|---|---|---|
| #77 (closed) | playbook | Fast replay used DEFAULT_COMMAND_TIMEOUT (120s) instead of skill timeout — wrong resource budget |
| #180 (closed) | multiple | Overnight VRAM-saturated depth-blocked zero-command sessions — excluded from pipeline |
| #248 (closed) | collection | run_data_collection.sh eval lock blocks its own Phase 1 child |
| #407 (closed) | T50 | Stale tunnel ports (8080–8085) not cleaned between runs |
| #418 (closed) | T51 | SOCKS port 1080 exhaustion — ssh -D not killed by prior cleanup |
| #436 (closed) | multiple | Stale ~/.archer_eval.lock on process exit without cleanup |
| #444 (closed) | T52/T53/T56 | _setup_pivot_range missing chisel/ligolo cleanup — runs 2+ fail |
| #451 (open) | multiple | VRAM bleed between objectives — ollama reload fires between runs not between objectives |
Class-level remediation:
- Between-objective VRAM flush —
ollama stop+ prewarm after each objective's runs complete. Prevents saturation from long objectives bleeding into short ones. - Idempotent
setup_fn— everysetup_fnmust kill processes and release ports from prior runs before starting. Asetup_fnthat assumes a clean state will fail on run 2+. - Lock file cleanup — eval lock must be removed in a
finallyblock, not just on clean exit. Any eval process that exits abnormally should release the lock. - Resource monitoring — Cockpit overview during long eval runs provides early warning: CPU/RAM flatline mid-run indicates bleed-out, not completion.
Prevention status: Documentation only for VRAM flush (item 1). setup_fn idempotency (item 2) partially enforced — pivot/AD preflights raise PreflightFailure (#494/#493), but cleanup completeness not audited. Tracking issue: #528 (setup_fn idempotency audit). Ligolo TUN route flush gap tracked in #525.
Class 8 — Character Limit / Command Truncation¶
Description: A hint block, command string, or evidence extraction exceeds a character limit, causing truncation, rejection, or silent data loss.
| Issue | Objective | Instance |
|---|---|---|
| #334 (closed) | web_xss | _extract_evidence CMD truncation at 120 chars — XSS POST payloads invisible to Tier 2 scorer |
| #448 (open) | T52 chisel | Hint block exceeds 500-char harness limit — command rejected |
| #452 (closed) | T52 chisel | &; syntax — degenerate loop on every run |
Class-level remediation:
check_hint_lengths.pyin CI — gate already enforces 500-char limit. Run locally before any hint commit.- Split at logical boundaries — if a command exceeds the limit, split into two sequential hint steps at a logical boundary. Never concatenate with
&&or&;to stay under limit. - Evidence extraction window —
_extract_evidenceCMD truncation at 120 chars is too short for POST payloads. Raise or eliminate the truncation for command text used in Tier 2 scoring.
Prevention status: C7 (500-char limit) CI gated in check_hints.py (absorbed from check_hint_lengths.py).
Class 9 — Routing Miss¶
Description: A task is routed to the wrong skill pack, causing the model to receive hints and context for the wrong domain. Includes keyword scorer over-capture (one skill's keywords absorb tasks that belong to a sibling skill) and classifier confidence threshold mismatches.
| Issue | Objective | Instance |
|---|---|---|
| #62/#63/#66/#67 (closed) | sweep | Early sweep findings: nmap, arp-scan, hydra, whatweb all routing to wrong skills |
| #103 (closed) | T8 | web_enumeration routing miss — skill= empty for directory enum task |
| #147 (closed) | T25 | Dead-tie — web_exploitation and web_cmd_injection score equally |
| #257 (closed) | classifier | Confidence threshold 0.7 never reached by TF-IDF+LR — lowered to 0.5 |
| #276 (closed) | collection | Sparse gate ignores classifier confidence — routing misses never self-correct |
| #294 (closed) | T33/T38 | T49 hint fix broke vulnerability_assessment and entity_identification routing |
| #343 (closed) | T42 | Routes to web_vulnerability_scanning instead of web_enumeration |
| #344 (closed) | routing | web_authentication bonus_fn fires on bare login — misroutes auth-bypass tasks |
| #345 (closed) | routing | system_info version keyword misroutes vuln-assessment tasks |
| #346 (closed) | routing | entity_identification captures check what's running — should be service_enumeration |
| #347 (closed) | routing | port_scanning captures what's listening — should be service_enumeration |
| #348 (closed) | routing | reconnaissance captures bare scan the target — should be port_scanning |
| #385 (open) | T52 | Routes to ssh_tunneling instead of chisel_pivot |
Class-level remediation:
exclude_keywordsdiscipline — every skill must have explicit exclude patterns for tasks that superficially match but belong to a sibling. Routing misses are often symmetric — if T52 routes to ssh_tunneling, ssh_tunneling is over-capturing.- Ambiguous task eval — run
eval_harness --ambiguousafter any hint or keyword change to verify no new misroutes are introduced. - Bonus_fn keyword audit — bonus functions that fire on single common words (
login,version,check) are over-broad. Require at least two co-occurring terms or domain-anchored context. - SHA tagging (#478) — once in, use SHA to filter pre-fix routing labels from classifier retraining, preventing old misroutes from training the next classifier.
Prevention status: Documentation only. No CI gate — routing correctness requires live eval. --ambiguous flag exists for manual verification.
Class 10 — Range Lock-In¶
Description: A hint block fires only on a specific app name, IP address, or target path, training the model to solve one specific target rather than the vulnerability class. The model fails on any variant or alternative target.
| Issue | Objective | Instance |
|---|---|---|
| #153 (closed) | Juice Shop | Contaminated ft.jsonl sessions (localhost:3000 wrong target) — trained on wrong target URL |
| #468 (closed) | T2a/T4/T6 | Task strings named specific apps rather than vuln class |
| #469 (closed) | T14/T24/T25/T26/T48 | Task strings with app paths and named apps |
| #432 (open) | multiple | Hardcoded vulnbox IPs and paths in hint triggers — generic companion missing |
Class-level remediation: See docs/research/range-lock-in.md for full analysis. Summary: every app-specific hint block must have a generic companion block using placeholders (<login-endpoint>, <module>, <target>). App-specific block drives eval pass rate; generic companion teaches the transferable pattern. Task strings must describe the vulnerability class, not the specific app or path.
Prevention status: C4 CI gated in check_hints.py — blocks new app-specific hints without generic companions. Cleanup of existing violations: #432.
Class 11 — False Positive success_fn / halt_fn¶
Description: A success or halt function returns true on a signal that appears in non-success contexts — curl error output, [THOUGHT] text, wrong-host output, truncated stdout, or intermediate/partial matches.
| Issue | Objective | Instance |
|---|---|---|
| #115 (closed) | T7 | Any 192.168.56.x IP in output passes _t7_host_discovery |
| #116 (closed) | T12 | critical|high|medium in natural language passes _t12_vuln_assess |
| #164 (closed) | T4 | _real_uid_root false-negative — skips printf/msfconsole output blocks |
| #314 (closed) | network_exploit | L1 check rejects valid non-root code execution and backdoor confirmation |
| #410 (closed) | T54 | head -10 truncates SSH banner at line 11 — success_fn never sees OpenSSH |
| #423 (closed) | T56 | _real_pivot_traversal matches target IP in [THOUGHT] text |
| #426 (closed) | T53 | has_traversal matches http/1. from wget error output |
| #443 (closed) | T56 | _halt_tunneling premature fire — traversal threshold too low |
| #401 (open) | T56/pivot | HTTP/\d regex matches curl verbose * using HTTP/1.x — not actual traversal |
Class-level remediation:
- Anchor patterns to tool-specific output — patterns must match strings that only appear in genuine success output, not in error messages, verbose output, or [THOUGHT] text.
- Exclude [THOUGHT] from pattern scope — strip [THOUGHT] blocks before applying success_fn patterns.
- Stdout-only matching — apply success patterns to command stdout, not full session text.
- Threshold calibration — traversal-type functions should require the target IP to appear in network-layer output. HTTP header strings and verbose curl flags are not traversal evidence.
- Avoid
head -Ntruncation — usegrepto extract the relevant line rather than truncating at a fixed line count.
Prevention status: C10 (skills/ success_fn _targeted_at guard) CI gated. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet gated — #527.
Class 12 — Model Loop¶
Description: The model repeats the same command or sequence without making progress, eventually hitting max_commands. Variants: retrying a failed setup step indefinitely; completing the task but not emitting OA; looping on bare infrastructure commands with no payload.
| Issue | Objective | Instance |
|---|---|---|
| #107 (closed) | multiple | Model loops on bare bash after exploit failure |
| #125 (closed) | T14 | Degenerate proxy loop when DVWA unreachable — no connectivity pre-check |
| #126 (closed) | agent | done:true + empty command causes infinite agent loop |
| #163 (closed) | T27 | Stops after 1 command — model treats minimal output as completion |
| #230 (closed) | watchdog | cmd_count comparison never resets between sessions — premature ambiguous block kills |
| #251 (closed) | T11/T23/T24/T26 | 102 runs at 97–100% halt depth, 0% OA |
| #311 (closed) | T16/T24/T26 | 100% HALT_DISCIPLINE — model completes task but OA never emitted |
| #382 (open) | linux_privesc | Model loops on bare SSH — drops command payload |
| #386 (closed) | T51 SOCKS | Hits max_commands every run — SOCKS setup loop |
Class-level remediation:
- Progress anchors in hints — hints must include checkpoints giving the model a clear signal it has advanced ("if port X is now listening, proceed to step 3"). Without progress anchors, the model retries setup steps indefinitely.
- Connectivity pre-check — any hint for a web target must verify the target is reachable before entering the main workflow. A target that is down causes looping on connection attempts.
- HD=100%/OA=0% diagnosis — this pattern means the model is completing the work but not emitting OA. Root cause is always hint structure (missing OA signal) or success_fn (not firing on correct output), not model capability. Adjust halt thresholds only after ruling out both.
done:trueguard — the agent loop must reject empty commands withdone:trueand not re-queue.
Prevention status: Documentation only. No CI gate — loop detection requires live eval observation.
Class 13 — Infrastructure Gap¶
Description: A required binary, container configuration, network configuration, or environment component is missing or misconfigured. Failures in this class are not hint or model failures — the environment itself is broken.
| Issue | Objective | Instance |
|---|---|---|
| #79 (closed) | T6 | msfconsole times out in archer-kali container — container resource ceiling |
| #176 (closed) | T17–T19/T22 | Stale playbook entries produce 0-cmd fast fails |
| #177 (closed) | T2b | T2b dependency on T2a not enforced — runs on uninitialized state |
| #181 (closed) | T2a | Port 6200 not releasing between runs — exit -y cleanup insufficient |
| #186 (closed) | T21 | Unhandled TimeoutExpired in Tomcat restart kills harness |
| #194 (closed) | T23 | john.pot contamination — _setup_t23 cleans wrong host |
| #258 (closed) | T23 | rockyou.txt corrupted in archer-kali + MS2 passwords missing from wordlist |
| #277 (closed) | multiple | Stale tool processes persist in archer-kali after eval harness exit |
| #302 (closed) | T16 | _setup_t16 must reset DVWA admin password before each run |
| #389 (closed) | T53 | archer-kali missing /dev/net/tun |
| #400 (closed) | T56 | 172.30.2.20 not deployed — pivot range only has 2 nodes |
| #425 (closed) | T53 | ligolo-agent binary missing from pivot range image |
| #437 (closed) | CI | verify-fix.yml no lab pre-flight — posts FAIL when targets unreachable |
| #442 (closed) | T53/T56 | tini + openssh-client missing from Dockerfile |
| #450 (closed) | T51 | SSH -D race condition — ss check runs before port is bound |
Class-level remediation:
- Prerun sanity check —
archer-prerunverifies VM reachability, required binaries, container state before any eval run. Do not start a run against a target that hasn't been verified reachable. - Idempotent
setup_fn— every setup function must be idempotent: clean prior state, verify environment, then initialize. Non-idempotent setup fails on run 2+. - Dockerfile audit on range changes — any change to
docker/pivot-range/Dockerfilerequires a full T50–T56 run to confirm infrastructure integrity. - Race condition guards — any hint that starts a background process and immediately checks for it must include a sleep + retry loop, not a single-shot check.
- Dependency enforcement — objectives with sequential dependencies (T2a → T2b) must enforce ordering in
setup_fn, not rely on run order.
Prevention status: PreflightFailure exception raises SKIP (not FAIL) on VM/AD unreachability — CI gated (#494/#493). setup_fn cleanup completeness not yet audited — #528. Ligolo TUN route flush: #525.
Class 14 — Training Data Contamination¶
Description: Invalid, low-quality, or incorrectly-labeled sessions enter the fine-tuning or classifier training pipeline, degrading model behavior in ways that are difficult to diagnose.
| Issue | Pipeline stage | Instance |
|---|---|---|
| #134 (closed) | fine-tune | depth_blocked sessions leak into training data via _build_full_conversation |
| #136 (closed) | fine-tune | stale data/finetune/ with obsolete Alpaca schema — regenerate required |
| #140 (closed) | fine-tune | HALT_DISCIPLINE inclusion policy split — build_training_data.py and prepare_finetune.py disagree on what to include |
| #150 (closed) | classifier | router_labels.csv no deduplication — classifier biased toward frequently-run objectives |
| #153 (closed) | fine-tune | Contaminated Juice Shop ft.jsonl sessions (localhost:3000 wrong target) |
| #155 (closed) | fine-tune | Raw model responses with backslash errors contaminate fine-tuning data |
| #275 (closed) | fine-tune | Web sessions with connectivity failures not filtered from pipeline |
| #316 (closed) | fine-tune | verify_fn_skipped not a hard gate on ft.jsonl writes — unverified sessions enter pipeline |
| #330 (closed) | fine-tune | skill="unknown" sessions not filtered by prepare_finetune.py |
| #369 (closed) | fine-tune | depth_blocked sessions produce zero-command training examples |
Class-level remediation:
- Pipeline gate checklist — before any training run, verify: (a) depth_blocked sessions excluded, (b) verify_fn_skipped sessions excluded, (c) skill="unknown" sessions excluded, (d) connectivity-failure sessions excluded, (e) no duplicate routing labels.
- Hard gates in code — contamination filters must be code gates, not process guidelines.
verify_fn_skipped=Truemust prevent ft.jsonl write at the point of generation, not in a downstream filter. - Quality filter for classifier —
eval_label+label_confidence==highonly (see PROCESSES.md). The 80%unknownevent entries are not usable for classifier retraining. - Schema versioning — ft.jsonl schema version must be checked at pipeline entry. Stale entries from prior schema versions must be excluded or migrated.
Prevention status: Some gates in code (depth_blocked, verify_fn_skipped); others documented in PROCESSES.md only. #501 (success signal verification before playbook write) addresses the wrong-host contamination path. Full gate checklist enforcement: documentation only.
Class 15 — Wrong Host / Target Confusion¶
Description: The model runs commands against the wrong machine — typically confusing the attacker host (Kali) with the target host, or confusing attacker with pivot in multi-hop scenarios. Distinct from Class 6 (missing hint) in that the model has a hint but misidentifies which machine to execute on.
| Issue | Objective | Instance |
|---|---|---|
| #123 (closed) | T12 | vulnerability_assessment runs lynis on local Kali host instead of target |
| #124 (closed) | T10 | post_exploitation enumerates Kali host instead of SSH target |
| #125 (closed) | T14 | Enters degenerate proxy loop when DVWA unreachable (connected to wrong interface) |
| #390 (closed) | T54 | socat_relay configured and run on attacker host, not pivot |
| #411 (closed) | T53 | ligolo-agent run on attacker instead of pivot |
Class-level remediation:
- Explicit host labels in hints — every hint command must be preceded by a label identifying which host it runs on:
# On attacker:,# On pivot (via SSH):,# On target:. The model uses these labels to orient itself. _targeted_atguards — success_fns for objectives that run commands on remote targets must verify output originated from the target, not localhost (see also Class 4).- Connectivity pre-check as first hint step — the first hint step for any remote objective should verify connectivity to the target before entering the workflow.
Prevention status: C10 (_targeted_at in skills/) CI gated. Host label enforcement in hint text: documentation only (no static check exists for comment presence). eval_harness.py gap: #527.
Open Instance Summary¶
As of 2026-05-30, all instances listed in the original inventory (#62–#480) are closed. The two remaining open issues are in post-#480 classes:
| Class | Open issues |
|---|---|
| 1–15 (all) | None — all #62–#480 instances resolved |
| [Post-#480 Class A] Hint Timeout | #632 (PT-POST-02 john cracking) |
| [Class 6 / routing] | #692 (PT-EXPLOIT-04 intra-hint routing) |
Three new failure classes identified post-#480 (documented in the ARCHER operational inventory; numbered 15–17 in that inventory's local numbering, distinct from Classes 15–16 in this document):
- [Post-#480 Class A] — Hint Timeout / Execution Time Overrun: Correct tool, correct config, but wordlist or cracking task exceeds session time budget (601s). Fix: scope wordlists, add time-box checkpoints. (#642 closed, #632 open)
- [Post-#480 Class B] — Phantom Pass: Harness grep filter returns
[Success: No output]→ model narrates expected output →success_fnmatches narration text. A v2-critical harness-level artifact, not a model or hint failure. (#681 closed) - [Post-#480 Class C] — Context Saturation (Hint-Level): Fresh session produces 0 commands when hint block exceeds ~600 chars. Identical symptom to Class 7 (VRAM bleed) but mechanism is hint budget, not VRAM depletion. Diagnosis: run in isolation; if still 0 cmds, decompose the objective. (No filed issue; established as CLAUDE.md rule)
Retroactive Audit Targets¶
Three categories of past failures that were never filed as issues because they produced false successes rather than observable eval failures. These are data quality audit targets — sessions that passed contemporary filters but are suspect in retrospect. They are not numbered failure classes; they do not manifest as failing eval runs.
Wrong-host playbook contamination¶
Sessions where success=True was recorded but commands ran against Kali/localhost instead of the intended target. These sessions were written to the playbook — and potentially to ft.jsonl — before _targeted_at guards existed on the relevant success_fns (Classes 4, 11, 15).
Detection method: Scan all playbook sessions for Kali-specific signals in target-expected output fields: loopback addresses (127.0.0.1, ::1), /home/kali/ paths, hostname archer-kali, Kali-default service banners. Any session with these in "success" command output where remote target output is expected is wrong-host contaminated.
Scope: Playbook entries generated before the _targeted_at guard was added to the relevant success_fns. Affected objectives per issue history: T10, T12, T23, T54, T56. After #501 ships (verified success signal gate), new playbook writes are protected. Existing entries require a one-time retroactive scan.
Remediation action: Retroactive playbook scan script; flag suspect entries for Auditor review; remove confirmed wrong-host entries from playbook and ft.jsonl. File scope-scoped issue once #501 is closed.
Routing miss undercount¶
Historical routing misses that were never filed as issues because wrong-skill outputs looked plausible at eval time. Class 9 lists 13 routing issues, but 80% of 38,927 routing log entries are unknown — no ground truth. A significant fraction of early eval runs may have been routed to the wrong skill pack and scored anyway, producing training data that reinforces incorrect skill selection.
Detection method: Once #480 (correct_skill passthrough) and #478 (SHA tagging) ship, compare skill_selected vs expected_skill across all eval_label entries. Pre-SHA entries where the two diverge are historical routing misses. The SHA boundary from the #343–#348 routing fix batch identifies the highest-risk epoch.
Scope: All eval runs prior to the #343–#348 routing keyword fix batch. Objectives with known routing ambiguity per Class 9: T8, T25, T42, T43, T44, T45, T47, T49, T52. Requires #478 and #480 to be shipped before the scan is executable.
Remediation action: Post-#478/#480, run routing miss scan against all eval_label routing log entries. Flag skill_selected ≠ expected_skill entries for exclusion from classifier retraining. Quantify the historical miss rate per skill pair.
Depth-blocked contamination window¶
Sessions generated between VRAM bleed manifestation (~issue #180) and filter addition (~issue #369). These depth-blocked, zero-command sessions may be in ft.jsonl from the contaminated window. The downstream filter was added at #369 but sessions generated between #180 and #369 are potentially contaminated even if they appear well-formed — the depth_blocked flag may have been absent from the schema at that time.
Detection method: Identify git SHAs for the commits closing #180 (VRAM bleed identified) and #369 (filter added). Any ft.jsonl session with a generation timestamp in that window and cmds=0 or depth_blocked=True should be excluded from the next retrain. After #478 ships, SHA tags make this lookup direct — no timestamp correlation needed.
Scope: ft.jsonl sessions from the #180→#369 window. Objectives most likely to have been affected: any long-running objective that ran overnight during that period (T6, T10, T12, T21, T53). Check cmds=0 prevalence in that epoch as a proxy for contamination density.
Remediation action: Extract SHA boundaries from git log; filter ft.jsonl by SHA epoch and cmds=0 / depth_blocked fields; exclude identified sessions from next training run. Document excluded count in PROCESSES.md training run log.
Cross-Class Meta-Patterns¶
Four meta-patterns appear across multiple failure classes. A fix targeting the meta-pattern closes multiple classes simultaneously.
Meta-pattern A — Startup log liveness check
Classes 2, 3, 11. The pattern: grep -q <startup_string> <logfile> as a process liveness test. Startup strings are written before crashes. The fix is always the same: check process existence (tmux has-session, pgrep, /proc/<pid>), not log content.
Prevention status: C1 (nohup/& on TUI tools) and C2 (grep -q without -i on log files) gated in check_hints.py. Gap: process liveness via session check (tmux has-session) not yet enforced.
Meta-pattern B — Missing verification step
Classes 4, 6, 11. Every hint that sets up infrastructure declares success without confirming the objective is actually complete. A mandatory verification step — one that produces the specific output the success_fn checks — prevents the entire category. This is the single highest-leverage audit action: grep every hint for a verification step and file a bug for any that lack one.
Prevention status: Manual audit in progress (#483). CI gate not yet implemented — C5 is deferred in check_hints.py pending a reliable absence heuristic. Tracking issue: #526.
Meta-pattern C — App-specific without generic companion
Classes 5, 6, 10. Hints that target a specific app (vsftpd, DVWA, Metasploitable2) train the model to solve the box, not the vulnerability class. The two-layer rule (app-specific + generic placeholder companion) addresses all three classes. Reference: docs/research/range-lock-in.md.
Prevention status: CI gated — C4 in check_hints.py blocks new violations at commit time. Remediation of existing hardcoded IPs/paths tracked in #432.
Meta-pattern D — Wrong-host execution
Classes 4, 6, 15. The model executes commands on the attacker host instead of the target, passes success checks because the output looks correct (uid=0 on Kali, lynis running locally, etc.), and produces a false positive. _targeted_at guards on success_fns and explicit host labels in hints address this pattern across all three classes.
Prevention status: Partially gated — C10 in check_hints.py checks skills/PT-*.py success_fn/verify_fn references. Gap: testenv/eval_harness.py objective success functions (_t*_ family) are not checked. Tracking issue: #527.
Highest-Leverage Remediations¶
Actions that close multiple open issues or prevent entire failure classes. CI status column reflects whether the action is enforced automatically.
| Action | Classes addressed | CI status | Tracking |
|---|---|---|---|
| tmux wrapper standard for all TUI tools | 2 | C1 gated | — |
grep -qi standard for all log pattern checks |
3 | C2 gated | — |
_targeted_at guard on all success_fns (skills/) |
4, 11, 15 | C10 gated | — |
_targeted_at guard on eval_harness.py _t*_ fns |
4, 11, 15 | Not gated | #527 |
| Exclude [THOUGHT] from success_fn scope | 4, 11 | Implemented | — |
| Verification step audit across all hint blocks | 4, 6, 11 | Not gated | #483, #526 |
| Two-layer rule (app-specific + generic companion) | 5, 6, 10 | C4 gated | #432 |
| Between-objective VRAM flush | 7 | Not gated | #451 |
| setup_fn idempotency audit | 7, 13 | Not gated | #528, #525 |
check_hint_lengths.py → check_hints.py |
8 | C7 gated | — |
| Pipeline gate checklist (training) | 14 | Partial | #501 |
Related¶
docs/research/range-lock-in.md— full analysis of Class 10-
96 — routing log quality analysis (feeds Class 9 and 14 remediations)¶
-
478 — git SHA tagging (Class 9 remediation; required for retroactive routing miss scan)¶
-
480 — correct_skill passthrough (Class 14 remediation; required for routing miss undercount scan)¶
-
481 — eval harness improvement ideas¶
-
482 — this document's tracking issue¶
-
483 — verification step audit (meta-pattern B; manual one-time pass)¶
-
497 — check_hints.py linter (automates Classes 1, 2, 3, 6, 8 prevention)¶
-
498 — failure-class dashboards (velocity, compound heatmap, remediation coverage, epoch view)¶
-
499 — symptom-to-class mapper (automates post-eval failure class diagnosis)¶
-
500 — this doc update (retroactive audit categories)¶
-
501 — success signal verification before playbook write (prevents wrong-host contamination at write time)¶
-
502 — failure class tagging on ft.jsonl sessions¶
-
503 — SHA epoch gating per failure class for retrain filtering¶
-
525 — ligolo TUN route flush in
_setup_pivot_range(Class 7/13 instance)¶ -
526 — C5 CI gate: verification step presence check (meta-pattern B enforcement)¶
-
527 — C6 CI gate:
_targeted_atguard audit for eval_harness.py (meta-pattern D gap)¶ -
528 — setup_fn idempotency audit across all pivot/AD/web objectives (Class 7)¶