ARCHER Failure Mode Inventory¶

A structured inventory of every failure class ARCHER has exhibited across eval runs and RCA sessions. Purpose: identify root causes shared across multiple objectives so that a single fix closes an entire class rather than re-filing the same bug under a new issue number.

Methodology: every closed bug and regression issue (#62–#480) was reviewed and assigned to one or more failure classes. Open instances are listed per class. Class-level remediations are specified where ≥2 instances share a root cause.

Last updated: 2026-05-30. Living document — add new instances as they surface. Covers issues #62–#480; post-#480 analysis and new classes (15–17) documented in the ARCHER operational inventory.

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

Lessons Learned¶

Six weeks of eval-driven development against live targets produced 130+ bug and regression issues. Reviewing them as a body — rather than individually — reveals patterns that were invisible when each issue was filed in isolation.

The one-bug-one-fix trap¶

The dominant failure pattern in ARCHER's development history is not any single bug class — it is the tendency to fix symptoms rather than classes. The same root cause recurred under different issue numbers: LHOST missing from a msfconsole chain (#161), an unsubstituted <attacker_ip> placeholder (#411), a lost variable between command dispatches (#475) — all three are the same failure (shell variable loss, Class 1), diagnosed and fixed three separate times. Across 15 classes and 130+ issues, this pattern repeats: a root cause is identified, fixed for the specific objective that surfaced it, and left unaddressed in adjacent objectives where the same code pattern exists.

The lesson: diagnosis is not complete until all instances of the root cause are found. A fix that closes one issue while leaving five sibling objectives with the same failure pattern is an incomplete fix.

Environmental assumptions accumulate silently¶

A large fraction of failures — Classes 2 (PTY crash), 13 (infrastructure gap), and 15 (wrong host) — are not model or hint failures at all. They are broken assumptions about the environment: a missing binary, a container without /dev/net/tun, a tool that requires a PTY but is launched without one. These failures are particularly costly because they are diagnosed as model failures first, wasting hint-tuning cycles before the real cause is found. The lesson: environment verification must precede model diagnosis. A failing objective should first be tested against a known-good environment before any hint is changed.

Success signals are consistently too weak¶

Four failure classes — premature OA (Class 4), false positive success_fn (Class 11), missing verification step (Class 6), and wrong host confusion (Class 15) — share a common weakness: the system accepts evidence that does not actually prove the objective was completed. A single string match on a log line, a grep that fires on error output, a success_fn that passes on wrong-host output, a model that reads a startup warning and declares success — all are variants of the same problem. The lesson: success signals must be necessary, not merely correlated. A signal that can appear in both success and failure contexts is not a success signal.

Training data quality is a silent multiplier¶

Class 14 (training data contamination) is the least visible failure class because contaminated sessions do not cause immediate eval failures — they degrade model behavior gradually, in ways that manifest as routing misses and hint non-compliance weeks or months later. Depth-blocked sessions, wrong-host sessions, and unverified sessions entered the fine-tuning pipeline across multiple issues (#134, #153, #275, #316, #369). The lesson: pipeline gates must be code, not process. A filter that exists only as a documented guideline will eventually be bypassed. Every contamination category must be enforced at the point of data generation.

Automation prevents recurrence; process does not¶

The four cross-class meta-patterns (startup log liveness, missing verification step, app-specific without generic companion, wrong-host execution) have each recurred three or more times despite being documented in CLAUDE.md, PROCESSES.md, and individual issue comments. Documentation does not prevent recurrence. The failures that stopped recurring are the ones that became CI gates — check_hint_lengths.py eliminated Class 8 truncation issues after a single enforcement point was added. The lesson: the only reliable prevention is automated enforcement. Every lesson in this document that has not yet been encoded as a CI check or lint rule is a lesson waiting to be relearned.

What this means for development priorities¶

The failure mode inventory points to three structural changes that would have prevented the majority of issues in this document:

A hint linter in CI (check_hints.py) — enforcing the tmux standard, case-insensitive grep, placeholder substitution, and verification step presence. Addresses Classes 1, 2, 3, and 6 mechanically.
_targeted_at guards on all success_fns — a single audit pass adding wrong-host guards. Addresses Classes 4, 11, and 15 in one action.
Hard gates in the training pipeline — verify_fn_skipped, depth-blocked, and unknown-skill sessions rejected at write time, not filtered downstream. Addresses Class 14 permanently.

None of these require model changes or hint rewrites. They are infrastructure changes that make entire failure classes structurally impossible.

Class 1 — Shell Variable Loss¶

Description: A shell variable is set in one ARCHER command dispatch and referenced in a subsequent dispatch. Because each command runs in a separate bash invocation, the variable is gone by the time it is needed. Unsubstituted literal placeholders (<attacker_ip>, <user>) are the same failure in hint authoring — the model receives the placeholder unchanged and uses it literally.

Root cause: ARCHER dispatches each generated command as an independent subprocess. Shell state does not persist between dispatches.

Issue	Objective	Instance
#89 (closed)	T4/T6	LHOST detection awk picks wrong IP when route has `via` gateway — wrong IP class for the context
#161 (closed)	T6 UnrealIRCd	LHOST missing from msfconsole chain — reverse shell never receives connection
#409 (closed)	T52 chisel	`<user>` placeholder in scp command reaches model literally — scp fails
#411 (closed)	T53 ligolo	`<attacker_ip>` placeholder unsubstituted + agent run on wrong host
#392 (closed)	T56 multi-hop	Wrong credentials (`msfadmin`) instead of pivot-range creds (`pivot:archer123`)
#475 (open)	T52 chisel	`ATTACKER_IP` set in cmd 1, empty in cmd 2 — client connects to `:8000` on pivot's localhost

Class-level remediation:

Inline subshell expansion — never assign a variable then reference it across dispatches. Inline the subshell: ssh ... $(ip route get {pivot} | grep -oP 'src \K\S+'):8000 not ATTACKER_IP=$(...) then ssh ...$ATTACKER_IP... in a later command.
Placeholder hygiene — all <placeholder> strings must be substituted at hint render time via the template system. Any placeholder that reaches the model is a hint defect. Audit: grep -r '<[a-z_]*>' skills/ should return zero results.
Cross-step $VAR audit — grep all skills/PT-*.py hints for $VAR patterns that appear in a later step than the assignment. Any cross-step reference is a latent bug.

Prevention status: C8 (VAR=$() assignment detection) CI gated in check_hints.py. C3 (cross-step $VAR dereference) deferred — requires multi-step semantic parse.

Class 2 — PTY / TUI Crash¶

Description: A tool that requires an interactive terminal (PTY) is launched via nohup, &, or docker exec without -t. The tool's TUI library detects no PTY and panics, exiting silently after writing one or two startup log lines — making it appear to have started successfully.

Root cause: Modern CLI tools use PTY-detecting interactive frameworks (ligolo-proxy: grumble + survey/v2; msfconsole: readline). Without a PTY the library aborts. The process writes startup output before crashing, so log-based liveness checks pass even though the process is dead.

Issue	Objective	Instance
#79 (closed)	T6 msfconsole	msfconsole times out at 300s when run directly inside archer-kali — PTY context absent
#82 (closed)	T4/T6	bind_netcat payload hangs in eval harness — no PTY for interactive netcat session
#389 (closed)	T53 ligolo	archer-kali missing `/dev/net/tun` — ligolo permanently broken at container level
#441 (closed)	T53 ligolo	nohup proxy+agent — same crash pattern, earlier attempt at fix
#446 (closed)	T53 ligolo	`ligolo-ng.yaml` in repo root crashes proxy on automated startup
#455 (closed)	T53 ligolo	Version incompatibility (agent 0.6.2 vs proxy 0.8.3) — compound infrastructure failure
#474 (open)	T53 ligolo	Model uses `nohup /usr/bin/ligolo-proxy`; proxy writes "Listening" then crashes; `grep -q Listening` passes; agent gets connection refused

Class-level remediation:

tmux wrapper standard — any tool with a TUI must run inside tmux new-session -d -s <name>. nohup is wrong for TUI tools, always.
Process liveness via session check — never use grep -q <startup_string> <logfile> as the liveness test. Use tmux has-session -t <name>, which only passes if the session is alive. Startup log lines are written before crashes.
Hint audit — grep all hints that launch interactive tools (ligolo-proxy, msfconsole, responder, etc.) and verify they use tmux, not nohup/background.
Container pre-flight — verify /dev/net/tun exists in archer-kali before any ligolo run.

Prevention status: C1 (nohup/& on TUI tools) CI gated in check_hints.py. Gap: process liveness check pattern (grep-q-logfile as liveness) not yet enforced separately from C2.

Class 3 — Case Mismatch / Pattern Miss¶

Description: A grep or regex pattern in a hint or success_fn fails because the actual log output uses different capitalization, punctuation, or wording than the pattern expects. Also includes success signals that are present in the tool output but absent from the model's visible context.

Issue	Objective	Instance
#166 (closed)	T9 nikto	OSVDB patterns not in ARCHER stdout — success_fn never fires despite correct output
#191 (closed)	T23 hash-crack	success_fn regex `hash` vs `hashes` — john plural output never matches
#380 (closed)	web_lfi	`<script>` match in `_SIGNAL_RE` contaminates Tier 2 evidence with HTML noise
#383 (closed)	linux_privesc	`_SIGNAL_RE` missing evidence patterns for sudo/SUID output — correct output invisible
#393 (closed)	T3a/T45	`halt_bypass_signals` missing nmap NSE vuln output patterns
#426 (closed)	T53 ligolo	`http/1.` from wget error satisfies `has_traversal` — wrong signal matched
#474 (open)	T53 ligolo	`grep -q 'agent joined'` (lowercase); actual log: `msg="Agent joined."` (capital A, period)

Class-level remediation:

Always use -i for log greps — grep -qi costs nothing and eliminates case mismatch. No case-sensitive grep in hints without an explicit rationale.
Verify patterns against live output — before adding any grep pattern, run the actual tool in the lab and copy the exact string. Never infer log format from documentation.
Pattern specificity — _SIGNAL_RE entries must match tool-specific output, not generic HTML/HTTP strings that appear in error pages. Add negative examples to the test suite.

Prevention status: C2 (grep -q without -i on log files) CI gated in check_hints.py.

Class 4 — Premature Objective Achieved / False Positive¶

Description: The model emits [OBJECTIVE_ACHIEVED] based on partial, fabricated, or ambiguous evidence. Variants: reading only the first positive line of output without checking for subsequent error lines; success_fn matching on wrong-host output; echo/printf fabrication of expected strings.

Issue	Objective	Instance
#78 (closed)	multiple	HALT_DISCIPLINE false positive — failed exploit marked answerable, quality=1.0 saved to playbook
#81 (closed)	T6	Model fabricates `uid=0` via echo — success_fn passes on fabricated output
#83 (closed)	T6	success_fn must guard against echo/printf fabrication of uid=0
#107 (closed)	multiple	Model emits OA on exploit failure + loops on bare bash
#115 (closed)	T7	`_t7_host_discovery` too broad — any 192.168.56.x IP in output passes
#116 (closed)	T12	`_t12_vuln_assess` matches `critical\|high\|medium` in natural language
#123 (closed)	T12	vulnerability_assessment runs lynis on local host — success_fn passes on wrong-host output
#124 (closed)	T10	post_exploitation enumerates Kali host — wrong-host false positive
#132 (closed)	T12	T12 passes on wrong-host lynis
#133 (closed)	T10	T10 passes on wrong-host enumeration
#141 (closed)	T14/T16	Passes when BWA target unreachable — no connectivity guard
#142 (closed)	T23	T23 passes when john runs but finds no hashes
#143 (closed)	T11	port_scanning speed anomaly passes too quickly
#144 (closed)	T15	No `_targeted_at` guard — localhost false positive risk
#163 (closed)	T27	linux_privesc premature OA — model stops after 1 command
#168 (closed)	T12	searchsploit passes without scanning remote target
#236 (closed)	T28	`_t28_suid_privesc` fires on command text, not evidence
#253/#254 (closed)	multiple	9 boundary violations — OA exits bypassing verify_fn; echo fabrication suspected
#367 (closed)	T12	lynis+searchsploit passes without scanning remote target
#368 (closed)	T10	uname on Kali + failed SSH attempt passes post_exploit
#387 (closed)	T54 socat	Fires OA without passing verify_traversal
#423 (closed)	T56	[THOUGHT] text contains target IP — satisfies _real_pivot_traversal
#474 (open)	T53 R3	Model reads agent startup warning, declares OA; next lines show fatal connection error

Class-level remediation:

_targeted_at guard on all success_fns — every success function must verify output came from the correct target IP, not localhost or the Kali host. This guard alone would have prevented ~8 instances in this list.
Exclude [THOUGHT] text from pattern matching — strip [THOUGHT]...[/THOUGHT] blocks before applying success_fn patterns. Model reasoning should never satisfy an evidence check.
Fabrication guard — success_fn patterns for privilege escalation outputs (uid=0, root) must appear in command stdout, not in text that the model could have generated itself.
Multi-signal requirement — OA for complex objectives (exploitation, pivoting) should require N distinct evidence signals, not a single string match.
Fatal-line scan — add hint instruction to read full command output before declaring success: startup warnings followed by fatal errors are the canonical false-positive pattern.

Prevention status: C10 (_targeted_at guard in skills/) CI gated in check_hints.py. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet checked — tracking issue #527.

Class 5 — Wrong Module / Tool Selection¶

Description: The model selects the wrong Metasploit module, CLI tool, or approach — hallucinating a module name, choosing a module for a different vulnerability, selecting the wrong tool for the target environment, or misconfiguring a tool that would otherwise work.

Issue	Objective	Instance
#62–#70 (closed)	sweep	Early sweep findings: nmap wrong_skill, wapiti no_output, msfconsole no_output, hydra wrong_skill, searchsploit early_complete
#80 (closed)	T5 hydra	hydra uses rockyou as both user and password list — will never complete
#90 (closed)	T6 UnrealIRCd	Wrong exploit module selected (`samba/usermap_script` instead of `unix/irc/unreal_ircd`)
#104 (closed)	T5	SSH brute-force 0/3 — hydra hints insufficient
#105 (closed)	T6	msfconsole module path or port mismatch
#165 (closed)	T21 Tomcat	msfconsole module error on Metasploit 6.4.126
#173 (closed)	T5 ncrack	917-line wordlist always times out before reaching msfadmin
#238 (closed)	T35	UDP scan always times out — needs port scope or --top-ports
#372 (closed)	entity_id	nmap missing `-O/-sV` flags — incomplete output
#376 (closed)	vuln_assess	Model stops after `nmap -sV`, never runs vuln scripts
#412 (closed)	T45	`nmap --script vuln` too slow for OWASP-BWA — nuclei should be primary
#419 (closed)	T28 SUID	Model runs nmap interactive but doesn't pipe `!sh`
#429 (closed)	T3a	`nmap --script vuln` times out on MS2 — nuclei should be primary
#472 (open)	PT-EXPLOIT-01	Wrong Metasploit module, missing PAYLOAD, module hallucination

Class-level remediation:

Exploitation short-circuits — for every named CVE or well-known vulnerability (vsftpd, UnrealIRCd, MS08-067, etc.) add an explicit short-circuit block with exact module path, PAYLOAD setting, and RHOST/LHOST sequence. The model cannot reliably select the correct module from general knowledge.
Tool priority guidance — hints must specify which tool is primary and which is fallback. "Use nuclei; if unavailable, fall back to nmap --script vuln" is unambiguous. Generic "run a vulnerability scanner" produces tool mismatches.
Module reference tables — add a reference block in exploitation hints listing correct modules for common MS2/DVWA targets by port and service. The model matches task context to the table rather than hallucinating.
Wordlist scope — brute-force hints must specify targeted wordlists, not generic full-length lists. A 917-line wordlist with rockyou order guarantees timeout on any short-session objective.

Prevention status: Documentation only. No CI gate — requires semantic understanding of hint content. Manual review at hint authoring time.

Class 6 — Missing Short-Circuit / Hint Gap¶

Description: A hint block exists for one phase of an operation but not for the critical subsequent phase. The model is left without guidance at the decisive step and falls back to generic behavior. Also includes hints that specify the wrong host, wrong path, or wrong authentication method for the target environment.

Issue	Objective	Instance
#80/#91 (closed)	T5 hydra	SSH legacy key negotiation — no hint for `-oHostKeyAlgorithms` flag; SSH rejects all auth attempts
#145 (closed)	T17	MySQL enumeration failing — hints insufficient for mysql client auth flow
#157 (closed)	T15	Broken bash hint + missing syntax-error loop guard
#162 (closed)	T23	Hash-crack objective never dumps shadow — model skips extraction step
#167 (closed)	T16	DVWA LFI curl returns empty — missing `-L` flag; session not persisting
#175 (closed)	T16	DVWA LFI security level POST blocked by CSRF; traversal runs at wrong security level
#182 (closed)	T21b	Model fails reverse shell sequence — listener-before-trigger sequencing missing
#189 (closed)	T23	Hint not driving hash crack step — 0/3 with 2 cmds (dump only, no john/hashcat)
#193 (closed)	T23	Hint dump step unreliable — model drops `sudo -S` and `tee`; john gets empty file
#235 (closed)	linux_privesc	SSH enumeration blocked by `sudo -l` password prompt on MS2 — hint missing sshpass
#239 (closed)	T46/T47	post_exploitation SSH hints missing `HostKeyAlgorithms` flags for MS2
#242 (closed)	T46/T47	exfiltration and persistence hints missing SSH-first / sshpass for remote targets
#245 (closed)	T49 DVWA XSS	web_xss hints missing DVWA stored XSS workflow — model skips login, hits wrong endpoint
#311 (closed)	T16/T24/T26	Model completes task but never emits OA — 100% HALT_DISCIPLINE; completion signal too weak
#328 (closed)	linux_privesc	SSH loop and sudo-hang on MS2
#329 (closed)	T24 web_xss	100% HALT_DISCIPLINE — completion signal too weak
#331 (closed)	web_enum	robots.txt 404 needs fallback to directory brute-force
#378 (closed)	T50–T56	All pivot hints missing verification step
#390 (closed)	T54 socat	socat_relay runs on attacker host, not pivot — wrong machine targeted
#391 (closed)	T55	ProxyJump hint missing trailing command — connects then drops
#397 (closed)	T50	Default port 80 closed on pivot target; hint missing traversal step
#399 (closed)	T51	Wrong proxychains config file path in hint
#406 (closed)	T48 CSRF	No hint for auth-first flow — `/vulnerabilities/csrf/` returns 404 unauthenticated
#432 (open)	multiple	Hardcoded vulnbox IPs/paths — generic companion missing
#472 (open)	PT-EXPLOIT-01	vsftpd exploitation short-circuit missing (detection short-circuit exists)

Class-level remediation:

Two-layer rule — every app-specific detection hint must have a paired exploitation hint. For every _hints_* block that fires on confirmation/detection keywords, verify a paired exploitation block exists. See docs/research/range-lock-in.md.
Verification step mandatory — every hint block that sets up infrastructure must include a verification step that produces the evidence the success_fn checks.
SSH compatibility flags — any hint connecting to MS2/Metasploitable targets via SSH must include -oHostKeyAlgorithms=+ssh-rsa -oPubkeyAcceptedAlgorithms=+ssh-rsa or equivalent. This has recurred in T5, T17, T46, T47 — it is a lab-wide compatibility requirement.
Host targeting explicit — every hint command must explicitly state which host it runs on (attacker, pivot, target). Ambiguous host context produces wrong-machine execution.
Completion signal audit — any objective with HD=100% and OA=0% across multiple runs has a missing or weak completion signal, not a model failure. Check whether OA can be emitted given the hint structure before adjusting halt thresholds.

Prevention status: C4 (app-specific without generic companion) CI gated in check_hints.py. Verification step presence (item 2) not yet gated — C5 tracking issue #526. Manual one-time verification step audit: #483.

Class 7 — VRAM / Resource Bleed¶

Description: A long-running or resource-intensive objective saturates GPU VRAM or system resources. Subsequent objectives start with depleted resources, producing cmds=0 or truncated runs. Also includes port/process pollution between runs where cleanup is absent.

Issue	Objective	Instance
#77 (closed)	playbook	Fast replay used DEFAULT_COMMAND_TIMEOUT (120s) instead of skill timeout — wrong resource budget
#180 (closed)	multiple	Overnight VRAM-saturated depth-blocked zero-command sessions — excluded from pipeline
#248 (closed)	collection	`run_data_collection.sh` eval lock blocks its own Phase 1 child
#407 (closed)	T50	Stale tunnel ports (8080–8085) not cleaned between runs
#418 (closed)	T51	SOCKS port 1080 exhaustion — `ssh -D` not killed by prior cleanup
#436 (closed)	multiple	Stale `~/.archer_eval.lock` on process exit without cleanup
#444 (closed)	T52/T53/T56	`_setup_pivot_range` missing chisel/ligolo cleanup — runs 2+ fail
#451 (open)	multiple	VRAM bleed between objectives — ollama reload fires between runs not between objectives

Class-level remediation:

Between-objective VRAM flush — ollama stop + prewarm after each objective's runs complete. Prevents saturation from long objectives bleeding into short ones.
Idempotent setup_fn — every setup_fn must kill processes and release ports from prior runs before starting. A setup_fn that assumes a clean state will fail on run 2+.
Lock file cleanup — eval lock must be removed in a finally block, not just on clean exit. Any eval process that exits abnormally should release the lock.
Resource monitoring — Cockpit overview during long eval runs provides early warning: CPU/RAM flatline mid-run indicates bleed-out, not completion.

Prevention status: Documentation only for VRAM flush (item 1). setup_fn idempotency (item 2) partially enforced — pivot/AD preflights raise PreflightFailure (#494/#493), but cleanup completeness not audited. Tracking issue: #528 (setup_fn idempotency audit). Ligolo TUN route flush gap tracked in #525.

Class 8 — Character Limit / Command Truncation¶

Description: A hint block, command string, or evidence extraction exceeds a character limit, causing truncation, rejection, or silent data loss.

Issue	Objective	Instance
#334 (closed)	web_xss	`_extract_evidence` CMD truncation at 120 chars — XSS POST payloads invisible to Tier 2 scorer
#448 (open)	T52 chisel	Hint block exceeds 500-char harness limit — command rejected
#452 (closed)	T52 chisel	`&;` syntax — degenerate loop on every run

Class-level remediation:

check_hint_lengths.py in CI — gate already enforces 500-char limit. Run locally before any hint commit.
Split at logical boundaries — if a command exceeds the limit, split into two sequential hint steps at a logical boundary. Never concatenate with && or &; to stay under limit.
Evidence extraction window — _extract_evidence CMD truncation at 120 chars is too short for POST payloads. Raise or eliminate the truncation for command text used in Tier 2 scoring.

Prevention status: C7 (500-char limit) CI gated in check_hints.py (absorbed from check_hint_lengths.py).

Class 9 — Routing Miss¶

Description: A task is routed to the wrong skill pack, causing the model to receive hints and context for the wrong domain. Includes keyword scorer over-capture (one skill's keywords absorb tasks that belong to a sibling skill) and classifier confidence threshold mismatches.

Issue	Objective	Instance
#62/#63/#66/#67 (closed)	sweep	Early sweep findings: nmap, arp-scan, hydra, whatweb all routing to wrong skills
#103 (closed)	T8	web_enumeration routing miss — `skill=` empty for directory enum task
#147 (closed)	T25	Dead-tie — web_exploitation and web_cmd_injection score equally
#257 (closed)	classifier	Confidence threshold 0.7 never reached by TF-IDF+LR — lowered to 0.5
#276 (closed)	collection	Sparse gate ignores classifier confidence — routing misses never self-correct
#294 (closed)	T33/T38	T49 hint fix broke vulnerability_assessment and entity_identification routing
#343 (closed)	T42	Routes to web_vulnerability_scanning instead of web_enumeration
#344 (closed)	routing	web_authentication bonus_fn fires on bare `login` — misroutes auth-bypass tasks
#345 (closed)	routing	system_info `version` keyword misroutes vuln-assessment tasks
#346 (closed)	routing	entity_identification captures `check what's running` — should be service_enumeration
#347 (closed)	routing	port_scanning captures `what's listening` — should be service_enumeration
#348 (closed)	routing	reconnaissance captures bare `scan the target` — should be port_scanning
#385 (open)	T52	Routes to `ssh_tunneling` instead of `chisel_pivot`

Class-level remediation:

exclude_keywords discipline — every skill must have explicit exclude patterns for tasks that superficially match but belong to a sibling. Routing misses are often symmetric — if T52 routes to ssh_tunneling, ssh_tunneling is over-capturing.
Ambiguous task eval — run eval_harness --ambiguous after any hint or keyword change to verify no new misroutes are introduced.
Bonus_fn keyword audit — bonus functions that fire on single common words (login, version, check) are over-broad. Require at least two co-occurring terms or domain-anchored context.
SHA tagging (#478) — once in, use SHA to filter pre-fix routing labels from classifier retraining, preventing old misroutes from training the next classifier.

Prevention status: Documentation only. No CI gate — routing correctness requires live eval. --ambiguous flag exists for manual verification.

Class 10 — Range Lock-In¶

Description: A hint block fires only on a specific app name, IP address, or target path, training the model to solve one specific target rather than the vulnerability class. The model fails on any variant or alternative target.

Issue	Objective	Instance
#153 (closed)	Juice Shop	Contaminated ft.jsonl sessions (localhost:3000 wrong target) — trained on wrong target URL
#468 (closed)	T2a/T4/T6	Task strings named specific apps rather than vuln class
#469 (closed)	T14/T24/T25/T26/T48	Task strings with app paths and named apps
#432 (open)	multiple	Hardcoded vulnbox IPs and paths in hint triggers — generic companion missing

Class-level remediation: See docs/research/range-lock-in.md for full analysis. Summary: every app-specific hint block must have a generic companion block using placeholders (<login-endpoint>, <module>, <target>). App-specific block drives eval pass rate; generic companion teaches the transferable pattern. Task strings must describe the vulnerability class, not the specific app or path.

Prevention status: C4 CI gated in check_hints.py — blocks new app-specific hints without generic companions. Cleanup of existing violations: #432.

Class 11 — False Positive `success_fn` / `halt_fn`¶

Description: A success or halt function returns true on a signal that appears in non-success contexts — curl error output, [THOUGHT] text, wrong-host output, truncated stdout, or intermediate/partial matches.

Issue	Objective	Instance
#115 (closed)	T7	Any 192.168.56.x IP in output passes `_t7_host_discovery`
#116 (closed)	T12	`critical\|high\|medium` in natural language passes `_t12_vuln_assess`
#164 (closed)	T4	`_real_uid_root` false-negative — skips printf/msfconsole output blocks
#314 (closed)	network_exploit	L1 check rejects valid non-root code execution and backdoor confirmation
#410 (closed)	T54	`head -10` truncates SSH banner at line 11 — success_fn never sees OpenSSH
#423 (closed)	T56	`_real_pivot_traversal` matches target IP in [THOUGHT] text
#426 (closed)	T53	`has_traversal` matches `http/1.` from wget error output
#443 (closed)	T56	`_halt_tunneling` premature fire — traversal threshold too low
#401 (open)	T56/pivot	`HTTP/\d` regex matches curl verbose `* using HTTP/1.x` — not actual traversal

Class-level remediation:

Anchor patterns to tool-specific output — patterns must match strings that only appear in genuine success output, not in error messages, verbose output, or [THOUGHT] text.
Exclude [THOUGHT] from pattern scope — strip [THOUGHT] blocks before applying success_fn patterns.
Stdout-only matching — apply success patterns to command stdout, not full session text.
Threshold calibration — traversal-type functions should require the target IP to appear in network-layer output. HTTP header strings and verbose curl flags are not traversal evidence.
Avoid head -N truncation — use grep to extract the relevant line rather than truncating at a fixed line count.

Prevention status: C10 (skills/ success_fn _targeted_at guard) CI gated. [THOUGHT] stripping implemented in eval_harness.py. Gap: eval_harness.py _t*_ success functions not yet gated — #527.

Class 12 — Model Loop¶

Description: The model repeats the same command or sequence without making progress, eventually hitting max_commands. Variants: retrying a failed setup step indefinitely; completing the task but not emitting OA; looping on bare infrastructure commands with no payload.

Issue	Objective	Instance
#107 (closed)	multiple	Model loops on bare bash after exploit failure
#125 (closed)	T14	Degenerate proxy loop when DVWA unreachable — no connectivity pre-check
#126 (closed)	agent	`done:true` + empty command causes infinite agent loop
#163 (closed)	T27	Stops after 1 command — model treats minimal output as completion
#230 (closed)	watchdog	cmd_count comparison never resets between sessions — premature ambiguous block kills
#251 (closed)	T11/T23/T24/T26	102 runs at 97–100% halt depth, 0% OA
#311 (closed)	T16/T24/T26	100% HALT_DISCIPLINE — model completes task but OA never emitted
#382 (open)	linux_privesc	Model loops on bare SSH — drops command payload
#386 (closed)	T51 SOCKS	Hits max_commands every run — SOCKS setup loop

Class-level remediation:

Progress anchors in hints — hints must include checkpoints giving the model a clear signal it has advanced ("if port X is now listening, proceed to step 3"). Without progress anchors, the model retries setup steps indefinitely.
Connectivity pre-check — any hint for a web target must verify the target is reachable before entering the main workflow. A target that is down causes looping on connection attempts.
HD=100%/OA=0% diagnosis — this pattern means the model is completing the work but not emitting OA. Root cause is always hint structure (missing OA signal) or success_fn (not firing on correct output), not model capability. Adjust halt thresholds only after ruling out both.
done:true guard — the agent loop must reject empty commands with done:true and not re-queue.

Prevention status: Documentation only. No CI gate — loop detection requires live eval observation.

Class 13 — Infrastructure Gap¶

Description: A required binary, container configuration, network configuration, or environment component is missing or misconfigured. Failures in this class are not hint or model failures — the environment itself is broken.

Issue	Objective	Instance
#79 (closed)	T6	msfconsole times out in archer-kali container — container resource ceiling
#176 (closed)	T17–T19/T22	Stale playbook entries produce 0-cmd fast fails
#177 (closed)	T2b	T2b dependency on T2a not enforced — runs on uninitialized state
#181 (closed)	T2a	Port 6200 not releasing between runs — `exit -y` cleanup insufficient
#186 (closed)	T21	Unhandled `TimeoutExpired` in Tomcat restart kills harness
#194 (closed)	T23	`john.pot` contamination — `_setup_t23` cleans wrong host
#258 (closed)	T23	`rockyou.txt` corrupted in archer-kali + MS2 passwords missing from wordlist
#277 (closed)	multiple	Stale tool processes persist in archer-kali after eval harness exit
#302 (closed)	T16	`_setup_t16` must reset DVWA admin password before each run
#389 (closed)	T53	archer-kali missing `/dev/net/tun`
#400 (closed)	T56	172.30.2.20 not deployed — pivot range only has 2 nodes
#425 (closed)	T53	ligolo-agent binary missing from pivot range image
#437 (closed)	CI	verify-fix.yml no lab pre-flight — posts FAIL when targets unreachable
#442 (closed)	T53/T56	tini + openssh-client missing from Dockerfile
#450 (closed)	T51	SSH `-D` race condition — `ss` check runs before port is bound

Class-level remediation:

Prerun sanity check — archer-prerun verifies VM reachability, required binaries, container state before any eval run. Do not start a run against a target that hasn't been verified reachable.
Idempotent setup_fn — every setup function must be idempotent: clean prior state, verify environment, then initialize. Non-idempotent setup fails on run 2+.
Dockerfile audit on range changes — any change to docker/pivot-range/Dockerfile requires a full T50–T56 run to confirm infrastructure integrity.
Race condition guards — any hint that starts a background process and immediately checks for it must include a sleep + retry loop, not a single-shot check.
Dependency enforcement — objectives with sequential dependencies (T2a → T2b) must enforce ordering in setup_fn, not rely on run order.

Prevention status: PreflightFailure exception raises SKIP (not FAIL) on VM/AD unreachability — CI gated (#494/#493). setup_fn cleanup completeness not yet audited — #528. Ligolo TUN route flush: #525.

Class 14 — Training Data Contamination¶

Description: Invalid, low-quality, or incorrectly-labeled sessions enter the fine-tuning or classifier training pipeline, degrading model behavior in ways that are difficult to diagnose.

Issue	Pipeline stage	Instance
#134 (closed)	fine-tune	depth_blocked sessions leak into training data via `_build_full_conversation`
#136 (closed)	fine-tune	stale `data/finetune/` with obsolete Alpaca schema — regenerate required
#140 (closed)	fine-tune	HALT_DISCIPLINE inclusion policy split — `build_training_data.py` and `prepare_finetune.py` disagree on what to include
#150 (closed)	classifier	`router_labels.csv` no deduplication — classifier biased toward frequently-run objectives
#153 (closed)	fine-tune	Contaminated Juice Shop ft.jsonl sessions (localhost:3000 wrong target)
#155 (closed)	fine-tune	Raw model responses with backslash errors contaminate fine-tuning data
#275 (closed)	fine-tune	Web sessions with connectivity failures not filtered from pipeline
#316 (closed)	fine-tune	`verify_fn_skipped` not a hard gate on ft.jsonl writes — unverified sessions enter pipeline
#330 (closed)	fine-tune	`skill="unknown"` sessions not filtered by `prepare_finetune.py`
#369 (closed)	fine-tune	depth_blocked sessions produce zero-command training examples

Class-level remediation:

Pipeline gate checklist — before any training run, verify: (a) depth_blocked sessions excluded, (b) verify_fn_skipped sessions excluded, (c) skill="unknown" sessions excluded, (d) connectivity-failure sessions excluded, (e) no duplicate routing labels.
Hard gates in code — contamination filters must be code gates, not process guidelines. verify_fn_skipped=True must prevent ft.jsonl write at the point of generation, not in a downstream filter.
Quality filter for classifier — eval_label + label_confidence==high only (see PROCESSES.md). The 80% unknown event entries are not usable for classifier retraining.
Schema versioning — ft.jsonl schema version must be checked at pipeline entry. Stale entries from prior schema versions must be excluded or migrated.

Prevention status: Some gates in code (depth_blocked, verify_fn_skipped); others documented in PROCESSES.md only. #501 (success signal verification before playbook write) addresses the wrong-host contamination path. Full gate checklist enforcement: documentation only.

Class 15 — Wrong Host / Target Confusion¶

Description: The model runs commands against the wrong machine — typically confusing the attacker host (Kali) with the target host, or confusing attacker with pivot in multi-hop scenarios. Distinct from Class 6 (missing hint) in that the model has a hint but misidentifies which machine to execute on.

Issue	Objective	Instance
#123 (closed)	T12	vulnerability_assessment runs lynis on local Kali host instead of target
#124 (closed)	T10	post_exploitation enumerates Kali host instead of SSH target
#125 (closed)	T14	Enters degenerate proxy loop when DVWA unreachable (connected to wrong interface)
#390 (closed)	T54	socat_relay configured and run on attacker host, not pivot
#411 (closed)	T53	ligolo-agent run on attacker instead of pivot

Class-level remediation:

Explicit host labels in hints — every hint command must be preceded by a label identifying which host it runs on: # On attacker:, # On pivot (via SSH):, # On target:. The model uses these labels to orient itself.
_targeted_at guards — success_fns for objectives that run commands on remote targets must verify output originated from the target, not localhost (see also Class 4).
Connectivity pre-check as first hint step — the first hint step for any remote objective should verify connectivity to the target before entering the workflow.

Prevention status: C10 (_targeted_at in skills/) CI gated. Host label enforcement in hint text: documentation only (no static check exists for comment presence). eval_harness.py gap: #527.

Open Instance Summary¶

As of 2026-05-30, all instances listed in the original inventory (#62–#480) are closed. The two remaining open issues are in post-#480 classes:

Class	Open issues
1–15 (all)	None — all #62–#480 instances resolved
[Post-#480 Class A] Hint Timeout	#632 (PT-POST-02 john cracking)
[Class 6 / routing]	#692 (PT-EXPLOIT-04 intra-hint routing)

Three new failure classes identified post-#480 (documented in the ARCHER operational inventory; numbered 15–17 in that inventory's local numbering, distinct from Classes 15–16 in this document):

[Post-#480 Class A] — Hint Timeout / Execution Time Overrun: Correct tool, correct config, but wordlist or cracking task exceeds session time budget (601s). Fix: scope wordlists, add time-box checkpoints. (#642 closed, #632 open)
[Post-#480 Class B] — Phantom Pass: Harness grep filter returns [Success: No output] → model narrates expected output → success_fn matches narration text. A v2-critical harness-level artifact, not a model or hint failure. (#681 closed)
[Post-#480 Class C] — Context Saturation (Hint-Level): Fresh session produces 0 commands when hint block exceeds ~600 chars. Identical symptom to Class 7 (VRAM bleed) but mechanism is hint budget, not VRAM depletion. Diagnosis: run in isolation; if still 0 cmds, decompose the objective. (No filed issue; established as CLAUDE.md rule)

Retroactive Audit Targets¶

Three categories of past failures that were never filed as issues because they produced false successes rather than observable eval failures. These are data quality audit targets — sessions that passed contemporary filters but are suspect in retrospect. They are not numbered failure classes; they do not manifest as failing eval runs.

Wrong-host playbook contamination¶

Sessions where success=True was recorded but commands ran against Kali/localhost instead of the intended target. These sessions were written to the playbook — and potentially to ft.jsonl — before _targeted_at guards existed on the relevant success_fns (Classes 4, 11, 15).

Detection method: Scan all playbook sessions for Kali-specific signals in target-expected output fields: loopback addresses (127.0.0.1, ::1), /home/kali/ paths, hostname archer-kali, Kali-default service banners. Any session with these in "success" command output where remote target output is expected is wrong-host contaminated.

Scope: Playbook entries generated before the _targeted_at guard was added to the relevant success_fns. Affected objectives per issue history: T10, T12, T23, T54, T56. After #501 ships (verified success signal gate), new playbook writes are protected. Existing entries require a one-time retroactive scan.

Remediation action: Retroactive playbook scan script; flag suspect entries for Auditor review; remove confirmed wrong-host entries from playbook and ft.jsonl. File scope-scoped issue once #501 is closed.

Routing miss undercount¶

Historical routing misses that were never filed as issues because wrong-skill outputs looked plausible at eval time. Class 9 lists 13 routing issues, but 80% of 38,927 routing log entries are unknown — no ground truth. A significant fraction of early eval runs may have been routed to the wrong skill pack and scored anyway, producing training data that reinforces incorrect skill selection.

Detection method: Once #480 (correct_skill passthrough) and #478 (SHA tagging) ship, compare skill_selected vs expected_skill across all eval_label entries. Pre-SHA entries where the two diverge are historical routing misses. The SHA boundary from the #343–#348 routing fix batch identifies the highest-risk epoch.

Scope: All eval runs prior to the #343–#348 routing keyword fix batch. Objectives with known routing ambiguity per Class 9: T8, T25, T42, T43, T44, T45, T47, T49, T52. Requires #478 and #480 to be shipped before the scan is executable.

Remediation action: Post-#478/#480, run routing miss scan against all eval_label routing log entries. Flag skill_selected ≠ expected_skill entries for exclusion from classifier retraining. Quantify the historical miss rate per skill pair.

Depth-blocked contamination window¶

Sessions generated between VRAM bleed manifestation (~issue #180) and filter addition (~issue #369). These depth-blocked, zero-command sessions may be in ft.jsonl from the contaminated window. The downstream filter was added at #369 but sessions generated between #180 and #369 are potentially contaminated even if they appear well-formed — the depth_blocked flag may have been absent from the schema at that time.

Detection method: Identify git SHAs for the commits closing #180 (VRAM bleed identified) and #369 (filter added). Any ft.jsonl session with a generation timestamp in that window and cmds=0 or depth_blocked=True should be excluded from the next retrain. After #478 ships, SHA tags make this lookup direct — no timestamp correlation needed.

Scope: ft.jsonl sessions from the #180→#369 window. Objectives most likely to have been affected: any long-running objective that ran overnight during that period (T6, T10, T12, T21, T53). Check cmds=0 prevalence in that epoch as a proxy for contamination density.

Remediation action: Extract SHA boundaries from git log; filter ft.jsonl by SHA epoch and cmds=0 / depth_blocked fields; exclude identified sessions from next training run. Document excluded count in PROCESSES.md training run log.

Cross-Class Meta-Patterns¶

Four meta-patterns appear across multiple failure classes. A fix targeting the meta-pattern closes multiple classes simultaneously.

Meta-pattern A — Startup log liveness check Classes 2, 3, 11. The pattern: grep -q <startup_string> <logfile> as a process liveness test. Startup strings are written before crashes. The fix is always the same: check process existence (tmux has-session, pgrep, /proc/<pid>), not log content. Prevention status: C1 (nohup/& on TUI tools) and C2 (grep -q without -i on log files) gated in check_hints.py. Gap: process liveness via session check (tmux has-session) not yet enforced.

Meta-pattern B — Missing verification step Classes 4, 6, 11. Every hint that sets up infrastructure declares success without confirming the objective is actually complete. A mandatory verification step — one that produces the specific output the success_fn checks — prevents the entire category. This is the single highest-leverage audit action: grep every hint for a verification step and file a bug for any that lack one. Prevention status: Manual audit in progress (#483). CI gate not yet implemented — C5 is deferred in check_hints.py pending a reliable absence heuristic. Tracking issue: #526.

Meta-pattern C — App-specific without generic companion Classes 5, 6, 10. Hints that target a specific app (vsftpd, DVWA, Metasploitable2) train the model to solve the box, not the vulnerability class. The two-layer rule (app-specific + generic placeholder companion) addresses all three classes. Reference: docs/research/range-lock-in.md. Prevention status: CI gated — C4 in check_hints.py blocks new violations at commit time. Remediation of existing hardcoded IPs/paths tracked in #432.

Meta-pattern D — Wrong-host execution Classes 4, 6, 15. The model executes commands on the attacker host instead of the target, passes success checks because the output looks correct (uid=0 on Kali, lynis running locally, etc.), and produces a false positive. _targeted_at guards on success_fns and explicit host labels in hints address this pattern across all three classes. Prevention status: Partially gated — C10 in check_hints.py checks skills/PT-*.py success_fn/verify_fn references. Gap: testenv/eval_harness.py objective success functions (_t*_ family) are not checked. Tracking issue: #527.

Highest-Leverage Remediations¶

Actions that close multiple open issues or prevent entire failure classes. CI status column reflects whether the action is enforced automatically.

Action	Classes addressed	CI status	Tracking
tmux wrapper standard for all TUI tools	2	C1 gated	—
`grep -qi` standard for all log pattern checks	3	C2 gated	—
`_targeted_at` guard on all success_fns (skills/)	4, 11, 15	C10 gated	—
`_targeted_at` guard on eval_harness.py `_t*_` fns	4, 11, 15	Not gated	#527
Exclude [THOUGHT] from success_fn scope	4, 11	Implemented	—
Verification step audit across all hint blocks	4, 6, 11	Not gated	#483, #526
Two-layer rule (app-specific + generic companion)	5, 6, 10	C4 gated	#432
Between-objective VRAM flush	7	Not gated	#451
setup_fn idempotency audit	7, 13	Not gated	#528, #525
`check_hint_lengths.py` → `check_hints.py`	8	C7 gated	—
Pipeline gate checklist (training)	14	Partial	#501

docs/research/range-lock-in.md — full analysis of Class 10
96 — routing log quality analysis (feeds Class 9 and 14 remediations)¶
478 — git SHA tagging (Class 9 remediation; required for retroactive routing miss scan)¶
480 — correct_skill passthrough (Class 14 remediation; required for routing miss undercount scan)¶
481 — eval harness improvement ideas¶
482 — this document's tracking issue¶
483 — verification step audit (meta-pattern B; manual one-time pass)¶
497 — check_hints.py linter (automates Classes 1, 2, 3, 6, 8 prevention)¶
498 — failure-class dashboards (velocity, compound heatmap, remediation coverage, epoch view)¶
499 — symptom-to-class mapper (automates post-eval failure class diagnosis)¶
500 — this doc update (retroactive audit categories)¶
501 — success signal verification before playbook write (prevents wrong-host contamination at write time)¶
502 — failure class tagging on ft.jsonl sessions¶
503 — SHA epoch gating per failure class for retrain filtering¶
525 — ligolo TUN route flush in _setup_pivot_range (Class 7/13 instance)¶
526 — C5 CI gate: verification step presence check (meta-pattern B enforcement)¶
527 — C6 CI gate: _targeted_at guard audit for eval_harness.py (meta-pattern D gap)¶
528 — setup_fn idempotency audit across all pivot/AD/web objectives (Class 7)¶

ARCHER Failure Mode Inventory¶

Lessons Learned¶

The one-bug-one-fix trap¶

Environmental assumptions accumulate silently¶

Success signals are consistently too weak¶

Training data quality is a silent multiplier¶

Automation prevents recurrence; process does not¶

What this means for development priorities¶

Class 1 — Shell Variable Loss¶

Class 2 — PTY / TUI Crash¶

Class 3 — Case Mismatch / Pattern Miss¶

Class 4 — Premature Objective Achieved / False Positive¶

Class 5 — Wrong Module / Tool Selection¶

Class 6 — Missing Short-Circuit / Hint Gap¶

Class 7 — VRAM / Resource Bleed¶

Class 8 — Character Limit / Command Truncation¶

Class 9 — Routing Miss¶

Class 10 — Range Lock-In¶

Class 11 — False Positive success_fn / halt_fn¶

Class 12 — Model Loop¶

Class 13 — Infrastructure Gap¶

Class 14 — Training Data Contamination¶

Class 15 — Wrong Host / Target Confusion¶

Open Instance Summary¶

Retroactive Audit Targets¶

Wrong-host playbook contamination¶

Routing miss undercount¶

Depth-blocked contamination window¶

Cross-Class Meta-Patterns¶

Highest-Leverage Remediations¶

Related¶

96 — routing log quality analysis (feeds Class 9 and 14 remediations)¶

478 — git SHA tagging (Class 9 remediation; required for retroactive routing miss scan)¶

480 — correct_skill passthrough (Class 14 remediation; required for routing miss undercount scan)¶

481 — eval harness improvement ideas¶

482 — this document's tracking issue¶

483 — verification step audit (meta-pattern B; manual one-time pass)¶

497 — check_hints.py linter (automates Classes 1, 2, 3, 6, 8 prevention)¶

498 — failure-class dashboards (velocity, compound heatmap, remediation coverage, epoch view)¶

499 — symptom-to-class mapper (automates post-eval failure class diagnosis)¶

500 — this doc update (retroactive audit categories)¶

501 — success signal verification before playbook write (prevents wrong-host contamination at write time)¶

502 — failure class tagging on ft.jsonl sessions¶

503 — SHA epoch gating per failure class for retrain filtering¶

525 — ligolo TUN route flush in _setup_pivot_range (Class 7/13 instance)¶

526 — C5 CI gate: verification step presence check (meta-pattern B enforcement)¶

527 — C6 CI gate: _targeted_at guard audit for eval_harness.py (meta-pattern D gap)¶

528 — setup_fn idempotency audit across all pivot/AD/web objectives (Class 7)¶

Class 11 — False Positive `success_fn` / `halt_fn`¶

525 — ligolo TUN route flush in `_setup_pivot_range` (Class 7/13 instance)¶

527 — C6 CI gate: `_targeted_at` guard audit for eval_harness.py (meta-pattern D gap)¶