V1 to V2 Evolution¶
V1 was always meant to act as a training system for V2, but V2 is not a rewrite. It is V1 with three components swapped, in a specific order, based on what V1 data reveals. The core loop, skill pack architecture, playbook system, eval harness, and container deployment model all carry forward unchanged.
What Changes¶
V1 pipeline:
task → keyword_scorer → hints_fn → [prompt] → LLM (free text) → regex_parser → execute_command → halt_fn
V2 pipeline:
task → classifier → hints_fn → [prompt] → LLM (JSON schema) → tool_dispatcher → done_check
Swap 1: Structured Output (Phase 2)¶
V1: Model emits free text. ARCHER extracts bash commands, findings blocks, and the [OBJECTIVE_ACHIEVED] token using regex (~300 lines of parsing code).
V2: Model returns a typed JSON response every turn:
{
"thought": "reasoning about current state",
"command": "the command to run",
"done": false,
"findings": "what this output reveals"
}
done is true, the loop stops. No token parsing. No regex. The entire parsing layer is deleted.
Gate to proceed: JSON validity rate ≥ 95% across the full objective suite.
Swap 2: Trained Classifier (Phase 3)¶
V1: detect_skill_category() scores keywords, applies penalties and bonuses via hand-written functions. ~200 lines of scoring logic across four pack files.
V2: A TF-IDF + logistic regression classifier trained on labeled routing examples from the eval harness. Single classify_task(task) -> (category, confidence) call. The entire keyword scorer, all bonus functions, and the LLM confidence gate are replaced.
Gate to proceed: 500+ high-confidence labeled examples per skill category.
Swap 3: Function Calling (Phase 4)¶
V1: Model generates a bash string. ARCHER validates it against a blocklist (_is_destructive_command()), protected path guards, and single-command checks before execution.
V2: Model calls typed tools:
run_command(cmd: str, timeout: int) -> CommandResult
save_finding(title: str, evidence: str, severity: Literal["info","low","med","high","crit"]) -> None
mark_done(summary: str) -> None
Gate to proceed: Structured output (Phase 2) must be stable in production.
What Stays¶
| Component | Status |
|---|---|
| Skill pack architecture (halt/hint/bonus dispatchers) | Keep - correct abstraction |
| Playbook (task → winning_command) | Keep - auto-seeding via eval harness is in place |
--kali exec target pattern |
Keep - container routing is clean |
| Eval harness + 27 objectives | Keep - expand coverage per domain |
| WATCHLIST process | Keep - discipline for unproven complexity |
| Campaign / loot system | Keep - core operational value |
What Gets Deleted¶
| Component | Reason |
|---|---|
detect_skill_category() + all bonus_fn definitions |
Replaced by trained classifier |
| Regex output parser (~300 lines) | Replaced by JSON schema enforcement |
_is_destructive_command() guards |
Replaced by typed tool definitions |
_has_objective_achieved() |
Replaced by done: true field |
SYSTEM_PROMPT_ADDENDUM char limits |
JSON schema compresses prompt overhead - limits can relax |
Domain-Tuned Models (Phase 5)¶
The long-term V2 target: one quantized ≤7B model per skill domain, loaded on demand via Ollama. A lightweight ~1-3B router model stays resident; the domain model swaps in per session.
Gate: 100+ successful sessions per domain (command sequences become fine-tuning examples). Fine-tuning runs on RunPod A100 via Unsloth QLoRA. Adapter exported to GGUF and registered with local Ollama.
Hardware constraint: 8 GB VRAM means one model at a time. The router must be small enough to coexist with the loaded domain model - target ≤1B for the classifier.
Current Status¶
| Phase | Status |
|---|---|
| Phase 1 - Instrumentation | Complete. Routing log, command logs, session ft.jsonl, audit review all in place. |
| Phase 2 - Structured output | Gate passed. ARCHER effective JSON validity rate: 96.6% (Issue #151). ft.jsonl now stores repaired responses, not raw model output, so fine-tuning examples are clean (Issue #155). Default in all collection runs. |
| Phase 3 - Trained classifier | Blocked on data volume. 173 labels collected; 500/skill required. Router label deduplication fixed (Issue #150); classifier will train on distinct task phrasings only. |
| Phase 4 - Function calling | Deferred until Phase 2 is stable in production. |
| Phase 5 - Domain-tuned models | Deferred until Phase 3 data thresholds are met. |
Where V1 Falls Into the Stochastic Trap¶
The V2 roadmap is a direct response to five places in V1 where a deterministic problem was handed to the model and code absorbed the fallout. Each one is identifiable by the compensating logic that surrounds it.
1. Output format compliance¶
The problem: The model is asked to reliably emit a specific text structure - [THOUGHT], bash block, [FINDINGS], [OBJECTIVE_ACHIEVED] - every turn. It mostly does. The ~300 lines of regex parsing code exist because "mostly" is not "always." Each line of that parser handles a variation the model introduced: an extra newline before the code fence, a finding nested inside the bash block, the completion token emitted without brackets. The parsing layer is the cost of using a stochastic system for a formatting contract.
The compensating logic: ~300 lines in ARCHER.py handling output variations.
Remediation: V2 Swap 1 - structured JSON output. The model returns a typed schema every turn; the parsing layer is deleted. Gate: JSON validity ≥95% across the full objective suite. Status: active under --json-output flag.
Phase 2 moves format enforcement from regex parsing to schema validation — but compliance is still instruction-dependent. The system prompt specifies the JSON schema; the model follows it with variable reliability. The deeper fix is Phase 5: fine-tuning on successful sessions so the model acquires the schema as trained behavior rather than a prompt directive. Phase 5 includes a mandatory validation gate (Issue #427): after fine-tuning, the eval suite is run with the format specification removed from the system prompt. Pass condition: malformed-output rate ≤ baseline + 5pp. If the test fails, the adapter requires additional training before promotion. Format compliance is not considered weight-resident until this test passes.
2. Completion signaling¶
The problem: [OBJECTIVE_ACHIEVED] is emitted when the model decides it is done. Its judgment about "done" is unreliable - which is why halt_fn, keyword matching on findings text, command count ceilings, and max_commands all exist in parallel. The entire halt system is deterministic scaffolding compensating for stochastic self-termination. When the scaffolding has a gap (as in the PT-WEBEX-03 case where 25 commands ran at max_commands=6), sessions run indefinitely past their declared budget.
The compensating logic: should_halt_objective(), per-skill halt_fn dispatchers, max_commands ceiling, keyword matching in findings text.
Remediation: Two-part. V2 Swap 1 replaces the [OBJECTIVE_ACHIEVED] token with done: true in the JSON schema - more structured, but still stochastic until the model is fine-tuned. V2 Phase 5 (domain-tuned models) encodes correct halt behavior in the model weights through training on sessions that ended at the right point. The halt system becomes a safety net rather than the primary mechanism.
3. The LLM routing gate¶
The problem: When keyword scoring produces a score gap of ≤2 between the top two skills, ARCHER makes a single Ollama call to resolve the ambiguity. This is textbook stochastic trap: using the model to answer a classification question that has a correct answer. It adds 600-850ms of latency and sits on the WATCHLIST because it is measurably expensive with unproven benefit.
The compensating logic: _ROUTING_MODEL gate in detect_skill_category(); WATCHLIST entry with 5-10% pass-rate improvement as the proof condition.
Remediation: V2 Swap 2 - trained TF-IDF + logistic regression classifier replaces the keyword scorer, all bonus functions, and the LLM gate entirely. A single classify_task() call returns a category and confidence in under 5ms. Gate: 500 high-confidence labeled examples per skill. Status: blocked at 173 labels.
4. Hint compliance¶
The problem: hints_fn injects CORRECT and WRONG examples directly into the prompt to steer the model toward the right command for a specific target. The model is supposed to follow these consistently. It does not - producing the "bad hint" class of eval failures where the model reads the correct example and still issues a structurally wrong command, or follows the wrong branch of a conditional hint. You are asking a stochastic system to reliably execute a specific command structure against a specific target. The hints become a reliability dependency rather than a supplement.
The compensating logic: Growing per-target hint blocks in each hints_fn; the PT-WEBEX-02 and PT-WEBEX-03 eval failures and their corresponding GitHub issues.
Remediation: V2 Phase 5 (domain-tuned models). A model fine-tuned on hundreds of successful ARCHER sessions for a specific domain already knows the correct tool invocations, flags, and target-specific paths. Hints shift from primary instruction to reinforcement. The hints_fn system stays but its failure mode becomes rare rather than routine.
5. Command safety validation¶
The problem: _is_destructive_command() catches dangerous commands after the model generates them. The model is implicitly relied on not to generate destructive commands in the first place - which is stochastic. The validation layer exists because that reliance is not safe. It is string matching applied after the fact to a problem that should be structural.
The compensating logic: _is_destructive_command() blocklist, protected path guards, single-command enforcement in ARCHER.py.
Remediation: V2 Swap 3 - function calling with typed tool signatures. The model calls run_command(cmd: str, timeout: int) rather than generating a raw bash string. Command injection is prevented by the type system. The destructive command validation layer is deleted because the attack surface it guards against no longer exists. Gate: Swap 1 (structured output) must be stable in production first.
Remediation Map¶
| Issue | V1 compensating logic | V2 fix | Phase | Gate |
|---|---|---|---|---|
| Output format compliance | ~300 lines regex parser | Structured JSON output (Phase 2) + weight-resident compliance (Phase 5) | 2 + 5 | Phase 2: JSON validity ≥95%; Phase 5: format compliance validation test passes (Issue #427) |
| Completion signaling | halt_fn + max_commands ceiling | done: true field + fine-tuned halt behavior |
2 + 5 | Phase 2 stable; 100+ sessions/domain |
| LLM routing gate | Ollama call on score gap ≤2 | Trained TF-IDF+LR classifier | 3 | 500 labels/skill |
| Hint compliance | Per-target CORRECT/WRONG blocks | Domain-tuned model encodes correct invocations | 5 | Phase 3 complete |
| Command safety | Destructive command blocklist | Typed function calling | 4 | Phase 2 stable |