Skip to content

V1 to V2 Evolution

V1 was always meant to act as a training system for V2, but V2 is not a rewrite. It is V1 with three components swapped, in a specific order, based on what V1 data reveals. The core loop, skill pack architecture, playbook system, eval harness, and container deployment model all carry forward unchanged.

What Changes

V1 pipeline:
  task → keyword_scorer → hints_fn → [prompt] → LLM (free text) → regex_parser → execute_command → halt_fn

V2 pipeline:
  task → classifier → hints_fn → [prompt] → LLM (JSON schema) → tool_dispatcher → done_check

Swap 1: Structured Output (Phase 2)

V1: Model emits free text. ARCHER extracts bash commands, findings blocks, and the [OBJECTIVE_ACHIEVED] token using regex (~300 lines of parsing code).

V2: Model returns a typed JSON response every turn:

{
  "thought": "reasoning about current state",
  "command": "the command to run",
  "done": false,
  "findings": "what this output reveals"
}
When done is true, the loop stops. No token parsing. No regex. The entire parsing layer is deleted.

Gate to proceed: JSON validity rate ≥ 95% across the full objective suite.

Swap 2: Trained Classifier (Phase 3)

V1: detect_skill_category() scores keywords, applies penalties and bonuses via hand-written functions. ~200 lines of scoring logic across four pack files.

V2: A TF-IDF + logistic regression classifier trained on labeled routing examples from the eval harness. Single classify_task(task) -> (category, confidence) call. The entire keyword scorer, all bonus functions, and the LLM confidence gate are replaced.

Gate to proceed: 500+ high-confidence labeled examples per skill category.

Swap 3: Function Calling (Phase 4)

V1: Model generates a bash string. ARCHER validates it against a blocklist (_is_destructive_command()), protected path guards, and single-command checks before execution.

V2: Model calls typed tools:

run_command(cmd: str, timeout: int) -> CommandResult
save_finding(title: str, evidence: str, severity: Literal["info","low","med","high","crit"]) -> None
mark_done(summary: str) -> None
Command injection is prevented by the tool type signature. The security validation layer is deleted.

Gate to proceed: Structured output (Phase 2) must be stable in production.

What Stays

Component Status
Skill pack architecture (halt/hint/bonus dispatchers) Keep - correct abstraction
Playbook (task → winning_command) Keep - auto-seeding via eval harness is in place
--kali exec target pattern Keep - container routing is clean
Eval harness + 27 objectives Keep - expand coverage per domain
WATCHLIST process Keep - discipline for unproven complexity
Campaign / loot system Keep - core operational value

What Gets Deleted

Component Reason
detect_skill_category() + all bonus_fn definitions Replaced by trained classifier
Regex output parser (~300 lines) Replaced by JSON schema enforcement
_is_destructive_command() guards Replaced by typed tool definitions
_has_objective_achieved() Replaced by done: true field
SYSTEM_PROMPT_ADDENDUM char limits JSON schema compresses prompt overhead - limits can relax

Domain-Tuned Models (Phase 5)

The long-term V2 target: one quantized ≤7B model per skill domain, loaded on demand via Ollama. A lightweight ~1-3B router model stays resident; the domain model swaps in per session.

task → classifier → (skill_category, model_id) → domain_model → tool_dispatcher

Gate: 100+ successful sessions per domain (command sequences become fine-tuning examples). Fine-tuning runs on RunPod A100 via Unsloth QLoRA. Adapter exported to GGUF and registered with local Ollama.

Hardware constraint: 8 GB VRAM means one model at a time. The router must be small enough to coexist with the loaded domain model - target ≤1B for the classifier.

Current Status

Phase Status
Phase 1 - Instrumentation Complete. Routing log, command logs, session ft.jsonl, audit review all in place.
Phase 2 - Structured output Gate passed. ARCHER effective JSON validity rate: 96.6% (Issue #151). ft.jsonl now stores repaired responses, not raw model output, so fine-tuning examples are clean (Issue #155). Default in all collection runs.
Phase 3 - Trained classifier Blocked on data volume. 173 labels collected; 500/skill required. Router label deduplication fixed (Issue #150); classifier will train on distinct task phrasings only.
Phase 4 - Function calling Deferred until Phase 2 is stable in production.
Phase 5 - Domain-tuned models Deferred until Phase 3 data thresholds are met.

Where V1 Falls Into the Stochastic Trap

The V2 roadmap is a direct response to five places in V1 where a deterministic problem was handed to the model and code absorbed the fallout. Each one is identifiable by the compensating logic that surrounds it.

1. Output format compliance

The problem: The model is asked to reliably emit a specific text structure - [THOUGHT], bash block, [FINDINGS], [OBJECTIVE_ACHIEVED] - every turn. It mostly does. The ~300 lines of regex parsing code exist because "mostly" is not "always." Each line of that parser handles a variation the model introduced: an extra newline before the code fence, a finding nested inside the bash block, the completion token emitted without brackets. The parsing layer is the cost of using a stochastic system for a formatting contract.

The compensating logic: ~300 lines in ARCHER.py handling output variations.

Remediation: V2 Swap 1 - structured JSON output. The model returns a typed schema every turn; the parsing layer is deleted. Gate: JSON validity ≥95% across the full objective suite. Status: active under --json-output flag.

Phase 2 moves format enforcement from regex parsing to schema validation — but compliance is still instruction-dependent. The system prompt specifies the JSON schema; the model follows it with variable reliability. The deeper fix is Phase 5: fine-tuning on successful sessions so the model acquires the schema as trained behavior rather than a prompt directive. Phase 5 includes a mandatory validation gate (Issue #427): after fine-tuning, the eval suite is run with the format specification removed from the system prompt. Pass condition: malformed-output rate ≤ baseline + 5pp. If the test fails, the adapter requires additional training before promotion. Format compliance is not considered weight-resident until this test passes.


2. Completion signaling

The problem: [OBJECTIVE_ACHIEVED] is emitted when the model decides it is done. Its judgment about "done" is unreliable - which is why halt_fn, keyword matching on findings text, command count ceilings, and max_commands all exist in parallel. The entire halt system is deterministic scaffolding compensating for stochastic self-termination. When the scaffolding has a gap (as in the PT-WEBEX-03 case where 25 commands ran at max_commands=6), sessions run indefinitely past their declared budget.

The compensating logic: should_halt_objective(), per-skill halt_fn dispatchers, max_commands ceiling, keyword matching in findings text.

Remediation: Two-part. V2 Swap 1 replaces the [OBJECTIVE_ACHIEVED] token with done: true in the JSON schema - more structured, but still stochastic until the model is fine-tuned. V2 Phase 5 (domain-tuned models) encodes correct halt behavior in the model weights through training on sessions that ended at the right point. The halt system becomes a safety net rather than the primary mechanism.


3. The LLM routing gate

The problem: When keyword scoring produces a score gap of ≤2 between the top two skills, ARCHER makes a single Ollama call to resolve the ambiguity. This is textbook stochastic trap: using the model to answer a classification question that has a correct answer. It adds 600-850ms of latency and sits on the WATCHLIST because it is measurably expensive with unproven benefit.

The compensating logic: _ROUTING_MODEL gate in detect_skill_category(); WATCHLIST entry with 5-10% pass-rate improvement as the proof condition.

Remediation: V2 Swap 2 - trained TF-IDF + logistic regression classifier replaces the keyword scorer, all bonus functions, and the LLM gate entirely. A single classify_task() call returns a category and confidence in under 5ms. Gate: 500 high-confidence labeled examples per skill. Status: blocked at 173 labels.


4. Hint compliance

The problem: hints_fn injects CORRECT and WRONG examples directly into the prompt to steer the model toward the right command for a specific target. The model is supposed to follow these consistently. It does not - producing the "bad hint" class of eval failures where the model reads the correct example and still issues a structurally wrong command, or follows the wrong branch of a conditional hint. You are asking a stochastic system to reliably execute a specific command structure against a specific target. The hints become a reliability dependency rather than a supplement.

The compensating logic: Growing per-target hint blocks in each hints_fn; the PT-WEBEX-02 and PT-WEBEX-03 eval failures and their corresponding GitHub issues.

Remediation: V2 Phase 5 (domain-tuned models). A model fine-tuned on hundreds of successful ARCHER sessions for a specific domain already knows the correct tool invocations, flags, and target-specific paths. Hints shift from primary instruction to reinforcement. The hints_fn system stays but its failure mode becomes rare rather than routine.


5. Command safety validation

The problem: _is_destructive_command() catches dangerous commands after the model generates them. The model is implicitly relied on not to generate destructive commands in the first place - which is stochastic. The validation layer exists because that reliance is not safe. It is string matching applied after the fact to a problem that should be structural.

The compensating logic: _is_destructive_command() blocklist, protected path guards, single-command enforcement in ARCHER.py.

Remediation: V2 Swap 3 - function calling with typed tool signatures. The model calls run_command(cmd: str, timeout: int) rather than generating a raw bash string. Command injection is prevented by the type system. The destructive command validation layer is deleted because the attack surface it guards against no longer exists. Gate: Swap 1 (structured output) must be stable in production first.


Remediation Map

Issue V1 compensating logic V2 fix Phase Gate
Output format compliance ~300 lines regex parser Structured JSON output (Phase 2) + weight-resident compliance (Phase 5) 2 + 5 Phase 2: JSON validity ≥95%; Phase 5: format compliance validation test passes (Issue #427)
Completion signaling halt_fn + max_commands ceiling done: true field + fine-tuned halt behavior 2 + 5 Phase 2 stable; 100+ sessions/domain
LLM routing gate Ollama call on score gap ≤2 Trained TF-IDF+LR classifier 3 500 labels/skill
Hint compliance Per-target CORRECT/WRONG blocks Domain-tuned model encodes correct invocations 5 Phase 3 complete
Command safety Destructive command blocklist Typed function calling 4 Phase 2 stable