Skip to content

ARCHER in Action

Three recordings of ARCHER running real penetration testing objectives against live vulnerable targets, with more being added as clean sessions are captured. No edits, no scripted responses — each recording is a single unmodified agent session.


Network reconnaissance — ARP scan

What you're watching

Task: discover live hosts on 192.168.56.0/24 using an ARP scan

ARCHER receives the task, the router classifies it as network_reconnaissance, and the skill pack injects tool guidance into the system prompt. The model issues a single arp-scan command; the code layer executes it inside the archer-kali container, captures the output, and feeds it back. The model reads the MAC/IP pairs, extracts the findings, and signals completion with [OBJECTIVE_ACHIEVED].

This is the simplest case in the agent loop: one command, deterministic output, unambiguous completion. It shows the full cycle — task in, tool run, findings out — without noise.


Web enumeration — PHP and text file discovery

What you're watching

Task: enumerate PHP and text files on the web server at 192.168.56.103 using gobuster or ffuf

The task names a specific tool. ARCHER's skill router detects the using gobuster or ffuf phrasing and injects a tool-enforcement directive: if gobuster or ffuf is available, the model must use one of them rather than substituting an equivalent tool. This is the code layer enforcing scope — not the model deciding on its own.

The session shows gobuster running with a PHP/text-file extension filter, returning discovered paths, and the model extracting the relevant findings before signalling completion.


Web enumeration — directory scan

What you're watching

Task: enumerate directories on the web server at 192.168.56.103

A broader directory scan with no tool constraint. The model selects a wordlist and tool, runs the scan, and works through the output to identify directories worth noting. Compare this to the PHP/text-file session above: same target, different scope, different tool choice, different findings.

This recording also shows ARCHER's session logging in action — every command and its full output is written to ~/.archer_sessions/ as a structured ft.jsonl file. Sessions like this, once they pass the two-tier audit pipeline, become training data for the V2 specialist model.


How the agent loop works

Each recording follows the same cycle:

  1. Task received — plain-English instruction, no structured input required
  2. Router — TF-IDF+LR classifier maps the task to the correct skill pack
  3. Hints injected — the skill pack adds tool guidance, completion indicators, and scope constraints to the system prompt
  4. Model turn[THOUGHT] reasoning block, then a bash command block
  5. Execution — command runs inside the archer-kali Kali Linux container; output returned to the model
  6. Repeat — until the model emits [OBJECTIVE_ACHIEVED] or the code layer's command ceiling is reached
  7. Findings extracted — structured parsing by the code layer, not model summarization
  8. Session logged — full transcript written for audit and fine-tuning pipeline

Why the failures are visible

These recordings are unedited. If the model overshoots, picks a suboptimal tool, or produces a verbose response where a terse one would do — that's in the recording.

This is deliberate. ARCHER is a research project, and the benchmark dashboard tracks exactly these failure modes: halt discipline rate, false positive rate, per-objective pass rate over time. The point isn't to hide the rough edges — it's to measure them, understand them, and close them systematically through the fine-tuning pipeline.

What you're watching is V1: the phase that validates the agent loop, builds the eval harness, and collects the operational data that will train the V2 specialist model. A session that fails cleanly and gets flagged by the audit pipeline is doing its job — it's a data point, not a defect to paper over.

Performance data and failure analysis are published through the build journal as eval runs accumulate.