ARCHER in Action¶

Three recordings of ARCHER running real penetration testing objectives against live vulnerable targets, with more being added as clean sessions are captured. No edits, no scripted responses — each recording is a single unmodified agent session.

Network reconnaissance — ARP scan¶

What you're watching

Task: discover live hosts on 192.168.56.0/24 using an ARP scan

ARCHER receives the task, the router classifies it as network_reconnaissance, and the skill pack injects tool guidance into the system prompt. The model issues a single arp-scan command; the code layer executes it inside the archer-kali container, captures the output, and feeds it back. The model reads the MAC/IP pairs, extracts the findings, and signals completion with [OBJECTIVE_ACHIEVED].

This is the simplest case in the agent loop: one command, deterministic output, unambiguous completion. It shows the full cycle — task in, tool run, findings out — without noise.

Web enumeration — PHP and text file discovery¶

What you're watching

Task: enumerate PHP and text files on the web server at 192.168.56.103 using gobuster or ffuf

The task names a specific tool. ARCHER's skill router detects the using gobuster or ffuf phrasing and injects a tool-enforcement directive: if gobuster or ffuf is available, the model must use one of them rather than substituting an equivalent tool. This is the code layer enforcing scope — not the model deciding on its own.

The session shows gobuster running with a PHP/text-file extension filter, returning discovered paths, and the model extracting the relevant findings before signalling completion.

Web enumeration — directory scan¶

What you're watching

Task: enumerate directories on the web server at 192.168.56.103

A broader directory scan with no tool constraint. The model selects a wordlist and tool, runs the scan, and works through the output to identify directories worth noting. Compare this to the PHP/text-file session above: same target, different scope, different tool choice, different findings.

This recording also shows ARCHER's session logging in action — every command and its full output is written to ~/.archer_sessions/ as a structured ft.jsonl file. Sessions like this, once they pass the two-tier audit pipeline, become training data for the V2 specialist model.

How the agent loop works¶

Each recording follows the same cycle:

Task received — plain-English instruction, no structured input required
Router — TF-IDF+LR classifier maps the task to the correct skill pack
Hints injected — the skill pack adds tool guidance, completion indicators, and scope constraints to the system prompt
Model turn — [THOUGHT] reasoning block, then a bash command block
Execution — command runs inside the archer-kali Kali Linux container; output returned to the model
Repeat — until the model emits [OBJECTIVE_ACHIEVED] or the code layer's command ceiling is reached
Findings extracted — structured parsing by the code layer, not model summarization
Session logged — full transcript written for audit and fine-tuning pipeline

Why the failures are visible¶

These recordings are unedited. If the model overshoots, picks a suboptimal tool, or produces a verbose response where a terse one would do — that's in the recording.

This is deliberate. ARCHER is a research project, and the benchmark dashboard tracks exactly these failure modes: halt discipline rate, false positive rate, per-objective pass rate over time. The point isn't to hide the rough edges — it's to measure them, understand them, and close them systematically through the fine-tuning pipeline.

What you're watching is V1: the phase that validates the agent loop, builds the eval harness, and collects the operational data that will train the V2 specialist model. A session that fails cleanly and gets flagged by the audit pipeline is doing its job — it's a data point, not a defect to paper over.

Performance data and failure analysis are published through the build journal as eval runs accumulate.