ARCHER in Action¶
Three recordings of ARCHER running real penetration testing objectives against live vulnerable targets, with more being added as clean sessions are captured. No edits, no scripted responses — each recording is a single unmodified agent session.
Network reconnaissance — ARP scan¶
What you're watching
Task: discover live hosts on 192.168.56.0/24 using an ARP scan
ARCHER receives the task, the router classifies it as network_reconnaissance,
and the skill pack injects tool guidance into the system prompt. The model issues
a single arp-scan command; the code layer executes it inside the archer-kali
container, captures the output, and feeds it back. The model reads the MAC/IP
pairs, extracts the findings, and signals completion with [OBJECTIVE_ACHIEVED].
This is the simplest case in the agent loop: one command, deterministic output, unambiguous completion. It shows the full cycle — task in, tool run, findings out — without noise.
Web enumeration — PHP and text file discovery¶
What you're watching
Task: enumerate PHP and text files on the web server at 192.168.56.103 using gobuster or ffuf
The task names a specific tool. ARCHER's skill router detects the using gobuster or ffuf
phrasing and injects a tool-enforcement directive: if gobuster or ffuf is available, the
model must use one of them rather than substituting an equivalent tool. This is the code
layer enforcing scope — not the model deciding on its own.
The session shows gobuster running with a PHP/text-file extension filter, returning discovered paths, and the model extracting the relevant findings before signalling completion.
Web enumeration — directory scan¶
What you're watching
Task: enumerate directories on the web server at 192.168.56.103
A broader directory scan with no tool constraint. The model selects a wordlist and tool, runs the scan, and works through the output to identify directories worth noting. Compare this to the PHP/text-file session above: same target, different scope, different tool choice, different findings.
This recording also shows ARCHER's session logging in action — every command and
its full output is written to ~/.archer_sessions/ as a structured ft.jsonl file.
Sessions like this, once they pass the two-tier audit pipeline, become training
data for the V2 specialist model.
How the agent loop works¶
Each recording follows the same cycle:
- Task received — plain-English instruction, no structured input required
- Router — TF-IDF+LR classifier maps the task to the correct skill pack
- Hints injected — the skill pack adds tool guidance, completion indicators, and scope constraints to the system prompt
- Model turn —
[THOUGHT]reasoning block, then abashcommand block - Execution — command runs inside the
archer-kaliKali Linux container; output returned to the model - Repeat — until the model emits
[OBJECTIVE_ACHIEVED]or the code layer's command ceiling is reached - Findings extracted — structured parsing by the code layer, not model summarization
- Session logged — full transcript written for audit and fine-tuning pipeline
Why the failures are visible¶
These recordings are unedited. If the model overshoots, picks a suboptimal tool, or produces a verbose response where a terse one would do — that's in the recording.
This is deliberate. ARCHER is a research project, and the benchmark dashboard tracks exactly these failure modes: halt discipline rate, false positive rate, per-objective pass rate over time. The point isn't to hide the rough edges — it's to measure them, understand them, and close them systematically through the fine-tuning pipeline.
What you're watching is V1: the phase that validates the agent loop, builds the eval harness, and collects the operational data that will train the V2 specialist model. A session that fails cleanly and gets flagged by the audit pipeline is doing its job — it's a data point, not a defect to paper over.
Performance data and failure analysis are published through the build journal as eval runs accumulate.