ARCHER System Map¶
A top-down chart of every subsystem and how they connect. Grounded in the post-refactor code as of #972 (executor), #973 (router/playbook/domain split),
980 (agent extraction), #981 (
archer/pentest/), #987 (plays/).¶For the runtime pack contract see
play-packs.md; for blast radius and dependency tables see../../ARCHITECTURE.md; for the file inventory see../../STRUCTURE.md.
1. The three-layer model (why the code is shaped this way)¶
Every line of ARCHER belongs to exactly one of three layers. The boundary is architectural, not advisory — compensating logic that crosses it is treated as a design defect.
flowchart TB
subgraph HUMAN["🧑 HUMAN LAYER — judgment, scope, authorization"]
H1["Defines scope & acceptable risk · authorizes irreversible actions · final QA/QC"]
end
subgraph MODEL["🤖 MODEL LAYER — probabilistic (qwen3:14b @ 8K)"]
M1["command generation · output interpretation · next-step chaining · attack-chain narrative"]
end
subgraph CODE["⚙️ CODE LAYER — deterministic (must have a correct answer)"]
C1["routing · execution · safety enforcement · halt detection · ground-truth verification · logging"]
end
HUMAN -->|"scope, authorization"| CODE
CODE -->|"task + hints + constraints"| MODEL
MODEL -->|"proposed command / findings"| CODE
CODE -->|"verified findings, residual"| HUMAN
| Layer | Owns | Lives in |
|---|---|---|
| Model | generation, interpretation, reasoning | the LLM (via archer_providers.py / Ollama) |
| Code | routing, execution, safety, halt, verify, logging | everything else in this map |
| Human | scope, risk, authorization, final QA | the operator (Auditor does initial QA) |
2. Subsystem inventory¶
| # | Subsystem | Primary file(s) | Responsibility |
|---|---|---|---|
| 1 | Entry / CLI | ARCHER.py (1.2k) |
Parse args, load one domain, dispatch to the agent loop |
| 2 | Agent core | archer_platform/agent.py (4.5k) |
The session loop, prompt assembly, halt/OA gates, PlayRegistry |
| 3 | Execution + safety | archer_executor.py (546) |
Run commands (local or docker exec), enforce safety/egress |
| 4 | Providers | archer_providers.py (558) |
Chat-session abstraction (Ollama local, Claude/Gemini cloud) |
| 5 | Routing | archer_router.py (391) |
Map task → skill category (classifier → keyword → bonus) |
| 6 | Domain interface | archer_platform/domain.py (120) |
DomainConfig platform↔domain contract |
| 7 | Pentest domain | archer/pentest/domain.py (52) |
PENTEST_DOMAIN instance + eval objectives/success_fns |
| 8 | Play packs | plays/PT-*.py (×10) |
Per-skill hints/halt/bonus dispatchers + SKILL_CATEGORIES |
| 9 | Playbook | archer_playbook.py (951) |
Learned command replay, variable substitution (domain-scoped) |
| 10 | Engagement / findings | archer_engagement.py, archer_findings.py |
Multi-phase sequencing + cross-phase findings accumulation |
| 11 | Eval harness | testenv/eval_harness.py (3.6k) + archer/pentest/eval/ |
Run objectives, score against ground truth, emit CSV |
| 12 | Training pipeline | testenv/audit_review.py, scripts/prepare_finetune.py, scripts/finetune.py |
Tier 1/2 gates → JSONL → LoRA adapter |
| 13 | Observability | scripts/archer_live.py (7k), scripts/archer_analytics.py |
Read-only dashboard + visitor analytics |
3. Module dependency graph (runtime)¶
Arrows mean "imports / calls into." Play packs are loaded dynamically and never import back — that one-way edge is a load-bearing invariant.
flowchart TD
CLI["ARCHER.py<br/>(entry / CLI / main)"]
AGENT["archer_platform/agent.py<br/>(loop · PlayRegistry · halt/OA)"]
EX ["archer_executor.py<br/>(execute + safety)"]
PROV["archer_providers.py<br/>(LLM sessions)"]
ROUTE["archer_router.py<br/>(skill routing)"]
PIFACE["archer_platform/domain.py<br/>(DomainConfig)"]
PDOM["archer/pentest/domain.py<br/>(PENTEST_DOMAIN)"]
PLAYS["plays/PT-*.py<br/>(dispatchers)"]
PLAYBK["archer_playbook.py"]
ENG["archer_engagement.py"]
FIND["archer_findings.py"]
CLI --> AGENT
CLI --> PDOM
CLI --> EX
CLI --> ROUTE
CLI --> PLAYBK
CLI --> FIND
AGENT --> EX
AGENT --> PROV
AGENT --> ROUTE
AGENT --> PLAYBK
AGENT --> FIND
AGENT --> PIFACE
AGENT --> PDOM
AGENT -. "dynamic import<br/>PlayRegistry.load_domain" .-> PLAYS
ROUTE --> EX
ROUTE -. "late-bind via sys.modules<br/>(PLAYS.loaded_domain)" .-> AGENT
PDOM --> PIFACE
PDOM --> EX
ENG --> FIND
PLAYS -. "NO back-import<br/>(isolation invariant)" .-x AGENT
Key edges:
- agent.py is the hub — it imports the executor, router, playbook, findings, providers, and both domain layers.
- archer_router → archer_executor only for default timeout constants; it reaches the loaded domain's SKILL_CATEGORIES via a deferred sys.modules lookup to avoid a circular import.
- Play packs depend on nothing in ARCHER — everything they need arrives in the config dict at call time.
4. Agent session — control flow¶
One session = task in → terminal exit (OBJECTIVE_ACHIEVED / HALT_DISCIPLINE / error / timeout). Each turn is one model inference + one command execution.
flowchart TD
START(["task in"]) --> ROUTE["route: detect_skill_category(task)<br/>→ skill category + min/max cmds + halt mode"]
ROUTE --> PB{"playbook hit?<br/>(domain-scoped)"}
PB -->|yes| SEED["seed validated command"]
PB -->|no| BUILD
SEED --> BUILD["assemble system prompt:<br/>env + hints_fn + domain addendum + JSON schema"]
BUILD --> CALL["session.stream() → model turn"]
CALL --> EXTRACT["extract bash command<br/>(JSON field or regex)"]
EXTRACT --> SAFE{"validate_command_safety<br/>+ validate_command_egress"}
SAFE -->|"escape (AR-9)"| HALT_ESC(["hard halt"])
SAFE -->|"dangerous"| APPROVE["human approval"]
SAFE -->|ok| RUN["execute_command / execute_with_sudo<br/>(local bash OR docker exec)"]
APPROVE --> RUN
RUN --> POST["post_process_output + scan analysis<br/>wrap in [EXTERNAL] (AR-1)"]
POST --> OA{"[OBJECTIVE_ACHIEVED]?<br/>strict parser"}
OA -->|"yes & step ≥ min_commands"| VERIFY["tier1 signal → tier1 probe → verify_fn"]
OA -->|"yes but step < min"| DEPTH["suppress (depth guard #4)"]
OA -->|no| HALTQ
DEPTH --> HALTQ
VERIFY -->|confirmed| DONE(["OBJECTIVE_ACHIEVED"])
VERIFY -->|false positive| HALTQ
HALTQ{"should_halt_objective()<br/>halt_fn + min/max gates"} -->|halt| HD(["HALT_DISCIPLINE"])
HALTQ -->|"max steps (25)"| HD
HALTQ -->|continue| BUILD
DONE --> LOG["write session log + ft.jsonl + residual.json<br/>+ routing/command/failure logs"]
HD --> LOG
The two exit gates encode the model/code split: the model claims completion
([OBJECTIVE_ACHIEVED]), but the code verifies it (verify_fn/success_fn
against real target state) before it counts.
5. Routing chain (task → skill category)¶
A routing miss corrupts the whole session (wrong hints, wrong halt criteria, wrong tools), so this is the most upstream decision in the system.
flowchart LR
T["task string"] --> C1{"Tier 1: Classifier<br/>TF-IDF + LR · conf ≥ 0.5<br/>+ skill in domain<br/>+ no exclude-kw veto"}
C1 -->|pass| OUT["selected skill category"]
C1 -->|"miss / --no-classifier"| C2{"Tier 2: Keyword scorer<br/>+2 keyword, −1 exclude,<br/>± bonus_fn"}
C2 -->|"score > 0"| OUT
C2 -->|"all ≤ 0"| UNK["'unknown'<br/>(excluded from training)"]
OUT --> LOG["~/.archer_routing_log.jsonl<br/>(scores, score_gap, confidence,<br/>classifier_version, git_sha)"]
C3["Tier 3: LLM gate<br/>(removed #169 — code dormant)"]:::dead
classDef dead stroke-dasharray: 5 5,opacity:0.5;
The routing log is the training signal: eval runs write eval_label entries with
the known-correct skill (label_confidence: high), which feed the next classifier.
6. Domain & play-pack plug-in model¶
--do <domain> --sd <subdomain> selects exactly one play pack. PlayRegistry.load_domain()
imports it, validates its dispatchers, and merges its registries into the runtime
globals — then refuses a second call (single-domain enforcement).
flowchart TD
CLI["--do pentest --sd recon"] --> REG["PlayRegistry.load_domain()"]
REG -->|"single-domain guard:<br/>RuntimeError on 2nd call"| X(("✗"))
REG --> IMP["importlib → plays/PT-Recon.py"]
IMP --> VAL["validate dispatchers<br/>(synthetic-call halt/hints/bonus)"]
VAL --> MERGE["merge into globals"]
MERGE --> SC["SKILL_CATEGORIES<br/>(core wins on conflict)"]
MERGE --> TS["TARGET_SIGNATURES / NOUNS"]
MERGE --> AD["SYSTEM_PROMPT_ADDENDUM (≤300 chars)"]
SC --> ROUTER["archer_router reads SKILL_CATEGORIES"]
SC --> LOOP["agent loop reads halt_fn / hints_fn / bonus_fn<br/>per category, via config dict"]
subgraph PACK["play pack contract (stdlib-only, no ARCHER import)"]
direction LR
P1["SKILL_CATEGORIES{}"]
P2["halt_fn(count, findings, config)→bool"]
P3["hints_fn(task, config, tools, sigs)→list"]
P4["bonus_fn(task, has_ctx, config)→int"]
P5["SYSTEM_PROMPT_ADDENDUM"]
P6["_register_pack_handlers()"]
end
DOMIFACE["archer_platform/domain.py<br/>DomainConfig contract"] --> PDOMI["archer/pentest/domain.py<br/>PENTEST_DOMAIN<br/>(error_patterns · post_process · report_prompt)"]
The 10 packs: PT-Recon, PT-Vulnerability, PT-Web, PT-Exploitation,
PT-PostExploit, PT-Pivoting, PT-Privesc, PT-ActiveDirectory,
PT-ThreatEmulation, plus the legacy monolithic penetration.py.
7. Eval → training pipeline (the learning loop)¶
This is a separate process from the agent: the harness spawns ARCHER as a subprocess, scores the result against ground truth, and the surviving sessions become fine-tuning data. Gates are hard — a failure at any stage drops the session.
flowchart TD
OBJ["archer/pentest/eval/objectives.py<br/>(task · expected_play · success_fn · setup_fn · verify_fn)"]
HARNESS["testenv/eval_harness.py"]
OBJ --> HARNESS
HARNESS -->|"setup_fn → spawn ARCHER → success_fn"| RUN["session run"]
RUN --> CSV["testenv/eval_results/*.csv<br/>(skill_selected, success, halt_reason, 35+ cols)"]
RUN --> FT["~/.archer_sessions/*.ft.jsonl<br/>(+ .residual.json)"]
FT --> T1["audit_review.py — Tier 1<br/>deterministic structural checks<br/>(_SIGNAL_RE evidence filter)"]
T1 -->|"lab_suspect → drop"| DROP1(("✗"))
T1 --> T2["Tier 2 — LLM judge (Haiku)<br/>4 criteria × 0–3 → .tier2.json"]
T2 --> PREP["prepare_finetune.py<br/>~11 hard gates"]
PREP -->|"skill=unknown · tier2<2 · BV · false-OA · SHA-epoch"| DROP2(("✗"))
PREP --> JSONL["data/finetune/<skill>.jsonl<br/>(diversity gate: ≤30%/target IP)"]
JSONL --> FTUNE["finetune.py — QLoRA (r=16, α=32)"]
FTUNE --> LORA["lora_weights/<skill>/ → Ollama Modelfile"]
LORA -.->|"--lora at inference"| HARNESS
CSV -.->|"routing labels"| CLASS["classifier retrain"]
CLASS -.-> RUN
The gates, in order, that a session must clear to become training data:
OBJECTIVE_ACHIEVED/HALT_DISCIPLINE exit → success_fn true → Tier 1 clean →
Tier 2 ≥ 2 → known skill (≠ unknown) → no boundary violation → not pre-epoch →
under per-skill cap → enough turns → under per-target diversity ceiling.
8. Data artifacts map¶
| Path | Written by | Read by | Contents |
|---|---|---|---|
~/.archer_sessions/*.jsonl |
agent | dashboard | session events (start/command/halt/end) |
~/.archer_sessions/*.ft.jsonl |
agent (--ft-log) |
audit, prepare, dashboard | model turns → training candidates |
~/.archer_sessions/*.residual.json |
agent (#202) | Auditor | unverified model claims (findings, severity) |
{session}.ft.jsonl.tier2.json |
audit_review | prepare, playbook seed | Tier 2 score sidecar |
~/.archer_routing_log.jsonl |
router | classifier training | routing decisions + eval labels |
~/.archer_command_log.jsonl |
agent | review | per-command attempts |
~/.archer_failure_log.jsonl |
agent | RCA | failure taxonomy codes |
~/.archer_playbook.db |
playbook | agent | learned commands + session metrics (domain-scoped) |
~/.archer_engagements/{id}/state.json |
engagement | --resume |
phase findings checkpoint |
~/.archer_campaigns.db |
agent (#960) | agent | cross-session hosts/creds |
testenv/eval_results/*.csv |
harness | dashboard, CI gate | per-objective results |
data/finetune/<skill>.jsonl |
prepare_finetune | finetune | ChatML training examples |
lora_weights/<skill>/ |
finetune | inference (--lora) |
LoRA adapter |
~/.archer_classifier/router_classifier.pkl |
train_classifier | router | TF-IDF+LR routing model |
9. Load-bearing invariants (do not break without agreement)¶
- Single domain per session —
PlayRegistry.load_domain()raises on the 2nd call. Prevents contradictory cross-domain prompts diluting the 8K context. - Strict
[OBJECTIVE_ACHIEVED]parser — only the exact token counts; the model claims, the code verifies (verify_fn/success_fnagainst real state). - Depth guard — OA below
min_commandsis suppressed; blocks zero-/one-command false completions. - Pack isolation —
plays/*.pyimport nothing from ARCHER; all context flows through theconfigdict. Keeps packs independently testable and stackable. - Safety is deterministic and early — container-escape (AR-9) and egress (AR-8) checks run before execution, with no model override.
- Container-root sudo bypass — when
EXEC_TARGETis set,execute_with_sudoskips all password machinery (the container is already root). - Ground truth is model-independent —
success_fnis regex-only over real output; never asks the model whether it succeeded. - Gates are hard — Tier 2 < 2,
skill=unknown, boundary violations, and false-positive OA each exclude a session from training, unconditionally. - Playbook is domain-scoped —
domain NOT NULL; entries never cross domains; no domain loaded ⇒ no playbook ops. - Observability is read-only on source —
archer_live.py/archer_analytics.pyonly read eval/session/analytics artifacts.
Generated 2026-06-15 from a verified read of the post-refactor codebase. If a detail here drifts from the code, the code wins — file an issue to re-sync.