ARCHER System Map¶

A top-down chart of every subsystem and how they connect. Grounded in the post-refactor code as of #972 (executor), #973 (router/playbook/domain split),

980 (agent extraction), #981 (archer/pentest/), #987 (plays/).¶

For the runtime pack contract see play-packs.md; for blast radius and dependency tables see ../../ARCHITECTURE.md; for the file inventory see ../../STRUCTURE.md.

1. The three-layer model (why the code is shaped this way)¶

Every line of ARCHER belongs to exactly one of three layers. The boundary is architectural, not advisory — compensating logic that crosses it is treated as a design defect.

flowchart TB
    subgraph HUMAN["🧑 HUMAN LAYER — judgment, scope, authorization"]
        H1["Defines scope & acceptable risk · authorizes irreversible actions · final QA/QC"]
    end
    subgraph MODEL["🤖 MODEL LAYER — probabilistic (qwen3:14b @ 8K)"]
        M1["command generation · output interpretation · next-step chaining · attack-chain narrative"]
    end
    subgraph CODE["⚙️ CODE LAYER — deterministic (must have a correct answer)"]
        C1["routing · execution · safety enforcement · halt detection · ground-truth verification · logging"]
    end
    HUMAN -->|"scope, authorization"| CODE
    CODE -->|"task + hints + constraints"| MODEL
    MODEL -->|"proposed command / findings"| CODE
    CODE -->|"verified findings, residual"| HUMAN

Layer	Owns	Lives in
Model	generation, interpretation, reasoning	the LLM (via `archer_providers.py` / Ollama)
Code	routing, execution, safety, halt, verify, logging	everything else in this map
Human	scope, risk, authorization, final QA	the operator (Auditor does initial QA)

2. Subsystem inventory¶

#	Subsystem	Primary file(s)	Responsibility
1	Entry / CLI	`ARCHER.py` (1.2k)	Parse args, load one domain, dispatch to the agent loop
2	Agent core	`archer_platform/agent.py` (4.5k)	The session loop, prompt assembly, halt/OA gates, `PlayRegistry`
3	Execution + safety	`archer_executor.py` (546)	Run commands (local or `docker exec`), enforce safety/egress
4	Providers	`archer_providers.py` (558)	Chat-session abstraction (Ollama local, Claude/Gemini cloud)
5	Routing	`archer_router.py` (391)	Map task → skill category (classifier → keyword → bonus)
6	Domain interface	`archer_platform/domain.py` (120)	`DomainConfig` platform↔domain contract
7	Pentest domain	`archer/pentest/domain.py` (52)	`PENTEST_DOMAIN` instance + eval objectives/success_fns
8	Play packs	`plays/PT-*.py` (×10)	Per-skill hints/halt/bonus dispatchers + `SKILL_CATEGORIES`
9	Playbook	`archer_playbook.py` (951)	Learned command replay, variable substitution (domain-scoped)
10	Engagement / findings	`archer_engagement.py`, `archer_findings.py`	Multi-phase sequencing + cross-phase findings accumulation
11	Eval harness	`testenv/eval_harness.py` (3.6k) + `archer/pentest/eval/`	Run objectives, score against ground truth, emit CSV
12	Training pipeline	`testenv/audit_review.py`, `scripts/prepare_finetune.py`, `scripts/finetune.py`	Tier 1/2 gates → JSONL → LoRA adapter
13	Observability	`scripts/archer_live.py` (7k), `scripts/archer_analytics.py`	Read-only dashboard + visitor analytics

3. Module dependency graph (runtime)¶

Arrows mean "imports / calls into." Play packs are loaded dynamically and never import back — that one-way edge is a load-bearing invariant.

flowchart TD
    CLI["ARCHER.py<br/>(entry / CLI / main)"]
    AGENT["archer_platform/agent.py<br/>(loop · PlayRegistry · halt/OA)"]
    EX  ["archer_executor.py<br/>(execute + safety)"]
    PROV["archer_providers.py<br/>(LLM sessions)"]
    ROUTE["archer_router.py<br/>(skill routing)"]
    PIFACE["archer_platform/domain.py<br/>(DomainConfig)"]
    PDOM["archer/pentest/domain.py<br/>(PENTEST_DOMAIN)"]
    PLAYS["plays/PT-*.py<br/>(dispatchers)"]
    PLAYBK["archer_playbook.py"]
    ENG["archer_engagement.py"]
    FIND["archer_findings.py"]

    CLI --> AGENT
    CLI --> PDOM
    CLI --> EX
    CLI --> ROUTE
    CLI --> PLAYBK
    CLI --> FIND
    AGENT --> EX
    AGENT --> PROV
    AGENT --> ROUTE
    AGENT --> PLAYBK
    AGENT --> FIND
    AGENT --> PIFACE
    AGENT --> PDOM
    AGENT -. "dynamic import<br/>PlayRegistry.load_domain" .-> PLAYS
    ROUTE --> EX
    ROUTE -. "late-bind via sys.modules<br/>(PLAYS.loaded_domain)" .-> AGENT
    PDOM --> PIFACE
    PDOM --> EX
    ENG --> FIND
    PLAYS -. "NO back-import<br/>(isolation invariant)" .-x AGENT

Key edges: - agent.py is the hub — it imports the executor, router, playbook, findings, providers, and both domain layers. - archer_router → archer_executor only for default timeout constants; it reaches the loaded domain's SKILL_CATEGORIES via a deferred sys.modules lookup to avoid a circular import. - Play packs depend on nothing in ARCHER — everything they need arrives in the config dict at call time.

4. Agent session — control flow¶

One session = task in → terminal exit (OBJECTIVE_ACHIEVED / HALT_DISCIPLINE / error / timeout). Each turn is one model inference + one command execution.

flowchart TD
    START(["task in"]) --> ROUTE["route: detect_skill_category(task)<br/>→ skill category + min/max cmds + halt mode"]
    ROUTE --> PB{"playbook hit?<br/>(domain-scoped)"}
    PB -->|yes| SEED["seed validated command"]
    PB -->|no| BUILD
    SEED --> BUILD["assemble system prompt:<br/>env + hints_fn + domain addendum + JSON schema"]
    BUILD --> CALL["session.stream() → model turn"]
    CALL --> EXTRACT["extract bash command<br/>(JSON field or regex)"]
    EXTRACT --> SAFE{"validate_command_safety<br/>+ validate_command_egress"}
    SAFE -->|"escape (AR-9)"| HALT_ESC(["hard halt"])
    SAFE -->|"dangerous"| APPROVE["human approval"]
    SAFE -->|ok| RUN["execute_command / execute_with_sudo<br/>(local bash OR docker exec)"]
    APPROVE --> RUN
    RUN --> POST["post_process_output + scan analysis<br/>wrap in [EXTERNAL] (AR-1)"]
    POST --> OA{"[OBJECTIVE_ACHIEVED]?<br/>strict parser"}
    OA -->|"yes & step ≥ min_commands"| VERIFY["tier1 signal → tier1 probe → verify_fn"]
    OA -->|"yes but step < min"| DEPTH["suppress (depth guard #4)"]
    OA -->|no| HALTQ
    DEPTH --> HALTQ
    VERIFY -->|confirmed| DONE(["OBJECTIVE_ACHIEVED"])
    VERIFY -->|false positive| HALTQ
    HALTQ{"should_halt_objective()<br/>halt_fn + min/max gates"} -->|halt| HD(["HALT_DISCIPLINE"])
    HALTQ -->|"max steps (25)"| HD
    HALTQ -->|continue| BUILD
    DONE --> LOG["write session log + ft.jsonl + residual.json<br/>+ routing/command/failure logs"]
    HD --> LOG

The two exit gates encode the model/code split: the model claims completion ([OBJECTIVE_ACHIEVED]), but the code verifies it (verify_fn/success_fn against real target state) before it counts.

5. Routing chain (task → skill category)¶

A routing miss corrupts the whole session (wrong hints, wrong halt criteria, wrong tools), so this is the most upstream decision in the system.

flowchart LR
    T["task string"] --> C1{"Tier 1: Classifier<br/>TF-IDF + LR · conf ≥ 0.5<br/>+ skill in domain<br/>+ no exclude-kw veto"}
    C1 -->|pass| OUT["selected skill category"]
    C1 -->|"miss / --no-classifier"| C2{"Tier 2: Keyword scorer<br/>+2 keyword, −1 exclude,<br/>± bonus_fn"}
    C2 -->|"score > 0"| OUT
    C2 -->|"all ≤ 0"| UNK["'unknown'<br/>(excluded from training)"]
    OUT --> LOG["~/.archer_routing_log.jsonl<br/>(scores, score_gap, confidence,<br/>classifier_version, git_sha)"]
    C3["Tier 3: LLM gate<br/>(removed #169 — code dormant)"]:::dead
    classDef dead stroke-dasharray: 5 5,opacity:0.5;

The routing log is the training signal: eval runs write eval_label entries with the known-correct skill (label_confidence: high), which feed the next classifier.

6. Domain & play-pack plug-in model¶

--do <domain> --sd <subdomain> selects exactly one play pack. PlayRegistry.load_domain() imports it, validates its dispatchers, and merges its registries into the runtime globals — then refuses a second call (single-domain enforcement).

flowchart TD
    CLI["--do pentest --sd recon"] --> REG["PlayRegistry.load_domain()"]
    REG -->|"single-domain guard:<br/>RuntimeError on 2nd call"| X(("✗"))
    REG --> IMP["importlib → plays/PT-Recon.py"]
    IMP --> VAL["validate dispatchers<br/>(synthetic-call halt/hints/bonus)"]
    VAL --> MERGE["merge into globals"]
    MERGE --> SC["SKILL_CATEGORIES<br/>(core wins on conflict)"]
    MERGE --> TS["TARGET_SIGNATURES / NOUNS"]
    MERGE --> AD["SYSTEM_PROMPT_ADDENDUM (≤300 chars)"]
    SC --> ROUTER["archer_router reads SKILL_CATEGORIES"]
    SC --> LOOP["agent loop reads halt_fn / hints_fn / bonus_fn<br/>per category, via config dict"]

    subgraph PACK["play pack contract (stdlib-only, no ARCHER import)"]
        direction LR
        P1["SKILL_CATEGORIES{}"]
        P2["halt_fn(count, findings, config)→bool"]
        P3["hints_fn(task, config, tools, sigs)→list"]
        P4["bonus_fn(task, has_ctx, config)→int"]
        P5["SYSTEM_PROMPT_ADDENDUM"]
        P6["_register_pack_handlers()"]
    end
    DOMIFACE["archer_platform/domain.py<br/>DomainConfig contract"] --> PDOMI["archer/pentest/domain.py<br/>PENTEST_DOMAIN<br/>(error_patterns · post_process · report_prompt)"]

The 10 packs: PT-Recon, PT-Vulnerability, PT-Web, PT-Exploitation, PT-PostExploit, PT-Pivoting, PT-Privesc, PT-ActiveDirectory, PT-ThreatEmulation, plus the legacy monolithic penetration.py.

7. Eval → training pipeline (the learning loop)¶

This is a separate process from the agent: the harness spawns ARCHER as a subprocess, scores the result against ground truth, and the surviving sessions become fine-tuning data. Gates are hard — a failure at any stage drops the session.

flowchart TD
    OBJ["archer/pentest/eval/objectives.py<br/>(task · expected_play · success_fn · setup_fn · verify_fn)"]
    HARNESS["testenv/eval_harness.py"]
    OBJ --> HARNESS
    HARNESS -->|"setup_fn → spawn ARCHER → success_fn"| RUN["session run"]
    RUN --> CSV["testenv/eval_results/*.csv<br/>(skill_selected, success, halt_reason, 35+ cols)"]
    RUN --> FT["~/.archer_sessions/*.ft.jsonl<br/>(+ .residual.json)"]
    FT --> T1["audit_review.py — Tier 1<br/>deterministic structural checks<br/>(_SIGNAL_RE evidence filter)"]
    T1 -->|"lab_suspect → drop"| DROP1(("✗"))
    T1 --> T2["Tier 2 — LLM judge (Haiku)<br/>4 criteria × 0–3 → .tier2.json"]
    T2 --> PREP["prepare_finetune.py<br/>~11 hard gates"]
    PREP -->|"skill=unknown · tier2<2 · BV · false-OA · SHA-epoch"| DROP2(("✗"))
    PREP --> JSONL["data/finetune/&lt;skill&gt;.jsonl<br/>(diversity gate: ≤30%/target IP)"]
    JSONL --> FTUNE["finetune.py — QLoRA (r=16, α=32)"]
    FTUNE --> LORA["lora_weights/&lt;skill&gt;/ → Ollama Modelfile"]
    LORA -.->|"--lora at inference"| HARNESS
    CSV -.->|"routing labels"| CLASS["classifier retrain"]
    CLASS -.-> RUN

The gates, in order, that a session must clear to become training data: OBJECTIVE_ACHIEVED/HALT_DISCIPLINE exit → success_fn true → Tier 1 clean → Tier 2 ≥ 2 → known skill (≠ unknown) → no boundary violation → not pre-epoch → under per-skill cap → enough turns → under per-target diversity ceiling.

8. Data artifacts map¶

Path	Written by	Read by	Contents
`~/.archer_sessions/*.jsonl`	agent	dashboard	session events (start/command/halt/end)
`~/.archer_sessions/*.ft.jsonl`	agent (`--ft-log`)	audit, prepare, dashboard	model turns → training candidates
`~/.archer_sessions/*.residual.json`	agent (#202)	Auditor	unverified model claims (findings, severity)
`{session}.ft.jsonl.tier2.json`	audit_review	prepare, playbook seed	Tier 2 score sidecar
`~/.archer_routing_log.jsonl`	router	classifier training	routing decisions + eval labels
`~/.archer_command_log.jsonl`	agent	review	per-command attempts
`~/.archer_failure_log.jsonl`	agent	RCA	failure taxonomy codes
`~/.archer_playbook.db`	playbook	agent	learned commands + session metrics (domain-scoped)
`~/.archer_engagements/{id}/state.json`	engagement	`--resume`	phase findings checkpoint
`~/.archer_campaigns.db`	agent (#960)	agent	cross-session hosts/creds
`testenv/eval_results/*.csv`	harness	dashboard, CI gate	per-objective results
`data/finetune/<skill>.jsonl`	prepare_finetune	finetune	ChatML training examples
`lora_weights/<skill>/`	finetune	inference (`--lora`)	LoRA adapter
`~/.archer_classifier/router_classifier.pkl`	train_classifier	router	TF-IDF+LR routing model

9. Load-bearing invariants (do not break without agreement)¶

Single domain per session — PlayRegistry.load_domain() raises on the 2nd call. Prevents contradictory cross-domain prompts diluting the 8K context.
Strict [OBJECTIVE_ACHIEVED] parser — only the exact token counts; the model claims, the code verifies (verify_fn/success_fn against real state).
Depth guard — OA below min_commands is suppressed; blocks zero-/one-command false completions.
Pack isolation — plays/*.py import nothing from ARCHER; all context flows through the config dict. Keeps packs independently testable and stackable.
Safety is deterministic and early — container-escape (AR-9) and egress (AR-8) checks run before execution, with no model override.
Container-root sudo bypass — when EXEC_TARGET is set, execute_with_sudo skips all password machinery (the container is already root).
Ground truth is model-independent — success_fn is regex-only over real output; never asks the model whether it succeeded.
Gates are hard — Tier 2 < 2, skill=unknown, boundary violations, and false-positive OA each exclude a session from training, unconditionally.
Playbook is domain-scoped — domain NOT NULL; entries never cross domains; no domain loaded ⇒ no playbook ops.
Observability is read-only on source — archer_live.py / archer_analytics.py only read eval/session/analytics artifacts.

Generated 2026-06-15 from a verified read of the post-refactor codebase. If a detail here drifts from the code, the code wins — file an issue to re-sync.