Skip to content

ARCHER System Map

A top-down chart of every subsystem and how they connect. Grounded in the post-refactor code as of #972 (executor), #973 (router/playbook/domain split),

980 (agent extraction), #981 (archer/pentest/), #987 (plays/).

For the runtime pack contract see play-packs.md; for blast radius and dependency tables see ../../ARCHITECTURE.md; for the file inventory see ../../STRUCTURE.md.


1. The three-layer model (why the code is shaped this way)

Every line of ARCHER belongs to exactly one of three layers. The boundary is architectural, not advisory — compensating logic that crosses it is treated as a design defect.

flowchart TB
    subgraph HUMAN["🧑 HUMAN LAYER — judgment, scope, authorization"]
        H1["Defines scope & acceptable risk · authorizes irreversible actions · final QA/QC"]
    end
    subgraph MODEL["🤖 MODEL LAYER — probabilistic (qwen3:14b @ 8K)"]
        M1["command generation · output interpretation · next-step chaining · attack-chain narrative"]
    end
    subgraph CODE["⚙️ CODE LAYER — deterministic (must have a correct answer)"]
        C1["routing · execution · safety enforcement · halt detection · ground-truth verification · logging"]
    end
    HUMAN -->|"scope, authorization"| CODE
    CODE -->|"task + hints + constraints"| MODEL
    MODEL -->|"proposed command / findings"| CODE
    CODE -->|"verified findings, residual"| HUMAN
Layer Owns Lives in
Model generation, interpretation, reasoning the LLM (via archer_providers.py / Ollama)
Code routing, execution, safety, halt, verify, logging everything else in this map
Human scope, risk, authorization, final QA the operator (Auditor does initial QA)

2. Subsystem inventory

# Subsystem Primary file(s) Responsibility
1 Entry / CLI ARCHER.py (1.2k) Parse args, load one domain, dispatch to the agent loop
2 Agent core archer_platform/agent.py (4.5k) The session loop, prompt assembly, halt/OA gates, PlayRegistry
3 Execution + safety archer_executor.py (546) Run commands (local or docker exec), enforce safety/egress
4 Providers archer_providers.py (558) Chat-session abstraction (Ollama local, Claude/Gemini cloud)
5 Routing archer_router.py (391) Map task → skill category (classifier → keyword → bonus)
6 Domain interface archer_platform/domain.py (120) DomainConfig platform↔domain contract
7 Pentest domain archer/pentest/domain.py (52) PENTEST_DOMAIN instance + eval objectives/success_fns
8 Play packs plays/PT-*.py (×10) Per-skill hints/halt/bonus dispatchers + SKILL_CATEGORIES
9 Playbook archer_playbook.py (951) Learned command replay, variable substitution (domain-scoped)
10 Engagement / findings archer_engagement.py, archer_findings.py Multi-phase sequencing + cross-phase findings accumulation
11 Eval harness testenv/eval_harness.py (3.6k) + archer/pentest/eval/ Run objectives, score against ground truth, emit CSV
12 Training pipeline testenv/audit_review.py, scripts/prepare_finetune.py, scripts/finetune.py Tier 1/2 gates → JSONL → LoRA adapter
13 Observability scripts/archer_live.py (7k), scripts/archer_analytics.py Read-only dashboard + visitor analytics

3. Module dependency graph (runtime)

Arrows mean "imports / calls into." Play packs are loaded dynamically and never import back — that one-way edge is a load-bearing invariant.

flowchart TD
    CLI["ARCHER.py<br/>(entry / CLI / main)"]
    AGENT["archer_platform/agent.py<br/>(loop · PlayRegistry · halt/OA)"]
    EX  ["archer_executor.py<br/>(execute + safety)"]
    PROV["archer_providers.py<br/>(LLM sessions)"]
    ROUTE["archer_router.py<br/>(skill routing)"]
    PIFACE["archer_platform/domain.py<br/>(DomainConfig)"]
    PDOM["archer/pentest/domain.py<br/>(PENTEST_DOMAIN)"]
    PLAYS["plays/PT-*.py<br/>(dispatchers)"]
    PLAYBK["archer_playbook.py"]
    ENG["archer_engagement.py"]
    FIND["archer_findings.py"]

    CLI --> AGENT
    CLI --> PDOM
    CLI --> EX
    CLI --> ROUTE
    CLI --> PLAYBK
    CLI --> FIND
    AGENT --> EX
    AGENT --> PROV
    AGENT --> ROUTE
    AGENT --> PLAYBK
    AGENT --> FIND
    AGENT --> PIFACE
    AGENT --> PDOM
    AGENT -. "dynamic import<br/>PlayRegistry.load_domain" .-> PLAYS
    ROUTE --> EX
    ROUTE -. "late-bind via sys.modules<br/>(PLAYS.loaded_domain)" .-> AGENT
    PDOM --> PIFACE
    PDOM --> EX
    ENG --> FIND
    PLAYS -. "NO back-import<br/>(isolation invariant)" .-x AGENT

Key edges: - agent.py is the hub — it imports the executor, router, playbook, findings, providers, and both domain layers. - archer_routerarcher_executor only for default timeout constants; it reaches the loaded domain's SKILL_CATEGORIES via a deferred sys.modules lookup to avoid a circular import. - Play packs depend on nothing in ARCHER — everything they need arrives in the config dict at call time.


4. Agent session — control flow

One session = task in → terminal exit (OBJECTIVE_ACHIEVED / HALT_DISCIPLINE / error / timeout). Each turn is one model inference + one command execution.

flowchart TD
    START(["task in"]) --> ROUTE["route: detect_skill_category(task)<br/>→ skill category + min/max cmds + halt mode"]
    ROUTE --> PB{"playbook hit?<br/>(domain-scoped)"}
    PB -->|yes| SEED["seed validated command"]
    PB -->|no| BUILD
    SEED --> BUILD["assemble system prompt:<br/>env + hints_fn + domain addendum + JSON schema"]
    BUILD --> CALL["session.stream() → model turn"]
    CALL --> EXTRACT["extract bash command<br/>(JSON field or regex)"]
    EXTRACT --> SAFE{"validate_command_safety<br/>+ validate_command_egress"}
    SAFE -->|"escape (AR-9)"| HALT_ESC(["hard halt"])
    SAFE -->|"dangerous"| APPROVE["human approval"]
    SAFE -->|ok| RUN["execute_command / execute_with_sudo<br/>(local bash OR docker exec)"]
    APPROVE --> RUN
    RUN --> POST["post_process_output + scan analysis<br/>wrap in [EXTERNAL] (AR-1)"]
    POST --> OA{"[OBJECTIVE_ACHIEVED]?<br/>strict parser"}
    OA -->|"yes & step ≥ min_commands"| VERIFY["tier1 signal → tier1 probe → verify_fn"]
    OA -->|"yes but step < min"| DEPTH["suppress (depth guard #4)"]
    OA -->|no| HALTQ
    DEPTH --> HALTQ
    VERIFY -->|confirmed| DONE(["OBJECTIVE_ACHIEVED"])
    VERIFY -->|false positive| HALTQ
    HALTQ{"should_halt_objective()<br/>halt_fn + min/max gates"} -->|halt| HD(["HALT_DISCIPLINE"])
    HALTQ -->|"max steps (25)"| HD
    HALTQ -->|continue| BUILD
    DONE --> LOG["write session log + ft.jsonl + residual.json<br/>+ routing/command/failure logs"]
    HD --> LOG

The two exit gates encode the model/code split: the model claims completion ([OBJECTIVE_ACHIEVED]), but the code verifies it (verify_fn/success_fn against real target state) before it counts.


5. Routing chain (task → skill category)

A routing miss corrupts the whole session (wrong hints, wrong halt criteria, wrong tools), so this is the most upstream decision in the system.

flowchart LR
    T["task string"] --> C1{"Tier 1: Classifier<br/>TF-IDF + LR · conf ≥ 0.5<br/>+ skill in domain<br/>+ no exclude-kw veto"}
    C1 -->|pass| OUT["selected skill category"]
    C1 -->|"miss / --no-classifier"| C2{"Tier 2: Keyword scorer<br/>+2 keyword, −1 exclude,<br/>± bonus_fn"}
    C2 -->|"score > 0"| OUT
    C2 -->|"all ≤ 0"| UNK["'unknown'<br/>(excluded from training)"]
    OUT --> LOG["~/.archer_routing_log.jsonl<br/>(scores, score_gap, confidence,<br/>classifier_version, git_sha)"]
    C3["Tier 3: LLM gate<br/>(removed #169 — code dormant)"]:::dead
    classDef dead stroke-dasharray: 5 5,opacity:0.5;

The routing log is the training signal: eval runs write eval_label entries with the known-correct skill (label_confidence: high), which feed the next classifier.


6. Domain & play-pack plug-in model

--do <domain> --sd <subdomain> selects exactly one play pack. PlayRegistry.load_domain() imports it, validates its dispatchers, and merges its registries into the runtime globals — then refuses a second call (single-domain enforcement).

flowchart TD
    CLI["--do pentest --sd recon"] --> REG["PlayRegistry.load_domain()"]
    REG -->|"single-domain guard:<br/>RuntimeError on 2nd call"| X(("✗"))
    REG --> IMP["importlib → plays/PT-Recon.py"]
    IMP --> VAL["validate dispatchers<br/>(synthetic-call halt/hints/bonus)"]
    VAL --> MERGE["merge into globals"]
    MERGE --> SC["SKILL_CATEGORIES<br/>(core wins on conflict)"]
    MERGE --> TS["TARGET_SIGNATURES / NOUNS"]
    MERGE --> AD["SYSTEM_PROMPT_ADDENDUM (≤300 chars)"]
    SC --> ROUTER["archer_router reads SKILL_CATEGORIES"]
    SC --> LOOP["agent loop reads halt_fn / hints_fn / bonus_fn<br/>per category, via config dict"]

    subgraph PACK["play pack contract (stdlib-only, no ARCHER import)"]
        direction LR
        P1["SKILL_CATEGORIES{}"]
        P2["halt_fn(count, findings, config)→bool"]
        P3["hints_fn(task, config, tools, sigs)→list"]
        P4["bonus_fn(task, has_ctx, config)→int"]
        P5["SYSTEM_PROMPT_ADDENDUM"]
        P6["_register_pack_handlers()"]
    end
    DOMIFACE["archer_platform/domain.py<br/>DomainConfig contract"] --> PDOMI["archer/pentest/domain.py<br/>PENTEST_DOMAIN<br/>(error_patterns · post_process · report_prompt)"]

The 10 packs: PT-Recon, PT-Vulnerability, PT-Web, PT-Exploitation, PT-PostExploit, PT-Pivoting, PT-Privesc, PT-ActiveDirectory, PT-ThreatEmulation, plus the legacy monolithic penetration.py.


7. Eval → training pipeline (the learning loop)

This is a separate process from the agent: the harness spawns ARCHER as a subprocess, scores the result against ground truth, and the surviving sessions become fine-tuning data. Gates are hard — a failure at any stage drops the session.

flowchart TD
    OBJ["archer/pentest/eval/objectives.py<br/>(task · expected_play · success_fn · setup_fn · verify_fn)"]
    HARNESS["testenv/eval_harness.py"]
    OBJ --> HARNESS
    HARNESS -->|"setup_fn → spawn ARCHER → success_fn"| RUN["session run"]
    RUN --> CSV["testenv/eval_results/*.csv<br/>(skill_selected, success, halt_reason, 35+ cols)"]
    RUN --> FT["~/.archer_sessions/*.ft.jsonl<br/>(+ .residual.json)"]
    FT --> T1["audit_review.py — Tier 1<br/>deterministic structural checks<br/>(_SIGNAL_RE evidence filter)"]
    T1 -->|"lab_suspect → drop"| DROP1(("✗"))
    T1 --> T2["Tier 2 — LLM judge (Haiku)<br/>4 criteria × 0–3 → .tier2.json"]
    T2 --> PREP["prepare_finetune.py<br/>~11 hard gates"]
    PREP -->|"skill=unknown · tier2<2 · BV · false-OA · SHA-epoch"| DROP2(("✗"))
    PREP --> JSONL["data/finetune/&lt;skill&gt;.jsonl<br/>(diversity gate: ≤30%/target IP)"]
    JSONL --> FTUNE["finetune.py — QLoRA (r=16, α=32)"]
    FTUNE --> LORA["lora_weights/&lt;skill&gt;/ → Ollama Modelfile"]
    LORA -.->|"--lora at inference"| HARNESS
    CSV -.->|"routing labels"| CLASS["classifier retrain"]
    CLASS -.-> RUN

The gates, in order, that a session must clear to become training data: OBJECTIVE_ACHIEVED/HALT_DISCIPLINE exit → success_fn true → Tier 1 clean → Tier 2 ≥ 2 → known skill (≠ unknown) → no boundary violation → not pre-epoch → under per-skill cap → enough turns → under per-target diversity ceiling.


8. Data artifacts map

Path Written by Read by Contents
~/.archer_sessions/*.jsonl agent dashboard session events (start/command/halt/end)
~/.archer_sessions/*.ft.jsonl agent (--ft-log) audit, prepare, dashboard model turns → training candidates
~/.archer_sessions/*.residual.json agent (#202) Auditor unverified model claims (findings, severity)
{session}.ft.jsonl.tier2.json audit_review prepare, playbook seed Tier 2 score sidecar
~/.archer_routing_log.jsonl router classifier training routing decisions + eval labels
~/.archer_command_log.jsonl agent review per-command attempts
~/.archer_failure_log.jsonl agent RCA failure taxonomy codes
~/.archer_playbook.db playbook agent learned commands + session metrics (domain-scoped)
~/.archer_engagements/{id}/state.json engagement --resume phase findings checkpoint
~/.archer_campaigns.db agent (#960) agent cross-session hosts/creds
testenv/eval_results/*.csv harness dashboard, CI gate per-objective results
data/finetune/<skill>.jsonl prepare_finetune finetune ChatML training examples
lora_weights/<skill>/ finetune inference (--lora) LoRA adapter
~/.archer_classifier/router_classifier.pkl train_classifier router TF-IDF+LR routing model

9. Load-bearing invariants (do not break without agreement)

  1. Single domain per sessionPlayRegistry.load_domain() raises on the 2nd call. Prevents contradictory cross-domain prompts diluting the 8K context.
  2. Strict [OBJECTIVE_ACHIEVED] parser — only the exact token counts; the model claims, the code verifies (verify_fn/success_fn against real state).
  3. Depth guard — OA below min_commands is suppressed; blocks zero-/one-command false completions.
  4. Pack isolationplays/*.py import nothing from ARCHER; all context flows through the config dict. Keeps packs independently testable and stackable.
  5. Safety is deterministic and early — container-escape (AR-9) and egress (AR-8) checks run before execution, with no model override.
  6. Container-root sudo bypass — when EXEC_TARGET is set, execute_with_sudo skips all password machinery (the container is already root).
  7. Ground truth is model-independentsuccess_fn is regex-only over real output; never asks the model whether it succeeded.
  8. Gates are hard — Tier 2 < 2, skill=unknown, boundary violations, and false-positive OA each exclude a session from training, unconditionally.
  9. Playbook is domain-scopeddomain NOT NULL; entries never cross domains; no domain loaded ⇒ no playbook ops.
  10. Observability is read-only on sourcearcher_live.py / archer_analytics.py only read eval/session/analytics artifacts.

Generated 2026-06-15 from a verified read of the post-refactor codebase. If a detail here drifts from the code, the code wins — file an issue to re-sync.