Sagittarius Threat Hunting — System Design¶

Status: Pre-implementation planning. No code written. Design surface for Sagittarius, a standalone distributed threat-hunting product built on the ARCHER engine.

Naming: Sagittarius is the product/deployment identity of this threat-hunting capability — a distinct distributed defensive platform. ARCHER is the underlying agent engine; the threat-hunting domain runs on ARCHER (agent loop, play packs, eval harness, --do hunting), but its product home, packaging, and deployment story are Sagittarius. Where this doc says "the hunting domain" or "the Executor/Explorer," that is the ARCHER engine; where it says "the product," that is Sagittarius.

What It Is¶

Sagittarius is a scheduled, hypothesis-driven AI hunting product, built on the ARCHER engine. It runs hunt plays against the telemetry available in its deployment environment — Zeek/Suricata/Elasticsearch-class network and endpoint sources — using the three-outcome verdict model below.

Sagittarius does not depend on any single SIEM or sensor platform. Security Onion / Elastic telemetry is a possible future integration — one supported data source Sagittarius could ingest later — not the deployment target or distribution channel. See "Data-Source Integration Scope" below.

It extends ARCHER's existing architecture (agent loop, play packs, eval harness) into the defensive domain via a new --do hunting domain flag, following the same single-domain enforcement pattern used for penetration testing.

Fundamental Differences: Pentesting vs Threat Hunting¶

The pentest domain and the hunting domain share ARCHER's core architecture but differ across five axes:

Axis	Pentesting	Threat Hunting
Environment	Adversarial — the target resists, may evict you	Cooperative — full read access to all telemetry
Data model	Discovered through action — picture built from nothing	Pre-existing, queryable — telemetry is already there
Action model	Hard, stateful — sends packets, spawns shells, changes state	Read-only queries — same hunt returns same answer every time
Success criteria	Binary: got the shell / flag / credential	Three-outcome: CONFIRMED / NOT OBSERVED / COVERAGE GAP
Time orientation	Present-tense against a live system	Past-tense against historical records

What this means for the model's job: In pentesting the model navigates an unknown system based on what it reveals under probing. In threat hunting the model reasons over a structured dataset that already exists. The challenge shifts from access to signal extraction.

The Three-Outcome Verdict Model¶

Pentest success/halt is binary. Hunting requires three outcomes — this is load-bearing for the eval harness:

Verdict	Meaning
CONFIRMED	Pattern found — hypothesis supported by evidence
NOT OBSERVED	Pattern absent — hypothesis denied, but data was present and queryable
COVERAGE GAP	Telemetry source missing or query returned zero records — not a clean bill of health

NOT OBSERVED is a valid finding. A hunt that finds nothing is not a failed hunt. The eval harness, the play's triage_fn, and the verdict store must distinguish all three. COVERAGE GAP in particular must never be silently treated as NOT OBSERVED — a missing data source is a detection blind spot, not evidence of absence.

Play Architecture (vs Pentest Skill Pack)¶

"Skills" → "Plays" is the correct terminology shift. "Skill" implies agent capability. "Play" implies a defined procedure from a playbook — run to test a hypothesis. Industry standard term (hunt playbooks, threat playbooks).

Pentest skill pack shape (current):¶

setup_fn:       verify target reachable
hints_fn:       specific tool invocations (nmap, sqlmap, gobuster)
success_fn:     check for specific output (shell, flag, credentials)
halt_fn:        stop conditions (max commands, danger detection, objective achieved)
bonus_fn:       routing weight delta

Hunting play shape (planned):¶

hypothesis:     "Host X is beaconing to a C2"
data_sources:   [zeek.conn, suricata.alert]  ← declared upfront; COVERAGE_GAP if absent
query_chain:
  step 1:       aggregate Zeek conn by dest IP
                group_by: [src_ip, dest_ip]
                compute: [count, avg_bytes, interval_variance]
                filter: interval_variance < 30 AND count > 10
  step 2:       IF matches → pivot to DNS queries for that dest IP
  step 3:       IF DGA-pattern domain → threat intel enrichment
triage_fn:      evaluates result against hypothesis + baseline registry
pass_fn:        periodic + small payload + DGA domain = CONFIRMED
deny_fn:        no periodic pattern in 24h window = NOT OBSERVED
inconclusive_fn: source absent or 0 records = COVERAGE GAP

New dispatcher additions vs pentest:¶

triage_fn(result, hypothesis, baseline_registry) → verdict — replaces success_fn/halt_fn
data_sources declaration — engine returns COVERAGE GAP if required source absent
Three-verdict model throughout eval harness

Why the model never writes queries¶

The critical design choice: the model interprets results, not query syntax. Human play authors write the Elasticsearch DSL, Zeek field filters, and aggregation logic. The model decides when to execute step 2 and what the aggregate result means for the hypothesis.

This eliminates the hallucinated field name problem that plagues dynamic query generation approaches. A model asked to write raw ES DSL will invent field names. A model asked to evaluate a pre-aggregated result set will not.

Tradeoff: plays must be written for novel hypotheses before they can run. The Explorer subsystem (below) addresses this.

Play Authoring Format (YAML) — Practitioner-Facing¶

The play shape above is pseudo-code; this is the concrete artifact a practitioner authors. Design goal: a hunt methodology is encoded as low-complexity declarative YAML that reads like a runbook, not code — the same authoring bar as a Sigma rule. No Python required to define a play.

Worked example (TH-BEACON-01, C2 beaconing):

id: TH-BEACON-01
technique: T1071.001                 # ATT&CK
author: jhawkins
hypothesis: >
  A host on the user subnet is beaconing to external C2 over HTTP(S) at a
  regular interval with low jitter and consistent small request sizes.

requires:                            # data-source gate — COVERAGE GAP if absent (Sigma logsource analogue)
  - zeek.conn
  - zeek.http

params:                              # env-specific, injected at runtime — keeps the LOGIC portable
  internal_cidr:   "{INTERNAL_CIDR}"
  lookback:        "24h"
  min_connections: 50
  max_jitter_pct:  15

steps:
  - id: candidates
    intent: "Internal->external pairs with many connections in the window."
    query: >
      zeek.conn | where ts > now()-{lookback} and src in {internal_cidr}
      and dst not in {internal_cidr} | summarize n=count() by src,dst | where n >= {min_connections}
    expect: rows                      # zero rows -> NOT OBSERVED, stop (deterministic gate)
    hints: ["Exclude known-good dst (CDN, OS update, telemetry) before flagging."]

  - id: periodicity
    intent: "Test inter-arrival times for low-jitter regularity."
    for_each: candidates
    branch:
      - when: { numeric: { field: jitter_pct, lt: "{max_jitter_pct}" } }
        goto: enrich
      - else: { verdict: not_observed }
    hints:
      - "Regular small requests at near-constant interval = beaconing."
      - "Human browsing is bursty/irregular - high jitter, varied sizes."

  - id: enrich
    intent: "dst reputation + JA3/TLS + URI patterns to confirm or clear."
    hints: ["Rare JA3, empty User-Agent, beacon-style URIs raise confidence."]

verdict:                             # three-outcome - evaluated by the #945 atom grammar (reused, not new)
  confirmed:
    all:
      - step_passed: periodicity
      - any: [ { pattern: "rare_ja3|empty user-agent|known_c2" },
               { reputation: { dst: "{dst}", min_score: 70 } } ]
  coverage_gap: { source_absent: any }
  not_observed: default

Reuse, don't reinvent — the evaluation engine already exists. The verdict block is built from the #945 declarative atom evaluator already running in the eval harness (testenv/objective_registry.py + the atom grammar: pattern, tool_used, numeric, any/all). testenv/objectives/PT-DISTCC-01.yaml proves objectives can be authored natively in YAML rather than ported from Python. The hunting verdict/triage block compiles to the same atoms — no new evaluation engine is built; the format is an extension of a proven one.

Compile target (post-#973): a loader turns the YAML into a DomainConfig (archer_platform/domain.py):

YAML field	DomainConfig surface
`requires`	`data_sources` declaration -> COVERAGE GAP gate
`steps[].hints`	`hints_fn` (per-step prompt guidance)
`steps[].branch`	deterministic query-chain control flow
`verdict`	`triage_fn` -> CONFIRMED / NOT OBSERVED / COVERAGE GAP
`params`	runtime-injected environment context

Refinement to the promotion pipeline: the pipeline above promotes Explorer drafts to plays/HT/*.py (Python). The #945 YAML-native precedent means validated plays can stay declarative YAML (loaded -> DomainConfig) instead of being ported to Python — lowering the authoring bar and keeping plays community-contributable as data, not code. Open decision: YAML-native plays vs. Python play packs. Recommendation: YAML-native for author-facing plays; reserve Python only for plays needing custom logic the atom grammar cannot express.

One artifact, three roles¶

The same play file serves three purposes — this is what makes "train the AI to employ the technique consistently" tractable:

Runtime program — loaded -> DomainConfig; the Executor runs the chain.
Training-data generator — a verdict-validated run becomes a session trace -> ft.jsonl -> QLoRA (ARCHER's existing V1->V2 pipeline). Over many runs the model internalizes the methodology.
Its own test — the verdict atoms are the eval pass/fail criteria. Run the play against the labeled defensive range (sensitivity + specificity). This is the measurability differentiator.

Determinism lives where ARCHER's three-layer split puts it: code enforces the chain/branches/verdict (repeatable), the model does bounded per-node interpretation, the human sets scope and baselines. The methodology is deterministic; the LLM is a bounded interpreter inside it.

Dual-Subsystem Architecture¶

Two subsystems sharing one Ollama endpoint. 8GB VRAM is not a barrier — Executor and Explorer never run concurrently. They are sequential, scheduled workloads.

Timeline example:
00:00  Executor sweep — 20-play queue, ~45 min compute, --think=false
00:45  Executor done, verdicts logged, GPU idle
01:00  Explorer wakes — reads MITRE gaps + recent verdicts + schema, --think=true
02:30  Explorer done — draft YAML written to candidate_plays/
       GPU idle until next sweep window
06:00  Executor runs again

Executor¶

Runs validated plays from the playbook on defined schedule
Uses qwen3:14b --think=false — fast, deterministic, reliable
Produces CONFIRMED / NOT OBSERVED / COVERAGE GAP per play per run
Logs every verdict to persistent SQLite store

Explorer¶

Runs in idle windows between Executor sweeps
Uses qwen3:14b --think=true — slower, deeper reasoning justified by exploration goal
Input: MITRE ATT&CK technique coverage gaps + recent verdict history + data source schema (static, injected — no hallucinated fields)
Output: candidate play YAML written to candidate_plays/ — structured draft, not executable Python
Human reviews candidate YAML weekly; promotes sound candidates to validated play packs
Never runs against live data autonomously — it produces draft hypotheses only

Play promotion pipeline¶

MITRE gap + recent verdicts + known schema
    ↓ Explorer (daily, off-peak)
candidate_plays/{MITRE-ID}_{date}.yaml    ← human-readable draft
    ↓ Human review gate
plays/HT/{HT-technique-id}.py             ← validated play pack, enters Executor queue

The Explorer grows the playbook. The human validates. Autonomous promotion never happens.

Persistent Daemon Mode¶

ARCHER currently runs as interactive, human-triggered sessions. Hunting's natural operational model is a headless scheduled service — hypothesis queue runs, verdicts log, no analyst required per sweep.

Scheduling model (hybrid)¶

Time-based sweeps: full hypothesis queue every N hours (configurable)
Event-triggered escalation: Suricata fires on a host → immediately run relevant plays for that technique class, skip the queue position

The baseline that builds itself¶

The most important property of persistent operation: the verdict history IS the baseline.

Static baseline systems require someone to define "normal" upfront. A persistent daemon builds it through operation: - HT-BEACON-01 returns NOT OBSERVED for 90 consecutive days - Day 91: CONFIRMED - 90 data points establish absence before the confirmation — no manual baselining required

The verdict store schema:

play_runs(
  play_id       TEXT,
  run_at        INTEGER,   -- Unix timestamp
  verdict       TEXT,      -- CONFIRMED | NOT_OBSERVED | COVERAGE_GAP
  evidence      TEXT,      -- snippet of matching result
  data_source   TEXT,      -- which source was queried
  record_count  INTEGER    -- records returned by query
)

Drift detection emerges naturally: "this play was NOT OBSERVED for N days and just CONFIRMED" is a first-class alert condition derived from the store without additional configuration.

Resource management on 8GB VRAM¶

One hypothesis chain at a time — sequential inference, never concurrent
Configurable throttle: max N inferences per hour
Sweep duration scales with playbook size; 20 plays ≈ 45 min compute
Executor and Explorer never overlap — time-scheduled separation

Defensive Detection Range vs Offensive Cyber Range¶

The eval infrastructure differs fundamentally from the pentest domain:

Dimension	Offensive (GOAD)	Defensive (telemetry dataset)
Physical target	Running VMs	A file — frozen PCAP + logs
VMs required during eval	Yes	No
Ground truth	Known vulnerability in known software	Known indicator planted at known timestamp
Eval metric	Binary (got shell?)	Three-outcome + false positive rate
Environmental drift	Yes — VMs update, network state changes	No — frozen dataset is reproducible
Portability	Requires hypervisor	Ship as tarball
Specificity	Not a metric (range IS the target)	First-class metric

Specificity is the metric that doesn't exist in offensive eval at all. A play that fires on everything is useless. The defensive eval harness measures both: - Sensitivity: does the play CONFIRM when the indicator is present? - Specificity: does the play NOT CONFIRM on clean background traffic?

Building the defensive range from GOAD telemetry¶

GOAD already generates the attack patterns needed. Run the attacks, let Zeek + Suricata process the traffic, label the timestamp windows as ground truth:

GOAD scenario	Primary Zeek signal	Suricata signal	Ground truth label
BloodHound LDAP sweep	`dns`, `conn` (LDAP query volume)	ET LDAP rules	HT-RECON-01
Kerberoasting	`kerberos` (SPN requests + RC4 downgrade)	SID 2024792	HT-CRED-01
Pass-the-Hash	`smb`, `ntlm` (unusual auth pairs)	ET CREDS rules	HT-LATERAL-01
DCSync	`dce_rpc` (DRSUAPI from non-DC host)	ET POLICY rules	HT-EXFIL-01
C2 beaconing	`conn` (periodic, low-variance intervals)	custom threshold	HT-C2-01

All of these are detectable with a Zeek + Suricata + Elasticsearch telemetry stack — the baseline source set Sagittarius targets, and the set a future Security Onion community integration would supply.

Eval protocol¶

Run GOAD attack scenario at known timestamp T
Label log window [T−5m, T+15m] as ground truth for play P
Run play P against full dataset (not just labeled window)
CONFIRMED within labeled window = sensitivity pass
CONFIRMED outside labeled window / background traffic volume = false positive rate

The defensive range is a reusable artifact. Freeze the logs after a GOAD run, ship as a tarball, replay evals indefinitely against the same dataset. Reproducibility is a property that GOAD live-target evals don't have.

What to Borrow from Reference Frameworks¶

Based on analysis of the Threat Hunter Playbook (Open Threat Research), policy-guided frameworks, and 2026 local inference stack patterns:

Adopt¶

Statistical pre-filter before LLM sees data. A lightweight per-host baseline (rolling mean + stddev on connection count, bytes, interval) flags deviations before the play query runs. The play queries against flagged entries, not raw logs. Solves the 8K context / massive log problem more cleanly than aggregation alone. No ML required — statistics cover most beaconing detection.

Structural separation of query generation from triage validation. hints_fn drives the query step; triage_fn drives result interpretation. A model that generated the query is biased toward confirming it. Explicit phase separation prevents this — the triage step receives the raw result + the original hypothesis + deny/inconclusive criteria, not the query that was run.

Baseline behavior registry. Local YAML/JSON per environment: {host → [known outbound IPs, known processes, expected cron windows]}. Triage step subtracts registry matches before scoring. Play author defines what constitutes a match for their hypothesis class.

Multi-candidate query variants per step. 2–3 parameterized variants per step (human-authored, not dynamically generated). Code runs all variants, unions results, model evaluates the union. Hedges against query quality issues without introducing hallucination risk.

Continuous verdict logging as training signal. Log every verdict — hypothesis ID, timestamp, result, result count, data source queried. Feeds both play refinement and eventual model fine-tuning.

Headless daemon mode (--daemon flag). Scheduled hypothesis sweeps, no interactive session required. This is the right operational model for a detection system.

Explicitly reject¶

Dynamic SIEM query generation (LLM writes ES DSL / SPL / KQL). The hallucinated field name problem is real and unsolved for 8K models. Our play-defined query chains eliminate this class of failure entirely. Model interprets results; it never writes queries.

Semantic search against schema catalogs. A Zeek/Suricata/Elasticsearch deployment has a fixed, known schema. Dynamic schema discovery is overengineering for a fixed-schema environment. Declare data_sources statically; return COVERAGE GAP if absent.

128k+ context assumptions. Reference framework patterns that rely on fitting large log excerpts in context don't transfer to 8K/8GB VRAM. Chunked query + statistical pre-filter is the answer.

Self-Play / Adversarial Generation — Assessment and What to Borrow¶

A recurring external suggestion (assessed 2026-06-13) is to discover novel hunt methodologies via MARL self-play — Red vs. Blue RL agents in a simulated enterprise, training over millions of episodes (AlphaGo-style), with the Blue agent rewarded for flagging behavioral chains that map to no existing MITRE signature. The full blueprint is not adopted; the tractable kernel is.

Why the full self-play blueprint is rejected¶

Sim-to-real gap (decisive). Novelty discovered in an abstract simulation (CyGym / Cyberwheel) is "novel" and "malicious" only relative to the simulation's own ground-truth labeling. A reward like "flag a path structurally proven malicious in the simulation" is circular — it proves malice in the sim's logic, not in real telemetry. A hunt that's novel against a mathematical attack graph may detect nothing real. ARCHER's defensive range avoids this by labeling real GOAD-generated Zeek/Suricata telemetry, not abstract graphs.
Reward hacking. Punishing the Red agent for using known MITRE paths to force novelty reliably produces degenerate exploits the simulator permits but that don't exist on real systems — compute spent beating the simulator, not Windows.
Missing human gate. The blueprint auto-synthesizes and implies deploying executable KQL/Sigma. That puts the model in a code role (a discovered "detection" may be a sim artifact; an auto-written query may hallucinate fields) — the exact anti-pattern in v1-to-v2.md. ARCHER requires a human-validation gate before a discovered play is trusted.
Compute / philosophy mismatch. The blueprint scales to 8×H100 for a 671B reasoning model and millions of RL episodes — a different universe from the 8GB-VRAM / qwen3:14b / local-first / centaur constraints this project is built on. MARL is also a research-grade convergence problem, not a feature extension.

What is worth taking¶

Adversarial generation belongs at the curriculum layer, not the hunting mechanism. Build a Red generator that mutates known MITRE techniques (execution timing, file paths, parent-child process lineage) into novel attack variants, runs them against the real GOAD-telemetry range, auto-labels the windows, and feeds the variants into the defensive range. This expands coverage and stress-tests plays — on existing hardware, no MARL, no sim-to-real gap.
The Explorer is already the Blue-side novelty engine. MITRE-gap-driven hypothesis generation against real frozen telemetry, human-reviewed, is a more direct and far cheaper route to novel hunt methods than RL self-play — and it already emits the copy-pasteable artifact (a candidate play YAML) the blueprint's "LRM layer" is reaching for.
Keep the invariants: human promotion gate, and the model interprets results, it never writes runtime queries. The LRM/Explorer may draft candidate plays; a human promotes them; nothing auto-deploys.

Domain note¶

Red-side novelty (inventing attacks that evade detection) is offensive novelty — closer to the pentest domain than to hunting. For hunting (defensive), the grounded path to novelty is the Explorer against real telemetry, with the technique-mutation generator expanding the range it hunts over. Convergent validation: the blueprint's "high-fidelity emulation" route (freeze → label → replay real logs) and its "LRM synthesizes a hypothesis from a successful detection" step are, independently, the defensive-range design and the Explorer already specified above.

Data-Source Integration Scope¶

Sagittarius is source-agnostic by design. Every play declares data_sources upfront — same model as Sigma's logsource. Play is COVERAGE GAP if a required source is absent. Environment-specific parameters (asset IP ranges, baseline thresholds) are injected at runtime from local config. Hunt logic is portable; environment context is not.

The baseline source set is a Zeek + Suricata + Elasticsearch telemetry stack. Sagittarius does not assume any particular SIEM packaging of those sources.

Possible future integration — Security Onion / Elastic. Security Onion community edition exposes Zeek, Suricata, and Elasticsearch (and does not expose osquery, Elastic Agent, Strelka, MISP, Onion AI, or the MCP server, which are Pro-only). Because Sagittarius already targets the Zeek/Suricata/Elasticsearch source set, an SO ingestion adapter is a natural "not now, not never" integration — one supported data source among others, added when there is demand. It is not the deployment target and not the distribution strategy. (Doug Burks remains a personal reference for the defensive/SO domain.)

Sigma analogy: Sigma rules declare logsource, compile per-platform. Sagittarius plays declare data_sources, adapt per-environment. Same portability model — which is exactly why adding a new source (such as SO/Elastic) later is an adapter, not a redesign.

Build Sequence (canonical — 3 stages, cross-repo gate)¶

Decided 2026-06-20 (ARCHER DECISIONS.md). Resolves the chicken-and-egg: the hunting domain can't be built/eval'd before the telemetry pipeline exists, and that pipeline is Sagittarius's ingestion core — so Sagittarius ingestion comes first, but only the ingestion slice, and it must produce labeled ground truth (live/unlabeled data can't score sensitivity/specificity).

Stage 1 — Sagittarius ingestion + ground-truth gate (prereq). Minimal cut of Sagittarius Phase 1: sensor stack → ingest --file → queryable store, populated with labeled telemetry. Tracked: jayhawkins108/Sagittarius#25 (consolidates #12/#13/#14/#15/#17). Rest of Sagittarius Phase 1 deferred to Stage 3.
Stage 2 — ARCHER hunting domain v1 (differentiator). archer/threat_hunting/ (#994) + YAML loader (#983) + TH metrics (#984); five play packs (the GOAD scenario map above) eval'd against the Stage-1 labeled corpus; findings via the existing --json-output. Gate: five plays validated with sensitivity/specificity numbers.
Stage 3 — Sagittarius product hardening. Live ingest, Executor daemon, Explorer, distributed Core/Edge, deferred Phase 1 items (#18–#24).

The "MVP Milestones" below are the Stage-2/Stage-3 build slice in two-week increments; the three stages above are the unifying sequence (the per-project "Phase N" labels overload — see the DECISIONS naming note).

Sagittarius MVP Milestones (~Two Months)¶

The first build slice for Sagittarius, in roughly two-week increments:

Phase	Deliverable
1	Elasticsearch query adapter — authenticated, rate-limited, chunked result handling (against a Zeek/Suricata/Elasticsearch source set)
2	Five validated play packs (HT-C2-01, HT-RECON-01, HT-LATERAL-01, HT-CRED-01, HT-EXFIL-01) tested against GOAD-generated ground truth dataset
3	Executor daemon — scheduled sweep runner, SQLite verdict store, three-outcome logging, drift detection
4	Explorer subsystem — MITRE gap reader → candidate YAML generator; human review workflow; eval harness with sensitivity + specificity metrics

Built on the ARCHER engine as a standalone domain (--do hunting), packaged and deployed as Sagittarius. The Elasticsearch adapter is source-side only — no modification to any upstream SIEM codebase, which keeps a later Security Onion integration clean (no ELv2 concern).

The Eval Problem Nobody Else Has Solved¶

The reference frameworks have no principled way to answer: "is it actually hunting well?" They measure whether the system ran, not whether it found real things and ignored false ones.

ARCHER's eval harness adapts directly to this problem: - Plant known indicators in a frozen telemetry dataset at known timestamps - Run plays against the full dataset - Measure both sensitivity (found the real thing) and false positive rate (didn't fire on noise) - Three-outcome verdict model records what the system claimed and why

This is the differentiator worth leading with for Sagittarius — not the architecture, not the hardware accessibility, but the measurability claim: we can tell you, with a number, whether this hunt is working.

Open Questions — Security Onion Integration (Deferred)¶

These are scoped to the possible future Security Onion / Elastic integration, not the Sagittarius MVP. They determine SO-adapter build time when that integration is prioritized; they are not on the critical path now. Doug Burks is the reference to ask on these:

SO community ES access: Does community edition expose a read-only Elasticsearch API endpoint, or does querying SO data require SO's own API layer? What's the auth model?
Zeek index naming: Are index names (logs-zeek.conn-*, etc.) standardized across community installs or does the SO deployment config vary them?
Community field schema docs: Does SO publish a field reference for Zeek/Suricata indices in their documentation, or is the Elastic Common Schema the source of truth?

The plays are schema-independent once an adapter normalizes the query surface — so adding SO later is adapter work, not a play rewrite.

Relationship to ARCHER Pentest Domain¶

ARCHER Threat Hunting is a new domain under the same ARCHER framework. The shared architecture:

Core agent loop (think → act → observe → chain) — identical
Domain loading via --do hunting (same SkillRegistry.load_domain() enforcement)
Play pack file structure (Python module, registered handlers)
Eval harness concept (objective-level pass/fail)
SYSTEM_PROMPT_ADDENDUM (domain framing injected at session start)
bonus_fn routing (routing weight for play selection)
Two-layer responsibility split: model reasons, code executes and enforces

What changes: triage_fn replaces success_fn/halt_fn; data_sources declaration is new; verdict model expands to three outcomes; daemon mode is new; Explorer is new.

The "plays" rename: Renaming skills/ → plays/, SKILL_CATEGORIES → PLAY_CATEGORIES, and all references across ARCHER.py, eval_harness.py, and docs is semantically correct and worth doing before the hunting domain ships. Blast radius is the entire codebase — separate issue to file before Coder begins.

Last updated: 2026-06-17. Design session notes; no implementation started. 2026-06-13: added "Play Authoring Format (YAML)" — concrete practitioner-facing grammar, #945 atom-evaluator reuse, DomainConfig compile target, YAML-native-vs-Python open decision, one-artifact-three-roles framing. 2026-06-13: added "Self-Play / Adversarial Generation — Assessment and What to Borrow" (reject full MARL self-play; adopt technique-mutation Red generator feeding the real GOAD range; Explorer stays the Blue novelty engine). 2026-06-17: strategic pivot — product home is now Sagittarius (standalone distributed threat-hunting product on the ARCHER engine); Security Onion reframed from deployment target/distribution strategy to a possible future data-source integration ("not now, not never"); SkillBridge-deliverable framing replaced with Sagittarius MVP milestones; Doug Burks retained as a personal reference. Technical design (play packs, three-outcome verdict, Executor/Explorer, defensive range) unchanged.