Sagittarius Threat Hunting — System Design¶
Status: Pre-implementation planning. No code written. Design surface for Sagittarius, a standalone distributed threat-hunting product built on the ARCHER engine.
Naming: Sagittarius is the product/deployment identity of this threat-hunting capability — a distinct distributed defensive platform. ARCHER is the underlying agent engine; the threat-hunting domain runs on ARCHER (agent loop, play packs, eval harness,
--do hunting), but its product home, packaging, and deployment story are Sagittarius. Where this doc says "the hunting domain" or "the Executor/Explorer," that is the ARCHER engine; where it says "the product," that is Sagittarius.
What It Is¶
Sagittarius is a scheduled, hypothesis-driven AI hunting product, built on the ARCHER engine. It runs hunt plays against the telemetry available in its deployment environment — Zeek/Suricata/Elasticsearch-class network and endpoint sources — using the three-outcome verdict model below.
Sagittarius does not depend on any single SIEM or sensor platform. Security Onion / Elastic telemetry is a possible future integration — one supported data source Sagittarius could ingest later — not the deployment target or distribution channel. See "Data-Source Integration Scope" below.
It extends ARCHER's existing architecture (agent loop, play packs, eval harness) into the defensive domain via a new --do hunting domain flag, following the same single-domain enforcement pattern used for penetration testing.
Fundamental Differences: Pentesting vs Threat Hunting¶
The pentest domain and the hunting domain share ARCHER's core architecture but differ across five axes:
| Axis | Pentesting | Threat Hunting |
|---|---|---|
| Environment | Adversarial — the target resists, may evict you | Cooperative — full read access to all telemetry |
| Data model | Discovered through action — picture built from nothing | Pre-existing, queryable — telemetry is already there |
| Action model | Hard, stateful — sends packets, spawns shells, changes state | Read-only queries — same hunt returns same answer every time |
| Success criteria | Binary: got the shell / flag / credential | Three-outcome: CONFIRMED / NOT OBSERVED / COVERAGE GAP |
| Time orientation | Present-tense against a live system | Past-tense against historical records |
What this means for the model's job: In pentesting the model navigates an unknown system based on what it reveals under probing. In threat hunting the model reasons over a structured dataset that already exists. The challenge shifts from access to signal extraction.
The Three-Outcome Verdict Model¶
Pentest success/halt is binary. Hunting requires three outcomes — this is load-bearing for the eval harness:
| Verdict | Meaning |
|---|---|
| CONFIRMED | Pattern found — hypothesis supported by evidence |
| NOT OBSERVED | Pattern absent — hypothesis denied, but data was present and queryable |
| COVERAGE GAP | Telemetry source missing or query returned zero records — not a clean bill of health |
NOT OBSERVED is a valid finding. A hunt that finds nothing is not a failed hunt. The eval harness, the play's triage_fn, and the verdict store must distinguish all three. COVERAGE GAP in particular must never be silently treated as NOT OBSERVED — a missing data source is a detection blind spot, not evidence of absence.
Play Architecture (vs Pentest Skill Pack)¶
"Skills" → "Plays" is the correct terminology shift. "Skill" implies agent capability. "Play" implies a defined procedure from a playbook — run to test a hypothesis. Industry standard term (hunt playbooks, threat playbooks).
Pentest skill pack shape (current):¶
setup_fn: verify target reachable
hints_fn: specific tool invocations (nmap, sqlmap, gobuster)
success_fn: check for specific output (shell, flag, credentials)
halt_fn: stop conditions (max commands, danger detection, objective achieved)
bonus_fn: routing weight delta
Hunting play shape (planned):¶
hypothesis: "Host X is beaconing to a C2"
data_sources: [zeek.conn, suricata.alert] ← declared upfront; COVERAGE_GAP if absent
query_chain:
step 1: aggregate Zeek conn by dest IP
group_by: [src_ip, dest_ip]
compute: [count, avg_bytes, interval_variance]
filter: interval_variance < 30 AND count > 10
step 2: IF matches → pivot to DNS queries for that dest IP
step 3: IF DGA-pattern domain → threat intel enrichment
triage_fn: evaluates result against hypothesis + baseline registry
pass_fn: periodic + small payload + DGA domain = CONFIRMED
deny_fn: no periodic pattern in 24h window = NOT OBSERVED
inconclusive_fn: source absent or 0 records = COVERAGE GAP
New dispatcher additions vs pentest:¶
triage_fn(result, hypothesis, baseline_registry) → verdict— replacessuccess_fn/halt_fndata_sourcesdeclaration — engine returns COVERAGE GAP if required source absent- Three-verdict model throughout eval harness
Why the model never writes queries¶
The critical design choice: the model interprets results, not query syntax. Human play authors write the Elasticsearch DSL, Zeek field filters, and aggregation logic. The model decides when to execute step 2 and what the aggregate result means for the hypothesis.
This eliminates the hallucinated field name problem that plagues dynamic query generation approaches. A model asked to write raw ES DSL will invent field names. A model asked to evaluate a pre-aggregated result set will not.
Tradeoff: plays must be written for novel hypotheses before they can run. The Explorer subsystem (below) addresses this.
Play Authoring Format (YAML) — Practitioner-Facing¶
The play shape above is pseudo-code; this is the concrete artifact a practitioner authors. Design goal: a hunt methodology is encoded as low-complexity declarative YAML that reads like a runbook, not code — the same authoring bar as a Sigma rule. No Python required to define a play.
Worked example (TH-BEACON-01, C2 beaconing):
id: TH-BEACON-01
technique: T1071.001 # ATT&CK
author: jhawkins
hypothesis: >
A host on the user subnet is beaconing to external C2 over HTTP(S) at a
regular interval with low jitter and consistent small request sizes.
requires: # data-source gate — COVERAGE GAP if absent (Sigma logsource analogue)
- zeek.conn
- zeek.http
params: # env-specific, injected at runtime — keeps the LOGIC portable
internal_cidr: "{INTERNAL_CIDR}"
lookback: "24h"
min_connections: 50
max_jitter_pct: 15
steps:
- id: candidates
intent: "Internal->external pairs with many connections in the window."
query: >
zeek.conn | where ts > now()-{lookback} and src in {internal_cidr}
and dst not in {internal_cidr} | summarize n=count() by src,dst | where n >= {min_connections}
expect: rows # zero rows -> NOT OBSERVED, stop (deterministic gate)
hints: ["Exclude known-good dst (CDN, OS update, telemetry) before flagging."]
- id: periodicity
intent: "Test inter-arrival times for low-jitter regularity."
for_each: candidates
branch:
- when: { numeric: { field: jitter_pct, lt: "{max_jitter_pct}" } }
goto: enrich
- else: { verdict: not_observed }
hints:
- "Regular small requests at near-constant interval = beaconing."
- "Human browsing is bursty/irregular - high jitter, varied sizes."
- id: enrich
intent: "dst reputation + JA3/TLS + URI patterns to confirm or clear."
hints: ["Rare JA3, empty User-Agent, beacon-style URIs raise confidence."]
verdict: # three-outcome - evaluated by the #945 atom grammar (reused, not new)
confirmed:
all:
- step_passed: periodicity
- any: [ { pattern: "rare_ja3|empty user-agent|known_c2" },
{ reputation: { dst: "{dst}", min_score: 70 } } ]
coverage_gap: { source_absent: any }
not_observed: default
Reuse, don't reinvent — the evaluation engine already exists. The verdict block is built from the #945 declarative atom evaluator already running in the eval harness (testenv/objective_registry.py + the atom grammar: pattern, tool_used, numeric, any/all). testenv/objectives/PT-DISTCC-01.yaml proves objectives can be authored natively in YAML rather than ported from Python. The hunting verdict/triage block compiles to the same atoms — no new evaluation engine is built; the format is an extension of a proven one.
Compile target (post-#973): a loader turns the YAML into a DomainConfig (archer_platform/domain.py):
| YAML field | DomainConfig surface |
|---|---|
requires |
data_sources declaration -> COVERAGE GAP gate |
steps[].hints |
hints_fn (per-step prompt guidance) |
steps[].branch |
deterministic query-chain control flow |
verdict |
triage_fn -> CONFIRMED / NOT OBSERVED / COVERAGE GAP |
params |
runtime-injected environment context |
Refinement to the promotion pipeline: the pipeline above promotes Explorer drafts to plays/HT/*.py (Python). The #945 YAML-native precedent means validated plays can stay declarative YAML (loaded -> DomainConfig) instead of being ported to Python — lowering the authoring bar and keeping plays community-contributable as data, not code. Open decision: YAML-native plays vs. Python play packs. Recommendation: YAML-native for author-facing plays; reserve Python only for plays needing custom logic the atom grammar cannot express.
One artifact, three roles¶
The same play file serves three purposes — this is what makes "train the AI to employ the technique consistently" tractable:
- Runtime program — loaded ->
DomainConfig; the Executor runs the chain. - Training-data generator — a verdict-validated run becomes a session trace ->
ft.jsonl-> QLoRA (ARCHER's existing V1->V2 pipeline). Over many runs the model internalizes the methodology. - Its own test — the
verdictatoms are the eval pass/fail criteria. Run the play against the labeled defensive range (sensitivity + specificity). This is the measurability differentiator.
Determinism lives where ARCHER's three-layer split puts it: code enforces the chain/branches/verdict (repeatable), the model does bounded per-node interpretation, the human sets scope and baselines. The methodology is deterministic; the LLM is a bounded interpreter inside it.
Dual-Subsystem Architecture¶
Two subsystems sharing one Ollama endpoint. 8GB VRAM is not a barrier — Executor and Explorer never run concurrently. They are sequential, scheduled workloads.
Timeline example:
00:00 Executor sweep — 20-play queue, ~45 min compute, --think=false
00:45 Executor done, verdicts logged, GPU idle
01:00 Explorer wakes — reads MITRE gaps + recent verdicts + schema, --think=true
02:30 Explorer done — draft YAML written to candidate_plays/
GPU idle until next sweep window
06:00 Executor runs again
Executor¶
- Runs validated plays from the playbook on defined schedule
- Uses
qwen3:14b --think=false— fast, deterministic, reliable - Produces CONFIRMED / NOT OBSERVED / COVERAGE GAP per play per run
- Logs every verdict to persistent SQLite store
Explorer¶
- Runs in idle windows between Executor sweeps
- Uses
qwen3:14b --think=true— slower, deeper reasoning justified by exploration goal - Input: MITRE ATT&CK technique coverage gaps + recent verdict history + data source schema (static, injected — no hallucinated fields)
- Output: candidate play YAML written to
candidate_plays/— structured draft, not executable Python - Human reviews candidate YAML weekly; promotes sound candidates to validated play packs
- Never runs against live data autonomously — it produces draft hypotheses only
Play promotion pipeline¶
MITRE gap + recent verdicts + known schema
↓ Explorer (daily, off-peak)
candidate_plays/{MITRE-ID}_{date}.yaml ← human-readable draft
↓ Human review gate
plays/HT/{HT-technique-id}.py ← validated play pack, enters Executor queue
The Explorer grows the playbook. The human validates. Autonomous promotion never happens.
Persistent Daemon Mode¶
ARCHER currently runs as interactive, human-triggered sessions. Hunting's natural operational model is a headless scheduled service — hypothesis queue runs, verdicts log, no analyst required per sweep.
Scheduling model (hybrid)¶
- Time-based sweeps: full hypothesis queue every N hours (configurable)
- Event-triggered escalation: Suricata fires on a host → immediately run relevant plays for that technique class, skip the queue position
The baseline that builds itself¶
The most important property of persistent operation: the verdict history IS the baseline.
Static baseline systems require someone to define "normal" upfront. A persistent daemon builds it through operation: - HT-BEACON-01 returns NOT OBSERVED for 90 consecutive days - Day 91: CONFIRMED - 90 data points establish absence before the confirmation — no manual baselining required
The verdict store schema:
play_runs(
play_id TEXT,
run_at INTEGER, -- Unix timestamp
verdict TEXT, -- CONFIRMED | NOT_OBSERVED | COVERAGE_GAP
evidence TEXT, -- snippet of matching result
data_source TEXT, -- which source was queried
record_count INTEGER -- records returned by query
)
Drift detection emerges naturally: "this play was NOT OBSERVED for N days and just CONFIRMED" is a first-class alert condition derived from the store without additional configuration.
Resource management on 8GB VRAM¶
- One hypothesis chain at a time — sequential inference, never concurrent
- Configurable throttle: max N inferences per hour
- Sweep duration scales with playbook size; 20 plays ≈ 45 min compute
- Executor and Explorer never overlap — time-scheduled separation
Defensive Detection Range vs Offensive Cyber Range¶
The eval infrastructure differs fundamentally from the pentest domain:
| Dimension | Offensive (GOAD) | Defensive (telemetry dataset) |
|---|---|---|
| Physical target | Running VMs | A file — frozen PCAP + logs |
| VMs required during eval | Yes | No |
| Ground truth | Known vulnerability in known software | Known indicator planted at known timestamp |
| Eval metric | Binary (got shell?) | Three-outcome + false positive rate |
| Environmental drift | Yes — VMs update, network state changes | No — frozen dataset is reproducible |
| Portability | Requires hypervisor | Ship as tarball |
| Specificity | Not a metric (range IS the target) | First-class metric |
Specificity is the metric that doesn't exist in offensive eval at all. A play that fires on everything is useless. The defensive eval harness measures both: - Sensitivity: does the play CONFIRM when the indicator is present? - Specificity: does the play NOT CONFIRM on clean background traffic?
Building the defensive range from GOAD telemetry¶
GOAD already generates the attack patterns needed. Run the attacks, let Zeek + Suricata process the traffic, label the timestamp windows as ground truth:
| GOAD scenario | Primary Zeek signal | Suricata signal | Ground truth label |
|---|---|---|---|
| BloodHound LDAP sweep | dns, conn (LDAP query volume) |
ET LDAP rules | HT-RECON-01 |
| Kerberoasting | kerberos (SPN requests + RC4 downgrade) |
SID 2024792 | HT-CRED-01 |
| Pass-the-Hash | smb, ntlm (unusual auth pairs) |
ET CREDS rules | HT-LATERAL-01 |
| DCSync | dce_rpc (DRSUAPI from non-DC host) |
ET POLICY rules | HT-EXFIL-01 |
| C2 beaconing | conn (periodic, low-variance intervals) |
custom threshold | HT-C2-01 |
All of these are detectable with a Zeek + Suricata + Elasticsearch telemetry stack — the baseline source set Sagittarius targets, and the set a future Security Onion community integration would supply.
Eval protocol¶
- Run GOAD attack scenario at known timestamp T
- Label log window
[T−5m, T+15m]as ground truth for play P - Run play P against full dataset (not just labeled window)
- CONFIRMED within labeled window = sensitivity pass
- CONFIRMED outside labeled window / background traffic volume = false positive rate
The defensive range is a reusable artifact. Freeze the logs after a GOAD run, ship as a tarball, replay evals indefinitely against the same dataset. Reproducibility is a property that GOAD live-target evals don't have.
What to Borrow from Reference Frameworks¶
Based on analysis of the Threat Hunter Playbook (Open Threat Research), policy-guided frameworks, and 2026 local inference stack patterns:
Adopt¶
Statistical pre-filter before LLM sees data. A lightweight per-host baseline (rolling mean + stddev on connection count, bytes, interval) flags deviations before the play query runs. The play queries against flagged entries, not raw logs. Solves the 8K context / massive log problem more cleanly than aggregation alone. No ML required — statistics cover most beaconing detection.
Structural separation of query generation from triage validation. hints_fn drives the query step; triage_fn drives result interpretation. A model that generated the query is biased toward confirming it. Explicit phase separation prevents this — the triage step receives the raw result + the original hypothesis + deny/inconclusive criteria, not the query that was run.
Baseline behavior registry. Local YAML/JSON per environment: {host → [known outbound IPs, known processes, expected cron windows]}. Triage step subtracts registry matches before scoring. Play author defines what constitutes a match for their hypothesis class.
Multi-candidate query variants per step. 2–3 parameterized variants per step (human-authored, not dynamically generated). Code runs all variants, unions results, model evaluates the union. Hedges against query quality issues without introducing hallucination risk.
Continuous verdict logging as training signal. Log every verdict — hypothesis ID, timestamp, result, result count, data source queried. Feeds both play refinement and eventual model fine-tuning.
Headless daemon mode (--daemon flag). Scheduled hypothesis sweeps, no interactive session required. This is the right operational model for a detection system.
Explicitly reject¶
Dynamic SIEM query generation (LLM writes ES DSL / SPL / KQL). The hallucinated field name problem is real and unsolved for 8K models. Our play-defined query chains eliminate this class of failure entirely. Model interprets results; it never writes queries.
Semantic search against schema catalogs. A Zeek/Suricata/Elasticsearch deployment has a fixed, known schema. Dynamic schema discovery is overengineering for a fixed-schema environment. Declare data_sources statically; return COVERAGE GAP if absent.
128k+ context assumptions. Reference framework patterns that rely on fitting large log excerpts in context don't transfer to 8K/8GB VRAM. Chunked query + statistical pre-filter is the answer.
Self-Play / Adversarial Generation — Assessment and What to Borrow¶
A recurring external suggestion (assessed 2026-06-13) is to discover novel hunt methodologies via MARL self-play — Red vs. Blue RL agents in a simulated enterprise, training over millions of episodes (AlphaGo-style), with the Blue agent rewarded for flagging behavioral chains that map to no existing MITRE signature. The full blueprint is not adopted; the tractable kernel is.
Why the full self-play blueprint is rejected¶
- Sim-to-real gap (decisive). Novelty discovered in an abstract simulation (CyGym / Cyberwheel) is "novel" and "malicious" only relative to the simulation's own ground-truth labeling. A reward like "flag a path structurally proven malicious in the simulation" is circular — it proves malice in the sim's logic, not in real telemetry. A hunt that's novel against a mathematical attack graph may detect nothing real. ARCHER's defensive range avoids this by labeling real GOAD-generated Zeek/Suricata telemetry, not abstract graphs.
- Reward hacking. Punishing the Red agent for using known MITRE paths to force novelty reliably produces degenerate exploits the simulator permits but that don't exist on real systems — compute spent beating the simulator, not Windows.
- Missing human gate. The blueprint auto-synthesizes and implies deploying executable KQL/Sigma. That puts the model in a code role (a discovered "detection" may be a sim artifact; an auto-written query may hallucinate fields) — the exact anti-pattern in
v1-to-v2.md. ARCHER requires a human-validation gate before a discovered play is trusted. - Compute / philosophy mismatch. The blueprint scales to 8×H100 for a 671B reasoning model and millions of RL episodes — a different universe from the 8GB-VRAM / qwen3:14b / local-first / centaur constraints this project is built on. MARL is also a research-grade convergence problem, not a feature extension.
What is worth taking¶
- Adversarial generation belongs at the curriculum layer, not the hunting mechanism. Build a Red generator that mutates known MITRE techniques (execution timing, file paths, parent-child process lineage) into novel attack variants, runs them against the real GOAD-telemetry range, auto-labels the windows, and feeds the variants into the defensive range. This expands coverage and stress-tests plays — on existing hardware, no MARL, no sim-to-real gap.
- The Explorer is already the Blue-side novelty engine. MITRE-gap-driven hypothesis generation against real frozen telemetry, human-reviewed, is a more direct and far cheaper route to novel hunt methods than RL self-play — and it already emits the copy-pasteable artifact (a candidate play YAML) the blueprint's "LRM layer" is reaching for.
- Keep the invariants: human promotion gate, and the model interprets results, it never writes runtime queries. The LRM/Explorer may draft candidate plays; a human promotes them; nothing auto-deploys.
Domain note¶
Red-side novelty (inventing attacks that evade detection) is offensive novelty — closer to the pentest domain than to hunting. For hunting (defensive), the grounded path to novelty is the Explorer against real telemetry, with the technique-mutation generator expanding the range it hunts over. Convergent validation: the blueprint's "high-fidelity emulation" route (freeze → label → replay real logs) and its "LRM synthesizes a hypothesis from a successful detection" step are, independently, the defensive-range design and the Explorer already specified above.
Data-Source Integration Scope¶
Sagittarius is source-agnostic by design. Every play declares data_sources upfront — same model as Sigma's logsource. Play is COVERAGE GAP if a required source is absent. Environment-specific parameters (asset IP ranges, baseline thresholds) are injected at runtime from local config. Hunt logic is portable; environment context is not.
The baseline source set is a Zeek + Suricata + Elasticsearch telemetry stack. Sagittarius does not assume any particular SIEM packaging of those sources.
Possible future integration — Security Onion / Elastic. Security Onion community edition exposes Zeek, Suricata, and Elasticsearch (and does not expose osquery, Elastic Agent, Strelka, MISP, Onion AI, or the MCP server, which are Pro-only). Because Sagittarius already targets the Zeek/Suricata/Elasticsearch source set, an SO ingestion adapter is a natural "not now, not never" integration — one supported data source among others, added when there is demand. It is not the deployment target and not the distribution strategy. (Doug Burks remains a personal reference for the defensive/SO domain.)
Sigma analogy: Sigma rules declare logsource, compile per-platform. Sagittarius plays declare data_sources, adapt per-environment. Same portability model — which is exactly why adding a new source (such as SO/Elastic) later is an adapter, not a redesign.
Build Sequence (canonical — 3 stages, cross-repo gate)¶
Decided 2026-06-20 (ARCHER DECISIONS.md). Resolves the chicken-and-egg: the hunting domain
can't be built/eval'd before the telemetry pipeline exists, and that pipeline is
Sagittarius's ingestion core — so Sagittarius ingestion comes first, but only the ingestion
slice, and it must produce labeled ground truth (live/unlabeled data can't score
sensitivity/specificity).
- Stage 1 — Sagittarius ingestion + ground-truth gate (prereq). Minimal cut of Sagittarius
Phase 1: sensor stack →
ingest --file→ queryable store, populated with labeled telemetry. Tracked:jayhawkins108/Sagittarius#25(consolidates #12/#13/#14/#15/#17). Rest of Sagittarius Phase 1 deferred to Stage 3. - Stage 2 — ARCHER hunting domain v1 (differentiator).
archer/threat_hunting/(#994) + YAML loader (#983) + TH metrics (#984); five play packs (the GOAD scenario map above) eval'd against the Stage-1 labeled corpus; findings via the existing--json-output. Gate: five plays validated with sensitivity/specificity numbers. - Stage 3 — Sagittarius product hardening. Live ingest, Executor daemon, Explorer, distributed Core/Edge, deferred Phase 1 items (#18–#24).
The "MVP Milestones" below are the Stage-2/Stage-3 build slice in two-week increments; the three stages above are the unifying sequence (the per-project "Phase N" labels overload — see the DECISIONS naming note).
Sagittarius MVP Milestones (~Two Months)¶
The first build slice for Sagittarius, in roughly two-week increments:
| Phase | Deliverable |
|---|---|
| 1 | Elasticsearch query adapter — authenticated, rate-limited, chunked result handling (against a Zeek/Suricata/Elasticsearch source set) |
| 2 | Five validated play packs (HT-C2-01, HT-RECON-01, HT-LATERAL-01, HT-CRED-01, HT-EXFIL-01) tested against GOAD-generated ground truth dataset |
| 3 | Executor daemon — scheduled sweep runner, SQLite verdict store, three-outcome logging, drift detection |
| 4 | Explorer subsystem — MITRE gap reader → candidate YAML generator; human review workflow; eval harness with sensitivity + specificity metrics |
Built on the ARCHER engine as a standalone domain (--do hunting), packaged and deployed as Sagittarius. The Elasticsearch adapter is source-side only — no modification to any upstream SIEM codebase, which keeps a later Security Onion integration clean (no ELv2 concern).
The Eval Problem Nobody Else Has Solved¶
The reference frameworks have no principled way to answer: "is it actually hunting well?" They measure whether the system ran, not whether it found real things and ignored false ones.
ARCHER's eval harness adapts directly to this problem: - Plant known indicators in a frozen telemetry dataset at known timestamps - Run plays against the full dataset - Measure both sensitivity (found the real thing) and false positive rate (didn't fire on noise) - Three-outcome verdict model records what the system claimed and why
This is the differentiator worth leading with for Sagittarius — not the architecture, not the hardware accessibility, but the measurability claim: we can tell you, with a number, whether this hunt is working.
Open Questions — Security Onion Integration (Deferred)¶
These are scoped to the possible future Security Onion / Elastic integration, not the Sagittarius MVP. They determine SO-adapter build time when that integration is prioritized; they are not on the critical path now. Doug Burks is the reference to ask on these:
- SO community ES access: Does community edition expose a read-only Elasticsearch API endpoint, or does querying SO data require SO's own API layer? What's the auth model?
- Zeek index naming: Are index names (
logs-zeek.conn-*, etc.) standardized across community installs or does the SO deployment config vary them? - Community field schema docs: Does SO publish a field reference for Zeek/Suricata indices in their documentation, or is the Elastic Common Schema the source of truth?
The plays are schema-independent once an adapter normalizes the query surface — so adding SO later is adapter work, not a play rewrite.
Relationship to ARCHER Pentest Domain¶
ARCHER Threat Hunting is a new domain under the same ARCHER framework. The shared architecture:
- Core agent loop (think → act → observe → chain) — identical
- Domain loading via
--do hunting(sameSkillRegistry.load_domain()enforcement) - Play pack file structure (Python module, registered handlers)
- Eval harness concept (objective-level pass/fail)
SYSTEM_PROMPT_ADDENDUM(domain framing injected at session start)bonus_fnrouting (routing weight for play selection)- Two-layer responsibility split: model reasons, code executes and enforces
What changes: triage_fn replaces success_fn/halt_fn; data_sources declaration is new; verdict model expands to three outcomes; daemon mode is new; Explorer is new.
The "plays" rename: Renaming skills/ → plays/, SKILL_CATEGORIES → PLAY_CATEGORIES, and all references across ARCHER.py, eval_harness.py, and docs is semantically correct and worth doing before the hunting domain ships. Blast radius is the entire codebase — separate issue to file before Coder begins.
Last updated: 2026-06-17. Design session notes; no implementation started. 2026-06-13: added "Play Authoring Format (YAML)" — concrete practitioner-facing grammar, #945 atom-evaluator reuse, DomainConfig compile target, YAML-native-vs-Python open decision, one-artifact-three-roles framing. 2026-06-13: added "Self-Play / Adversarial Generation — Assessment and What to Borrow" (reject full MARL self-play; adopt technique-mutation Red generator feeding the real GOAD range; Explorer stays the Blue novelty engine). 2026-06-17: strategic pivot — product home is now Sagittarius (standalone distributed threat-hunting product on the ARCHER engine); Security Onion reframed from deployment target/distribution strategy to a possible future data-source integration ("not now, not never"); SkillBridge-deliverable framing replaced with Sagittarius MVP milestones; Doug Burks retained as a personal reference. Technical design (play packs, three-outcome verdict, Executor/Explorer, defensive range) unchanged.