Script Reference¶

ARCHER is built on a collection of Python scripts and shell utilities that handle everything outside the agent loop itself — data collection, training pipeline, evaluation, monitoring, and lab management. This page documents each script's purpose and when to use it.

Scripts live in scripts/ at the repo root. None of them are used at runtime by the agent; they are all workflow tooling.

Monitoring & Health¶

Scripts that give you visibility into the system's current state.

`archer_live.py`¶

Command: python3 scripts/archer_live.py or bash scripts/start_archer_live.sh

The live monitoring server. Serves a local web dashboard at http://localhost:7788 with twelve pages:

Live — active session stream, unified Eval Pipeline section (T1/T2 pipeline + RCA findings merged into one scrollable table with expandable rows, Findings/All filter toggle), range health, action queue
Benchmark — OA/FP/HD trend chart with secondary-axis batch wall-clock (WT, toggle via WT button; hidden by default), heatmap with WT row (≤10m/≤30m/>30m bands), AI-generated snapshot explanations on click, per-objective pass rate history with focus dropdown and smart filters, regression table, failure class registry, pass rate distribution (#940)
Failures — issue burndown (hourly/daily granularity), per-skill fail rates, halt breakdown, coder verification cap tracker
RCA — per-objective failure diagnostics sorted by severity
Sessions — session explorer with inline expansion, objective filter, deep-link from RCA findings
Review — Tier 2 audit queue with command snapshot, session log, one-click verdicts, skill examples, T1/T2/T3 visibility
Training — pipeline funnel, API spend tally, daily cost chart
Router — routing confidence, per-skill accuracy, label growth, unknown-skill row handling
Notes — freeform reviewer note creation, editing, and management
Skills — skill reference browser with pass rates, examples, domain breakdowns
Help — per-chart help blocks, metric threshold reference
Index — session overview and header stats

Automatically starts archer_companion.py in the background on startup (--no-tui headless mode). Provides REST endpoints — /api/stats, /api/companion, /api/eval/history, /api/health, /api/explain/snapshot, /api/explain/snapshots/batch, /api/review/*, and others.

Also reachable over Tailscale at the host's Tailscale IP.

`archer_companion.py`¶

Command: python3 scripts/archer_companion.py [--force-spend] [--dry-run] [--no-tui]

The eval session review pipeline. Tails ~/.archer_eval_events.jsonl and dispatches each new session through a two-stage review:

T1 (CPU-only): Structural audit — checks session file existence, command patterns, halt class. Produces a failure_class label.
T2 (Claude Haiku API): Quality scoring — scores the session 0–3 on approach validity. Only runs on sessions that pass T1 and are budget-gated at 90% spend.

Files GitHub issues automatically for sessions that fail T2 or are classified as infra-gap. Writes state to ~/.archer_companion_state.json for the live dashboard to read.

Started automatically by archer_live.py at server startup. Can also be run standalone with --dry-run (T1 only, no issue filing) or --force-spend (ignore budget cap). Logs to ~/.archer_companion.log when headless.

`archer_status.py`¶

Command: archer-status

The primary session-start command. Reads four sources — eval results, routing log, session logs, and live lab probes — and renders a full health snapshot in under 30 seconds. Covers: eval quality (OA rate, failure streaks, staleness), router label progress for every skill with a progress bar showing percentage to the 50-label training gate, classifier state, training pipeline volume, and open GitHub issues with blocking relationships.

Use --handoff to also write a ranked recommendation list to HANDOFFS.md.

`archer_health.py`¶

Command: archer-health

A deeper diagnostic view focused on five specific failure signals: halt rates per objective (flagging any objective above 20%), failure clusters from recent sessions, label volume vs the training gate, per-skill routing confidence trend (score gap), and false-positive objective-achieved events where the model claimed success but verification failed. Run this when archer-status shows a degraded metric and you want to understand why.

`routing_analysis.py`¶

Command: python3 scripts/routing_analysis.py

Reads the routing log and computes a rolling confidence trend per skill. Confidence is measured as the score gap between the top-ranked and second-ranked skills — a higher gap means the router is more decisive. Flags skills where confidence has declined over the last N decisions. Useful before a classifier retraining run to see which skills are ambiguous.

`archer_issues.py`¶

Command: archer-issues

Single-call GitHub issue triage. Parses blocking relationships from issue bodies and groups issues by tier: critical, high-priority bug/regression, and ready-to-start. Shows blocked issues with their dependency chain indented beneath them. --role coder/auditor/scribe filters to role-relevant issues; --blocked shows only issues with unresolved blockers.

Evaluation & Quality Gates¶

Scripts that measure ARCHER's behavior and enforce quality gates.

`eval_harness.py` (in `testenv/`)¶

Command: python3 testenv/eval_harness.py

The primary measurement tool. Runs ARCHER against 75 active objectives across 9 play packs on live vulnerable targets (MS2, BWA, bee-box, Juice Shop, GOAD-Light). Records pass/fail, halt reason, session log, and routing decision per run. All eval runs write a timestamped CSV to testenv/eval_results/.

Key flags: - --verify — fix-verification pass. Embeds "verify": true in emitted events so archer_companion skips T2 scoring. T1 structural checks still run. Use this whenever re-running an objective after a Coder commit to confirm a fix. - --strict-preflight — abort before the first run if any objective is missing a setup_fn. Catches configuration errors before wasting GPU time. - --ambiguous — adds underspecified-phrasing variants for router training data. - --audit — shows evidence snippets for passes and final N turns for failures. - --compare — diffs two eval CSVs side by side.

CSV output includes standard columns (success, halt_reason, command_count, elapsed_s, skill_selected) plus newer columns added in the May 2026 sprint: ceiling_proximity (command_count / max_cmds), vram_mb_at_start (GPU VRAM before each run), and estimated_tokens (len(stdout)//4 approximation).

After each run the harness prints four diagnostic blocks: zero-command triage (probable cause for cmds=0 rows), training yield (ft.jsonl candidates per skill), historical halt distribution (halt_reason breakdown over last 20 rows per failing objective), and inline failure classes (Class N tags per failing row before the classify_failures.py subprocess call). See eval-harness.md for full detail.

`ci_regression_check.py`¶

Command: Called by GitHub Actions

Compares a new eval result against the baseline CSV. Fails if two or more CI gate objectives drop from ≥70% to zero or one pass. The gate set is defined in testenv/ci_smoke_set.txt. Called automatically on every push to main.

`ci_verify_fix.py`¶

Command: Called by GitHub Actions / python3 scripts/ci_verify_fix.py --issue N

Implements the Coder → Auditor verification workflow. When the pending-verification label is applied to a GitHub issue, this script reads the Verify: T## T## footer from the issue's comments, runs three eval passes on those objectives, and posts the result back. If all objectives pass 3/3, it adds the verified label and closes the issue.

Tier: tag enforcement (verify-fix.yml): Before the eval run fires, verify-fix.yml checks that the issue body or a comment contains Tier: hint-only or Tier: shared-utility. If the tag is absent, the workflow posts a blocking comment, removes the pending-verification label, and skips the run entirely (#621/#723, commit 679c2bd). Apply the pending-verification label only after posting the "committed — pending verification" comment with the Tier: line — labels applied without the tag are invisible to archer-coder-next and silently open cap slots that aren't really available.

`propose_objectives.py`¶

Command: python3 scripts/propose_objectives.py

Ranked candidate objective generator. Queries NVD for CVEs affecting known lab services, scores by EPSS exploitation likelihood, maps to ATT&CK techniques, and checks coverage gaps against existing objectives in eval_harness.py. Outputs ranked stubs for human review — does not modify the harness.

Key flags: - --epss-min FLOAT — minimum EPSS score to include (default: 0.10) - --top INT — max candidates to output (default: 20) - --out FILE — write stubs to a file instead of stdout - --no-api — use cached NVD/EPSS data only, no live queries - --include-covered — include CVEs for techniques already in objectives - --emit-yaml — write YAML objective drafts for IN_LAB candidates to testenv/objectives/ instead of Python stubs; skips files that already exist (#945, commit a2f2bf0)

Each stub includes RANGING (LOW/MEDIUM/HIGH), HINT_LAYERS (layer_1: training-specific; layer_2: generic eval hint), and TARGET_STATUS (IN_LAB, VULHUB, LAB_EXT, or UNKNOWN). Use --emit-yaml for objectives where the target is confirmed in-lab and you want YAML atom format directly.

`regression_watcher.py`¶

Command: python3 testenv/regression_watcher.py

Diffs the most recent eval results against the baseline and files a GitHub issue if any previously-passing objective has dropped to zero. Run on return from away, before reviewing open issues. --dry-run reports regressions without filing.

Data Collection¶

Scripts that orchestrate automated training data collection.

`run_data_collection.sh`¶

Command: bash scripts/run_data_collection.sh

The main unattended collection script. Runs in two phases: Phase 1 is a full eval sweep for ft.jsonl diversity (one pass per objective); Phase 2 is a sparse+ambiguous loop that focuses on skills below the 50-label gate. GPU throttle is on (60W limit). Checks for pending verification requests at cycle boundaries and yields to them automatically.

`run_eval.sh`¶

Command: bash scripts/run_eval.sh

Thin GPU power-limit wrapper around eval_harness.py. Sets nvidia-smi power limit to 60W before the run and restores 80W on exit via a trap. Accepts all eval_harness.py arguments, including --verify for fix-verification passes.

`eval_queue.py` and `eval_queue_worker.py`¶

Commands: archer-queue-start, archer-queue-status, archer-queue

Priority-aware eval job scheduler. The queue has three tiers: High (verification runs), Normal (CI / on-demand), and Low (background collection). The worker daemon monitors the queue and preempts low-priority collection runs when a higher-priority job arrives. Prevents verification jobs from waiting behind a 15-hour collection run.

Training Pipeline¶

Scripts that convert session logs into training artifacts.

`build_training_data.py`¶

Command: python3 scripts/build_training_data.py

Batch converter. Reads ~/.archer_routing_log.jsonl and extracts high-confidence eval labels into data/router_labels.csv. Also converts session ft.jsonl files to per-skill JSONL for fine-tuning. --report prints label volumes without writing any files — use this to check if skills are at the 50-label gate before training.

`prepare_finetune.py`¶

Command: python3 scripts/prepare_finetune.py

Filters and formats session ft.jsonl files for instruction tuning. Only sessions ending in OBJECTIVE_ACHIEVED or HALT_DISCIPLINE are included. Applies Tier 2 scoring gate (score ≥ 2 required). Writes per-skill JSONL to data/finetune/. --report shows volume stats and acceptance rates without writing.

`train_classifier.py`¶

Command: python3 scripts/train_classifier.py

Trains the TF-IDF + Logistic Regression skill router classifier on data/router_labels.csv. Per-skill 50-example gate; --force bypasses for low-data testing. Saves the trained model and metadata (including per-skill confusion matrices and a training set hash) to ~/.archer_classifier/. At inference time ARCHER loads this and uses it as the first routing tier before falling back to keyword scoring.

`finetune.py`¶

Command: python3 scripts/finetune.py (RunPod A100 only)

Cloud GPU fine-tuning via Unsloth QLoRA. Reads data/finetune/<skill>.jsonl and trains a LoRA adapter on top of Qwen3-14B. --dry-run --all validates data and prints a cost estimate without requiring a GPU.

`export_lora.py`¶

Command: python3 scripts/export_lora.py --skill <name> (on RunPod, post-training)

Merges a trained LoRA adapter into the base model, converts to GGUF, and writes a Modelfile for Ollama. Run on the cloud instance after finetune.py; transfer artifacts locally and ollama create to make the skill-tuned model available.

Audit & Verification¶

Scripts that validate training data quality before it enters the pipeline.

`check_hints.py`¶

Command: python3 scripts/check_hints.py

CI hint anti-pattern linter. Runs static analysis across all plays/PT-*.py files and exits 1 on any violation. Rules enforced:

C1 — TUI tool (ligolo-proxy, msfconsole, responder) backgrounded without tmux (Class 2 failure risk)
C2 — case-sensitive grep on a .log file reference (Class 3)
C4 — app-specific hint block without a generic <placeholder> companion (Class 10, range lock-in)
C5 — infrastructure tunnel setup without a subsequent verification step
C7 — hint string exceeds 500-char limit
C8 — VAR=$(...) assigned and dereferenced without && chaining (shell-var-loss, Class 1)
C10 — success_fn/verify_fn missing _targeted_at guard (wrong-host false-positive risk, Classes 11/15)
C11 — hardcoded IP in _hints_* plain-string or f-string literal without {ip}/{target} substitution (#949)
C12 — hardcoded credential pair (user:pass) in _hints_* without <password> companion (#949)
C13 — soft conditional halt signal ("If … emit [OBJECTIVE_ACHIEVED]") in _hints_*; prefer imperative completion conditions (#949)

Add # nocheck inline to suppress a specific violation. Intentional lab-fixture strings (lab IPs in task comparisons, 127.0.0.1) are exempt from C11 to avoid false positives.

Entry point: scripts/hint_lint.py (pre-commit symlink). Also runs automatically in CI via ci.yml.

`hint_regression.py`¶

Command: python3 scripts/hint_regression.py (advisory CI step)

Cross-file sibling pattern scanner. Detects soft conditional halts (R1), hardcoded IPs (R2), intermediate-state halt signals (R3), and missing generic companion blocks (R4) across all PT-*.py files. Unlike check_hints.py (per-file exit-1 linter), this is advisory — it surfaces cross-pack patterns that only become visible when scanning the full skill set together. Added as an advisory step in ci.yml (#953).

`classify_failures.py`¶

Command: python3 scripts/classify_failures.py --csv testenv/eval_results/<run>.csv

Post-eval failure classifier. Reads an eval CSV and session logs; maps each failing row to a failure class (Classes 1–15) using heuristics. Flags: --gate <rate> exits 1 if model-loop rate (Class 12) exceeds the threshold; --premature-oa audits sessions where halt_reason=OBJECTIVE_ACHIEVED but verify_fn=False; --auto-file opens GitHub issues for hint-gap objectives (Class 6) failing ≥2 consecutive runs. Called automatically by eval-regression.yml after every eval push.

`archer_coder_next.py`¶

Command: python3 scripts/archer_coder_next.py

Verification-queue cap enforcer for the Coder role. Queries GitHub labels (tier:hint-only, tier:shared-utility) and checks all three two-tier sub-caps: hint-only pending < 6, shared-utility pending < 2, combined pending < 8. Exits 0 when all caps are clear; exits 1 with a breakdown when any cap is breached. Run this before starting any new Coder issue — a non-zero exit means the Auditor queue must clear before new work begins. Flags issues without a tier label as ⚠ untagged (invisible to the cap; must be corrected). See PROCESSES.md for the full cap rationale.

`audit_review.py` (in `testenv/`)¶

Commands: archer-audit-dry, archer-review-flagged

Two-tier training data audit. Tier 1 (--dry-run) applies structural checks: wrong host, empty output, degenerate loops, lab-suspect annotations. Tier 2 (--tier2) uses Claude Haiku as an LLM judge to score each Tier-1-clean session 0–3 on evidence quality. Sessions scoring below 2 are excluded from fine-tuning. prepare_finetune.py gates on this score.

`verify_logs.py`¶

Command: python3 scripts/verify_logs.py

Integrity check for session logs. Verifies SHA-256 sidecars against their corresponding session files. Exits non-zero on any mismatch or missing file. Run before any pipeline step that reads from ~/.archer_sessions/.

`compare_tier2_backends.py`¶

Command: python3 scripts/compare_tier2_backends.py

Calibration tool. Scores a sample set of sessions with all three Tier 2 backends (Haiku, Gemini, Ollama/qwen3) and prints score correlation and per-session agreement. Used to diagnose calibration drift between backends.

`check_json_validity.py`¶

Command: python3 scripts/check_json_validity.py

Reports the JSON validity rate for --json-output sessions. Gate for fine-tuning: ≥95% overall validity required. --skill and --since filters.

Lab Management¶

Scripts that manage the evaluation infrastructure.

`lab_range.sh`¶

Command: bash scripts/lab_range.sh [start|stop|status]

Manages the full eval lab: starts, stops, and health-checks all target VMs (Metasploitable2, bWAPP, Juice Shop, GOAD-Light). Called automatically by run_data_collection.sh at the start of each collection cycle.

`pivot_range.sh`¶

Command: archer-pivot <command> (build, up, down, status, brief)

Manages the Docker-based pivot training range. Creates a chain of 1–7 hops using configurable techniques: SSH local forwarding, SSH dynamic (SOCKS), socat relay, chisel, and Ligolo-ng. --hops N, --seed N, --techniques <list>. brief emits a one-liner suitable for injection into the ARCHER system prompt as context.

Dashboard & Demo¶

Scripts that produce public-facing artifacts.

`generate_dashboard.py`¶

Command: python3 scripts/generate_dashboard.py

Generates docs/dashboard.md with inline SVG charts. Reads eval CSVs, session stats, router label counts, classifier state, and Tier 2 audit results. Also maintains data/trend_archive.csv: one snapshot per 5-day window for long-term trend history. Called by GitHub Actions after every eval result push; --dry-run prints to stdout without writing.

`record_demo.py`¶

Command: python3 scripts/record_demo.py --task-id <id> --task "..."

Records a live ARCHER session as an asciinema v2 cast file, without using the asciinema CLI. Spawns ARCHER in a PTY, monitors output for completion signals, waits a configurable grace period after completion, and writes the cast file directly. Avoids the worker-process PTY issues in asciinema's recorder.

`trim_cast.py`¶

Command: python3 scripts/trim_cast.py input.cast [output.cast]

Post-processes a cast file to remove post-completion prose and the interactive prompt. Cuts at the first completion signal ([ Objective Reached ], [ FINDINGS ], [OBJECTIVE_ACHIEVED]) with a per-signal delay to let the final output render.

Notifications¶

`archer_notify.py`¶

Command: python3 scripts/archer_notify.py "title" "body"

Shared ntfy push notification utility. Reads ARCHER_NTFY_TOPIC from the environment; silently no-ops if unset. Called by run_data_collection.sh after each audit run and by regression_watcher.py on regressions.

`archer_morning_digest.sh`¶

Command: archer-morning-digest or bash scripts/archer_morning_digest.sh

Runs archer-status, formats a summary, and sends it via ntfy. Designed for scheduling via cron (0 8 * * *) so the morning status arrives as a push notification before sitting down.

Script Reference¶

Monitoring & Health¶

archer_live.py¶

archer_companion.py¶

archer_status.py¶

archer_health.py¶

routing_analysis.py¶

archer_issues.py¶

Evaluation & Quality Gates¶

eval_harness.py (in testenv/)¶

ci_regression_check.py¶

ci_verify_fix.py¶

propose_objectives.py¶

regression_watcher.py¶

Data Collection¶

run_data_collection.sh¶

run_eval.sh¶

eval_queue.py and eval_queue_worker.py¶

Training Pipeline¶

build_training_data.py¶

prepare_finetune.py¶

train_classifier.py¶

finetune.py¶

export_lora.py¶