Skip to content

Script Reference

ARCHER is built on a collection of Python scripts and shell utilities that handle everything outside the agent loop itself — data collection, training pipeline, evaluation, monitoring, and lab management. This page documents each script's purpose and when to use it.

Scripts live in scripts/ at the repo root. None of them are used at runtime by the agent; they are all workflow tooling.


Monitoring & Health

Scripts that give you visibility into the system's current state.

archer_live.py

Command: python3 scripts/archer_live.py or bash scripts/start_archer_live.sh

The live monitoring server. Serves a local web dashboard at http://localhost:7788 with twelve pages:

  • Live — active session stream, unified Eval Pipeline section (T1/T2 pipeline + RCA findings merged into one scrollable table with expandable rows, Findings/All filter toggle), range health, action queue
  • Benchmark — OA/FP/HD trend chart (AI-generated snapshot explanations on click), per-objective pass rate history with focus dropdown and smart filters, regression table, failure class registry, pass rate distribution
  • Failures — issue burndown (hourly/daily granularity), per-skill fail rates, halt breakdown, coder verification cap tracker
  • RCA — per-objective failure diagnostics sorted by severity
  • Sessions — session explorer with inline expansion, objective filter, deep-link from RCA findings
  • Review — Tier 2 audit queue with command snapshot, session log, one-click verdicts, skill examples, T1/T2/T3 visibility
  • Training — pipeline funnel, API spend tally, daily cost chart
  • Router — routing confidence, per-skill accuracy, label growth, unknown-skill row handling
  • Notes — freeform reviewer note creation, editing, and management
  • Skills — skill reference browser with pass rates, examples, domain breakdowns
  • Help — per-chart help blocks, metric threshold reference
  • Index — session overview and header stats

Automatically starts archer_companion.py in the background on startup (--no-tui headless mode). Provides REST endpoints — /api/stats, /api/companion, /api/eval/history, /api/health, /api/explain/snapshot, /api/explain/snapshots/batch, /api/review/*, and others.

Also reachable over Tailscale at the host's Tailscale IP.

archer_companion.py

Command: python3 scripts/archer_companion.py [--force-spend] [--dry-run] [--no-tui]

The eval session review pipeline. Tails ~/.archer_eval_events.jsonl and dispatches each new session through a two-stage review:

  • T1 (CPU-only): Structural audit — checks session file existence, command patterns, halt class. Produces a failure_class label.
  • T2 (Claude Haiku API): Quality scoring — scores the session 0–3 on approach validity. Only runs on sessions that pass T1 and are budget-gated at 90% spend.

Files GitHub issues automatically for sessions that fail T2 or are classified as infra-gap. Writes state to ~/.archer_companion_state.json for the live dashboard to read.

Started automatically by archer_live.py at server startup. Can also be run standalone with --dry-run (T1 only, no issue filing) or --force-spend (ignore budget cap). Logs to ~/.archer_companion.log when headless.

archer_status.py

Command: archer-status

The primary session-start command. Reads four sources — eval results, routing log, session logs, and live lab probes — and renders a full health snapshot in under 30 seconds. Covers: eval quality (OA rate, failure streaks, staleness), router label progress for every skill with a progress bar showing percentage to the 50-label training gate, classifier state, training pipeline volume, and open GitHub issues with blocking relationships.

Use --handoff to also write a ranked recommendation list to HANDOFFS.md.

archer_health.py

Command: archer-health

A deeper diagnostic view focused on five specific failure signals: halt rates per objective (flagging any objective above 20%), failure clusters from recent sessions, label volume vs the training gate, per-skill routing confidence trend (score gap), and false-positive objective-achieved events where the model claimed success but verification failed. Run this when archer-status shows a degraded metric and you want to understand why.

routing_analysis.py

Command: python3 scripts/routing_analysis.py

Reads the routing log and computes a rolling confidence trend per skill. Confidence is measured as the score gap between the top-ranked and second-ranked skills — a higher gap means the router is more decisive. Flags skills where confidence has declined over the last N decisions. Useful before a classifier retraining run to see which skills are ambiguous.

archer_issues.py

Command: archer-issues

Single-call GitHub issue triage. Parses blocking relationships from issue bodies and groups issues by tier: critical, high-priority bug/regression, and ready-to-start. Shows blocked issues with their dependency chain indented beneath them. --role coder/auditor/scribe filters to role-relevant issues; --blocked shows only issues with unresolved blockers.


Evaluation & Quality Gates

Scripts that measure ARCHER's behavior and enforce quality gates.

eval_harness.py (in testenv/)

Command: python3 testenv/eval_harness.py

The primary measurement tool. Runs ARCHER against 67 active objectives across 9 skill packs on live vulnerable targets (MS2, BWA, bee-box, Juice Shop, GOAD-Light). Records pass/fail, halt reason, session log, and routing decision per run. All eval runs write a timestamped CSV to testenv/eval_results/.

Key flags: - --verify — fix-verification pass. Embeds "verify": true in emitted events so archer_companion skips T2 scoring. T1 structural checks still run. Use this whenever re-running an objective after a Coder commit to confirm a fix. - --strict-preflight — abort before the first run if any objective is missing a setup_fn. Catches configuration errors before wasting GPU time. - --ambiguous — adds underspecified-phrasing variants for router training data. - --audit — shows evidence snippets for passes and final N turns for failures. - --compare — diffs two eval CSVs side by side.

CSV output includes standard columns (success, halt_reason, command_count, elapsed_s, skill_selected) plus newer columns added in the May 2026 sprint: ceiling_proximity (command_count / max_cmds), vram_mb_at_start (GPU VRAM before each run), and estimated_tokens (len(stdout)//4 approximation).

After each run the harness prints four diagnostic blocks: zero-command triage (probable cause for cmds=0 rows), training yield (ft.jsonl candidates per skill), historical halt distribution (halt_reason breakdown over last 20 rows per failing objective), and inline failure classes (Class N tags per failing row before the classify_failures.py subprocess call). See eval-harness.md for full detail.

ci_regression_check.py

Command: Called by GitHub Actions

Compares a new eval result against the baseline CSV. Fails if two or more CI gate objectives drop from ≥70% to zero or one pass. The gate set is defined in testenv/ci_smoke_set.txt. Called automatically on every push to main.

ci_verify_fix.py

Command: Called by GitHub Actions / python3 scripts/ci_verify_fix.py --issue N

Implements the Coder → Auditor verification workflow. When the pending-verification label is applied to a GitHub issue, this script reads the Verify: T## T## footer from the issue's comments, runs three eval passes on those objectives, and posts the result back. If all objectives pass 3/3, it adds the verified label and closes the issue.

regression_watcher.py

Command: python3 testenv/regression_watcher.py

Diffs the most recent eval results against the baseline and files a GitHub issue if any previously-passing objective has dropped to zero. Run on return from away, before reviewing open issues. --dry-run reports regressions without filing.


Data Collection

Scripts that orchestrate automated training data collection.

run_data_collection.sh

Command: bash scripts/run_data_collection.sh

The main unattended collection script. Runs in two phases: Phase 1 is a full eval sweep for ft.jsonl diversity (one pass per objective); Phase 2 is a sparse+ambiguous loop that focuses on skills below the 50-label gate. GPU throttle is on (60W limit). Checks for pending verification requests at cycle boundaries and yields to them automatically.

run_eval.sh

Command: bash scripts/run_eval.sh

Thin GPU power-limit wrapper around eval_harness.py. Sets nvidia-smi power limit to 60W before the run and restores 80W on exit via a trap. Accepts all eval_harness.py arguments, including --verify for fix-verification passes.

eval_queue.py and eval_queue_worker.py

Commands: archer-queue-start, archer-queue-status, archer-queue

Priority-aware eval job scheduler. The queue has three tiers: High (verification runs), Normal (CI / on-demand), and Low (background collection). The worker daemon monitors the queue and preempts low-priority collection runs when a higher-priority job arrives. Prevents verification jobs from waiting behind a 15-hour collection run.


Training Pipeline

Scripts that convert session logs into training artifacts.

build_training_data.py

Command: python3 scripts/build_training_data.py

Batch converter. Reads ~/.archer_routing_log.jsonl and extracts high-confidence eval labels into data/router_labels.csv. Also converts session ft.jsonl files to per-skill JSONL for fine-tuning. --report prints label volumes without writing any files — use this to check if skills are at the 50-label gate before training.

prepare_finetune.py

Command: python3 scripts/prepare_finetune.py

Filters and formats session ft.jsonl files for instruction tuning. Only sessions ending in OBJECTIVE_ACHIEVED or HALT_DISCIPLINE are included. Applies Tier 2 scoring gate (score ≥ 2 required). Writes per-skill JSONL to data/finetune/. --report shows volume stats and acceptance rates without writing.

train_classifier.py

Command: python3 scripts/train_classifier.py

Trains the TF-IDF + Logistic Regression skill router classifier on data/router_labels.csv. Per-skill 50-example gate; --force bypasses for low-data testing. Saves the trained model and metadata (including per-skill confusion matrices and a training set hash) to ~/.archer_classifier/. At inference time ARCHER loads this and uses it as the first routing tier before falling back to keyword scoring.

finetune.py

Command: python3 scripts/finetune.py (RunPod A100 only)

Cloud GPU fine-tuning via Unsloth QLoRA. Reads data/finetune/<skill>.jsonl and trains a LoRA adapter on top of Qwen3-14B. --dry-run --all validates data and prints a cost estimate without requiring a GPU.

export_lora.py

Command: python3 scripts/export_lora.py --skill <name> (on RunPod, post-training)

Merges a trained LoRA adapter into the base model, converts to GGUF, and writes a Modelfile for Ollama. Run on the cloud instance after finetune.py; transfer artifacts locally and ollama create to make the skill-tuned model available.


Audit & Verification

Scripts that validate training data quality before it enters the pipeline.

check_hints.py

Command: python3 scripts/check_hints.py

CI hint anti-pattern linter. Runs static analysis across all skills/PT-*.py files and exits 1 on any violation. Rules enforced: C1 (TUI tool without tmux — Class 2 failure risk), C2 (case-sensitive grep — Class 3), C4 (app-specific hint block without a generic companion — Class 10, range lock-in), C7 (hint block exceeds 500-char limit), C8 (cross-step shell variable — Class 1), C10 (success_fn missing _targeted_at guard — Classes 11/15). Add # nocheck inline to suppress a specific violation. Also runs automatically via the check_hint_lengths pre-commit hook.

classify_failures.py

Command: python3 scripts/classify_failures.py --csv testenv/eval_results/<run>.csv

Post-eval failure classifier. Reads an eval CSV and session logs; maps each failing row to a failure class (Classes 1–15) using heuristics. Flags: --gate <rate> exits 1 if model-loop rate (Class 12) exceeds the threshold; --premature-oa audits sessions where halt_reason=OBJECTIVE_ACHIEVED but verify_fn=False; --auto-file opens GitHub issues for hint-gap objectives (Class 6) failing ≥2 consecutive runs. Called automatically by eval-regression.yml after every eval push.

archer_coder_next.py

Command: python3 scripts/archer_coder_next.py

Verification-queue cap enforcer for the Coder role. Queries GitHub labels (tier:hint-only, tier:shared-utility) and checks all three two-tier sub-caps: hint-only pending < 6, shared-utility pending < 2, combined pending < 8. Exits 0 when all caps are clear; exits 1 with a breakdown when any cap is breached. Run this before starting any new Coder issue — a non-zero exit means the Auditor queue must clear before new work begins. Flags issues without a tier label as ⚠ untagged (invisible to the cap; must be corrected). See PROCESSES.md for the full cap rationale.

audit_review.py (in testenv/)

Commands: archer-audit-dry, archer-review-flagged

Two-tier training data audit. Tier 1 (--dry-run) applies structural checks: wrong host, empty output, degenerate loops, lab-suspect annotations. Tier 2 (--tier2) uses Claude Haiku as an LLM judge to score each Tier-1-clean session 0–3 on evidence quality. Sessions scoring below 2 are excluded from fine-tuning. prepare_finetune.py gates on this score.

verify_logs.py

Command: python3 scripts/verify_logs.py

Integrity check for session logs. Verifies SHA-256 sidecars against their corresponding session files. Exits non-zero on any mismatch or missing file. Run before any pipeline step that reads from ~/.archer_sessions/.

compare_tier2_backends.py

Command: python3 scripts/compare_tier2_backends.py

Calibration tool. Scores a sample set of sessions with all three Tier 2 backends (Haiku, Gemini, Ollama/qwen3) and prints score correlation and per-session agreement. Used to diagnose calibration drift between backends.

check_json_validity.py

Command: python3 scripts/check_json_validity.py

Reports the JSON validity rate for --json-output sessions. Gate for fine-tuning: ≥95% overall validity required. --skill and --since filters.


Lab Management

Scripts that manage the evaluation infrastructure.

lab_range.sh

Command: bash scripts/lab_range.sh [start|stop|status]

Manages the full eval lab: starts, stops, and health-checks all target VMs (Metasploitable2, bWAPP, Juice Shop, GOAD-Light). Called automatically by run_data_collection.sh at the start of each collection cycle.

pivot_range.sh

Command: archer-pivot <command> (build, up, down, status, brief)

Manages the Docker-based pivot training range. Creates a chain of 1–7 hops using configurable techniques: SSH local forwarding, SSH dynamic (SOCKS), socat relay, chisel, and Ligolo-ng. --hops N, --seed N, --techniques <list>. brief emits a one-liner suitable for injection into the ARCHER system prompt as context.


Dashboard & Demo

Scripts that produce public-facing artifacts.

generate_dashboard.py

Command: python3 scripts/generate_dashboard.py

Generates docs/dashboard.md with inline SVG charts. Reads eval CSVs, session stats, router label counts, classifier state, and Tier 2 audit results. Also maintains data/trend_archive.csv: one snapshot per 5-day window for long-term trend history. Called by GitHub Actions after every eval result push; --dry-run prints to stdout without writing.

record_demo.py

Command: python3 scripts/record_demo.py --task-id <id> --task "..."

Records a live ARCHER session as an asciinema v2 cast file, without using the asciinema CLI. Spawns ARCHER in a PTY, monitors output for completion signals, waits a configurable grace period after completion, and writes the cast file directly. Avoids the worker-process PTY issues in asciinema's recorder.

trim_cast.py

Command: python3 scripts/trim_cast.py input.cast [output.cast]

Post-processes a cast file to remove post-completion prose and the interactive prompt. Cuts at the first completion signal ([ Objective Reached ], [ FINDINGS ], [OBJECTIVE_ACHIEVED]) with a per-signal delay to let the final output render.


Notifications

archer_notify.py

Command: python3 scripts/archer_notify.py "title" "body"

Shared ntfy push notification utility. Reads ARCHER_NTFY_TOPIC from the environment; silently no-ops if unset. Called by run_data_collection.sh after each audit run and by regression_watcher.py on regressions.

archer_morning_digest.sh

Command: archer-morning-digest or bash scripts/archer_morning_digest.sh

Runs archer-status, formats a summary, and sends it via ntfy. Designed for scheduling via cron (0 8 * * *) so the morning status arrives as a push notification before sitting down.