Glossary¶

Definitions for the concepts and terms used across Centaur Security Labs research and the ARCHER project. This page is the single reference — terms are defined here once rather than re-defined each time they appear in a paper or article.

Organized by concept group. Cross-references point to the paper where each concept is examined in depth.

Architecture & Design¶

Centaur architecture

A three-layer design in which model, code, and human each handle the class of work they do reliably. The model generates and interprets; deterministic code routes, verifies, and logs; a named human analyst reviews findings and authorizes high-impact actions. The boundary between layers is architectural — not a guideline or a preference. Named after the human-machine collaboration model Garry Kasparov described after his matches against Deep Blue. Examined in depth in The Stochastic Trap.

Code layer

The deterministic component of the centaur architecture: routing decisions, halt detection, safety constraint enforcement, ground-truth verification, session logging, and audit trail maintenance. These are tasks with correct answers. The code layer cannot be probabilistic — placing a language model here is the definition of the stochastic trap.

Compensating logic

Code written to handle cases where the model fails to follow an expected format, structure, or behavior. Parsers for output variations, fallbacks when the model doesn't comply, safety checks added after generation. Each piece of compensating logic is a diagnostic: the model was given a job it is structurally unsuited for. V1 of ARCHER accumulated approximately 300 lines of it before the architectural boundary was enforced.

Ground truth

The actual state of the target system, confirmed by code-layer verification independent of the model's output. A shell responding at uid=0, a file containing expected content, a port confirmed open by an active probe. Not the model's description of what happened — the model can produce a fluent, confident account of success regardless of whether success occurred. The gap between ground truth and model claims is where false positives live.

Human layer

The human analyst component of the centaur architecture. Holds the contextual judgment, accumulated operational experience, and accountability that cannot be transferred to a model or encoded in rules. Final QA on all output, authorization of high-impact actions, and interpretation of findings against organizational risk are human-layer responsibilities — not because we haven't automated them yet, but because the knowledge required to do them is tacit by nature.

Model layer

The language model component of the centaur architecture. Appropriate for tasks where pattern-matching over a large training corpus produces better results than any deterministic rule: command generation, output interpretation, attack chain reasoning, multi-turn investigation. Probabilistic by construction — identical inputs can produce different outputs. Not appropriate for tasks that require deterministic correctness.

Local-first architecture

A deployment model in which inference runs entirely within the operator's own hardware and network boundary — no external API, no data leaving the system, no third-party inference provider in the chain. Local-first solves specific problems: no external dependency risk, no data sovereignty exposure, an end-to-end auditable inference path. The tradeoff is hardware cost and the engineering work required to stay within a hard resource constraint (ARCHER's constraint: 8 GB VRAM). Local-first is a deliberate design choice for threat-sensitive environments, not a default. Examined in depth in The 8GB VRAM Constraint: Architecting a Local-First AI Security Agent (Centaur Security Labs, pending release).

Hint (hint block)

A conditional instruction block within a skill pack that fires when a task string matches specific keywords. Each block injects targeted guidance into the model's context: exact tool invocations, recommended sequencing, expected output patterns, and success indicators for a specific technique. Hints are the primary mechanism for encoding practitioner knowledge into the system without modifying model weights. They require the two-layer structure (app-specific + generic companion) to avoid producing range lock-in. Examined in depth in Range Lock-In.

Playbook

A database of validated command sequences, abstracted from operational sessions using IP and credential placeholders, that the model can query for guidance on specific task types. The playbook accumulates from successful sessions: a verified exploit sequence becomes a playbook entry with concrete commands replaced by {TARGET} and {PORT} placeholders. The model fills the placeholders from recon context at session time. Distinguished from hints — hints are pre-loaded at session start; playbook entries are retrieved on demand during the session.

Run context

The structured state summary passed to the model at each turn of a session: confirmed hosts and services from prior commands, credentials obtained, findings extracted, objectives partially completed, and scope boundaries in effect. Run context prevents the model from re-discovering information it already has and enables multi-step chains where each step builds on prior output. Managing run context within a fixed token budget is a core architectural constraint — unbounded context accumulation produces context pressure and eventually halts the session.

Skill pack

A self-contained, domain-specific capability bundle: hints, guidance, tool permissions, success criteria, and domain context for a specific operational capability. Each skill pack loads on demand for its domain (penetration testing, threat hunting, DFIR, hardening). Only one skill pack is active per session — the domain determines the model's entire context frame for that run. Skill packs are the primary unit of domain engineering: adding a new capability means building a new skill pack, not modifying the core agent.

Stochastic trap

The design pattern in which a probabilistic system is placed in a role requiring deterministic behavior, and the resulting failures are absorbed by instructions and compensating code rather than by reassigning the work. Each patch treats the symptom rather than the cause: the model is still in a deterministic role, still producing failures at its natural rate, and the system is now carrying the weight of every workaround. Left uncorrected, compensating logic becomes the dominant complexity. Examined in depth in The Stochastic Trap.

Task variant

An alternate phrasing of a canonical objective task string, used to test whether the agent generalizes beyond memorized commands. A canonical task ("scan 192.168.56.103 for open ports") has multiple variants with different wording, specificity, and framing. An agent that passes on the canonical but fails on variants has learned the words, not the skill. Variant breadth is a direct measure of generalization.

Verification gate

A code-layer check that runs independently of the model's claimed completion to confirm the objective was genuinely achieved. Probes actual target state — checks that a shell is responding at uid=0, that a file contains expected content, that a port is confirmed open by an active probe. The verification gate is what separates a measured pass from a confabulated one: the model cannot influence the result because the gate runs deterministic code against real system state.

VRAM bleed

Uncontrolled GPU memory consumption that accumulates across model invocations or context loads until available VRAM is exhausted, forcing the model to be evicted and reloaded mid-session. In ARCHER's 8 GB VRAM environment, VRAM bleed manifests as a sudden session crash or silent zero-output halt, typically appearing on longer sessions or after model-switch sequences. Root cause is context accumulation or model caching behavior exceeding the hardware budget. One of ARCHER's named failure classes.

Model Behavior¶

Automation bias

The tendency to over-trust automated system output and reduce independent verification over time. In security operations, automation bias manifests as accepting AI-generated findings without tracing them to source tool output, or relaxing human review as the system appears to perform well. The centaur architecture treats automation bias as a structural risk: human review is not optional and is not reducible — it is the third layer of the system.

Confabulation

The production of confident, fluent, plausible output that does not correspond to actual system state. A structural property of language models: they generate the statistically likely continuation of a prompt, which can produce a convincing success narrative when no success occurred. Confabulation is not deception — there is no intent, no theory of mind, no awareness of the gap between the claim and reality. The architectural response is verification: code-layer checks against ground truth that catch confabulated completions before they reach an analyst or a training pipeline.

Context pressure

The degradation of instruction-following behavior as a session accumulates prior turns. As earlier content fills the model's context window, format instructions and behavioral guidelines compete with session history for the model's attention — and lose. A model that reliably follows output structure early in a session may produce non-compliant output in a long one. Predicts output format drift and is one of the reasons ARCHER operates within a fixed context budget.

Output format drift

The progressive departure from a prescribed output format over the course of a long session. Produced by context pressure. Observable as missing structural markers, verbose interpretation where terse tokens were expected, or gradual breakdown of a format the model followed correctly at session start.

System 1 / System 2

Daniel Kahneman's framework for two modes of cognition: System 1 is fast, automatic, pattern-matching; System 2 is slow, deliberate, correctness-checking. In the centaur architecture, the model layer maps to System 1 — generating candidate actions quickly from pattern recognition across training data. The code and human layers map to System 2 — verifying, routing, and authorizing with deliberate correctness requirements. The architecture works when each layer stays in its mode; failures occur when System 1 outputs are treated as System 2 results.

Direction gap

The performance gap in AI-augmented work that is attributable to differences in human direction skill rather than differences in AI capability. Most analyses of AI-augmented team performance treat AI capability as the primary variable and the human operator as a constant. The direction gap hypothesis inverts this: when two teams using the same model produce dramatically different results, the gap is almost never in the model — it is in how the human directs the model. Direction skill consists of identifiable sub-skills: context externalization (building the structured documents that supply the model's working memory across sessions), failure codification (encoding past failure history as constraints the system will respect), verification design (structuring sessions so that ground-truth checks are built in rather than bolted on), and scope discipline (maintaining the boundary between what the model decides and what the human authorizes). Examined in depth in The Direction Gap.

Tacit knowledge

Knowledge that is held but cannot be fully articulated — the accumulated operational judgment that tells an experienced analyst when a finding is meaningful in context, when a scope boundary should hold, when something requires escalation. Tacit knowledge is not a gap to be closed by better prompting or more training data. It is the reason the human layer exists in the centaur architecture and cannot be designed out of it.

Findings grounding

The requirement that every finding in an agent session be traceable to specific tool output in the session log. A grounded finding cites the command that produced it and the output that supports it. An ungrounded finding is produced without supporting output — it is confabulation. Tier 2 audit scoring evaluates findings on grounding before any session enters the training pipeline; sessions with ungrounded findings fail quality filtering and do not contribute training data.

Phantom pass

A session where the model claims objective completion and [OBJECTIVE_ACHIEVED] fires, but the code-layer verification gate confirms that target state did not change. Distinguished from confabulation by specificity: confabulation describes the general property of producing plausible ungrounded output; a phantom pass is the specific outcome where that output was convincing enough to trigger completion — and constitutes a training contamination event if it enters the pipeline unfiltered.

Premature objective achieved

Firing the completion signal before the objective is genuinely complete — typically on the first interesting output rather than the confirmed final outcome. The most common halt discipline failure in ARCHER's eval record, recurring 23 times across the first six weeks of active development. Produces false positives: the session logs a pass, the objective was not achieved, and if the session enters training it teaches the model to stop too early.

Tool alignment

The degree to which the model selects and invokes tools appropriate to the current task, given the available tool set. Poor tool alignment manifests as using a generic approach when a specialized tool exists, or using a tool in a way that produces no useful output for the objective. Tier 2 audit scoring evaluates tool alignment as one of four quality criteria before sessions enter the training pipeline.

Quality & Learning¶

Eval-driven development

A development methodology in which a continuously-running evaluation harness — not a test suite, not code review, not manual QA — is the primary signal guiding system improvement. The loop: define objectives with independent success criteria, run sessions against real targets, measure pass rates, diagnose failures, fix, repeat. Eval-driven development makes performance regressions visible immediately rather than after deployment; it also creates a constraint that conventional software engineering does not face: the evaluation harness itself is a candidate for refactoring, and changing it risks corrupting the longitudinal data it has been producing. See the measurement instrument problem.

Longitudinal benchmark

A record of per-objective pass rates tracked across successive eval runs over an extended development period. The longitudinal benchmark is the primary source of signal for regression detection, training data quality assessment, and development prioritization. Its integrity depends on measurement instrument stability: if the eval harness changes in ways that affect pass rates independent of model behavior, the longitudinal record is corrupted. Distinguished from a point-in-time benchmark, which measures capability at a single moment. ARCHER's longitudinal benchmark covers 200+ evaluation sessions. Examined in depth in Beyond Pass Rate (Centaur Security Labs, pending release).

Measurement instrument problem

The constraint, specific to eval-driven AI development, that the evaluation harness cannot be refactored without risking corruption of the longitudinal benchmark data it is producing. In conventional software engineering, a test suite can be restructured, expanded, or corrected without affecting what the tests measure — the tests and the system under test are independent artifacts. In eval-driven AI development, the harness is simultaneously the measurement instrument and a candidate for improvement: fixing a false-positive verifier changes the baseline; reordering objectives changes the run context available to later objectives; updating a task string changes what behavior the training data captures. Any of these changes can make historical pass rates non-comparable to current ones. The implication is that harness evolution requires version control, a documented change log, and explicit decisions about whether historical data collected under prior harness versions remains valid. Examined in depth in The Measurement Instrument Problem (Centaur Security Labs, pending release).

Dark knowledge

Code or systems that the team depends on but does not fully understand — produced when AI-assisted generation outpaces deliberate comprehension. Dark knowledge compounds silently: invisible until something breaks in a way no one can diagnose. The inverted apprenticeship model is the response: building fast with AI assistance, then systematically dissecting what was built before the system becomes load-bearing.

False positive rate

The fraction of task-completion signals that fail independent verification — sessions where the model claimed success but a code-layer ground-truth check confirmed the target state did not match. ARCHER's measured baseline: 9.1% in a controlled 87-session run; 18.4% across 1,639 collected sessions. Every false positive that enters a training pipeline teaches the model to produce more of them.

Halt discipline

The property of an agent session where the model stops issuing commands when the objective is genuinely complete, rather than running to the command ceiling or stopping prematurely on partial evidence. Poor halt discipline takes two forms: running past completion (wasted work, context consumption) and stopping early on the first interesting output (incomplete task). Measuring and improving halt discipline is one of the primary levers on training data quality.

Inverted apprenticeship

A learning model for AI-assisted development in which the practitioner builds first and achieves understanding second, through deliberate dissection of what was built. Inverts the traditional model (understand first, then build) because AI assistance produces working systems faster than construction-time understanding can keep pace with. The critical condition: the system must be engineered to produce diagnostic data, and dissection must be deliberate — not incidental — before the system is load-bearing. Examined in depth in The Inverted Apprenticeship (Centaur Security Labs, pending release).

Probabilistic residual

The set of model assertions in a session that were not confirmed by independent code-layer checks — everything the model produced that the system did not verify against actual target state. The residual is where the model's probabilistic nature is most visible: plausible, sometimes correct, but not evidenced. In production, the residual is the primary artifact for human review. The analyst evaluates what the code layer could not confirm.

Sufficiency vs. optimality

Two distinct standards for evaluating AI agent performance. Sufficiency asks: did the agent find a correct solution? Binary pass/fail on whether an objective was achieved. Optimality asks: did the agent find the best correct solution given operational constraints? Most AI security evaluation is built around sufficiency — it is the right standard for measuring capability. It is not sufficient for production deployment, where multiple correct solutions exist but differ significantly in stealth, evidence left behind, operational safety, and alignment with engagement-specific constraints that the AI cannot fully know. The transition from sufficiency to optimality requires quality-weighted training data selection, multi-solution exploration within sessions, and a principled human-in-the-loop mechanism at the point where "best" is defined. Examined in depth in Sufficiency vs. Optimality (Centaur Security Labs, pending release).

Training pipeline

The end-to-end process that converts operational sessions into a fine-tuned model: session collection → structural quality checks → LLM-as-judge quality scoring → data preparation → fine-tuning → deployment. Each stage is a filter. The pipeline closes the operational feedback loop: performance in the field generates training data that improves future performance. The quality of the loop depends entirely on the integrity of the filtering stages — a false positive that enters training teaches the wrong behavior.

Audit tiers (T1 / T2 / T3)

The staged quality filtering process applied to ARCHER sessions before training pipeline entry. Tier 1 applies deterministic structural checks — tool presence, output volume, timing plausibility, exit code validity — and is a fast filter for obvious disqualifiers. Tier 2 applies an LLM-as-judge scoring pass across four criteria: findings grounded in tool output, appropriate tool selection, genuine completion, scope adherence. Sessions must clear both tiers before entering fine-tuning data. Tier 3 is human auditor review, reserved for sessions flagged as borderline or high-stakes by T2. See also: audit pipeline.

Command ceiling

The maximum number of commands the agent may issue in a single session before the code layer forcibly halts it. Prevents unbounded session cost and context accumulation. An agent that regularly approaches the command ceiling without completing the objective has a halt discipline problem — it is failing to recognize completion or is looping. Sessions that hit the ceiling are logged as incomplete regardless of what the model claimed.

LLM-as-judge

The use of a language model as a scoring system for evaluating other language model outputs — in ARCHER's case, a judge model scoring operational sessions across the Tier 2 quality criteria before those sessions enter the training pipeline. Enables quality evaluation at scale where manual annotation would be the bottleneck. Introduces the risk that the judge inherits the same failure modes as the model being judged: a judge that is also prone to confabulation can pass false positives. Judge calibration and human audit sampling are required controls.

QLoRA

Quantized Low-Rank Adaptation — a parameter-efficient fine-tuning technique that trains a small set of additional weight matrices (LoRA adapters) on domain-specific data without modifying the base model weights. The quantized component runs the base model in reduced precision (4-bit), dramatically lowering VRAM requirements during training. ARCHER uses QLoRA to produce domain-specialized adapters from operational sessions on hardware within the 8 GB VRAM constraint — fine-tuning that would otherwise require 40+ GB of VRAM runs within 8 GB with acceptable quality loss.

Session metrics (OA / FP / HD / ER / SR)

The five primary quality signals extracted from each eval session. Objective Achievement (OA): did the session achieve the objective. False Positive rate (FP): did a claimed success survive independent verification. Halt Discipline (HD): did the model stop when the objective was complete, neither prematurely nor after unnecessary continuation. Efficiency Ratio (ER): commands used relative to the minimum required. Step Recall (SR): coverage of the canonical solution's required steps. Together these distinguish a fast wrong answer from a slow correct one and provide the multi-dimensional quality signal needed for training data selection.

Training contamination

The corruption of training data quality by including sessions containing errors the fine-tuned model should not learn: false positives, premature halts, confabulated findings, wrong-tool selections. Training contamination is the mechanism by which quality failures in the operational pipeline propagate into degraded model behavior across successive fine-tuning iterations. Each contaminated session teaches the model to reproduce the error it contains. Audit tiers T1/T2 exist specifically to prevent contaminated sessions from reaching the fine-tuning stage.

Centaur Agent in Operation¶

Centaur Agent (codename: ARCHER)

The public product name for the AI security operations agent developed at Centaur Security Labs. Referred to throughout the build journal and private repository by its internal development codename ARCHER, which is preserved for codebase consistency. A locally-hosted agent built on the centaur architecture: runs entirely within the operator's network boundary on commodity GPU hardware; the model generates candidate actions and interprets tool output; deterministic code handles routing, halt detection, verification, and logging; a named human analyst reviews findings and is accountable for the output. Covers the full security operations lifecycle across multiple domains: penetration testing, threat hunting, digital forensics, hardening, and others.

Centaur Eval (codename: AgentEval)

The public product name for the evaluation framework built alongside Centaur Agent. Referred to in development documentation as AgentEval or the eval harness. A live end-to-end measurement system: a defined set of objectives run against real vulnerable targets, producing per-objective pass/fail results with independent ground-truth verification. Not a unit test suite — exercises the full agent loop against the actual task distribution. Baseline: 94% pass rate across 51 active objectives against Metasploitable2, BWA, bee-box, and Juice Shop. Planned for public release as a standalone open-source benchmark framework after the V2 fine-tuned model ships.

Audit pipeline

The quality filtering system that determines which sessions are fit for the training pipeline. Tier 1 performs deterministic structural checks: tool presence, output volume, plausible timing, exit code validity. Tier 2 applies an LLM-as-judge scoring pass across four criteria — findings grounded in tool output, appropriate tool selection, genuine completion, scope adherence. Sessions must clear both tiers before entering fine-tuning data.

Domain

The top-level capability loaded for a session — penetration testing, threat hunting, digital forensics, and so on. Each domain is a self-contained skill pack with specialized guidance, tools, and evaluation criteria. Only one domain runs per session; the domain determines what the model receives as context for that session and what the code layer uses to evaluate completion.

Eval harness

The quality measurement system: a defined set of objectives run against real vulnerable targets, producing per-objective pass/fail results with independent ground-truth verification. Not a unit test suite. A live end-to-end measurement that exercises the full agent loop against the actual task distribution. ARCHER's baseline: 94% pass rate across 51 active objectives against Metasploitable2, BWA, bee-box, and Juice Shop.

Fine-tuning

The process of training a pre-trained model on domain-specific data to specialize its behavior for a particular task distribution. ARCHER uses QLoRA fine-tuning: a small set of additional weight matrices (the LoRA adapter) is trained on ARCHER's operational sessions without modifying the base model. Fine-tuning moves the model from general security knowledge toward the specific tasks ARCHER actually runs — the operational distribution, not a benchmark.

LoRA adapter

The artifact produced by fine-tuning. Rather than retraining all parameters of the base model, LoRA trains two small matrices per layer whose product approximates the behavioral update. The adapter is small, portable, and reversible — the base model is unchanged. ARCHER's adapter accumulates domain knowledge from operational sessions; the base model provides the general capability that the adapter specializes.

Objective

A specific, measurable task defined in the eval harness: a task string, a target system, a success criterion verified by independent code-layer check, and a command budget. Objectives are the unit of quality measurement — not sessions, not runs. A session either achieves the objective or it does not; the rate at which sessions achieve a given objective is the primary quality signal.

Model tiering

A routing architecture in which different tasks are assigned to different model sizes based on their reasoning requirements. Tier 1 tasks (reconnaissance, port scanning, service enumeration) require pattern-matching over structured output with no multi-step reasoning — a smaller, faster model handles them without quality loss. Tier 2 tasks (exploitation, privilege escalation, AD attacks, reporting) require chained inference, tool output interpretation, and planning across multiple steps — the full model is required. Tiered routing produces speed and cost improvements on the Tier 1 workload without touching Tier 2 quality; in ARCHER's implementation, qwen3:8b handles all Tier 1 objectives at the same pass rate as qwen3:14b at 2–3× faster wall clock time. Examined in depth in Smarter Than One: Model Tiering, Domain Specialization, and the Future of Multi-Model Security Agents (Centaur Security Labs, pending release).

Router

The system that maps a task string to the correct skill and guidance set at session start. Routes first through a trained classifier (TF-IDF + logistic regression), falling back to keyword scoring when classifier confidence is low. A routing miss corrupts the entire session — the model receives wrong tools, wrong guidance, and wrong constraints with no recovery path within the session.

Session

A single ARCHER agent run from task input to terminal exit. Begins when the agent receives a task and ends when the model signals completion, the code layer halts it at the command ceiling, or an error or timeout fires. Every session produces a complete audit log of every command issued, every output returned, and every finding extracted. Qualifying sessions — those that pass quality verification — become training candidates.

Audit trail

The non-repudiable, code-layer-maintained record that a specific command or finding was generated by the AI agent at a specific time, against a specific target, for a specific human-authorized objective. In ARCHER, the audit trail is written by the code layer and is not editable by the model — the model cannot modify the log of what it issued. Required for any production use of AI security tooling: the analyst reviewing findings must be able to trace every finding to its originating command and the raw tool response that supported it.

Prerun check

A sequence of environment health checks run before any eval harness invocation to confirm the lab is in a known good state: target services reachable, no stale lock files, containers running, no contamination from prior sessions. A failing prerun check is a stop condition — running evals against a degraded lab produces misleading pass rate data and may corrupt the longitudinal benchmark. Mandatory before every harness invocation, including opportunistic runs.

Session log

The complete machine-readable audit record of an ARCHER session: every command issued, every tool response returned, every finding extracted, every verification result, and the final objective outcome with pass/fail classification. Session logs are the input to the audit pipeline, the primary diagnostic artifact for investigating objective failures, and the source data for the longitudinal benchmark. Stored as JSONL; each line is one agent turn.

Standards & Frameworks¶

HAZOP hint review

Hazard and Operability Study methodology applied to AI security agent hint design. In ARCHER, the HAZOP checklist evaluates every proposed hint change against five guidewords before the commit: WRONG-TARGET (can this hint fire on the wrong host?), NO-RESPONSE (does it leave the model spinning if the service is down?), PARTIAL-RESPONSE (does success_fn correctly fail if only part of the chain completes?), CONFLICTING-HINT (does this contradict a generic companion block for the same vulnerability class?), and OVERFIT (does it encode target-specific values without generic companions?). Borrowed from process safety engineering, where HAZOP systematically applies deviation guidewords to each design element in an industrial system to surface latent hazards before they manifest. Examined in depth in ARCHER Failure Mode Inventory.

Kill chain

The Lockheed Martin Intrusion Kill Chain — a model of a cyber attack as seven sequential phases: reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objectives. Used in ARCHER to structure attack path reasoning and to ensure post-exploitation objectives are contextualized within a complete attack lifecycle. Distinguished from MITRE ATT&CK, which enumerates techniques at fine granularity without prescribing phase sequence. The kill chain provides the "what is this phase for" framing; ATT&CK provides the "what techniques exist at this phase" inventory.

MITRE ATT&CK

A publicly maintained knowledge base of adversary tactics and techniques, organized as a matrix of 14 tactics (columns) and hundreds of specific techniques (cells). Tactics represent the adversary's goal at a given stage — Initial Access, Execution, Persistence, Privilege Escalation, Defense Evasion, Credential Access, Discovery, Lateral Movement, Collection, Exfiltration, Command and Control, Impact, and others. Techniques are the methods by which that goal is achieved. ARCHER maps session findings to ATT&CK technique IDs in output, providing a standard reference vocabulary shared with SIEM detection rules, threat intelligence platforms, and detection engineering teams.

PTES

Penetration Testing Execution Standard — a framework defining the phases of a penetration test: pre-engagement interactions, intelligence gathering, threat modeling, vulnerability analysis, exploitation, post-exploitation, and reporting. ARCHER's skill domain follows the PTES phase structure, with eval objectives mapped to phases. Using PTES as the organizing framework means objectives map to a documented professional standard and performance can be described against an externally understood reference rather than a bespoke taxonomy.

Failure Modes¶

Two-layer hint rule

The structural requirement that every hint block targeting a specific application or IP address must be paired with a generic companion block that teaches the transferable pattern. The app-specific block provides exact commands for the training target (DVWA, Metasploitable2, bWAPP) and drives eval pass rate against the training environment. The generic companion provides placeholder-based guidance (<login-endpoint>, <token-field>, <target-service>) that teaches the vulnerability class rather than the specific target. Without the generic companion, the model learns to solve the training box — it accumulates correct commands for known targets without developing the underlying reasoning that transfers to novel ones. This is the structural fix for range lock-in. Examined in depth in Range Lock-In.

Compound failure

An objective that is failing because of two or more independent structural defects simultaneously. Compound failures take disproportionately more fix iterations than single-defect failures: closing one defect leaves the others active, and the objective continues to fail — making it appear the fix didn't work. The ARCHER failure mode inventory identified T53 (ligolo pivot) as the canonical example: four independent failure classes were active simultaneously, each requiring a separate diagnosis and fix.

Failure class

A named category of recurring root cause — a structural defect that manifests under different issue numbers but shares the same underlying cause. Six weeks of eval-driven ARCHER development produced 15 failure classes covering 158 issue instances. The key insight: the same root cause was diagnosed and fixed independently multiple times because each instance looked different on the surface. Naming the class makes the pattern visible and enables class-level remediation rather than per-instance patching. Full taxonomy in ARCHER Failure Mode Inventory (Centaur Security Labs, pending release).

One-bug-one-fix trap

The dominant development anti-pattern in AI agent systems: fixing a failure in the specific objective that surfaced it while leaving adjacent objectives with the identical structural defect. The fix closes one issue; the root cause remains active across the rest of the codebase. The trap is hard to avoid because failures surface one at a time, and the natural response is to fix what's in front of you. The countermeasure is class-level diagnosis: before closing any fix, grep for the same pattern in adjacent code and file bugs for every instance found, not just the one that triggered the investigation.

Range lock-in

A training failure mode in which an agent learns to solve a specific target or application rather than the underlying vulnerability class. Produced by hints and training data that are too app-specific: the model learns "on DVWA, send this curl command" rather than "for stored XSS, inject into user-controlled input fields." The result is an agent that passes eval against the training target but fails against any variant. The fix is the two-layer rule: every app-specific hint must have a generic companion that teaches the transferable pattern using placeholders. Examined in depth in Range Lock-In (Centaur Security Labs, pending release).

Recurrence cost

The development overhead produced when a structural defect is diagnosed and fixed multiple times independently rather than once at the class level. In ARCHER's development record — 130+ issues diagnosed across six weeks of intensive iteration — an estimated 40–60 issues (25–38% of the total) could have been prevented by earlier class-level recognition. Per-class examples: the premature objective-achieved pattern recurred 23 times; the missing verification step pattern recurred 25 times before a systematic audit was proposed. Recurrence cost is invisible in per-issue tracking — it only becomes visible when issues are reviewed as a body.

Class-level remediation

Fixing a structural defect across all instances simultaneously rather than addressing each surfaced failure independently. When a defect is diagnosed in one objective, class-level remediation requires searching adjacent hints and skill packs for the same pattern and committing a coordinated fix set — not just closing the issue that triggered the investigation. The countermeasure to the one-bug-one-fix trap and the primary mechanism for reducing recurrence cost. Class-level remediation requires naming the failure class before fixing any instance.

Hint gap

A missing or incomplete hint block that leaves the model without structured guidance for a technique the task requires. The model attempts the technique using only training knowledge, which may be insufficient for the specific target environment or tool version. Hint gaps are the most common root cause in ARCHER's failure record — the routing was correct, the model had the right intent, but it lacked the specific procedural guidance to complete the objective. Remediated by a Hint Pass: adding or expanding the relevant hint block without modifying the model or the core system.

Wrong-host execution

Issuing a command against a target other than the one specified by the current objective — because a prior session's host context was not cleared from run context, the task string underspecified the target, or the hint referenced an IP that does not match the current session's scope. A wrong-host execution is simultaneously a safety failure (scope was violated) and a verification gate failure (the objective specified a different target). ARCHER's HAZOP WRONG-TARGET guideword specifically guards against hint designs that enable this failure mode.

Terms are defined here once. For full technical treatment, follow the cross-references to the research papers and build journal.