The Centaur Framework: A Design Specification for Human-AI Collaboration in Security Operations¶
Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
References: Full source index →
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
The Centaur model — named for the human-chess-engine partnerships that outperformed both humans and engines operating alone — has become a common frame for describing AI-augmented security operations. The frame is useful but underspecified: "human and AI collaborate" describes an aspiration, not an architecture. Without a precise division of responsibilities, systems claimed to implement the Centaur model frequently violate its core principle — that each layer of the partnership does the work it is suited for, and only that work.
This paper proposes a formal three-layer specification for Centaur implementation in security operations: the model layer, the code layer, and the human layer. For each layer, I define the class of work it handles, the invariants it must enforce, and the boundary conditions that distinguish its work from adjacent layers. I document five categories of boundary violation observed in practice and the failure modes each produces. I derive twenty concrete design requirements for compliant Centaur implementations and evaluate ARCHER against those requirements. I argue that the most valuable output of a compliant implementation is not task success — it is Human-Verified Traceability: a finding with a verifiable chain from tool output to confirmed target state to named human sign-off. I document the methodology — derived bottom-up from a catalog of operational failures rather than top-down from first principles — and its limitations, most significantly the self-assessment circularity that results from evaluating a specification against the system that produced it. A reproducibility protocol enables external evaluators to verify each of the twenty requirements from session artifacts without access to the development environment. I conclude with practical recommendations for security teams, transitioning organizations, and tool vendors; with falsifiable claims, two of which are grounded in ARCHER operational evidence; and with the design questions any team should answer before claiming their system is a Centaur implementation.
1. Introduction¶
Kasparov's insight from Advanced Chess was not that computers and humans should cooperate. It was that the division of labor was determinative.[^1] Amateur players with chess engines outperformed grandmasters without them — but teams of grandmasters with disciplined human-engine collaboration outperformed both.[^2] The advantage was not additive. It came from doing the right work in the right part of the partnership.
The chess analogy requires a bounded reading. Modern engines have advanced to the point where human input in pure tactical calculation is frequently noise rather than signal — Stockfish and its successors now operate at levels where human judgment in move selection offers little marginal benefit.[^3] That evolution does not undermine the framework. It clarifies what the analogy was always actually establishing: a structural principle about division of labor, not a claim about enduring human cognitive superiority in any particular domain. In security operations, the human layer's irreplaceable contribution is not superior calculation. It is legal and operational risk acceptance — the acceptance of organizational liability and regulatory accountability that no probabilistic system can provide and no regulatory framework will accept as substituted. That function is non-computable in a way chess moves are not.
Aviation and medicine provide a cleaner analogy for the accountability function specifically. A pilot holds an FAA certificate regardless of how capable the autopilot is.[^4] A physician signs the diagnosis regardless of the AI diagnostic tool's accuracy rate.[^5] Accountability is legally anchored — it does not migrate to the tool as the tool improves. The chess analogy correctly establishes the division of labor principle. The aviation and medicine analogies correctly establish that the human layer's accountability function is permanent: not a bottleneck pending better AI, but a structural feature of how professional and legal accountability is assigned.
The same insight applies to security operations and is equally underimplemented. Most AI security tooling adds an AI layer to an existing workflow without defining what the AI layer is responsible for and what it is not. The result is a system where the AI does what it can — generates text, interprets output, suggests next steps — and the human handles whatever the AI missed. This is collaboration without division of labor. It produces inconsistent quality, inconsistent accountability, and inconsistent audit trails, because neither party has a well-defined role.
A common objection is that this is simply good software engineering — separation of concerns. The objection is partially correct but insufficient. Standard software engineering specifies how to partition technical responsibility; it does not specify that one partition must be permanently occupied by a named human bearing legal and professional accountability. The Centaur Framework adds that: not a software architecture, but a responsibility architecture. The human layer cannot be optimized away — not because humans are the most capable actors in each domain, but because accountability is legally non-delegable. That requirement is not a transitional concession to current AI capability.
The Centaur Framework proposed in this paper makes the division of labor structural. The model layer, the code layer, and the human layer each have defined responsibilities, defined invariants, and defined boundaries. Violations of those boundaries — the model performing code-layer work, the code layer performing human-layer work — are diagnosable architectural defects, not judgment calls.
2. Background and Related Work¶
2.1 The Centaur Concept¶
The term "Centaur" in the context of human-computer collaboration derives from Kasparov's account of Advanced Chess and the 2005 PAL/CSS Freestyle tournament.[^1] The tournament result — amateur players using three computers and disciplined process outperforming grandmaster teams — produced what Kasparov termed the "weak human + machine + better process" formulation: the quality of the human-machine collaboration process mattered more than the capability of either party alone.[^2]
Academic analysis supports and qualifies this finding. Bilalić et al. (2024), analyzing 11.6 million decisions by elite chess players, demonstrate conditions under which human-AI collaboration improves performance over either alone — and identify decision types where human input remains complementary rather than redundant.[^3] Gaessler and Piezunka (2023) examine how AI tools change human skill development and performance longitudinally, finding that engine assistance substitutes for some human collaboration but does not eliminate the value of human judgment in contextually complex positions.[^9]
2.2 Human-AI Teaming in Security Operations¶
Security operations presents a domain with strong structural parallels to chess-engine collaboration: high-volume, high-stakes decisions under uncertainty where AI excels at procedural consistency and humans excel at contextual judgment.
Alert fatigue is a documented and measurable problem in Security Operations Centers (SOCs). Tariq et al. (2025) provide a systematic review of alert fatigue in SOC environments, documenting the cognitive load accumulation patterns that drive analyst error and missed detections.[^14] Chhetri et al. (2024) develop and evaluate a human-AI teaming framework specifically designed to mitigate alert fatigue, demonstrating measurable improvements in analyst decision quality when AI handles triage and humans handle contextual judgment.[^15]
Controlled study evidence for analyst augmentation is emerging. A 2025 CSA benchmark study (n=148) found that AI-assisted analysts completed security investigations 45–61% faster and scored 22–29% higher on accuracy than the manual control group; fatigue-induced completeness degradation was reduced by roughly half in the AI-assisted group.[^16] This study was co-published with a vendor and should be weighted with appropriate caution, but its controlled design makes it more evidentially sound than typical industry white papers.
Human-in-the-loop machine learning in security — where the ML system's decisions remain subject to human review and correction — is an active research area. Kim et al. (2024) apply active learning to cyber intrusion detection, demonstrating that incorporating analyst feedback into the model's training loop improves detection rates while reducing analyst workload compared to static ML approaches.[^10]
LLM-based penetration testing agents represent a closely related development. Deng et al. (2024) evaluate large language models as penetration testing agents across HackTheBox and real-world targets, finding that while models show competence in isolated exploitation steps, they exhibit systematic degradation over multi-step sessions: loss of session context, inability to track which techniques have been attempted, and premature objective-achieved claims that leave the task incomplete.[^21] These failure modes — inconsistent completion signaling and context accumulation errors across turns — are structurally identical to the halt-detection and format-drift failures catalogued in §3. The code-layer controls this framework prescribes (deterministic halt detection, ground-truth verification independent of model output) are architectural responses to this documented class of LLM failure, not defensive additions to an otherwise sound approach.
2.3 Responsibility Attribution in Automated Systems¶
Responsibility frameworks for AI-assisted decisions are now embodied in regulatory instruments. The NIST AI Risk Management Framework (AI RMF 1.0) provides voluntary guidance for managing AI risk across the system lifecycle.[^11] The framework's GOVERN function — the first of four core functions — specifically addresses accountability structures, oversight roles, and organizational risk tolerance, mapping directly to the human layer's non-delegable responsibilities in this specification. The EU AI Act (Regulation 2024/1689) requires that high-risk AI systems be designed for effective human oversight, with Article 14 specifying that providers and deployers must implement oversight measures commensurate with the system's risk and autonomy level.[^12] Article 9 additionally requires a continuous risk management system covering identification, analysis, and mitigation of risks across the AI system's lifecycle — a requirement that maps to the code layer's systematic safety constraint enforcement, session logging, and ground-truth verification functions. IEEE Std 7001-2021 defines measurable, testable transparency levels for autonomous systems across stakeholder categories, providing an engineering standard for the kind of decision-process visibility the Centaur Framework requires of the code layer.[^13]
2.4 Related Operational Frameworks¶
MITRE ATT&CK provides a structured knowledge base of adversary tactics, techniques, and procedures that defines what security analysis tasks look like at the operational level — the basis for defining what belongs in the model layer versus what requires human contextual judgment.[^17] TIBER-EU (ECB, 2018) governs threat intelligence-based red-team testing of financial sector entities, establishing accountability structures for controlled adversary simulation — a domain where the human layer's authorization function is most precisely defined in existing practice.[^18] The Penetration Testing Execution Standard (PTES)[^19] and related methodological frameworks define the procedural baseline that Centaur-compliant tools are intended to execute more consistently than unaugmented analyst teams.
3. The Three-Layer Specification¶
3.1 The Model Layer¶
Responsibility: Generate candidate actions and interpret unstructured input.
The model layer handles work where probabilistic reasoning over a learned distribution produces better results than any deterministic rule. Security operations involve two canonical examples of this work class:
Command generation: Given a task description, current tool output, and session state, what is the next action? The space of possible next actions is too large to enumerate with rules. A model trained on security operations generalizes better than any decision tree.
Output interpretation: Tool output from nmap, msfconsole, Suricata, Zeek, and similar tools is unstructured, varies by version and configuration, and requires contextual interpretation. No static parser covers the full variation space. A model that has learned the structure of tool output generalizes to variations that were not explicitly anticipated.
Invariants the model layer must enforce:
- The model does not route. Routing decisions (which analytical workflow handles this task) have correct answers; they belong in the code layer.
- The model does not halt. Session termination is a deterministic decision; it belongs in the code layer.
- The model does not log. Audit trail maintenance must be external to the probabilistic layer to be trustworthy.
- The model does not authorize. Authorization decisions carry accountability; they belong in the human layer.
Boundary conditions:
The model layer boundary is violated when the system adds compensating logic to handle model output variation: parsers that handle multiple output formats, fallbacks when the model doesn't follow instructions, post-processing that corrects malformed output. Each instance of compensating logic is evidence that the model is performing code-layer work. See companion paper: The Stochastic Trap.[^20]
A note on the distinction between compensating logic and intentional code architecture: these are not the same thing, though they may look similar in implementation. Compensating logic is reactive — the model was assigned a role (produce structured output, signal completion) and failed at it, so the surrounding code was patched to handle the variation. Intentional code architecture is proactive — the code layer was designed from the start to own routing, halt detection, and verification, because those functions belong there. The distinction is not about complexity; both can accumulate maintenance debt. It is about design intent and correctness guarantees. A routing classifier with a test suite and ground-truth labels is maintainable code with measurable accuracy. A regex thicket added to catch the cases where the model failed to follow instructions is compensating logic that will grow until the architectural problem is addressed.
3.2 The Code Layer¶
Responsibility: Enforce accountability where accountability is required.
The code layer handles work with correct answers — decisions that must be made the same way every time, processes that must be auditable and reproducible. In security operations:
Task routing: A task string maps to a specific analytical workflow. The routing decision has a correct answer. The code layer implements an accountable routing mechanism — a trained classifier, rule engine, or human selection — that makes the routing decision reliably and whose accuracy is measurable against a labeled ground-truth set.
On the routing objection and the accountability/reliability distinction
A common objection: if a trained classifier misroutes a task, the outcome is identical to an LLM misrouting it — a misroute is a misroute. This is true, and the framework does not claim per-instance reliability superiority. The distinction is accountability and remediability. When the classifier misroutes, the failure pattern is visible in a confusion matrix; the fix is labeled training examples and an isolated retrain. When the LLM misroutes, the failure is distributed across an unobservable probability space; the fix is revised instructions with no ground-truth baseline.
The operative criterion is SDLC compliance, not probabilistic vs. deterministic implementation: the routing mechanism has frozen weights at deployment, a specific training set, a version history, and a confusion matrix measurable against ground truth. The model agent's routing output has none of these properties.
Halt detection: A session is either complete or it isn't. The code layer enforces a halt condition based on observable signals — command count, success indicators in tool output, time budget — without delegating the decision to the model.
Ground-truth verification (the Ground Truth Gate): A model that claims the objective is achieved has produced a probability-weighted assertion. The code layer verifies that claim against actual target state — probing the target for the specific state change asserted, independent of the model's output. A finding that passes ground-truth verification is traceable to an observable target state. A finding that does not has not been verified and must not be reported as confirmed. This is not a logging function — it is an epistemological gate. The model claims; the code verifies; the session log records both. A finding that completes this sequence — raw tool output captured, target state confirmed by code-layer probe, named human sign-off on the residual — has achieved what this framework terms Human-Verified Traceability, defined formally in §3.4 and the Glossary.
Session logging: The audit trail must accurately represent what happened — what commands ran, what output was returned, what the ground-truth verification confirmed — with no gaps and no modifications. The code layer maintains this record and ensures its integrity.
Safety constraint enforcement: Some commands must not execute without explicit authorization. The code layer enforces these constraints mechanically, without relying on the model to decline.
Invariants the code layer must enforce:
- The code layer does not interpret findings. Interpretation requires judgment against organizational context; that is human-layer work.
- The code layer does not authorize high-impact actions. Authorization requires accountability; that is human-layer work.
- The code layer does not generate the audit narrative. The audit trail is a factual record; the narrative interpretation belongs to the human layer.
Boundary conditions:
The code layer boundary is violated when routing or halt decisions are delegated to the model — when the model is asked "is this task reconnaissance or exploitation?" or "have you achieved the objective?" and its answer drives system behavior. The code layer should make these decisions and present the model with the context that results.
3.3 The Human Layer¶
Responsibility: Provide legal and operational risk acceptance where risk acceptance is required.
The human layer's irreplaceable function is not superior cognition — it is risk acceptance. A model can identify a vulnerability. The code can verify the vulnerability exists in the target. Only a named human can accept the organizational risk of the remediation decision, sign the rules of engagement that authorize the test, and be held accountable by a regulator if the finding is wrong or the action is unauthorized. This is a precise claim: the human layer exists because accountability is legally and operationally non-delegable under current regulatory frameworks — GDPR, NIS2, DORA, and equivalent instruments assign accountability to named professionals, not to automated systems.^6^8 Whether this will remain true as regulatory frameworks evolve is an open question. What is not contingent on regulatory evolution is the epistemological gap: the organizational context that determines what a finding means cannot be encoded in a model regardless of how legal frameworks develop. The human layer would remain necessary for that function even if future law permitted autonomous system accountability.
In security operations, this risk acceptance manifests as:
Scope and ROE authorization: The human defines what systems are in scope, what actions are authorized, and what risk is acceptable before execution begins. This is not a technical decision — it is an organizational and legal decision that creates the authorization boundary the code layer enforces.
Contextual interpretation: A finding that "this system has an unauthenticated RCE vulnerability" means different things to different organizations. The technical finding is the same; its severity, urgency, and remediation priority depend on organizational context the model cannot have.
Authorization of high-impact actions: Actions that cannot be undone require explicit human authorization. The accountability for these actions is non-transferable.
Epistemological gap closure: The human layer closes a gap that risk acceptance framing alone does not cover. A model sees a vulnerability. A human sees a vulnerability on a legacy system scheduled for decommission in 48 hours, in a regulated environment where a finding above a certain severity triggers mandatory 72-hour disclosure. The technical finding is the same; its organizational meaning is not. The human layer is the Final Parity Check between technical truth and organizational reality — the place where a correct finding meets the constraints that determine what to do about it. No model can hold that context. No code can acquire it.
Auditability review: The human reviews the probabilistic residual — the set of model assertions not confirmed by C5 Tier 1 ground-truth verification. The code layer generates this residual as the complement of mechanically confirmed states; the human reviews it. The code layer cannot determine which residual items require judgment — that determination is itself a human-layer function. The code layer's role is to surface the residual clearly, not to pre-filter it. A human who has reviewed 99 verified findings and rubber-stamps the 100th is not performing this function — they are abdicating it.
The human layer requirement applies uniformly across finding types, but the review burden must be proportional to consequence. A finding that a port is open on an internal development server and a finding of unauthenticated RCE in a public-facing payment processor carry different organizational risks. H4 must be calibrated to consequence, or it will either over-burden the human layer with low-stakes findings or under-resource review of high-stakes ones.
Invariants the human layer must enforce:
- The human layer does not perform mechanical, high-volume work that the code or model layer can perform reliably. Cognitive capacity is the human layer's scarce resource; it should be spent on judgment, not execution.
- The human layer does not skip QA on model output. Accountability is non-delegable; accepting model output without review is accepting accountability for the model's failure modes.
Boundary conditions:
The human layer boundary is violated in two directions. The first: the human performs mechanical work that belongs in the model or code layer — manually parsing tool output, tracking session state, applying known-good methodology mechanically. This wastes cognitive capacity on work where the human provides no comparative advantage.
The second, more dangerous violation: the human delegates accountability to the model — accepts a model finding without review, authorizes an action based on a model recommendation without evaluating the recommendation's basis. The accountability for a finding or action cannot be delegated to a probabilistic system.
3.4 The Compliant Session: Normal-Case Flow¶
Before examining how boundaries are violated (Section 4), it helps to trace how they function correctly. A compliant Centaur session follows this sequence:
1. Human authorization (H1). A named human professional documents the scope, target, and rules of engagement before execution begins. The code layer enforces this as the authorization boundary for all subsequent actions.
2. Code-layer routing (M1/C1). The code layer routes the task to the appropriate analytical workflow using an accountable mechanism — a trained classifier, rule engine, or human selection. The model does not participate in this decision and is not informed of the alternatives it was not assigned.
3. Model-layer execution. The model generates candidate actions, interprets tool output, and chains toward task completion within the context the code layer established.
4. Code-layer halt and verification (M2/C2/C5). When the model signals completion, the code layer does not accept the signal immediately. It checks observable signals — command count, success indicators — and for objectives with binary success conditions probes the target directly for the asserted state change. A probe failure continues the session. A probe success marks the finding Tier 1 confirmed.
5. Residual generation (C6). The code layer generates the probabilistic residual: every model assertion not confirmed by a Tier 1 probe. This bounded set is the human layer's review input — not the full session transcript.
6. Named human review (H2/H4/H5). A named human professional reviews the residual. Tier 1 items require no re-evaluation. The reviewer's attention goes to Tier 2 items where judgment is required. The reviewer's identity is logged alongside the finding.
7. Human-Verified Traceability. A finding that completes this sequence has a verifiable chain from raw tool output (C3) to confirmed target state (C5) to named human sign-off (H5). This property — Human-Verified Traceability — is the primary output a compliant implementation is designed to produce. It is what makes a finding actionable for remediation and defensible under regulatory review. The Glossary entry defines it formally.
This flow is interrupted at a predictable point whenever a boundary violation occurs. Section 4 describes which violation interrupts the flow at which step, and what the downstream consequences are.
3.5 Information Topology¶
The three-layer specification assigns work to layers. But it also implies a corresponding assignment of information: each layer can only observe what its function requires, and should be structurally prevented from observing more.
This is not merely an implementation convenience — it is a security property. A model layer that can observe routing scores, halt conditions, or verification results can learn to produce outputs that pass those checks without performing the underlying work. A code layer that accumulates organizational context to assist with interpretation is acquiring information that belongs in the human layer. An information topology violation is often the earliest observable signal of a responsibility boundary violation.
The model layer is information-narrow by design. It receives only the context the code layer constructs for it: the task description, skill-appropriate guidance, and accumulated tool output from previous turns. It does not observe routing decisions, classifier confidence scores, halt condition state, verification results, or training gate outcomes. The model's input is a deliberate reduction of the full system state — sufficient for command generation and output interpretation, and no more. This narrowness is what makes the Ground Truth Gate possible: the model cannot observe or influence the verification step.
The code layer is information-privileged. It observes both the model's outputs and the target environment's actual state simultaneously. This dual visibility is the structural basis for ground-truth verification — the code layer can compare what the model asserted against what the target actually shows. No other layer holds both. The code layer also observes session-level signals invisible to the model: command counts, timing, violation events, and quality criteria that determine whether a session enters the training pipeline.
The human layer receives a bounded residual. The code layer's pre-processing function is not merely audit hygiene — it is what makes human-layer review tractable. A human reviewing full session transcripts turn-by-turn at scale is performing code-layer work. A human reviewing a bounded residual of unverified assertions is performing human-layer work. The distinction is not about volume; it is about whether the code layer has done its job of separating the mechanically confirmable from the items that require judgment.
The practical diagnostic value of the information topology: if a decision requires information the deciding layer structurally does not have, the decision is misassigned. Routing requires full skill configuration — that information belongs in the code layer, not the model layer. Finding interpretation requires organizational context — that belongs in the human layer, not the code layer. When compensating logic accumulates around a decision point (parsers handling model output variation, fallbacks for non-compliant signals), it is evidence that the model is being asked to produce something the code layer should own, and the model's information access was insufficient to do it reliably.
The five boundary violation classes in Section 4 each have a corresponding information topology failure: the wrong information was made available to the wrong layer, or a decision was made using information the deciding layer was not designed to hold.
The learning loop as extended topology
The information topology described here governs what each layer observes within a session. A complementary topology governs what each layer generates from a session — and how that generated information flows across sessions to improve future ones. That inter-session architecture — the learning loop — is developed in the companion paper: The Learning Loop: Knowledge Architecture for Self-Improving Human-AI Security Systems.
4. Boundary Violation Taxonomy¶
4.1 Model in Code Role (Stochastic Routing)¶
The most common boundary violation: using the model to make routing decisions that have correct answers. Symptoms: routing accuracy is variable across task phrasings for the same underlying intent; different model versions produce different routing behaviors; adding more routing instructions doesn't converge to reliable routing.
Failure mode: sessions routed to the wrong skill domain produce wrong tooling, wrong hints, wrong halt criteria. The session may complete without error while producing findings for the wrong task.
Reference: ARCHER V1 keyword router + LLM tie-breaking vs. V2 trained classifier.
4.2 Model in Halt Role (Premature or Missed Termination)¶
Using the model's completion signal as the authoritative halt condition. Symptoms: sessions that end with model-claimed success but no actual objective completion; sessions that continue past the point of completion because the model didn't emit the halt token.
Failure mode: false positive sessions enter training data (model learns to claim success prematurely); missed halts produce sessions too long to review and accumulate context debt.
Reference: ARCHER V1 [OBJECTIVE_ACHIEVED] token vs. V2 verify_fn ground-truth gate.
4.3 Code in Human Role (Automated Authorization)¶
The code layer executes high-impact actions without human authorization, typically by interpreting a broad task description as implicitly authorizing all necessary actions. Symptoms: the agent takes actions outside the intended scope of the original task; irreversible actions execute without explicit human sign-off.
Failure mode: unauthorized system changes, out-of-scope actions, accountability gap if something goes wrong.
Reference: ARCHER explicit scope configuration; human authorization required for high-impact commands.
4.4 Human in Model Role (Manual Execution)¶
The human performs mechanical work that belongs in the model or code layer — parsing output, tracking session state, maintaining checklists of required steps. Symptoms: analysts spend large fractions of engagement time on procedural tasks; methodology consistency degrades on long engagements due to cognitive load.
Failure mode: inconsistent methodology execution; analyst cognitive capacity consumed by mechanical work; senior analyst skills wasted on tasks a model performs reliably.
Reference: The Centaur model's primary measurable benefit: procedural consistency without analyst attention cost.
4.5 Silent Competence (Agent Withholds the Correct Answer)¶
An agent that identifies a fix but lacks authorization for it, and does not flag the conflict. Symptoms: the agent routes around the blocked path and presents alternatives as its output; the user sees a dead end while the agent saw the exit; the actual fix is delayed by sessions that adjusted the wrong variables.
This is the most insidious boundary violation because it looks like correct behavior. The agent is following its role constraints. But the role constraints have produced a worse outcome than if the agent had flagged the conflict immediately.
Failure mode: accumulated latency before the correct fix is identified and authorized; workarounds that compound; user interprets slow progress as a hard problem rather than a role gap.
The following case from ARCHER development illustrates the pattern. An evaluation objective was failing across multiple sessions. A Coder instance investigating the failures identified the root cause: a vulnerable binary on the target lab VM had been silently replaced by a system package update — the attack approach was correct, but the target had changed. The fix was straightforward: restore the original binary. The Coder instance, operating under a role constraint that restricted lab VM modifications without explicit authorization, classified the action as outside its current scope and instead iterated on attack chain variables. Three subsequent sessions adjusted hint text, timeout parameters, and tool flags before the authorization question was surfaced explicitly and the binary restored in a single action. Post-mortem review of the open issue backlog identified four additional issues that had been similarly misclassified — each had an identifiable fix that had been suppressed without disclosure, routed around rather than escalated.
Remediation for silent competence:
The fix is structural, not behavioral. Role documents must: 1. Define a pre-authorized class of actions that the agent can take without asking (reducing the frequency of blocked paths) 2. Define an explicit escalation trigger for the remaining blocked paths ("I believe the fix requires X. This appears outside my current authorization. Should I proceed?") 3. Require the agent to surface withheld actions at session end ("Withheld actions: any fix the Coder identified but did not execute due to role constraints")
A structural gap remains: the remediations above operate at the behavioral layer — they ask the model to report what it withheld. Code-layer monitoring of silent competence (detecting withheld paths through observable behavioral signals such as routing dead ends, session-end disclosures that reference blocked actions, or command sequences consistent with workaround behavior) is a research direction not yet implemented in ARCHER. Until code-layer detection is available, silent competence incidents are recoverable only through human review of the probabilistic residual (C6/H4), which means the exposure window is the duration of a session rather than the duration of a command.
5. Design Requirements for Centaur-Compliant Systems¶
A system that claims to implement the Centaur model should be evaluable against these requirements. Systems that satisfy them are Centaur implementations. Systems that do not are AI-augmented tools that may or may not achieve Centaur-level performance.
A system's compliance posture is determined by its distribution across the three status categories. The framework treats the twenty requirements as a threshold rather than a score: a system with zero Not Met requirements has made a genuine architectural commitment to each one, even where implementation is incomplete. A single Not Met in the M or C group means the system cannot make a reliable audit trail claim; Not Met on any H requirement means it cannot make a named-accountability claim. The Partial status acknowledges that compliance is a process — but Partial requires that the requirement's architectural intent is present and the implementation gap has a defined path to closure. A system that says "we are working toward M1" without a classifier in development does not qualify as Partial on M1; it is Not Met.
These requirements define an architectural standard, not a performance threshold. No quantitative compliance bar — routing accuracy ≥ X%, halt false-negative rate ≤ Y% — is specified here, because the specification was derived from a single implementation and generalizing to a performance floor would overfit to one system's current capability. ARCHER's current evaluation performance (94% objective completion rate across twenty-seven penetration testing objectives) demonstrates that the architecture is achievable at high rates; it is not a required floor for compliance. Future work, including independent evaluation of additional Centaur implementations, will determine what performance ranges are characteristic of compliant systems.
Scope
This framework addresses interactive human-in-the-loop systems — architectures where a named human is actively engaged in or available to authorize each session. Autonomous agentic systems, where no human is present during execution and authorization is granted in advance for open-ended task completion, introduce additional boundary conditions — particularly in the human layer — that this specification does not yet fully address. Agentic contexts are a planned extension of this framework; see Section 9.
Model Layer Requirements¶
M1. Task routing is not performed by the model layer. The routing decision uses an accountable mechanism (trained classifier, rule engine, or human selection) whose accuracy is measurable against ground truth.
M2. Session termination is not determined by the model layer's completion signal alone. A code-layer ground-truth verification confirms the signal before termination.
M3. Session logging is external to the model layer. Logs capture raw tool output and ground-truth verification results, not model summaries. Note: this requirement distinguishes log maintenance (code-layer) from disclosure production (model-layer). Session-end disclosure — the agent's report of withheld actions, deferred items, and escalation triggers — is legitimate model-layer output. M3 is not violated by asking the model to produce this output; it is violated if that output is treated as the authoritative audit record rather than being captured by the code-layer logging mechanism.
M4. High-impact action authorization is not performed by the model layer.
Code Layer Requirements¶
C1. Task routing is implemented in code with an accountable routing mechanism. "Accountable" means: separately trained or specified, accuracy measurable against a labeled ground-truth set, and improvable independently of the main model agent. A trained classifier meets this bar. A model agent asked to classify its own task does not. SDLC compliance is the operative criterion: the routing mechanism has a frozen weights file at deployment, a specific training set, a version history, and a confusion matrix that can be measured against ground truth. A model agent routing its own tasks has none of these properties — its routing output is statistically distributed with no ground-truth tether.
C2. Halt detection is implemented in code based on observable signals (command count, verified success indicators). The model's completion signal is one input, not the authoritative determination.
C3. Session logs are complete (no commands or outputs omitted), timestamped at the event level, integrity-protected against modification, and include the results of any ground-truth verification steps.
C4. Safety constraints (out-of-scope targets, prohibited command classes) are enforced in code, not by instruction to the model.
C5. Ground-truth verification (the Ground Truth Gate) is implemented as a code-layer function external to the model. After any model-claimed success, the system probes the target for the specific state change asserted — independent of the model's output. The session continues if the state change is not confirmed. A finding rests on the verified state, not on the model's assertion. This is not a logging function — it is an epistemological gate.
The Gate applies across two tiers of objectives:
Tier 1 (Binary — mandatory code-layer verification): Objectives with unambiguous, observable success conditions — a root shell confirmed by uid=0 in tool output, a file that is present or absent, a port that is open or closed, a credential that authenticates or does not. For these, a deterministic probe of the target state is both feasible and required. The model claims; the code verifies.
Tier 2 (Interpretive — human-layer verification): Complex logic flaws, business logic vulnerabilities, and objectives where success requires contextual judgment. For these, writing a deterministic code-layer verifier is often as hard as the original task — and using a second AI model to generate the verifier reintroduces the Stochastic Trap.[^20] Tier 2 verification belongs in the human layer by definition.
The goal of C5 is not to verify everything. It is to shrink the human auditability residual (C6) to only what genuinely requires a human brain. ARCHER's implementation executes Tier 1 checks locally via docker exec against the archer-kali evaluation container — no external API calls, no probabilistic intermediary between the claimed state and the confirmed state.
C6. The system generates a probabilistic residual summary for human review. The residual is defined as the complement of C5 Tier 1 confirmations: everything the model asserted that was not mechanically verified. This includes Tier 2 findings (where mechanical verification is not feasible), model-generated severity ratings and remediation recommendations, and any inference the model made beyond the raw tool output it was given. The code layer generates the full residual automatically — it does not pre-filter or prioritize. Pre-filtering would require the code layer to judge which items need human attention, which is itself a human-layer function. The code layer surfaces everything not mechanically confirmed; the human layer determines what requires judgment and acts on it.
Human Layer Requirements¶
H1. Scope, target, and rules of engagement are documented before execution begins and enforced by the code layer as a constraint. This documentation constitutes the legal authorization boundary.
H2. Findings that drive remediation decisions are reviewed by a named human professional before they are reported. Review is of the probabilistic residual (per C6), not necessarily the full session output. The named professional accepts organizational and regulatory accountability for the finding.
H3. High-impact and irreversible actions require explicit human authorization. The system must not execute them autonomously.
H4. The human auditability function is defined, required, and scoped. Not "review the output if you want to" — a mandatory review of the probabilistic residual with defined criteria for what constitutes adequate review. The function must account for QA fatigue: a human reviewing large volumes of verified findings at uniform cognitive load will eventually rubber-stamp. The review scope must be constrained to the cases that require judgment.
H5. Accountability for findings is assigned to a named human professional, not to the system. The session log must record who reviewed and authorized each finding.
H6. When human-layer authorization is required but the authorized human is unavailable, the system must halt or queue pending tasks. It must not interpret human unavailability as implicit authorization, and must not default to autonomous action. Authorization taken in the absence of required human sign-off is a boundary violation regardless of outcome.
Collaboration Requirements¶
X1. Role boundaries are documented and the system enforces them mechanically, not by convention.
X2. Boundary violations are detectable. The system provides signals (compensating logic accumulation, model-in-halt-role detection) that indicate when layers are out of bounds.
X3. Silent competence is structurally prevented. Roles have escalation triggers, pre-authorized action classes, and session-end disclosure requirements. These remediations operate within the probabilistic layer's constraints: they increase the surface area of disclosure without converting a probabilistic failure mode into a deterministic one. An escalation trigger asks the model to recognize and surface a blocked path — this is a structural improvement over no trigger, not a guarantee of complete disclosure. The session-end disclosure requirement creates a forcing function; it does not guarantee all withheld actions are surfaced. The goal is to reduce the frequency and duration of silent competence incidents, not to eliminate the underlying probabilistic constraint. The residual is auditable via the human layer (H4).
X4. The division of labor is auditable. A post-session review can determine which decisions were made in which layer, and which findings rest on ground-truth verification versus model assertion.
6. Evaluation Against ARCHER¶
The evaluation below applies the twenty requirements in Section 5 to ARCHER. A methodological limitation must be stated directly: this specification was derived in part from operational experience building ARCHER. A framework derived from a system's design and aspirations will describe that system more favorably than one derived independently — the requirement set reflects what ARCHER was trying to do, which means ARCHER's partial compliance is not independent evidence of the framework's rigor. This evaluation is self-assessed and should be weighted accordingly. Independent evaluation — applying the framework to a system it was not designed around — would be substantially stronger evidence, and is planned.
With that caveat stated: ARCHER shows significant partial compliance, and the gaps are genuine. The table should be read as a roadmap: these are the requirements that remain unmet in even the most intentionally designed tools, and closing them is the work ahead.
Last updated: 2026-05-12. Overall standing: 17 Met / 3 Partial / 0 Not Met.
| Requirement | ARCHER Status | Evidence |
|---|---|---|
| M1 — No model routing | Met | V2 TF-IDF+LR classifier deployed as primary routing mechanism; label gate cleared (≥50/skill, all 15 skills, 2026-05-12). Frozen weights, training set SHA-256 hash, per-skill confusion matrix, auto-versioned metadata produced by train_classifier.py. Routing decisions carry classifier_version from session_start log. V1 keyword scorer retained as fallback only. Phase 4 Auditor-verified 2026-05-10 (27/27 PASS, testenv/eval_results/20260510_220009.csv). MOP-2 GREEN: 100% classifier confidence ≥ 0.70 across n=65 production tasks. |
| M2 — No model-only halt | Met | Three-layer Ground Truth Gate in production loop; false success claims suppressed and session continues (#198, Auditor-verified 2026-05-09). |
| M3 — External logging | Met | ~/.archer_sessions/ captures raw tool output and verify_result field; SHA-256 sidecar integrity on all exit paths (#201, verified); probabilistic residual generated as .residual.json sidecar on all five exit paths (#202, Auditor-verified 2026-05-09); decision_layer annotation on all ft.jsonl event types (#226, cf91ebf, Phase 5). |
| M4 — No model authorization | Met | High-impact commands require human confirmation; signal.alarm() timeout halt on unavailability (#204, Auditor-verified 2026-05-09). |
| C1 — Accountable routing | Met | V2 classifier trained and deployed; routing decisions carry classifier_version attribution in session_start log; training_set_hash and confusion_matrix present in ~/.archer_classifier/metadata.json (#205, 9eb882a). Meets SDLC compliance bar: frozen weights at deployment, specific training set, version history, measurable confusion matrix. Phase 4 Auditor-verified 2026-05-10 (27/27 PASS). |
| C2 — Code-layer halt | Met | should_halt_objective() pre-checks authoritative in production; Ground Truth Gate suppresses false OA claims and continues session (#198, Auditor-verified 2026-05-09). |
| C3 — Complete, timestamped logs | Met | Session logs complete with verify_result on all exit paths; SHA-256 sidecar tamper-evidence; scripts/verify_logs.py integrity sweep (#198, #201, Auditor-verified 2026-05-09). |
| C4 — Code-enforced safety | Met | exec_target scope enforcement; --prep-sudo safety gate; --authorized-by required at CLI invocation (#204). |
| C5 — Ground Truth Gate | Met | Inline signal check + active target probe + --verify-fn external script; tier1_failed suppresses OA and continues session; all five exit paths produce SHA-256 verified sidecars (#198, Auditor-verified 2026-05-09). |
| C6 — Auditability residual output | Met | .residual.json sidecar on all five exit paths; machine-parseable and human-readable; SHA-256 integrity verified; review_status null-initialized for Auditor sign-off (#202, Auditor-verified 2026-05-09). |
| H1 — Pre-execution scope / ROE | Met | Target IP and scope set at CLI invocation; --authorized-by required flag documents the authorizing human (#204). |
| H2 — Named human review of residual | Partial | Residual generated as bounded first-class output and Auditor review explicitly scoped to it (#202/#203, verified). Named reviewer sign-off field present but not yet mechanically required — review_status.reviewer is null-initialized, populated by convention not enforcement. |
| H3 — High-impact authorization | Partial | Explicit confirmation prompts implemented; timeout_halt on unavailability (#204). Not all high-impact action classes are formally enumerated — scope is operationally defined, not exhaustively specified in code. |
| H4 — Scoped auditability function | Partial | Auditor role scoped to probabilistic residual (#202/#203); residual is bounded and machine-parseable — reviewer receives a defined input, not the full transcript. Mandatory review criteria and sign-off record not yet mechanically enforced. |
| H5 — Named accountability in log | Met | Required --authorized-by <name> CLI flag; value written to session_start log event, immutable after creation; eval harness passes --authorized-by eval-harness (#204, Auditor-verified 2026-05-09). |
| H6 — Halt on human unavailability | Met | signal.alarm() timeout wraps high-impact confirmation prompt; timeout_halt exit path distinct from HALT_DISCIPLINE; halt event logged with pending_action, timeout_seconds, outcome; --human-timeout N configurable (#204, Auditor-verified 2026-05-09). |
| X1 — Documented, enforced roles | Met | Role boundaries documented in CLAUDE.md and docs/roles/. Mechanical enforcement via boundary_violation event type at 5 detectable signals: verify_fn_skipped, classifier_bypassed, halt_below_floor, auth_without_authby, false_success_claim. Daily violation count surfaced in archer-status (#224, c37d654, Phase 5). Operational as of 2026-05-12: 39 violations detected in 24h (halt_below_floor=20, false_success_claim=13, classifier_bypassed=4, verify_fn_skipped=2) — monitoring producing actionable signal. |
| X2 — Boundary violations detectable | Met | Structured boundary_violation events emitted inline at all 5 signal points; automatically logged without manual inspection. archer-status surfaces daily count (#224, c37d654, Phase 5). Violation distribution is diagnostically meaningful: dominant type (halt_below_floor) correlates with the RED MOP-3 skill (web_exploitation), confirming that X2 detection is sufficient to identify the responsible code-layer enforcement gap without manual session review. |
| X3 — Silent competence prevented | Met | withheld_actions required field on all session-end exits; populated via model-emitted [WITHHELD] token and code-detected safety constraint blocks; absence is a logged boundary violation (#225, 2774292, Phase 5). Scope-stall rule in CLAUDE.md and docs/roles/ provides the operational layer. |
| X4 — Division of labor auditable | Met | decision_layer annotation (model/code/human) on all ft.jsonl event types; layer-of-origin explicit in session log without requiring architectural inference (#226, cf91ebf, Phase 5). |
This evaluation covers ARCHER only. Applying the framework to a second implementation — one not derived from the same operational experience — is planned and would produce substantially stronger evidence that the requirements generalize rather than describe ARCHER specifically. Until that evaluation is complete, the compliance scores in this table should be read as a design roadmap, not as validated performance data.
7. Methodology¶
The framework was built bottom-up from operational failure, not top-down from first principles. This section documents how each component — the three-layer specification, the boundary violation taxonomy, and the twenty design requirements — was derived, and states the limitations that follow from the derivation approach.
7.1 Derivation of the Three-Layer Specification¶
The three-layer specification emerged from a classification problem encountered during ARCHER development: when a session produced an unexpected outcome, the proximate cause was almost always that one layer of the system was doing work that belonged in a different layer. The model was making routing decisions. The code layer was accepting the model's completion claim without verification. The human layer was approving findings it had not actually reviewed.
The classification criterion — which layer should own this work? — required a prior answer to a more basic question: what distinguishes work types from each other? Three properties proved useful:
Correctness structure. Some work has correct answers that are measurable against ground truth: routing a recon task to the recon skill category is either correct or incorrect, and the correctness is verifiable with a labeled dataset. This work belongs in the code layer, where correctness is measurable and failures are diagnosable. Other work has no correct answer independent of context: interpreting whether a particular scan result constitutes a finding relevant to this target in this environment requires judgment. This work belongs in the model layer, where probabilistic generalization over a learned distribution produces better results than any rule.
Accountability structure. Some decisions carry legal and professional accountability that cannot be delegated to a system, regardless of that system's accuracy. Authorization for a high-impact action, sign-off on a finding that drives remediation — these decisions bind a named professional who accepts organizational and regulatory liability. This work belongs in the human layer, not because humans are more accurate but because accountability is legally non-delegable. The aviation and medical analogies in Section 1 establish this: the FAA certificate holder and the signing physician do not become optional as autopilot and diagnostic AI improve.
Failure signature. The clearest diagnostic for misassigned work is where compensating logic accumulates. When the model is asked to produce structured output and the surrounding code grows to handle its output variation, the model is in a code role. When the human is asked to "review the output" without a bounded, defined residual to review, the human layer is performing work the code layer should have pre-processed. Compensating logic is the observable signal of a misassigned responsibility.
These three properties produced the layer assignment: probabilistic work with no ground-truth tether to the model layer; deterministic work with measurable correctness to the code layer; accountability decisions to the human layer. The layer invariants in Section 3 are the formal statement of this assignment. The boundary conditions are the observable symptoms of violation.
7.2 Construction of the Boundary Violation Taxonomy¶
The five violation categories in Section 4 were derived from a bottom-up catalog of ARCHER session failures. The derivation procedure was:
- Collect instances where a session produced an outcome worse than the system's capability warranted — false success claims, premature halts, findings that could not be acted on, fixes identified but not surfaced.
- For each instance, identify which layer was performing work that did not belong to it.
- Classify by the direction of the violation: model in code role, model in halt role, code in human role, human in model role.
The fifth category — silent competence — was identified through a different path. It did not appear as a session outcome failure in the conventional sense; the session completed and the agent followed its role constraints. The failure was visible only through retrospective analysis: comparing what the agent surfaced against what it had identified but not acted on. Silent competence is structurally distinct from the other four categories because it produces no immediate observable signal and because the agent's role-compliant behavior is part of the failure mechanism, not a mitigation of it.
Two additional violation types were considered and excluded: human layer over-reliance (the human performs work the model can do reliably, reducing throughput without improving quality) and code layer over-reach (safety constraints are too broad, blocking legitimate work). Both are real phenomena. Both were excluded from the taxonomy because their failure modes are domain-specific and difficult to specify without overfitting to ARCHER's particular operational context. The five retained categories are those whose failure signatures are observable and whose remediation is architectural rather than operational.
7.3 Derivation of the Twenty Requirements¶
The twenty requirements in Section 5 were derived from the violation taxonomy through a one-to-one mapping: each requirement is a design property that, if met, makes a specific violation class detectable, preventable, or recoverable. The derivation was:
- For each violation category, ask: what architectural property would prevent this violation from occurring silently?
- State that property as a testable requirement — one that a system either satisfies or does not, with observable evidence.
- Verify that the requirement is achievable: there must exist at least one implementation that satisfies it.
The requirements are minimal in the following sense: no requirement is included without a corresponding violation class it closes. Requirements that represent best practices without a specific violation class relationship (comprehensive documentation, defense-in-depth) are not included. The framework makes no claim that the twenty requirements are complete — there may be violation classes and corresponding requirements not yet observed. The current set represents requirements for which implementation evidence exists.
A note on specification gaming: the requirements are stated as architectural properties, not performance thresholds. A system that technically satisfies a requirement while defeating its intent — a trivially simple keyword router trained on three task types to satisfy M1, a single-sentence scope statement to satisfy H1 — meets the letter but violates the spirit. The practical diagnostic: a system that can answer the Four Questions in the Appendix plainly, tracing each answer to verifiable artifacts, is unlikely to be gaming the requirements. A system where the answers require hedging or redirect to documentation rather than observable system behavior almost certainly is. Requirement gaming is an inherent risk in any compliance framework; the mitigations available here are the artifact-citation standard in §7.4 and the external evaluator protocol in §8.
The four-letter grouping (M, C, H, X) was added after derivation as a readability aid, not as a structural claim. The collaboration requirements (X1–X4) address the boundaries between layers rather than the internal function of any single layer; they were separated for that reason.
7.4 Evaluation Methodology¶
The Section 6 evaluation applies the twenty requirements to ARCHER using three evidence categories:
Met: The requirement is implemented and the implementation has been independently verified. "Independently" means: verified by the Auditor instance — a separate Claude Code session with read-only access to source code and direct access to eval harness output — through a combination of code inspection, eval harness runs against live targets, and session log review. The Auditor did not implement the features it verified. Met status requires both implementation and Auditor confirmation.
Partial: The requirement is implemented but not mechanically enforced, or mechanically enforced but not yet verified. The most common partial case: the requirement is met procedurally (documented in operational guidance and followed by convention) but not enforced by code — so a deviation would not produce an automatic error. Partial status is also used when the implementation is committed and passing its own tests but has not yet been exercised in a full production eval run.
Not Met: The requirement is not implemented. No instance of Not Met exists in the current ARCHER evaluation; all gaps are Partial rather than absent. This is a consequence of the derivation approach — the requirements reflect ARCHER's design intent, which means ARCHER was at minimum attempting to satisfy each requirement before the requirement was formalized.
Auditor verification for each Met requirement includes a specific artifact: a CSV result file with pass rates before and after the feature was committed, a count of session logs with the required field present, or a git show confirming the implementation is in the commit referenced. "Verified" without a cited artifact is not sufficient for Met status.
7.5 Limitations¶
Self-assessment. The framework specification was derived from operational experience building ARCHER. A framework derived from a system's design and aspirations will describe that system more favorably than one derived independently. The twenty requirements reflect what ARCHER was trying to achieve; the evaluation reflects how well it achieved those goals. The circularity is real and the compliance scores should be weighted accordingly.
This limitation does not invalidate the framework. The boundary violation taxonomy and the three-layer specification describe failure modes that are observable in any AI-augmented security tool, not only ARCHER. But the requirement formulation — specifically, which properties are deemed sufficient for Met status — reflects one implementation team's judgment about what constitutes adequate evidence. Independent evaluation against the framework by teams who did not derive it would produce harder evidence that the requirements are generalizable rather than ARCHER-specific.
Single implementation. The framework is grounded in one implemented system. No quantitative compliance bar is specified (routing accuracy ≥ X%, halt false-negative rate ≤ Y%) because generalizing a performance floor from a single implementation would overfit to ARCHER's current capability. The requirement set is architectural: it specifies structural properties, not performance thresholds. Future work, including evaluation of additional Centaur implementations, will determine what performance ranges are characteristic of compliant systems.
Auditor independence. The Auditor role in ARCHER is implemented as a Claude Code instance — the same model family as the Coder and Scribe instances. The Auditor has read-only access to source code and does not implement the features it verifies, which provides structural separation. It is not independence in the sense that an external third-party auditor is independent. A reviewer who wanted to increase confidence in the Met evaluations should treat the Auditor verification as internal QA — necessary but not sufficient for independent validation — and apply additional scrutiny to requirements where the Auditor's verification was by code review rather than live execution.
Collaboration layer enforcement. Requirements X1–X4 address the boundaries between layers rather than the internal function of any single layer. The reference implementation satisfies them through two complementary mechanisms: procedural constraints (role boundaries documented in CLAUDE.md and docs/roles/, enforced by convention) and mechanical detection (structured boundary_violation event logging at 5 signal points, required withheld_actions field on all session-end exits, decision_layer annotation on all ft.jsonl events — Phase 5, commits c37d654/#224, 2774292/#225, cf91ebf/#226). The residual limitation is scope, not mechanism: the five detectable signal types cover the most operationally significant violations. Violations outside those five are not automatically caught. Detection is not deterministic prevention — a sufficiently determined agent could violate role constraints without triggering a logged event — but the framework does not claim deterministic prevention is achievable in a probabilistic multi-agent system. The requirement is that violations are detectable and the evidence is in the log. That criterion is now met.
A third-party evaluator configuration guide — minimum working setup to verify all twenty requirements without access to the full development environment — is provided in Section 8.
8. Reproducibility¶
The architectural claims in this paper are verifiable by inspection of the open-source ARCHER repository at github.com/jayhawkins108/ARCHER. This section provides the minimum working setup for an external evaluator to produce session artifacts and inspect them against each of the twenty requirements.
8.1 Minimum Setup¶
The following components are required to run a verifiable ARCHER session:
| Component | Version | Purpose |
|---|---|---|
| ARCHER source | main branch |
Agent code, eval harness, skill packs |
| Ollama | ≥ 0.3 | Local model inference host |
| qwen3:14b | current | Default inference model |
| Docker | ≥ 24 | archer-kali container runtime |
| Python | ≥ 3.11 | ARCHER runtime and scripts |
| Metasploitable2 | standard release | Eval target (baseline eval objectives) |
A network-isolated evaluation target (Metasploitable2, DVWA, or equivalent) is required to verify ground-truth gate behavior. The gate checks make live probes of target state; they cannot be verified against a mock target.
ARCHER's ground-truth verification (verify_fn) is structurally similar to the "solver functions" used in HackSynth (Tihanyi et al., 2024) — both programmatically verify whether an objective was achieved rather than trusting the agent's self-report. The distinction is the evidence standard: HackSynth's solver functions check for flag capture (binary); ARCHER's verify_fn checks that the success indicator is present in actual tool output, not in model-generated text. A session where the model asserts "I have obtained a root shell" passes a flag-capture check if the flag appears anywhere in the transcript; ARCHER's gate additionally verifies the flag appears in a command_executed output event, not in a model turn. This distinction is load-bearing for Requirement M2 (no model-only halt) and for the training data quality gate — false-positive halt signals, where the model claims success without tool-output evidence, are the primary contamination class in the V2 fine-tuning pipeline.
# Clone and set up
git clone https://github.com/jayhawkins108/ARCHER
cd ARCHER
python3 -m venv archer_env && source archer_env/bin/activate
pip install -r requirements.txt
# Pull model
ollama pull qwen3:14b
# Build eval container
./docker/run.sh
# Confirm container is up
docker inspect archer-kali --format '{{.State.Status}}'
8.2 Producing a Verifiable Session¶
All twenty requirements can be inspected from the artifacts produced by a single eval harness run. The recommended verification run uses one objective, three passes, with session logging enabled:
This produces, per session, three files in ~/.archer_sessions/:
- <timestamp>_<pid>_<task>.log — full session transcript
- <timestamp>_<pid>_<task>.log.sha256 — SHA-256 integrity sidecar
- <timestamp>_<pid>_<task>.residual.json — probabilistic residual sidecar
The eval harness also appends one entry to ~/.archer_routing_log.jsonl per run.
8.3 Verification Protocol by Requirement Group¶
Model Layer (M1–M4)¶
M1 — No model routing. Inspect ~/.archer_routing_log.jsonl. Each entry includes a classifier_used boolean, confidence score, and score_gap field. Entries where classifier_used: true and confidence ≥ 0.5 confirm code-layer routing. A session where all routing decisions have classifier_used: true confirms M1 Met status. Also inspect ~/.archer_classifier/metadata.json for classifier_version and training_set_hash.
# Inspect last 5 routing decisions
tail -5 ~/.archer_routing_log.jsonl | python3 -m json.tool | grep -E 'classifier_used|score_gap|selected_skill'
M2 — No model-only halt. Run the eval harness with --audit on an objective where the model historically claims success prematurely (PT-EXPLOIT-01 recommended). Inspect the session log for verify_result: tier1_failed entries — these confirm the Ground Truth Gate fired and suppressed an OA claim before it terminated the session.
python3 testenv/eval_harness.py --runs 1 --objectives PT-EXPLOIT-01 --audit
# Look for: "verify_result": "tier1_failed" in session log
grep "tier1_failed" ~/.archer_sessions/$(ls -t ~/.archer_sessions/*.log | head -1)
M3 — External logging. Confirm that session logs contain raw tool output, not model summaries. Inspect a command_executed event in the session log: the output field should be verbatim shell output, not a model paraphrase. Confirm the SHA-256 sidecar matches:
python3 scripts/verify_logs.py --since $(date +%Y-%m-%d)
# Expected: 0 mismatches, 0 missing sidecars
M4 — No model authorization. Run ARCHER without --authorized-by. The session should refuse to start:
python3 ARCHER.py -a -y --kali -local qwen3:14b --do pentest "scan 192.168.56.103"
# Expected: error requiring --authorized-by flag
Code Layer (C1–C6)¶
C1 — Accountable routing. After a training run (python3 scripts/train_classifier.py), inspect ~/.archer_classifier/metadata.json. Confirm presence of training_set_hash (SHA-256 of the training CSV), confusion_matrix per skill, and version timestamp. Inspect a session log session_start event for classifier_version field — this links the routing decision to the specific model artifact that made it.
cat ~/.archer_classifier/metadata.json | python3 -m json.tool | grep -E 'training_set_hash|version|classifier'
grep "classifier_version" ~/.archer_sessions/$(ls -t ~/.archer_sessions/*.log | head -1)
C2 — Code-layer halt. Inspect should_halt_objective() in ARCHER.py. Confirm that command_count < min_commands returns False unconditionally — the model's [OBJECTIVE_ACHIEVED] token cannot trigger termination below the command floor regardless of the model's confidence. The pre-checks run before halt_fn is consulted.
C3 — Complete, timestamped logs. Run scripts/verify_logs.py. Zero mismatches confirms integrity. Inspect a session log directly and confirm every command has a timestamp field and output is not truncated.
C4 — Code-enforced safety. Inspect execute_command() in ARCHER.py for the exec_target scope check. Confirm that commands routed through execute_with_sudo strip the sudo prefix when running inside archer-kali (container is root; sudo unavailability would fail silently otherwise).
C5 — Ground Truth Gate. Run PT-EXPLOIT-02 (the suppression objective) and inspect the session log for a verify_result: tier1_failed event followed by continued session turns — confirming the gate fired, rejected the false success, and the session did not terminate:
python3 testenv/eval_harness.py --runs 1 --objectives PT-EXPLOIT-02 --audit
grep -A 5 "tier1_failed" ~/.archer_sessions/$(ls -t ~/.archer_sessions/*.log | head -1)
C6 — Probabilistic residual. Inspect a .residual.json sidecar. Confirm findings contains model assertions not confirmed by a verify_fn Tier 1 check, review_status.reviewer is null (awaiting human sign-off), and the file has a paired .residual.json.sha256 sidecar:
ls ~/.archer_sessions/*.residual.json | tail -1 | xargs python3 -m json.tool | grep -E 'findings|reviewer|review_required'
Human Layer (H1–H6)¶
H1 — Pre-execution scope. Confirm --authorized-by is required. The session log's session_start event should contain "authorized_by": "<name>" as a top-level field. This field is written at session initialization and is not modifiable by subsequent model turns.
H2 — Named review of residual. Open a .residual.json sidecar and confirm review_status.reviewer is null. This is the open audit item — the field exists for the named reviewer to populate; the absence of a populated value is the evidence of Partial status. Populated values in production sessions confirm the Auditor role is exercising H2.
H3 — High-impact authorization. Inspect ARCHER.py for the human confirmation prompt path. Run a session that reaches a sudo-required command and confirm the prompt fires. The --human-timeout N flag sets the window; expiry triggers timeout_halt.
H4 — Scoped auditability. Confirm that the Auditor role's review input is the .residual.json sidecar, not the full session transcript. The residual's findings array is bounded — it contains only model assertions not Tier 1 confirmed. A reviewer reading only the residual receives a complete picture of what requires judgment; they are not expected to read the full session log.
H5 — Named accountability. Inspect session_start in any session log produced by the eval harness:
grep "authorized_by" ~/.archer_sessions/$(ls -t ~/.archer_sessions/*.log | head -1)
# Expected: "authorized_by": "eval-harness"
H6 — Halt on unavailability. The timeout_halt exit path is not directly exercisable in eval sessions (eval sessions don't reach H3 authorization prompts by design). Verify by code inspection: signal.alarm() wraps the confirmation prompt in ARCHER.py; the handler calls _exit_session() with exit_reason='timeout_halt'. The halt event logged includes pending_action, timeout_seconds, and outcome.
Collaboration Layer (X1–X4)¶
X1 — Documented, enforced roles. Read CLAUDE.md and docs/roles/. The three roles (Coder, Auditor, Scribe) have defined file access, tool access, and coordination protocols. To verify mechanical enforcement: run a session and inspect the session log for boundary_violation events. Run archer-status to confirm the daily boundary violation count is surfaced. A clean run expects zero violations; any of the five detectable signals produces a logged event automatically.
X2 — Boundary violations detectable. Five violation signal types are automatically detected and logged: verify_fn_skipped, classifier_bypassed, halt_below_floor, auth_without_authby, false_success_claim. To verify: run python3 testenv/eval_harness.py --runs 1 --objectives PT-ENUM-01 and inspect the session log for boundary_violation events (expect zero in a clean run). archer-status surfaces the daily count without manual log inspection.
X3 — Silent competence prevented. Every session-end event includes a required withheld_actions field. To verify: run any session and inspect the session-end event in ~/.archer_sessions/<latest>. Confirm withheld_actions is present as a structured field. An absent field is automatically logged as a boundary_violation event.
X4 — Division of labor auditable. All ft.jsonl event types carry a decision_layer annotation (model/code/human). To verify: run python3 testenv/eval_harness.py --runs 1 --objectives PT-ENUM-01 with --ft-log (on by default) and inspect ~/.archer_sessions/<latest>.ft.jsonl. Every event should include "decision_layer": "<layer>". Layer-of-origin is explicit — no architectural inference required.
8.4 What the Artifacts Prove¶
An evaluator who completes the protocol above can confirm, from the produced artifacts alone:
- What happened: every command executed, every output received, in sequence with timestamps (C3, M3)
- What was confirmed: which model-claimed successes were verified by a code-layer probe of actual target state (C5, M2)
- What requires judgment: the bounded residual of unconfirmed assertions, isolated in
.residual.json(C6, H4) - Who is accountable: the named human in
session_start.authorized_by(H5) - Whether the log is intact: SHA-256 sidecar verification (C3)
What the artifacts cannot yet prove — pending Phase 5 implementation — is which layer made each individual decision within the session, and whether any withheld actions were suppressed without disclosure. Those properties become verifiable when #224–#226 ship.
9. Recommendations¶
9.1 For Security Teams Designing AI-Augmented Workflows¶
Start with the division of labor, not the tool selection. Before evaluating which AI tool to use, define what work belongs in each layer for your specific operational context. The tool selection follows from the specification; the specification does not follow from the tool. Use the Four Questions in the Appendix as a minimum diagnostic before procurement. A tool that cannot answer all four questions is not a Centaur implementation regardless of its marketing.
Enforce boundaries in code where possible. Role constraints documented in prose are advisory. Role constraints enforced mechanically — routing that does not ask the model, halt logic that does not accept the model's completion claim, logging that the model cannot modify — are reliable. For every constraint that matters, ask: is this enforced in code or by convention? If the answer is "convention" for any constraint that affects audit accountability, that is a remediable gap, not an acceptable operating condition.
Build the QA function before the production function. The Auditor role — the function that validates model output before it drives decisions — should be designed before the model is deployed. Define who reviews what, how often, and against what criteria before the first production session runs. A system without a defined QA function is a system where the model's output is accepted without review, which violates H2 and H4. Establishing the review scope after a high-consequence finding is too late.
The review burden must be proportional to consequence. Before deploying any Centaur system in production: define which finding types are Tier 1 (mechanically verifiable) and which are Tier 2 (require human judgment), establish the residual scope the reviewer will receive, and define what constitutes adequate review for each tier. QA fatigue is a real phenomenon — uniform review of large volumes of findings at constant cognitive load produces rubber-stamping. Scoping the human layer to Tier 2 items is how the framework prevents this.
Treat silent competence as an architectural risk. In any system where multiple agents have defined roles and scope constraints, silent competence is the baseline failure mode. Add escalation triggers, pre-authorized action classes, and session-end disclosure requirements to every role definition. The question "is there anything you've identified but haven't acted on because you weren't sure it was within your role?" should be asked at the end of every session until escalation becomes habitual. Once escalation is habitual, the role documents should be updated to pre-authorize those paths so the escalation becomes unnecessary.
Define minimum acceptable log fidelity before deployment. A session log that the model can summarize or paraphrase is not an audit trail — it is a model-generated narrative. Minimum acceptable log fidelity: verbatim command output (not parsed or summarized), ground-truth verification results, SHA-256 tamper evidence on every session file, and the identity of who authorized the session. These are requirements, not preferences, for any engagement where the output drives a remediation decision.
9.2 For Teams Transitioning From Existing AI-Augmented Workflows¶
Assess your current layer assignments before redesigning. Use the Four Questions (Appendix) as a diagnostic. Most teams discover they have implicit layer assignments that are working adequately in practice — the risk is in the edge cases and the failure modes they have not yet encountered. Identify the highest-consequence boundary violations first; those are the ones to remediate before the next high-stakes engagement.
Sequence the code-layer changes before the human-layer changes. The human auditability function (H4) is more effective when the probabilistic residual (C6) is already defined and isolated by the code layer. Asking analysts to review more carefully before the code layer generates a scoped residual adds cognitive burden without reducing risk. The correct sequence: (1) define and isolate the residual; (2) scope the human review function to the residual; (3) add accountability logging; (4) address routing accountability.
Define your Tier 1 verification cases before building the Ground Truth Gate. The objectives with binary, observable success conditions are the ones to instrument first. Each Tier 1 verifier reduces the residual the human must review. Start with the most common success conditions in your task distribution — open port, confirmed credential, root shell. These are instrumentable in a few lines of code and immediately reduce the review burden on the human layer.
Silent competence mitigation is a role definition problem, not a model problem. Before assuming a better model will reduce withheld-path failures, check whether the role documents have explicit escalation triggers and pre-authorized action classes. The architectural fix is faster than the model improvement and addresses the structural cause rather than the symptom. Better models reduce the frequency of withheld-path failures; explicit escalation triggers make the remaining failures visible.
Measure before and after. When transitioning to a Centaur-compliant architecture, track three metrics: routing accuracy (verified against labeled ground truth), false-positive session rate (sessions that claimed success without a confirmed state change), and review burden (time spent per residual item). These three metrics capture the quality improvements Centaur compliance claims to produce. If they do not improve after the architectural changes, there is a gap in the implementation — not in the framework.
9.3 For Tool Developers and Vendors¶
Publish your layer assignments. Document which decisions are made in which layer. A vendor that cannot describe their routing decision mechanism in architectural terms — "the model decides" vs. "a trained classifier with a published confusion matrix decides" — is not making an informed architectural choice, and buyers cannot evaluate compliance without this information. Layer assignment documentation should be part of every product's technical specification, not buried in marketing copy.
Design for the Centaur, not for the demo. Demo performance optimizes for the model's best case. Centaur performance optimizes for consistent quality across the operational distribution. These require different architectural choices. The demo that impresses the audience is not the system the analyst will trust on a long engagement. The gap between demo performance and production performance is usually a layer assignment problem: the demo optimized the model layer; the production failure is in the code layer.
Publish your training data provenance. If your routing mechanism uses a trained classifier, publish: the size and composition of the training set, the per-skill confusion matrix, and the version timestamp linking routing decisions to specific classifier artifacts. SDLC compliance — frozen weights at deployment, a specific training set, a version history, and a measurable confusion matrix — is what distinguishes an accountable classifier from a model agent asked to classify its own tasks. Buyers should require this documentation as a baseline; vendors who cannot provide it are making an architectural claim they cannot support.
Define your residual explicitly. A probabilistic residual is not an optional report — it is the complement of your Tier 1 confirmations. Vendors whose systems claim Centaur compliance without generating a defined probabilistic residual are either (a) performing Tier 1 verification and not publishing the complement, or (b) not performing Tier 1 verification and reporting model assertions as findings. Both are compliance gaps. The residual definition — what constitutes a Tier 1 verification, what appears in the residual, and in what format — should be documented alongside the verification protocol and published to buyers.
Design for external audit. A Centaur-compliant tool should be auditable by a third party who did not build it. Section 8 of this paper is a reference design for that specification. Tools that cannot produce independently-verifiable session artifacts are making architectural claims on the honor system.
9.4 Open Research Directions¶
Agentic system extension. The current framework assumes an interactive human-in-the-loop context. Autonomous agentic systems — where no human is present during execution and authorization is granted in advance — require additional specification, particularly in the human layer. Specific open questions: How does advance authorization bound the scope of autonomous action without creating a blanket authorization gap? What code-layer mechanisms correspond to H3, H4, and H6 when no human can be interrupted mid-session? Is the Tier 2 verification function relocatable — can it be deferred to a post-session human review without unacceptably widening the accountability gap? These questions are left open in this version of the framework; agentic system extension is a planned subsequent specification.
Code-layer silent competence detection. Section 4.5 describes silent competence remediations that operate at the behavioral layer — asking the model to report withheld actions. Detecting silent competence through observable behavioral signals (routing dead ends, workaround command sequences, session-end disclosure patterns) without relying on model self-report is an unsolved problem. A code-layer detection mechanism would close the gap between the requirement (X3) and the current implementation. The research direction: characterize the behavioral signatures of withheld-path routing — command sequences that avoid a high-confidence path in favor of a lower-confidence alternative — and develop a detector that operates on session logs rather than on model introspection. The challenge is distinguishing deliberate workarounds from legitimate alternative approaches, which may require a ground-truth dataset of sessions where withheld paths were subsequently confirmed by human review.
Independent performance characterization. ARCHER's 94% objective completion rate demonstrates that the architecture is achievable at high rates. Characterizing the performance distribution of compliant systems across implementations, domains, and target environments requires evaluation beyond a single system. Specific need: an independent evaluation of a second Centaur-compliant implementation — one not derived from the ARCHER operational experience — to determine whether the requirement set generalizes or overfits to ARCHER's design choices. This evaluation would also determine what performance ranges are characteristic of compliant systems, enabling a responsible specification of performance floors that this paper declines to provide from a single-system basis.
Formal verification of code-layer invariants. The current framework specifies requirements in prose and evaluates them against observable artifacts. A natural extension is to formalize the code-layer invariants — routing not delegated to model, halt not determined by model signal alone, log integrity maintained — in a specification language (TLA+, Alloy, or similar) and verify that an implementation satisfies them mechanically. The tractability argument for this approach: the code-layer invariants are structurally simple enough to be candidate targets for formal verification; they make claims about what the model layer is not permitted to do, which is an invariant on system composition rather than on model behavior. The human-layer invariants, which require reasoning about accountability and organizational context, are not formal verification targets.
Multi-system Centaur coordination. The current framework addresses a single Centaur system with three layers. Security operations increasingly involve multiple AI systems — threat intelligence ingestion, SIEM analysis, endpoint detection, response orchestration — each making decisions that feed the next. When the output of one system's model layer becomes the input to another system's code layer, the inter-system accountability chain is undefined by the current framework. The open question is: what does Centaur compliance require of a pipeline of individually-compliant systems? Specifically, does a ground-truth verification result from System A carry forward as evidence in System B's session log, or must System B independently verify state against the target? This is a planned extension; the single-system framework is the prerequisite.
10. Falsifiable Claims¶
The claims below separate two categories: those for which supporting case evidence already exists in the ARCHER development record, and those requiring controlled study for confirmation. Both are falsifiable; the distinction is between grounded and predicted.
Case-Evidenced Claims¶
1. Silent competence accounts for a measurable fraction of delayed fixes in AI-assisted development. The ARCHER backlog audit identified four issues misclassified as hard technical problems that were in fact role-authorization gaps. In the PT-EXPLOIT-05 case (§4.5), three sessions iterated on the wrong variables before the constraint-blocking nature of the problem was surfaced and addressed. The pattern — correct fix identified, authorization question suppressed, workaround iterations accumulate — was confirmed as recurring rather than isolated. Falsified if: systematic review of AI-assisted development sessions finds that delays are attributable to technical difficulty rather than role gaps at rates inconsistent with the ARCHER sample.
2. Pre-authorized action classes reduce session latency to correct outcomes. Once the lab VM modification was explicitly authorized in the PT-EXPLOIT-05 case, the fix was applied in a single session. The three preceding sessions had each been spent on the wrong problem. The structured escalation trigger added to CLAUDE.md after the incident (§4.5 remediation #2) was applied before further class-level fixes, and no equivalent multi-session delay recurred for that class of problem. Falsified if: adding explicit pre-authorization for a defined action class does not reduce mean sessions-to-correct-outcome compared to baseline sessions requiring runtime escalation for the same action class.
3. A single human practitioner operating as the judgment layer over structured AI execution lanes matches the throughput of a 4–6 person team on equivalent scope. The ARCHER development record spans approximately five weeks and encompasses: 15 skill domain implementations, 50+ graded eval objectives with ground-truth verification, a five-phase Centaur compliance architecture, a complete V2 training and fine-tuning pipeline (routing classifier, ft.jsonl collection, audit gate, RunPod adapter pipeline), a MOP/MOE measurement framework, and a compliance-grade documentation ecosystem. The equivalent traditional team requires five disciplines to staff without serial blocking across sprints:
| Discipline | Scope |
|---|---|
| Security engineering | Skill pack design, exploitation chain implementation, eval objectives |
| ML / data engineering | Routing classifier, fine-tuning pipeline, data quality gates |
| Core software engineering | Agent loop, halt discipline, session logging, test suite |
| Security research and writing | Whitepaper, compliance mapping, architecture documentation |
| DevOps | Docker topology, GPU management, CI/CD pipeline |
A four-to-six-person staffing model covers these disciplines. A human team of this size carries an estimated 20% coordination overhead — context handoffs, merge conflict resolution, role boundary negotiation — that compounds across sprint cycles. The three-instance Claude Code model (Coder/Auditor/Scribe) produces parallel execution across these lanes with near-zero coordination cost: instances operate on separate file domains, findings transfer through a structured handoff document, and no calendar friction exists between role boundaries.
The mechanism is not substitution — it is a different cost structure. Human judgment (what to build, what the architecture must enforce, what constitutes correct) remains with one person and is non-delegable. What structured AI execution removes is the serial handoff cost paid whenever work crosses a role boundary in a human team. The absence of that cost, compounded across 26 weeks of sprint cycles, accounts for the throughput gap.
Self-assessment limitation: This claim rests on a single self-reported development record where the practitioner is simultaneously researcher, implementer, and evaluator — the conditions most susceptible to motivated reasoning. The widening objection deserves explicit weight: a traditionally-staffed team with distributed expertise might produce a system broader in some dimensions where individual contributor focus produces depth at the expense of coverage. The efficiency claim is specifically about throughput on defined scope, not about the quality ceiling available to either model.
Falsified if: An equivalent scope — comparable skill coverage, eval objective count, and training pipeline completeness — is produced by a traditionally-staffed 4–6 person team in less elapsed calendar time. Alternatively falsified if: independent assessment of the ARCHER development record attributes the timeline to exceptional individual contributor velocity rather than to the AI execution layer specifically. The project is, in a narrow sense, evidence for its own thesis: a single human operating as the judgment layer over a structured AI execution layer outperforms the coordination overhead of a small human team on this class of constrained-scope technical work.
Predictions Requiring Controlled Study¶
4. Human-AI teams with a defined division of labor outperform teams without one. Prediction: security operations teams using Centaur-compliant tools produce more consistent, auditable findings per engagement hour than teams using tools without defined layer boundaries, controlling for analyst experience. (pending: controlled study not yet conducted).
5. Compensating logic accumulation predicts routing failure rate. Prediction: systems with larger parsing compensation layers have higher routing error rates on ambiguous task phrasings. Falsified if: no correlation between compensation layer size and routing error rate across a reference set of AI security tools. (pending: cross-system measurement not yet conducted).
6. The three-layer specification identifies the necessary conditions for Centaur compliance. Prediction: any system that fails a requirement in Section 5 shows measurable performance inconsistency attributable to that violation; systems satisfying all twenty requirements achieve Centaur-level consistency. Sufficiency is not claimed — additional requirements may be necessary that this specification does not yet capture. (pending: independent evaluation required).
Appendix: The Four Questions¶
When evaluating any AI-assisted security workflow for Centaur compliance, these four questions identify the structural gaps.
Question 1: Who routes? What decides which analytical workflow handles a given task? If the answer is "the model" or "it depends on how you phrase the task," routing is in the model layer. That is a boundary violation with measurable consequences.
Question 2: Who halts? What determines when a session is complete? If the answer is "the model says it's done," halt detection is in the model layer. That is a boundary violation that produces false positive findings and missed completions.
Question 3: Who is accountable? When a finding drives a remediation decision, who is responsible for its accuracy? If the answer is "the tool" or "we assumed the AI was correct," accountability has been delegated below the human layer. That is a boundary violation with legal and professional consequences.
Question 4: Who verifies ground truth? When the model claims the objective is complete — a shell obtained, a credential recovered, a vulnerability confirmed — what checks whether that claim reflects actual target state? If the answer is "the model's output looked correct" or "the session log shows success," you have a boundary violation. Ground truth is a state observable in the target system, not in the model's output. A Centaur implementation probes the target after any success claim and confirms the asserted state change through a mechanism the model cannot influence. The model claims. The code verifies. The finding rests on the verification.
A Centaur implementation can answer all four questions: the code layer routes, the code layer halts against a verified condition, the code layer confirms findings against target state, and the named analyst who reviewed and authorized the output is accountable for it.
References
Pre-publication — citations requiring verification before final submission
- [ ] [^3] Bilalić et al. — cited as "advance online publication." Final volume, issue, and page numbers cannot be retrieved (Wiley paywall). Verify via institutional access or check doi.org/10.1111/bjop.12750 closer to submission; update footnote if final pagination is now assigned.
- [ ] [^10] Kim, Dán & Zhu — DOI
10.1109/TIFS.2024.3402148confirmed as a different paper (Online Self-Supervised Deep Learning for Intrusion Detection Systems); citation correctly uses IEEE Xplore document 10613858 only. Obtain the correct DOI via IEEE Xplore access before final submission. - [ ] [^21] Deng et al. (PentestGPT) — venue and page range cited from preprint; confirm USENIX Security '24 proceedings page numbers (arXiv:2308.06782 used as fallback if proceedings access unavailable).
- [x] [^16] CSA/Dropzone AI — resolved. Vendor co-authorship disclosure present in §2.2 body text at point-of-claim and in the reference footnote.
[^1]: Kasparov, G. "The Chess Master and the Computer." The New York Review of Books, vol. 57, no. 2, 11 February 2010. nybooks.com/articles/2010/02/11/the-chess-master-and-the-computer/
[^2]: "Dark Horse ZackS Wins Freestyle Chess Tournament." ChessBase News, 2005. chessbase.com/post/dark-horse-zacks-wins-freestyle-che-tournament
[^3]: Bilalić, M., Graf, M., & Vaci, N. "Computers and chess masters: The role of AI in transforming elite human performance." British Journal of Psychology, 2024, advance online publication. DOI: 10.1111/bjop.12750
[^4]: 14 C.F.R. § 91.3 — Responsibility and authority of the pilot in command. Electronic Code of Federal Regulations, current version. ecfr.gov/current/title-14/…/section-91.3. Key provision (§ 91.3(a)): "The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft."
[^5]: Cestonaro, C., Delicati, A., Marcante, B., Caenazzo, L., & Tozzo, P. "Defining medical liability when artificial intelligence is applied on diagnostic algorithms: a systematic review." Frontiers in Medicine, vol. 10, 2023, Article 1305756. DOI: 10.3389/fmed.2023.1305756
[^6]: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation). Official Journal of the European Union, L 119, 4 May 2016, pp. 1–88. eur-lex.europa.eu/eli/reg/2016/679/oj. Accountability principle: Article 5(2); liability: Article 82.
[^9]: Gaessler, F., & Piezunka, H. "Training with AI: Evidence from chess computers." Strategic Management Journal, vol. 44, no. 11, 2023, pp. 2724–2750. DOI: 10.1002/smj.3512
[^10]: Kim, Y., Dán, G., & Zhu, Q. "Human-in-the-Loop Cyber Intrusion Detection Using Active Learning." IEEE Transactions on Information Forensics and Security, vol. 19, 2024, pp. 8658–8672. IEEE Xplore document 10613858. Note: full DOI should be verified on IEEE Xplore before final publication.
[^11]: National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. U.S. Department of Commerce, January 2023. DOI: 10.6028/NIST.AI.100-1
[^12]: Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689, 12 July 2024. eur-lex.europa.eu/eli/reg/2024/1689/oj/eng. Human oversight requirements: Article 14.
[^13]: IEEE Std 7001™-2021. IEEE Standard for Transparency of Autonomous Systems. IEEE, approved 9 December 2021, published 4 March 2022. IEEE Xplore document 9726144. ieeexplore.ieee.org/document/9726144/
[^14]: Tariq, S., Chhetri, M. B., Nepal, S., & Paris, C. "Alert fatigue in Security Operations Centres: Research challenges and opportunities." ACM Computing Surveys, vol. 57, no. 9, Article 224, March 2025. DOI: 10.1145/3723158
[^15]: Chhetri, M. B., Tariq, S., Singh, R., Jalalvand, F., Paris, C., & Nepal, S. "Towards Human-AI Teaming to Mitigate Alert Fatigue in Security Operations Centres." ACM Transactions on Internet Technology, vol. 24, no. 3, Article 12, July–August 2024. DOI: 10.1145/3670009
[^16]: Cloud Security Alliance & Dropzone AI. Beyond the Hype: A Benchmark Study of AI in the SOC. October 2025. cloudsecurityalliance.org/artifacts/a-benchmark-study-of-ai-agents-in-the-soc. Disclosure: co-published with a commercial vendor (Dropzone AI). Study design is controlled (n=148, pre-registered metrics); findings should be weighted accordingly.
[^17]: Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G., & Thomas, C. B. MITRE ATT&CK®: Design and Philosophy. Technical Report PR-19-01075-28. The MITRE Corporation, revised March 2020. attack.mitre.org/docs/ATTACK_Design_and_Philosophy_March_2020.pdf
[^18]: European Central Bank. TIBER-EU Framework: How to Implement the European Framework for Threat Intelligence-Based Ethical Red-Teaming. Frankfurt am Main: ECB, May 2018. ecb.europa.eu/pub/pdf/other/ecb.tiber_eu_framework.en.pdf
[^19]: Penetration Testing Execution Standard (PTES). Community standard. pentest-standard.org
[^20]: Hawkins, J. "The Stochastic Trap: An Architectural Critique of Current AI Security Tools." Centaur Security Labs, 2026. centaursecuritylabs.com/research/stochastic-trap
[^21]: Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., & Rass, S. "PentestGPT: An LLM-Empowered Automatic Penetration Testing Tool." 33rd USENIX Security Symposium (USENIX Security '24), Philadelphia, PA, August 2024. arXiv: 2308.06782. Page range to be confirmed against final proceedings.
Glossary
Centaur model. The division of labor between human and AI named after Kasparov's 1998 observation that a human working with a computer beat either alone. In security operations: the machine handles speed, breadth, and consistency; the human handles judgment, context, and accountability. Neither operates alone.
Compensating logic. Code added reactively to handle variation in model output — parsers that accommodate multiple output formats, fallbacks when the model doesn't follow instructions, post-processing that corrects malformed output. Compensating logic is the primary diagnostic for model-layer boundary violations: when the surrounding code grows to handle what the model was assigned to produce, the model is performing code-layer work. The distinction from intentional code architecture is design intent: compensating logic is a patch on a misassigned responsibility; intentional code architecture is a deliberate decision that the function belongs in the code layer from the start. See §3.1 and §7.1.
Code layer. One of three non-overlapping responsibility layers in the framework. The code layer owns all deterministic, auditable functions: task routing, command execution, safety constraint enforcement, halt detection, and session logging. Work that requires consistent, reproducible behavior belongs here — not in the model layer.
Collaboration layer. The fourth layer governing how the model, code, and human layers relate. Requirements X1–X4 specify that role boundaries are documented and mechanically enforced, boundary violations are detectable, silent competence is structurally prevented, and the layer of origin for every decision is auditable.
Ground Truth Gate. A three-layer verification mechanism that confirms model-claimed task success through code-layer checks independent of model output. Layers: (A) inline signal check against session output, (B) active target probe via docker exec, (C) external --verify-fn script. A tier1_failed result from any layer suppresses the success signal and continues the session.
Halt discipline. The set of code-layer rules governing when an AI session ends. Includes minimum and maximum command counts, keyword-based completion signals, and the Ground Truth Gate. The model requests termination; the code layer decides whether to honor it.
Human layer. One of three non-overlapping responsibility layers. The human layer owns all non-delegable accountability functions: defining scope and acceptable risk, interpreting findings against organizational context, authorizing irreversible or high-impact actions, and final remediation decisions. These functions cannot be assigned to a probabilistic system and have no code-layer substitute.
Human-Verified Traceability. The property of a finding that has a verifiable chain from raw tool output to confirmed target state to named human sign-off. A finding with this property can be presented to a regulator, acted on by a remediation team, and traced backward to the specific evidence that confirmed it. Human-Verified Traceability requires: (1) verbatim tool output captured in the session log (C3), (2) target state confirmed by a code-layer Ground Truth Gate probe independent of the model's assertion (C5), and (3) the reviewing professional's identity logged alongside the finding (H5). A finding that rests only on model assertion — without C5 confirmation and H5 sign-off — has not achieved Human-Verified Traceability regardless of the model's confidence. The compliant session flow in §3.4 traces how this property is produced step by step.
Model layer. One of three non-overlapping responsibility layers. The model layer owns work where probabilistic reasoning over a learned distribution outperforms deterministic rules: command generation, output interpretation, next-step chaining, attack chain narrative, and MITRE ATT&CK mapping. The model layer does not own routing, halt detection, logging, or compliance functions.
Probabilistic residual. The complement of Tier 1 confirmations: everything the model asserted during a session that was not mechanically verified by a code-layer check. Includes model-generated severity ratings, remediation recommendations, Tier 2 findings, and any inference beyond raw tool output. Generated automatically at session end as a structured output for human review. Defined at C6.
Probabilistic system. A system that generates outputs by sampling from a learned statistical distribution — such as a large language model — rather than by executing deterministic logic. Probabilistic systems are capable and powerful for appropriate tasks; they are constitutionally unreliable for tasks requiring deterministic correctness (audit trails, routing, halt detection).
SDLC compliance. Software Development Lifecycle compliance. The property that a routing mechanism has frozen weights at deployment, a specific training set, a version history, and a confusion matrix measurable against ground truth. Used in M1/C1 to distinguish an accountable classifier from ad-hoc model-generated routing.
Silent competence. A failure mode in role-constrained AI systems: an agent correctly identifies a solution, lacks authorization to implement it, and says nothing — routing around the blocked path and presenting alternatives as if the correct answer were unavailable. See the companion paper Silent Competence.
Stochastic Trap. The failure mode of assigning deterministic, accountable work to a probabilistic system — using a model to make decisions that have correct answers, require audit trails, or carry non-delegable accountability. The trap: the model's probabilistic output approximates the right answer often enough to appear functional in testing, but fails unpredictably in production and cannot be remediated against a ground-truth baseline. In security operations, the Stochastic Trap appears whenever routing, halt detection, or verification is delegated to the model layer. Using a second AI model to generate a ground-truth verifier is a specific instance: the verifier must be deterministic by definition; a probabilistic verifier reintroduces the same failure mode at the verification layer. See companion paper The Stochastic Trap[^20] and §3.1.
Tier 1 verification. A binary, mechanically confirmable ground-truth check performed by the code layer. Examples: confirmed shell at uid=0 via docker exec, open port confirmed by direct connection, CVE string present in actual scanner output. Tier 1 checks are the basis for shrinking the probabilistic residual.
Tier 2 verification. An interpretive check that cannot be performed deterministically by the code layer. Examples: whether a complex logic flaw is exploitable in a specific business context, whether a severity rating is appropriate given organizational risk tolerance. Tier 2 findings always require human judgment and appear in the probabilistic residual.
About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production. CEH, CHFI, Pentest+, Security+.
Centaur Security Labs — centaursecuritylabs.com