Skip to content

The Centaur Model

Centaur Security Labs — Jay Hawkins


The question that shaped this work wasn't whether AI belongs in security operations — it clearly does. The question was how to build a reliable, trustworthy system using AI when the model is probabilistic by nature: prone to hallucinations, context loss, and going off script in ways that matter a great deal in a security context. How can something so stochastic perform such consequential work? What I found: the model is only part of the equation. We also need deterministic code in situations where we know there is a set right or wrong answer, because it's faster and we know it performs exactly as we program it to, and we need proper data collection to discover ground truth from which to operate from. The next question then was which decisions belong to the model, which belong to the code, and which a human should never delegate.

Getting that division right turns out to be the primary determinant of whether an AI security tool is reliable in production. This paper describes the architecture that emerged from exploring that question — a principled division of labor between human judgment, deterministic code, and probabilistic AI — and the measurable consequences it produces for reliability, auditability, and compliance.


The Origin

In 1997, IBM's Deep Blue defeated Garry Kasparov at chess — the first time a machine had beaten a reigning world chess champion under standard tournament conditions. The technology press declared chess solved and human intuition obsolete.

Kasparov disagreed. He spent the next decade proving why.

Advanced Chess — a format he helped develop — allowed human players to consult chess engines during play. The results were unambiguous. The strongest performances didn't come from grandmasters playing alone, or from the most powerful engines running without human input. They came from human-computer teams with a disciplined division of labor: humans supplying strategic judgment and positional intuition, machines supplying tactical calculation and combinatorial depth no human could match.

Kasparov called this the Centaur. A hybrid that outperformed both of its constituent parts — not because the combination was additive, but because the division of work was principled.

The chess analogy has limits that must be stated. Modern engines have advanced to the point where human input in pure tactical calculation is frequently noise rather than signal. That evolution does not undermine the framework — it clarifies it. The structural principle chess established is this: the division of labor is determinative. What humans contribute in security operations is not superior calculation. It is alignment with organizational risk, legal authority, and institutional context — variables that are genuinely non-computable in a way that chess moves are not. An engine can find the optimal move. No system can decide whether exploiting a vulnerability violates the rules of engagement for this client, on this engagement, on this date. That decision belongs to a human, not because humans calculate better, but because humans are the ones who can be held accountable for the outcome.

Aviation and medicine make the accountability point more cleanly than chess. A pilot holds an FAA certificate regardless of how capable the autopilot is. A physician signs the diagnosis regardless of the AI diagnostic tool's accuracy rate. The accountability is legally anchored — it does not migrate to the tool as the tool improves. This is the correct frame for the human layer in security operations: not a temporary bottleneck pending better AI, but a permanent feature of how legal and professional accountability is structured. As AI security tools improve, they perform more of the work. The named human professional still signs the finding.

The same dynamic applies to security operations, with the same caveat: the division of labor is determinative. Get it wrong, and the combination is worse than either part.


The Division of Labor

The Centaur Model divides work across three layers. The boundaries between them are not arbitrary — they follow from the nature of the work itself.

The model layer handles work where probabilistic reasoning over a learned distribution produces better results than any deterministic rule. In security operations: generating the next investigative command given current tool output and session state, interpreting unstructured output from tools like nmap, Suricata, and msfconsole, building coherent attack chain narratives across turns, and generating candidate next steps from a large and combinatorial solution space. These are tasks that live in a space of ambiguity where there is no single correct, fixed answer, where a model that has learned from a large corpus of operational data generalizes better than any rule set.

The code layer handles work with correct answers — decisions that must be made the same way every time and processes that must be auditable and reproducible. Task routing (mapping a task string to the correct analytical workflow), halt detection (determining when a question is answered and a session is complete), ground-truth verification (confirming that a claimed finding reflects actual target state, not model assertion), safety constraint enforcement (preventing unauthorized actions from executing), and session logging (maintaining an accurate, unmodifiable audit trail) all belong here. These roles have correct answers. Assigning these tasks to the model layer and compensating for unreliability with instructions produces a failure mode that we call the Stochastic Trap: a system that is right most of the time, with no mechanism for catching the cases when it isn't.

A note on terminology: calling the code layer "deterministic" invites a misreading. The routing classifier is probabilistic — trained on labeled data, evaluated against ground truth, with a measurable accuracy rate. What makes it a code-layer function is not that it uses a simple rule, but that it is accountable: separately trained, separately evaluated, improvable against a labeled test set, and auditable. The model agent's routing "decisions" are statistically distributed and uncorrelated with any ground truth you can measure. That is the distinction that matters. A more precise frame: the code layer is code because it follows the software development lifecycle and it has a predefined/predictable output. The routing classifier has a frozen weights file at deployment, a specific training set, a version history, and a confusion matrix measurable against ground truth. It can be tested and improved in isolation. The model agent's routing output cannot be tested against a confusion matrix — it is statistically distributed across every re-run. SDLC accountability, not implementation type, is the boundary.

The human layer handles work that cannot be delegated because it involves legal and operational risk acceptance. This is a precise claim, not a vague appeal to human superiority. A model can find a vulnerability. The code can verify the vulnerability exists. Only a named human can decide whether exploiting it violates the rules of engagement for this engagement, accept the organizational risk that decision carries, and be accountable to the regulator if something goes wrong. That function — risk acceptance — is non-delegable by definition. No regulatory framework accepts "the AI decided" as a defense. No contract transfers liability to a language model. The human layer exists not because humans are smarter than models, but because humans are the only layer that can be held responsible.

There is also an epistemological gap that risk acceptance framing alone does not capture. A model sees a vulnerability. A human sees a vulnerability on a legacy system scheduled for decommission in 48 hours, in a regulated environment where a finding above a certain severity triggers mandatory 72-hour disclosure. The technical finding is the same. Its organizational meaning is not. The human layer is the Final Parity Check between technical truth and organizational reality — the place where a correct finding meets the constraints that determine what to do about it. No model can hold that context. No code can acquire it. It belongs in the human layer by definition.


When the Division Goes Wrong

The most common pattern in AI-augmented security tooling places the language model as the primary decision-making layer. The model receives a task, reasons over it, and produces output that the system executes with minimal mediation.

Language models are probabilistic systems. They generate outputs by sampling from learned distributions, not by executing logic. Applied to roles that require deterministic correctness — routing decisions, halt conditions, audit trails — they produce outputs that are plausible in form but unreliable in content. The natural response is prompt engineering: adding instructions, formatting requirements, and output constraints in the expectation that a capable enough model will behave consistently.

What I found in practice: every parser written to handle output variation, every fallback added when the model doesn't follow a format, every safety check inserted after generation — each is evidence that the model has been given work that belongs in the code layer. That accumulation of compensating logic is the signal that the division of labor has broken down.

The Centaur Model resolves this by matching the nature of the work to the nature of the layer. The model earns its place in exactly the operations where probabilistic reasoning produces better results than any rule. Everything else is code or human judgment.


What Correct Division Produces

The measurable advantage of the Centaur Model is not speed, though AI-assisted analysis is faster than manual. It is consistency and provenance.

A human analyst on a long engagement will, under cognitive load, skip methodology steps. Not from incompetence — from the limits of working memory. A model executing the methodology procedurally, with every command logged and every output captured, doesn't require the analyst to maintain the checklist in their head at all. The analyst's cognitive capacity is preserved for the work that requires it: interpreting ambiguous findings, making scope decisions, communicating risk to stakeholders, authorizing the high-impact moves.

Provenance matters independently. A security finding that cannot be traced to specific tool output — a finding that exists only as a model-generated summary — is not auditable and does not constitute compliance evidence under NIS2 or DORA. Every ARCHER session produces a timestamped log that captures each command issued, the raw output returned, and the finding annotations linked to specific tool output. The finding is traceable. The audit trail is complete. The human who reviewed and authorized the session is the named professional of record.

Ground-truth verification is the mechanism that makes this provenance real rather than claimed. ARCHER's verify_fn layer executes after any model-claimed success and probes the target for the specific state change the model asserted — a confirmed root shell, a verified CVE string in actual tool output, a file that demonstrably exists or doesn't. If the state change isn't present, the session continues regardless of the model's confidence. The model claims; the code verifies; the session log records both. A finding that passes verify_fn is traceable to a specific observable state in the target system, not to the model's probability distribution over plausible outputs.

Ground-truth verification applies to objectives with unambiguous, binary-observable success conditions: a root shell confirmed by uid=0 in tool output, a file that either exists or does not, a port that is open or closed, a credential that authenticates or does not. It does not apply to complex logic flaws, business logic vulnerabilities, or objectives where success requires human interpretation. The verification claim is bounded by what is confirmable without invoking another probabilistic layer. ARCHER's implementation runs verification locally via docker exec against the archer-kali evaluation container — no external API calls, no probabilistic intermediary between the claimed state and the confirmed state.

The most valuable output of a Centaur implementation is not task success — it is Human-Verified Traceability: a finding with a verifiable chain from tool output to confirmed target state to named human sign-off. This chain is what regulators audit. It is what a professional can defend. It is what distinguishes an AI-assisted finding from an AI-generated assertion.

ARCHER, built on this architecture, reaches a 94% pass rate across 67 active objectives — 87 evaluation sessions against real targets (2026-05-09 baseline run, 29-objective set × 3 runs), not synthetic scenarios. The 6% that fail are documented failures, not uncaught false positives. That distinction matters: a system designed for provenance catches its own errors.


How the System Learns

The three-layer architecture isn't static. Each layer generates information that feeds the others — and if that flow is designed deliberately, each iteration of the system makes the next one better. What looks like a division of labor is also a learning loop.

What Each Layer Knows

The model layer knows what it was trained on. In a Centaur implementation, that training data isn't sourced from generic benchmarks — it's sourced from the system's own operational sessions. Every completed run produces structured examples of how the model responded to real tasks against real targets, what worked, and what the outcome was. That data is audited for quality, curated by skill domain, and fed back into the fine-tuning pipeline. The model that runs next month has learned from the sessions that ran this month. Over time, it stops generalizing from internet-scale training data and starts specializing on the exact task distribution it actually sees.

The code layer knows what it has measured. The routing classifier learns from labeled routing decisions — every session produces a record of which task phrasing mapped to which skill, and whether that mapping was correct. Hint logic — the per-skill instructions that guide the model's approach — is updated when the audit trail reveals systematic failures at specific objectives. The playbook accumulates winning command sequences across sessions, making successful approaches reusable rather than rediscovered from scratch each time. The code layer's knowledge is operationalized in frozen artifacts — a trained classifier with a version history, a database of validated commands, a set of hints refined by failure analysis — rather than regenerated probabilistically on each run.

The human layer knows what neither of the other layers can know: why a finding matters in its specific organizational context, whether a failure pattern reflects a systemic problem or an edge case, which objectives to prioritize, and when the data being generated is trustworthy enough to train on. The human layer also holds the accumulated judgment about where the system is breaking down — reading failure reports, reviewing flagged sessions, diagnosing root causes, and deciding which fixes are worth making. That judgment is what shapes both layers below it.

How Information Moves Between Layers

The session log is the primary synchronization artifact. Every run produces a structured record: the task, the skill domain, every command issued, every output returned, the outcome. That log flows upward — to the audit system for quality checks, to the human layer for failure review, and eventually to the fine-tuning pipeline for model improvement. It also flows sideways — to the routing classifier as a labeled training example, to the playbook as a candidate winning command if the session succeeded.

Failure analysis moves downward. When the human layer identifies a root cause — a hint that's sending the model down the wrong path, a halt condition that's firing too early, a routing error on a specific task phrasing — that diagnosis becomes a targeted change to code-layer logic. The fix is committed, verified against the objective that was failing, and the improved behavior appears in subsequent sessions.

The routing log closes a specific loop: every routing decision records the classifier's confidence and whether its prediction was used. That log is what revealed that ARCHER's original 0.7 confidence threshold was discarding correct predictions — the threshold was lowered to 0.5 based on that measurement, not intuition. Without the log, the threshold would have stayed wrong indefinitely.

The Improvement Spiral

These flows compound. Better model performance produces higher-quality sessions. Higher-quality sessions produce better training data. Better training data improves the model. Improved routing sends tasks to the right skill domain, which means sessions complete correctly more often, which means more high-quality examples enter the training pipeline. Hint fixes reduce the failure rate on specific objectives, which means those objectives generate valid training examples instead of contaminated ones.

The spiral requires one condition to function: the data quality gate. Unaudited sessions fed directly into the training pipeline don't improve the model — they teach it to replicate whatever behavior it's been exhibiting, including failures. The audit layer (automated structural checks plus human review of flagged sessions) is not overhead on the improvement process. It is the mechanism that makes the spiral go up rather than sideways.

Where the System Stands Now

ARCHER currently produces structured session logs from every eval run, with a Tier 1 structural audit running automatically and a Tier 2 LLM-as-judge scoring system for sessions that pass Tier 1. The routing classifier is trained and deployed across all 15 active skill categories. The playbook accumulates winning commands across sessions, with IP abstraction so commands generalize across targets. Failure analysis is supported by automated taxonomy reports and halt quality metrics.

The fine-tuning pipeline is built and validated on a test run. The constraint is data volume — fine-tuning requires sufficient high-quality examples per skill domain before the investment in a training run is warranted.

What We're Building Toward

The next iteration of the learning loop closes the gap between evaluation and training. Rather than periodic fine-tuning runs when data volume crosses a threshold, the goal is a continuous feedback path: sessions complete, are audited, and contribute to a rolling fine-tuned adapter that reflects current operational data rather than a snapshot from the last training run.

Longer term, structured residual artifacts — per-session records that capture not just what happened but why specific decisions were made — create a richer training signal. A session log tells you the outcome; a residual artifact tells you the reasoning chain that produced it. That distinction matters for training models that don't just replicate successful behavior but understand why it succeeded.

The human layer's knowledge is the hardest to formalize and the most valuable to preserve. The tacit judgment about which failures matter, which fixes are durable versus symptomatic, and which objectives are worth adding to the evaluation suite — that knowledge currently lives in GitHub issues, commit messages, and session-close notes. Making it more structured and queryable is an open problem worth working on.


Accountability Is Not Transferable

A human must remain responsible for every decision that carries legal or ethical weight. This is not a safety constraint applied after the fact. It is the design principle the entire research stack is built on.

CEOs delegate authority. They cannot delegate responsibility. The same rule applies to AI-assisted security operations. You can hand over the execution; a named professional must own the outcome. This is not a position the Centaur Model takes reluctantly — it is the position that makes the model work. The human layer's value is not in performing mechanical tasks. It is in providing the judgment and accountability that no probabilistic system can provide, and that no regulatory framework will accept as substituted.


Four Questions Worth Asking

These four questions clarify where an AI security tool's architecture actually places each responsibility — and whether the answers hold up under scrutiny.

Who routes? What decides which analytical workflow handles a given task? When routing lives in the model layer, the same task phrased differently can land in the wrong skill domain — producing the wrong tooling, the wrong halt criteria, and findings for the wrong objective. A trained classifier with a measurable confusion matrix handles this more reliably.

Who halts? What determines when a session is complete? When halt detection is delegated to the model, two problems follow: sessions can terminate on plausible-but-false success signals, or run indefinitely because the model never emits a clean completion token. Both are harder to catch than a session that fails explicitly.

Who is accountable? When a finding drives a remediation decision, who is responsible for its accuracy? In regulated environments, accountability has to rest with a named human professional — not because humans are more capable than models, but because no regulatory framework accepts "the AI decided" as a defensible answer.

Who verifies ground truth? When the model claims the objective is complete — a shell obtained, a vulnerability confirmed, a credential recovered — what checks whether that claim reflects actual target state? A Centaur implementation probes the target after any success claim and confirms the asserted state change through a mechanism the model cannot influence. The model claims. The code verifies. The finding rests on the verification, not the claim.

A Centaur implementation answers all four: the code layer routes, the code layer halts against a verified condition, the code layer confirms findings against target state, and a named analyst who reviewed and authorized the output is accountable for it.


Further Reading

The research that develops these ideas in depth:

  • The Stochastic Trap — Why placing a probabilistic system in a role that requires deterministic behavior produces compounding failures, and why prompt engineering doesn't fix it.
  • Beyond Pass Rate: A Benchmark and Diagnostic Decomposition for LLM Security Agents — What aggregate pass rate conceals, and how to measure agent quality in a way that actually guides improvement. (pending release)
  • What Aggregate Pass Rate Hides — The decomposition framework in detail: OA rate, false positive rate, and halt discipline read together. (pending release)
  • Range Lock-In — How app-specific training produces agents that solve the lab target but fail on anything else. (pending release)
  • ARCHER Failure Mode Inventory — The empirical case for why the human oversight layer is non-negotiable: a taxonomy of 16 failure classes from five weeks of high-cadence eval-driven development, and what automation can and cannot prevent. (pending release)

Additional papers on the Centaur Framework formal specification, training data integrity, and adversarial robustness are in development.


Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production.

Full background →