Building AI You Can Trust: A Practitioner's Methodology for AI Security Tooling¶
Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
The Problem This Paper Solves¶
Every AI security tool claims to work. Most of them can't prove it.
The claim is always the same — AI-powered detection, AI-augmented analysis, AI-driven response — and it is almost never accompanied by a methodology that would let you verify it independently. How is the system tested? Against what? Over what time period? What counts as a pass? What distinguishes a model failure from a harness failure? What happens when the environment changes?
These are not rhetorical questions. They are the minimum bar for trusting any tool whose output carries operational or legal weight. A penetration test finding that leads to a remediation decision. A detection that triggers an incident response. A threat hunting query that shapes an executive's risk assessment. The output of an AI security tool affects real decisions made by real people with real accountability. The standard for trusting it has to be higher than "it worked in the demo."
This paper documents the methodology I developed building ARCHER — a local-first AI agent for security operations — across an intensive development period of approximately six weeks of production evaluation against live targets. It is not academic. It does not describe a theoretical framework. It describes what I actually built, the mistakes I made, what the mistakes taught me, and why the methodology that emerged from them is the right one for anyone building AI security tooling they intend to trust.
1. Start with a Hard Question About Architecture¶
Before the first line of code, I had to answer a question most AI security tool builders skip: what is the AI actually responsible for, and what is everything else responsible for?
This is not a philosophical question. It is the most consequential engineering decision in the system, and answering it wrong produces tools that are either useless (the AI has so much help it isn't doing anything interesting) or untrustworthy (the AI is doing things the system can't verify or enforce).
The answer I landed on, after building and rebuilding the wrong versions several times, is a three-layer split:
The model layer handles: command generation, output interpretation, multi-turn reasoning, attack chain narrative, MITRE mapping. Anything that requires probabilistic judgment over an open-ended input space.
The code layer handles: task routing, command execution, safety constraint enforcement, halt detection, session logging, audit trail production. Anything that must be deterministic, enforceable, and verifiable.
The human layer handles: defining scope and acceptable risk, interpreting findings against organizational context, authorizing irreversible or high-impact actions, final remediation decisions, and accountability for the output.
The line between model and code is the most important boundary in the system. If you find yourself writing code to compensate for model unreliability — a parser that handles output variations, a fallback when the model doesn't follow a format, a safety check after generation — the model is in a code role. Stop and redesign. That compensation code will drift, fail silently, and eventually produce the kind of output failure that damages trust in the entire system.
The line between code and human is the second most important boundary. Automation that removes human judgment at decision points carrying legal or organizational weight is not a feature — it is a liability transfer. The centaur model works because the human is not just reviewing output; the human is the accountability layer for decisions the system explicitly cannot make.
This architecture is not optional. It is the prerequisite for everything else in this methodology.
2. Build the Eval Harness Before the Agent¶
The single most counterintuitive decision in ARCHER's development was building the evaluation harness before building most of the agent capabilities. The instinct is to build the thing first and figure out how to measure it later. That instinct is wrong, and it is expensive to correct.
Here is why it matters: the eval harness is not a test suite. It is a co-producer of the quality signal. An agentic system that executes commands against a real environment does not produce a static output you can compare against a ground truth. It produces a trajectory — a sequence of actions, observations, and decisions — whose correctness depends on whether the environment responded as expected, whether the verifier code correctly classified the outcome, and whether stopping conditions fired at the right time. When a session fails, the failure might be the model, or it might be any of these harness components.
If you build the agent first and the harness second, you will spend months debugging agent behavior that is not broken. You will misattribute harness failures to model failures, chase phantom regressions, and build compensating logic that masks the actual problem. I know this because I did it, in a smaller scope, and the lesson cost several weeks.
Build the harness first. Define the objectives before writing skill code. Write the verifier functions — the ground-truth checks that confirm whether the agent actually accomplished the task — before building the agent's capability to accomplish them. Then build the agent against a test suite that already exists.
What the harness must measure¶
An eval harness for an agentic system needs to measure three things separately. Aggregate pass rate conflates them, which makes it nearly useless for diagnosis.
Objective Achieved (OA): Did the session end with a verified success — [OBJECTIVE_ACHIEVED] confirmed by a deterministic verifier that checks actual target state? This is the primary quality signal. It is not the only one.
False Positive rate (FP): Of all sessions in which the agent emits a completion signal, what fraction are rejected by the independent verifier? A system that halts cleanly when it cannot complete a task is behaving correctly. A system that frequently declares objectives achieved when the verifier disagrees is producing false confidence — the most dangerous failure mode in a security context.
Halt Discipline (HD): Of all sessions that do not produce a verified completion, what fraction exhaust the command budget without halting cleanly? An agent that exhausts its command budget without halting is neither completing the task nor failing cleanly. It is consuming resources without producing a usable outcome. High HD is often a signal that the stopping conditions in the harness are wrong, not that the model is broken.
These three signals are derivable from logs any properly instrumented agentic harness should already produce. They require no additional infrastructure. They are not academic decomposition — they are the difference between knowing your system is broken and knowing why it is broken and where to fix it.
The harness validity trap¶
The most subtle failure mode I encountered was harness invalidity: a verifier that claimed to check ground truth but was checking something else. A port scanner objective whose verifier was checking whether Nmap output contained specific text rather than whether the service was actually detected. A privilege escalation objective whose verifier was matching against a UID string that appeared in log output even when the escalation had failed.
An invalid verifier produces results that look like agent performance data but are actually harness state data. Running the agent against it harder, tuning the prompts, changing the model — none of it helps, because the problem is not the agent.
Before trusting any eval result, ask: if the agent did nothing at all, what would this verifier return? If the answer is not a deterministic fail, the verifier is not checking what you think it is checking.
3. Use a Multi-Tier Quality Pipeline, Not a Binary Gate¶
Binary pass/fail at the session level discards too much signal. A session that correctly enumerates services, selects the right exploit, runs it against the right target, and stops cleanly when the exploit fails — that session teaches something. A binary fail gate throws it away along with sessions where the agent hallucinated a target that didn't exist.
The training pipeline I use has three tiers:
Tier 1 (T1) — behavioral classification. Every session is classified into one of six failure classes: no_*_signal (tool was never used), partial_signal (tool ran but evidence is incomplete), wrong_approach (wrong tool selected), infra_issue (environment failure), success (verified pass), and disciplined_halt (clean stop within budget). This classification is deterministic and fast — it requires no LLM call, just log parsing. T1 tells you what category of failure you are looking at.
Tier 2 (T2) — LLM-as-judge quality scoring. A lightweight LLM rates each session 0–3 on four dimensions: findings grounding (are claims linked to specific tool output?), tool-task alignment (was the right tool selected for the task?), completion genuineness (was the claimed completion actually achieved?), and scope adherence (did the agent stay within the authorized target?). Sessions scoring below 2 are excluded from training data. Sessions scoring 2–3 are training candidates. This gate catches the failure mode binary OA misses: a session that passed OA by luck, or a session that failed OA but demonstrated high-quality partial completion worth learning from.
Tier 3 (T3) — human review. A random sample of T2 candidates is reviewed by a human. T3 does not re-score sessions mechanically — it asks whether the T2 judgment was correct and whether the training example teaches the right thing. When T3 disagrees with T2, the override is logged and used to recalibrate T2. This is the layer that catches T2 drift: the judge's scoring criteria shifting over time in ways that individual session scores don't reveal.
This three-tier structure applies to any agentic system, not just ARCHER. The specific dimensions change by domain — a threat hunting agent might score differently on scope_adherence than a pentest agent — but the architecture is domain-agnostic.
4. Design Skill Packs for Generalization, Not Box-Solving¶
The failure mode I call range lock-in is the most predictable failure in AI security agent training, and it is almost entirely a data design problem.
If your training data consists of sessions where the agent solved Metasploitable 2 targets, the model learns to solve Metasploitable 2 targets. It learns the specific port numbers, the specific banner strings, the specific exploit module names for that environment. Apply it to a different target and the learned pattern — use exploit/unix/ftp/vsftpd_234_backdoor, set RHOSTS 192.168.56.102 — is wrong in every specific but the agent doesn't know that, because its training signal came entirely from one environment.
The fix is a two-layer hint design. Every skill pack that provides task-specific guidance must include two blocks:
The target-specific block gives exact commands for the training environment. This drives eval pass rate and provides the model with a concrete success path during training. It is what the model executes when it recognizes the specific target.
The generic companion block teaches the transferable pattern using domain placeholders: <login-endpoint>, <token-field>, <target-ip>. The model fills these from reconnaissance output. This is what the model should use against any target it hasn't seen before.
A skill pack with only the target-specific block trains a model that can solve the box. A skill pack with both trains a model that understands the vulnerability class. The distinction is the difference between a demonstration and a deployable tool.
5. Instrument Everything, Trust Nothing¶
Six weeks of eval-driven development produced a clear lesson about observability: the failure modes you can't see are the ones that will damage trust in the system.
Every session in ARCHER produces a complete, timestamped log: each command issued, raw output returned, findings linked to specific tool output, halt reason, verifier result, and T2 score. This is not logging for debugging — it is logging for accountability. When a finding is questioned, the log is the chain of evidence. When a verifier result is disputed, the log shows exactly what output the verifier received. When a training data quality concern arises, the log is the audit trail.
The specific instrumentation decisions that matter:
Log the verifier input, not just the verifier result. Knowing a session failed tells you nothing about why. Knowing what the verifier received and what it was checking tells you whether the model failed or the harness failed.
Log halt reasons explicitly, not just halt outcomes. "Session halted" is not useful. "Session halted at command 4 because CMD[3] output contained no exploit signal and the budget was exhausted" is actionable.
Log T2 scores with reasoning. A score of 1 with no reasoning attached is noise. A score of 1 with "agent correctly identified the target and selected the right tool but produced no output evidence linking findings to tool results" tells you exactly what training signal to add.
Track verifier coverage. For every objective in the test suite, periodically ask: what would this verifier return for a session that did nothing? For a session that hallucinated a success? These are not hypotheticals — they are the two most common verifier failure modes, and the only way to catch them before they corrupt training data.
6. The Three Numbers That Tell You If the System Is Working¶
Once the harness is instrumented and running continuously, three numbers tell you the system's health:
OA rate trend — not the absolute value, but the direction and variance. A stable OA rate at 85% is healthy. An OA rate that swings between 60% and 100% across consecutive runs with no corresponding changes to the agent is a harness stability problem, not a model quality problem.
FP rate — should be near zero. A system that frequently claims objective completion without verified success is not just inaccurate; it is systematically producing false confidence. An FP rate above 5% is a signal that the stopping conditions or success-claim logic needs redesign.
T2 pass rate — should be above 60% for training data to accumulate fast enough to make fine-tuning viable. A T2 pass rate that drops suddenly usually indicates one of three things: a verifier miscalibration that's excluding good sessions, a T2 judge drift that's raising the quality bar without cause, or a genuine model regression that's producing low-quality partial completions across multiple objectives.
When all three numbers are stable and in range, the system is working. When any of them moves unexpectedly, the decomposition tells you where to look.
7. What Generalizes Beyond Security¶
The architecture in this paper — three-layer responsibility split, instrumentable eval harness, multi-tier quality pipeline, generalization-first skill design — was developed for a security agent. The principles apply to any domain where AI output carries operational weight and failure has real consequences.
Medical decision support. Legal document analysis. Financial risk assessment. Autonomous infrastructure management. In all of these, the same questions matter: what is the AI actually responsible for? How do you know when it's working? How do you separate model failure from harness failure? How do you prevent the model from learning to solve the demo rather than the problem?
The eval harness and training pipeline described here are being separated from ARCHER into a domain-agnostic framework — designed so that any agentic system, in any domain, can implement the same OA/FP/HD decomposition and multi-tier quality pipeline without building it from scratch. That work is underway and will be documented here as it progresses.
What It Takes¶
Building an AI security tool you can trust requires accepting a constraint most builders want to avoid: you cannot trust the system until you have built the infrastructure to verify it, and building that infrastructure takes longer than building the agent.
The eval harness, the verifier functions, the logging, the T2 pipeline, the trend analysis — none of this is the agent. All of it is what makes the agent trustworthy. In my experience building ARCHER, the verification infrastructure took roughly as much time to build as the agent capabilities it verified. That ratio is not inefficiency. It is what the standard of evidence for a security tool actually requires.
The result is a system that can answer the questions its users will eventually ask: how do I know it works? What happens when it's wrong? How do you know the difference? Those questions do not have good answers without this methodology. With it, they have specific, evidence-grounded, auditable answers.
That is the standard. Everything else is a demo.
Standards Convergence¶
The methodology described here was derived bottom-up from operational failure analysis — what broke, why, and what structure prevented it from happening again. It is worth noting that the OWASP AI Security Verification Standard (AISVS) C9 (draft, 2026) arrived at the same structural requirements from a standards direction. AISVS C9 requires pre-execution gates enforcing hard policy constraints — the code layer's verification gate. It separately requires human approval gates for irreversible or privileged actions — the human layer's authorization function — and audit records that reconstruct the full action chain — the chain-of-evidence logging this methodology mandates.
The convergence is not coincidental. The same failure modes that drove this methodology — unverified model assertions treated as findings, irreversible actions taken without human sign-off, audit trails that cannot reconstruct what actually happened — are exactly the risks AISVS C9 was designed to prevent. An implementation built to this methodology satisfies the structural intent of AISVS C9 simultaneously.
Further Reading¶
- What Aggregate Pass Rate Hides — The three-axis decomposition in detail, with case studies from 218 real eval runs
- Beyond Pass Rate — Benchmark Paper — Full longitudinal dataset and academic treatment of the decomposition
- Range Lock-In — Analysis of the training data generalization failure and the two-layer fix
- Operational Failure Modes Taxonomy — 16 failure classes derived from six weeks of high-cadence eval-driven development
- Context as Infrastructure — Accountability loops in a three-layer AI system
- The Centaur Model — The human-machine collaboration framework underlying this methodology