Sufficiency vs. Optimality: The Problem AI Security Tools Haven't Solved¶

Centaur Security Labs — Jay Hawkins

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

Abstract¶

Most AI security tools are designed to answer the question: did the agent find a correct solution? This is the sufficiency model — binary pass/fail on whether an objective was achieved. It is the right question for evaluation. It is not sufficient for production use, where multiple correct solutions exist but differ significantly in stealth, operational safety, transferability, evidence left behind, and alignment with engagement-specific constraints. This paper argues that the transition from sufficiency to optimality — from "find a correct solution" to "find the best correct solution given constraints the AI cannot fully know" — requires three architectural changes: quality-weighted training data selection, iterative solution exploration within sessions, and a principled human-in-the-loop mechanism at the point where "best" is defined. Current AI security architectures, including ARCHER, have the infrastructure to begin this transition. What is missing is not capability — it is design intent.

1. The Multiple-Correct-Answer Problem¶

Most problems have more than one correct solution.

This is trivially true in mathematics — there are multiple proofs of the Pythagorean theorem, all valid. It is equally true in software engineering — there are multiple correct implementations of any function, varying in performance, readability, and maintenance cost. And it is true in security operations, where the space of valid exploitation paths for a given vulnerability class is large, and the paths differ substantially in properties that matter operationally.

The implications for AI systems are not widely discussed, and the security AI space has largely avoided them by adopting the sufficiency model: define an objective, run the agent, check whether the objective was achieved. Pass or fail. This model is appropriate for measuring capability. It is not sufficient for production deployment.

The gap between sufficiency and optimality is not a gap in AI capability — current models can generate multiple candidate solutions, explore alternative approaches, and evaluate outputs against quality criteria. The gap is in how AI security systems have been designed to use those capabilities. The field has built systems that stop at "found a correct answer" when the question production requires is "found the best correct answer for this specific operational context."

2. What "Multiple Correct Solutions" Means in Security AI¶

In a penetration testing context, multiple correct solutions are the norm, not the exception.

Consider SQL injection. There are at least five distinct approaches that produce confirmed database access on a vulnerable target: error-based injection, union-based injection, blind boolean injection, time-based blind injection, and out-of-band injection. All are correct. All achieve the objective. They differ along dimensions that matter operationally:

Approach	Noise in logs	Reliability	Transferability	Time
Error-based	High	High	Medium	Fast
Union-based	Medium	High	High	Fast
Boolean blind	Low	Medium	High	Slow
Time-based blind	Low	Low	High	Very slow
Out-of-band	Very low	Low	Low	Varies

An agent that finds any of these has "solved" the SQL injection objective. An agent that selects among them based on engagement context — a red team operation prioritizing stealth chooses boolean blind; an authorized pentest on a time-constrained schedule chooses union-based — has solved a harder and more useful problem.

Current AI security tools, including ARCHER, do not consistently distinguish between these cases. The eval harness passes any session that achieves confirmed database access. The training pipeline accepts any passing session. The model learns that all five approaches are correct, with no signal about which is better in which context.

This is not a flaw in the architecture — it is a consequence of the sufficiency model. The architecture was designed to measure whether objectives are achieved, not to differentiate among the quality of approaches that achieve them.

3. How AI Systems Navigate Solution Space¶

To understand the transition from sufficiency to optimality, it helps to understand how AI systems currently navigate the space of possible solutions.

Probabilistic generation. When a language model generates a sequence of commands or a chain of reasoning, it does not select the single highest-probability output at each step. Temperature and sampling parameters cause the model to explore different branches of the possibility space — which is why the same prompt, run twice, can produce two different but both valid exploitation approaches. This is the model's native capacity for solution diversity. Current systems largely waste it by running one session and declaring the result.

Evaluation as a filter, not a signal. In ARCHER's architecture, the verifier (verify_fn) confirms whether the objective was achieved. The T2 judge assesses session quality on a 0–3 scale across four dimensions. Both are currently used as filters — sessions that don't pass get excluded. Neither is used as a signal — to inform iteration, to select among competing approaches, or to weight training data by quality. The evaluators exist. The feedback loop from evaluator to generator is not closed.

Solution space is high-dimensional, not flat. The space of valid exploitation paths is not a list — it is a high-dimensional structure where paths vary along multiple independent axes (stealth, reliability, transferability, time, evidence footprint). AI systems can navigate this space. They navigate it randomly, under the sufficiency model, because no objective function guides them toward the dimensions that matter for a specific operational context.

4. The Sufficiency Trap in Training Data¶

The sufficiency model creates a specific pathology in training data: all passing sessions are treated as equally good, and the training signal pushes the model toward whichever approaches happen to appear in passing sessions, regardless of their operational quality.

This has two consequences.

First, brute-force approaches contaminate training. A session that achieves a SQL injection objective by exhaustively trying every common payload until one works is "correct" under the sufficiency model. So is a session that analyzes the application's error messages, identifies the injection point type, and selects the appropriate technique in three commands. Both pass. Both enter the training corpus. The model trained on both learns that exhaustive payload spraying is a valid approach — which it is, but it is not the approach a skilled practitioner would choose, and it is not the approach that transfers to novel targets.

Second, context-inappropriate solutions displace context-appropriate ones. If the eval harness runs objectives against a specific target in a specific configuration, the passing sessions cluster around approaches that work against that target and configuration. The model trained on those sessions learns approaches optimized for the training target — which is the range lock-in problem described in a companion paper. But there is a deeper version of the problem: even among approaches that generalize well, the ones that happen to succeed first in training sessions get overrepresented. The training data does not systematically select for the approaches that are best across the distribution of operational contexts.

Both pathologies are visible in the aggregate pass rate and undetectable by it. They require a quality signal — something that distinguishes a brute-force pass from a targeted pass — to surface and correct.

5. What Optimality Requires¶

The transition from sufficiency to optimality requires three changes.

5.1 Quality-weighted training data selection¶

The T2 LLM-as-judge infrastructure already exists in ARCHER. It scores sessions on four dimensions: technique appropriateness, output interpretation accuracy, objective relevance, and operational discipline. A session that achieves an objective through targeted technique scores higher on T2 than one that achieves it through brute force. A session that correctly interprets partial output and adapts its approach scores higher than one that ignores relevant information.

The change required is small but consequential: use T2 scores as weights in training data selection, not just as a binary filter.

Currently, a session passes T2 if it exceeds a threshold on all four dimensions. Sessions below threshold are excluded. Sessions above threshold are accepted — all weighted equally. A session scoring 2.8/3.0 on all dimensions and a session scoring 2.0/3.0 on all dimensions are both "pass" and both enter the training corpus with equal weight.

Quality-weighted selection would preferentially include the 2.8/3.0 session. Over time, the training distribution would shift toward higher-quality techniques, more disciplined output interpretation, and better operational judgment. The eval infrastructure already produces the signal; the training pipeline just doesn't consume it.

One failure mode to manage: quality-weighted selection applies optimization pressure to the T2 rubric itself. If the rubric imperfectly captures operational quality — or if rubric dimensions are easier to satisfy via surface-level compliance than genuine technique — the model will learn to optimize for rubric scores rather than the underlying qualities the rubric was designed to measure. This is the standard Goodhart/reward-hacking dynamic: the measure becomes the target. The mitigation is periodic rubric auditing: comparing T2 score distributions to human expert ratings on a sample of sessions, and updating rubric dimensions when the two diverge. Quality-weighted selection is only as useful as the quality signal driving it.

A prerequisite for quality-weighted selection is corpus partitioning by constraint type. A session optimized for stealth and a session optimized for speed may both score 2.8/3.0 on T2 overall — but they are not interchangeable training examples. Mixing them with equal weight produces contradictory signal: choose the stealthy approach, and also choose the fast approach, for the same objective class. The correct implementation partitions the corpus by the engagement-level constraint specified at session initialization — stealth-optimized sessions train stealth-weighted adapters; speed-optimized sessions train speed-weighted adapters — and applies T2-based weighting within each partition.

5.2 Iterative solution exploration within sessions¶

Current ARCHER sessions are single-path: the agent selects an approach and follows it until success or halt. If the approach produces a low-quality result — the objective is achieved, but with poor technique or unnecessary noise — the session closes and the result stands.

An iterative model would allow the agent to recognize a low-quality successful path and explore alternatives before closing the session. The mechanism is not fundamentally different from what T2 currently does post-hoc: evaluate the quality of the completed approach, then decide whether to accept it or retry with a different strategy.

This requires two architectural additions:

An in-session quality signal. A lightweight quality assessment — cheaper than the full T2 judge, more expensive than a binary verifier — that can evaluate whether the achieved solution meets a quality threshold worth training on. If not, the session continues with an alternative approach rather than closing on the first correct answer.

A retry protocol. A structured mechanism for the agent to recognize that it has found a correct but suboptimal solution, attempt an alternative approach, and compare the results. This is the agentic workflow evolution described in the broader AI literature — moving from single-pass generation to iterative refinement loops.

5.3 Human definition of "best" at the engagement level¶

This is the most important architectural change, and it is the one that current AI security tools most consistently fail to implement.

The definition of "best" for a given exploitation approach is not fixed — it depends on engagement constraints that the AI cannot know from the task description alone. A red team operation under strict stealth requirements has a different "best" than an authorized vulnerability assessment under time pressure. A target organization with extensive security monitoring requires a different approach than one with minimal detection capability.

The engagement type determines which dimensions dominate the optimization:

Engagement type	Primary criterion	Secondary criterion	Techniques favored
Red team / adversary simulation	Stealth	Transferability	Boolean blind, time-based, OOB
Time-boxed penetration test	Speed	Reliability	Union-based, error-based
Compliance assessment	Reliability	Documentation	Error-based, union-based
Threat emulation	Behavioral fidelity	TTP alignment	Threat actor-matched

These criteria cannot be inferred from the task description alone. "Exploit SQL injection" does not encode whether the engagement prioritizes speed or stealth. The optimization target must be specified at session initialization — and only the human who authorized and scoped the engagement can specify it correctly.

This assumes the optimization criteria are knowable before the session begins. In practice that assumption is often incomplete: real engagements are iterative, and the operative constraint may shift mid-session based on what the agent discovers (a target that appeared monitored turns out not to be; a time-constrained window opens unexpectedly). Specifying criteria at initialization is the right starting point, but the architecture should treat those criteria as revisable rather than fixed — the human layer's role includes updating the constraint specification as the engagement context clarifies. Session-level iteration (Phase 3 in Section 8) is a partial answer: it provides a structured moment to re-evaluate which path best serves the current understanding of the constraint profile.

No amount of model training or session iteration can internalize these constraints automatically, because they are not derivable from the technical task. They are defined by the human who authorized and scoped the engagement.

The three-layer architecture provides the right framework for handling this. The model generates solutions and evaluates them against technical quality criteria it can assess. The code layer enforces constraints that can be specified programmatically. The human layer defines the operational criteria that determine "best" — the engagement-level constraints that require human judgment to specify and that cannot be safely delegated to either the model or the code.

Concretely: rather than asking the agent "exploit SQL injection," the engagement should be defined with explicit optimization criteria — "exploit SQL injection, prioritizing approaches that minimize log noise and avoid WAF signatures." The model can optimize for specified technical criteria. It cannot determine which criteria to optimize for without guidance.

The mechanism that makes this work is the T2 rubric. When optimization criteria are specified at session initialization, T2 evaluates solution quality against those criteria specifically. Training on T2-weighted sessions for each constraint type shifts the model's prior toward criterion-appropriate technique selection — not through abstract reasoning about engagement context, but because it has learned from quality-weighted sessions which approaches score well under which constraint profiles.

This is not a limitation of AI — it is the correct division of labor. The model's job is to navigate the solution space efficiently given constraints. The human's job is to define the constraints. Conflating these roles produces either an agent that optimizes for the wrong criteria (because the human didn't specify them) or a human who is reduced to reviewing outputs without meaningful input into what the outputs should be.

5.4 Threat Emulation as Constrained Optimization¶

Threat emulation — operating as a specific threat actor — inverts the optimization problem described above.

In a standard penetration test, the optimization criteria are operational: minimize noise, maximize speed, maximize reliability. The agent is free to select the best path by those criteria across the full space of valid exploitation techniques.

In threat emulation, the optimization criterion is behavioral fidelity: find the path a specific threat actor would use, not the path that is technically optimal. APT29 does not use the fastest exploit; it uses the exploit consistent with its documented tooling, staging patterns, and lateral movement tradecraft. The constraint is not "minimize log noise in general" but "match the behavioral signature of this threat profile."

This has direct implications for how threat emulation objectives must be structured and evaluated:

Objective specification. A threat emulation objective cannot be defined as "achieve domain admin." It must be defined as "achieve domain admin using techniques consistent with [threat actor] TTPs" — with the TTP constraints explicit in the objective definition. The verifier must assess technique conformance, not just outcome achievement.

T2 rubric extension. The T2 quality dimensions for threat emulation replace operational criteria with fidelity criteria: does the agent's technique selection match the threat actor's documented tooling? Does the staging pattern match known behavior? Does the lateral movement sequence follow documented tradecraft? A threat emulation session that achieves the objective with the wrong tools is a low-quality session regardless of technical success.

Training data separation. Threat emulation sessions must not enter the same training partition as standard penetration testing sessions. The optimization targets are in some cases directly opposed — a session using noisy, well-documented techniques because they match a threat actor's profile would score poorly on operational stealth dimensions. Mixing these with stealth-optimized pentest sessions produces contradictory training signal.

The infrastructure required — constraint-specified objectives, fidelity-aware T2 rubrics, partitioned training corpus — is the same infrastructure required for optimality-aware penetration testing. Threat emulation is not a separate architecture; it is a different constraint set applied to the same framework.

6. What This Means for Evaluation Design¶

The sufficiency model produces evaluation metrics that measure the wrong thing for production deployment.

A pass rate measures: what fraction of the time does the agent find any correct solution? This is a useful capability benchmark. It is not a useful production-readiness benchmark, because it does not measure whether the correct solution found is the best solution available.

An optimality-aware evaluation would additionally measure:

Solution quality distribution. Across all passing sessions, what is the distribution of T2 scores? A system that achieves 80% pass rate with a median T2 score of 2.8/3.0 is more operationally useful than one that achieves 80% pass rate with a median T2 score of 2.1/3.0, even though aggregate pass rate cannot distinguish them.

Approach diversity. Across sessions targeting the same objective class, how many distinct valid approaches does the agent employ? A system that always uses the same approach for a given vulnerability class is more brittle and less generalizable than one that selects among approaches based on context signals.

Constraint alignment. Given a specified optimization criterion (minimize noise, maximize speed, maximize reliability), does the agent's solution selection align with the criterion? This requires eval objectives that specify optimization constraints and verifiers that assess constraint satisfaction, not just objective achievement.

These metrics do not replace the sufficiency-based pass rate — they complement it. Pass rate tells you whether the system works. The quality distribution, approach diversity, and constraint alignment tell you how well it works and for which operational contexts.

7. The Iterative Paradigm Shift¶

The broader AI literature describes a shift from "answer engines" to "iterative explorers" — systems that generate multiple candidate solutions, evaluate them, and refine toward the best rather than stopping at the first correct answer.

In AI security specifically, this shift is not optional. It is required by the operational context.

A penetration tester does not attempt one approach and report results whether or not it was the best approach available. A good practitioner generates hypotheses, tests the most promising, adapts based on results, and selects the approach that best serves the engagement objectives. The value of the practitioner is not just that they can find A working exploit — it is that they can find the RIGHT working exploit for the specific context.

AI security agents that replicate this behavior will be more valuable than those that stop at sufficiency. The architecture to support it — evaluation infrastructure that measures quality, training pipelines that weight on quality, session designs that allow iterative exploration — exists in nascent form in current systems. Moving from sufficiency to optimality is primarily an engineering and design problem — the core capabilities exist; what is missing is the design intent to wire them together — though open research questions remain around rubric calibration, corpus partitioning, and reliable constraint elicitation.

The remaining dependency is the human layer. Optimality in a security context cannot be defined by the model or the code. It is defined by the analyst who understands the engagement scope, the operational constraints, and the risk tolerance of the organization. AI can navigate the solution space efficiently. The analyst determines where in that space to navigate.

That is the right division of labor — and it is the division that will define the next generation of production-ready AI security tooling.

8. Implications for ARCHER¶

ARCHER's current architecture addresses sufficiency well. The transition to optimality is achievable in three phases without architectural replacement:

Phase 1 — Quality-weighted training (immediate). Modify the training data selection pipeline to weight session inclusion by T2 score rather than treating all passing sessions equally. The T2 infrastructure already produces the signal; the pipeline modification is a configuration change, not a new capability.

Phase 2 — Enhanced T2 rubric dimensions (near-term). Add operational quality dimensions to the T2 rubric: stealth (does the approach minimize unnecessary log noise?), efficiency (does the approach achieve the objective in fewer commands than alternatives would?), transferability (does the approach generalize or is it target-specific?). These dimensions already exist implicitly in what good T2 scores capture; making them explicit allows systematic measurement and training signal.

Phase 3 — Iterative session design (post-V2). After V2 Phase 5 (fine-tuned model validated), introduce session-level iteration: if a session achieves an objective with a low T2 score, allow the agent to recognize this and explore an alternative approach before closing. This is a session budget and protocol change, not a model change.

The human layer (throughout). Engagement-level optimization criteria should be specified at session initialization. The current task description format can be extended to include explicit optimization hints: "prioritize stealth," "maximize speed," "avoid WAF signatures." These hints inform the model's approach selection without requiring the model to infer operational context it cannot know.

The shift from sufficiency to optimality does not require a different model. It requires a different design intent — measuring quality, weighting training on quality, specifying optimization criteria explicitly, and using the human layer for the judgment calls that belong there.