The Stochastic Trap: An Architectural Critique of Current AI Security Tools¶
Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
The dominant paradigm in AI-augmented security tooling treats the language model as the primary decision-making layer: it receives a task, reasons over it, and produces output that the surrounding system executes with minimal mediation. This architecture often concentrates work in the wrong place. Language models are probabilistic systems - they generate outputs by sampling from learned distributions, not by executing logic. Applied to roles that require deterministic correctness - audit trails, task routing, halt detection, compliance logging - they produce outputs that are plausible in form but unreliable and indefensible in content. I call this the Stochastic Trap: the failure mode of applying probabilistic reasoning to deterministic roles.
This paper documents three manifestations of the Stochastic Trap observed during the development of ARCHER, a local-first AI penetration testing agent: output format drift under context pressure, routing errors on ambiguous task phrasings, and halt detection failure that allowed sessions to continue past the point of objective completion. For each failure mode, I identify the architectural pattern that produced it and the remediation that addressed it. I conclude with a design taxonomy that distinguishes the work that belongs to the model layer from the work that belongs to the code layer, and argue that this distinction — not model capability — is the primary determinant of AI security tool reliability.
1. Introduction¶
Security operations require two properties that are in fundamental tension with probabilistic systems: consistency and auditability. A penetration test must be reproducible — the same methodology applied to the same target must produce comparable results across operators and sessions. A security finding must be traceable — every claim must connect to specific evidence in specific tool output. These requirements are not optional features. They are the minimum bar for findings that drive remediation decisions in regulated environments.
Language models cannot reliably provide either property. They are trained to produce text that is consistent with the patterns in their training data, not text that is logically consistent across turns in a specific session. They do not maintain state; they regenerate it on each forward pass, which means their outputs are statistically correlated with correct behavior but not mechanically guaranteed to exhibit it.
The industry response to this limitation has been prompt engineering: adding explicit instructions, formatting requirements, and output constraints to system prompts in the expectation that the model will comply. The assumption is that a model capable enough can be instructed to behave deterministically. This assumption is quite simply - wrong. A language model that has been instructed to produce structured JSON will produce structured JSON most of the time - and malformed JSON some of the time, with no mechanism for distinguishing the cases before execution and often times no mechanism to identify faulty information. At scale, "most of the time" is insufficient for audit trails, routing decisions, and halt logic.
The Stochastic Trap is the architectural pattern that results from placing probabilistic systems in deterministic roles and compensating for their unreliability with more instructions.
2. Background and Related Work¶
The stochastic character of large language model output in structured-task contexts has direct empirical support. Liu et al. (2024) demonstrate that model performance degrades significantly when relevant information is positioned in the middle of a long context, with models attending preferentially to content at the beginning and end of their context window — a U-shaped performance curve that predicts the output format drift observed in ARCHER V1 sessions as session context accumulated over multiple turns. The implication for structured output specifically is that instruction compliance is not uniform across a session: a model may produce well-formed [FINDINGS] blocks and [OBJECTIVE_ACHIEVED] tokens early in a session and drift toward non-compliant variants as the context log lengthens, independent of prompt quality. Temperature compounds this effect: at higher temperatures, output format variation increases even holding context length constant.
Survey of the current AI security tool landscape — covering architectural patterns and documented failure modes across commercial and open-source tools — is largely absent from the peer-reviewed literature. Most disclosure of failure modes comes from vendor case studies, practitioner conference presentations, or incident post-mortems rather than systematic academic analysis. This paper contributes from two sources: ARCHER's failure taxonomy from Section 3, derived from operational sessions, and practitioner-observed parallels in commercial AI security tooling documented alongside each failure mode. The commercial examples are drawn from publicly available product documentation, vendor-published evaluation data, and practitioner assessments of tools run against known-state environments; they are illustrative, not exhaustive, and the evidence base for each is noted inline.
Two peer-reviewed evaluation frameworks for AI penetration testing agents have emerged as reference benchmarks: CAIBench (Anonymous et al., 2025) and HackSynth (Tihanyi et al., 2024). Both measure task completion rate — whether an agent captured a flag or completed an objective — using containerized Kali Linux environments and binary ground-truth verification. Neither addresses the architectural failure modes this paper documents. CAIBench's Attack Success Rate and HackSynth's task completion rate measure the model layer only; they do not measure whether routing decisions were deterministic, whether halt behavior was evidence-based, or whether sessions produced auditable output. HackSynth's halt mechanism is a hard 20-iteration ceiling — a mechanical stop with no quality check on why the session ended. These benchmarks establish a useful capability baseline but cannot detect the Stochastic Trap, because the Stochastic Trap's failure modes are architectural: they appear not in whether the agent succeeded but in how it decided it had.
PTES and the OWASP Web Security Testing Guide establish the methodology requirements that determine what constitutes a valid finding in professional penetration testing. The Penetration Testing Execution Standard defines seven phases of a penetration test and specifies the documentation each phase must produce. The OWASP WSTG v4.2 extends this to web application testing with 91 specific test cases, each with a defined objective and required evidence type. An AI tool claiming compatibility with either standard must produce findings traceable to specific phase deliverables and evidence formats — not narrative summaries that omit the phase reference and leave the supporting evidence unlinked.
The terms stochastic and deterministic are used in their standard computer science sense: a stochastic system produces outputs that are not fully determined by inputs (probabilistic components are present); a deterministic system produces identical output for any given input. The terms audit trail and provenance are important to define here - an audit trail is a tamper-evident, complete, ordered record of actions and their outputs; provenance is the traceable chain from raw tool evidence to a derived security finding.
Industry adoption of AI in security operations is early but growing rapidly. ISC2's 2025 AI Adoption Pulse Survey (n=436 cybersecurity professionals globally) found 30% actively using AI security tools and 42% in evaluation or testing phases. The survey's definition — "AI-enabled security solutions, generative AI, and/or agentic AI for automatic action" — is intentionally broad and does not distinguish between audit-grade tools and tools designed for exploration only. The compliance implications of that distinction are a subject of this paper.
3. The Three Manifestations¶
3.1 Output Format Drift¶
The early iteration of ARCHER V1 relied on the model to produce structured output — command blocks, findings annotations, completion tokens — under a natural-language system prompt instruction set meant to bind it's output to the proper format. The model complied most of the time. Over the course of a session, compliance degraded.
The failure mode: as context length increased and the model's attention window became populated with prior turns, the probability of compliant output decreased. Sessions that began with clean [FINDINGS] blocks drifted into free-text analysis that the downstream parser could not extract signal from. Sessions that began with clean [OBJECTIVE_ACHIEVED] signals began producing variations — "Objective achieved," "The objective has been completed," "I believe we have accomplished" — that the keyword parser matched inconsistently.
The remediation: ARCHER V1's parsing layer grew to nearly 300 lines of regex to handle output variation. This is a correct diagnosis of the symptom and a wrong treatment of the cause. Compensating logic around model unreliability is evidence that the model is in a code role. The correct remedy is to remove the model from the output format decision entirely.
ARCHER V2 addresses format drift by moving format compliance from the instruction layer to the training layer. Rather than instructing the base model to follow output conventions, V2 fine-tunes on ARCHER eval sessions so the model acquires the required markers — bash command blocks, [FINDINGS] annotations, [OBJECTIVE_ACHIEVED] tokens — as trained behavior rather than as a prompt directive. The distinction matters: instruction compliance degrades; trained behavior degrades more slowly.
Implementation status
The fine-tuned format adapter is the V2 target architecture, not current production. The current system uses the base qwen3:14b model with the instruction-layer format specification and a regex parsing layer that handles output variation. Domain-tuned adapters are deferred pending data collection (Phase 4–5 of the V2 pipeline).
The parser complexity data is a direct measure of this accumulation. ARCHER V1's extract_bash_command — the function responsible for extracting model-generated bash commands from session output — was 8 lines handling a single fence variant (```bash). The current version is 33 lines handling four variants: ```bash, ```sh, ```shell, and a generic ``` fallback added when the model began producing output without language tags. Each additional pattern was added reactively after the model produced a format the prior version could not handle. This is compensating logic — not a planned extension, but accumulated adaptation to model output variation. The _parse_json_response function adds 40 lines of JSON repair logic (backslash normalization, retry loops) introduced as a named fix when the model began producing malformed JSON under context pressure.
The parser complexity data functions as a proxy: the rate at which new compensating logic was added tracks the rate at which the instruction-only model deviated from expected formats. That rate was non-zero throughout V1 development and did not plateau.This means that deviations aren't a fixed problem that can be enumerated and handled, instead they are an open-ended consequence of a stochastic system producing output under varying context conditions. So when a new objective is added, the context shifts resulting in additional fixes for each new objective.
The same pattern is observable in commercial AI security tooling. Microsoft Copilot for Security generates incident timelines by synthesizing signals across endpoint telemetry, email headers, and identity logs into a natural-language narrative. The synthesis is model-layer: the model reasons over retrieved artifacts and produces a coherent account of the incident. Microsoft's own documentation describes this as "AI-generated summaries" — but in practice the output is consumed by analysts as if it were an audit trail. The failure mode is that the narrative does not clearly distinguish model inference from evidenced fact. A connection the model drew between two signals because they co-occurred in training data is rendered in the same prose register as a connection that is directly supported by the retrieved artifacts. The same pattern appears across AI-generated security incident reports more broadly: the model does framing work — taking signals and constructing narrative — but the output is presented in a form that implies evidentiary traceability it does not actually provide.
3.2 Routing Errors on Ambiguous Task Phrasings¶
ARCHER's skill router maps a natural-language task string to the correct analytical workflow — reconnaissance, exploitation, post-exploitation, web, active directory. The V1 router used keyword scoring: a bag-of-words match against keyword lists for each skill category, with bonus functions for contextual signals.
Keyword scoring fails predictably on ambiguous phrasings. "Assess the security posture of 192.168.56.103" is a reconnaissance task. It is also a vulnerability assessment task. It is also, under a generous reading, a web exploitation task. The keyword scorer assigns scores to all three, takes the highest, and the model follows the wrong skill pack's guidance for the entire session.
The empirical shape of this failure: in ARCHER eval runs, routing misses on ambiguous phrasings account for a measurable fraction of objective failures that are not attributable to the model's command generation quality. A session routing to the wrong skill category produces wrong hints, wrong halt criteria, wrong tool selection — none of which are recoverable within the session.
The remediation: ARCHER V2 adds a trained TF-IDF + logistic regression classifier as the first routing tier, with keyword scoring as fallback. The classifier is trained on high-confidence labels from eval harness runs — task strings with known correct routings. When classifier confidence exceeds 0.5, the routing decision bypasses keyword scoring. Ambiguous cases below the confidence threshold fall back to keyword scoring.
Implementation status
The classifier is trained and deployed as the active production router. All 15 skill categories cleared the 50-label gate. The confidence threshold is 0.5 — the initial design target of 0.7 produced excessive fallback to keyword scoring on eval data, and was lowered after measurement. The LLM gate in the original V2 design has been removed: keyword scoring performed comparably on ambiguous cases that reached it, and the latency cost was not justified by the accuracy gain.
Routing log data from --ambiguous eval runs provides a partial measurement. Across 235 ambiguous task phrasings run against the keyword-only router (before classifier deployment), routing accuracy was 54.5% — 128 correct routings out of 235 test cases. An additional 17 cases (7.2%) produced routing failures: the router returned no selection. The remaining 107 misroutes sent sessions to the wrong skill category with high scorer confidence, meaning the wrong hints and halt criteria were applied for the full session. Canonical task phrasings — those structurally identical to training examples — routed at 100% accuracy, confirming that keyword scoring is not failing generally but specifically on the ambiguous, underspecified phrasings that appear regularly in real-world task input.
A direct comparison of keyword-only accuracy versus classifier accuracy on the same ambiguous test set is not yet available: the 235 labeled test cases were generated before the classifier was deployed, and no equivalent post-classifier --ambiguous run with ground-truth labels has been conducted. The classifier was trained on these labels; its accuracy on this exact test set would be optimistic and should not be cited. A held-out ambiguous test set is the required next measurement.
The same failure mode appears in AI-augmented application security tooling that places binary classification decisions in the model layer. Static analysis triage tools classify findings as true or false positives using a language model; AI-assisted remediation tools route vulnerability findings to fix templates the same way. Both architectures place the routing decision in the model layer. The failure mode is predictable from first principles: a probabilistic classifier applied to an ambiguous input produces inconsistent output — the same code pattern classified differently depending on surrounding context. No prompt engineering adjustment resolves this, because the inconsistency is not a compliance problem — it is a property of the system type.
3.3 Halt Detection Failure¶
An AI agent that doesn't know when it's done is an agent that wastes compute, accumulates context debt, and produces sessions that are too long to review. ARCHER V1's halt logic was keyword-based: the model was instructed to emit [OBJECTIVE_ACHIEVED] when the objective was complete. The instruction worked most of the time.
The failure mode was in the other direction: sessions where the model emitted [OBJECTIVE_ACHIEVED] prematurely, after a plausible-sounding tool output that did not actually confirm the objective. A session that claims success without real evidence is worse than a session that fails silently — it produces a false positive that enters the training data, fine-tuning the model to replicate the false success behavior.
Eval harness data quantifies the false-positive rate directly. Across the controlled baseline eval run — 87 sessions against Metasploitable 2 under the reference configuration — 44 sessions ended with an OBJECTIVE_ACHIEVED signal. Of these, 4 (9.1%) failed code-layer ground-truth verification: the model's completion claim was not confirmed by the verify_fn probe of actual target state. Across a broader sample of 1,639 sessions spanning all eval runs with complete outcome logging, 656 ended OBJECTIVE_ACHIEVED; 121 (18.4%) failed ground-truth verification. The difference between the baseline rate (9.1%) and the aggregate rate (18.4%) reflects the broader sample including pre-fix and debugging runs; the baseline figure is the cleaner measure of the current system under production conditions. Both are well above the 5% falsification threshold in Claim 5 below, confirming that false-positive halt signals occur at a nontrivial rate without the verification gate. Without verify_fn, these sessions would have been logged as successful completions and been eligible for training data inclusion — a direct route from model hallucination to training data contamination.
The intuitive response to a system that claims false success is to instruct it more accurately — tighter formatting requirements, explicit directives to report failures honestly, clearer definitions of what constitutes a confirmed objective. This diagnosis treats the model as if it were a human analyst choosing to falsify a report. It is the wrong diagnosis.
A human analyst who files a false finding has a theory of mind: they know the finding is false, they know how it will be consumed, and they elect to report it anyway. The analyst bears professional accountability for that choice. A model that emits [OBJECTIVE_ACHIEVED] without confirming the objective had no decision to make — it sampled a token sequence consistent with training data from task-completion contexts. There is no internal state to lie from, no intent, no deceiver. Accountability does not belong to the model; it belongs to the system design that placed a probabilistic token sampler in a role that requires a correctness check.
This distinction has a direct architectural consequence: instructions cannot fix the problem. A model is not violating a directive when it produces a false completion signal — it is sampling from a distribution. Instructing a distribution to be accurate does not change its tails (the low-probability outcomes at the edges of the distribution — the cases where the model produces something rare and wrong). The only mechanism that catches false completion signals is one that is structurally independent of the model: a code-layer verification function that probes actual target state and rejects claims the evidence does not support.
The remediation: ARCHER adds a verify_fn layer to the eval harness — a ground-truth check that runs after any objective-achieved signal and confirms the claimed success is real (e.g., confirmed shell at uid=0, CVE string present in actual tool output, not echoed by the model). Sessions that pass the model's halt signal but fail verify_fn are excluded from training data and flagged for Auditor review.
Autonomous penetration testing platforms face the halt detection problem at commercial scale. A platform that executes multi-step attack chains autonomously must determine when an attack path is complete — when it has achieved a claimed objective versus when the model has produced plausible-sounding output that resembles completion. Practitioner assessments of commercial autonomous pentesting platforms, run against known-state environments where the correct answer is independently verifiable, have documented cases where the claimed objective and the independently verified state diverge. This is the halt detection failure in production: a probabilistic system deciding it is done, with no ground-truth verification layer to catch the gap between the claim and reality.
On the vendor response. The failure modes in this section are not unknown to the vendors of these tools. The dominant industry response has been retrieval-augmented generation: tethering model outputs to retrieved evidence to reduce hallucination. RAG improves output quality but does not address the structural problem. RAG solves data availability — it gives the model better information to reason over. It does not solve behavioral determinism: a model with perfect retrieval context can still experience output format drift, still make probabilistic routing decisions, still emit a false completion signal when the context window is saturated. A grounded model still produces narratives that synthesize rather than cite. RAG is compensating logic — better compensating logic than prompt engineering alone, but compensating logic nonetheless. The diagnostic from Section 1 applies: each layer added to compensate for model unreliability is evidence that the model is in the wrong role. The architecture, not the model quality, is the variable that needs to change.
4. The Design Taxonomy¶
The Stochastic Trap is not a problem of model capability. Adding a more capable model to an architecture that concentrates work in the model layer produces more plausible failures, not fewer failures. The remedy is architectural: assign work to the layer that can reliably perform it.
4.1 Work That Belongs in the Model Layer¶
- Generating the next investigative command given current tool output and session state
- Interpreting varied, unstructured tool output and extracting signal
- Reasoning across turns to maintain attack chain narrative coherence
- Generating candidate next steps from a large, combinatorial solution space
These are tasks where probabilistic reasoning produces better results than any rule. No static rule set covers the full space of tool output formats, target configurations, and investigative paths that a penetration test encounters. A model that has learned from a large corpus of security tooling output generalizes better than any hand-written parser.
4.2 Work That Belongs in the Code Layer¶
- Task routing to the correct analytical workflow
- Halt detection and session termination
- Safety constraint enforcement
- Session logging and audit trail maintenance
- Success verification against ground-truth tool output
These are tasks with correct answers. A routing decision is either right or wrong. A session is either complete or it isn't. Audit logs are either accurate or they're not. Assigning these tasks to the model layer and compensating with instructions is the Stochastic Trap. The code layer is the correct home.
Session state belongs here too. The model has no memory; it has only context. What the model knows about prior turns is exactly what the code layer has passed back to it. A system that dumps raw terminal history into each successive prompt is delegating state management to the model — asking it to reconstruct what is known, what has been tried, and what remains, from an unstructured log. The code layer should maintain a structured record of what execution has established: commands run, outputs confirmed, objectives partially satisfied. The model should receive a summary of that state alongside the immediate execution context, not the raw history it came from. This is not a convenience; it is what limits the context pressure that produces the U-shaped degradation Liu et al. document.
4.3 Work That Belongs in the Human Layer¶
- Defining scope and acceptable risk for each engagement
- Interpreting findings against organizational context the system cannot have
- Authorizing irreversible or high-impact actions
- Making the final call on findings that drive remediation decisions
This layer is not optional. A tool that removes human judgment from these decisions does not produce enterprise-grade security output — it produces plausible-sounding output that has not been accountable to a professional standard.
5. Implications for Tool Design¶
Design principle 1: The model layer should be responsible only for operations where probabilistic reasoning produces better results than any deterministic alternative. Route, halt, log, and verify in code.
Design principle 2: Compensating logic for model unreliability is a diagnostic. Every parser written to handle output variation, every fallback when the model doesn't follow instructions, every safety check added after generation — each is evidence that the model is in the wrong role. Accumulation of compensating logic is the leading indicator of a system approaching architectural failure.
Design principle 3: Measure the right things. Model benchmark scores do not predict reliability in operational security tasks. What predicts reliability is the pass rate on structured eval objectives against real targets, with ground-truth verification of claimed successes. Build the measurement before relying on the output. One important qualification: pass rate against training targets is a necessary signal but not a sufficient one — an agent can achieve high pass rates by learning the specific implementations of its training environment rather than the underlying vulnerability classes. Measuring generalization requires eval infrastructure that can run the same objectives against novel targets. See Range Lock-In (Centaur Security Labs, pending release) for the full analysis of this failure mode and its fix.
Design principle 4: Strict code-layer boundaries reduce operational cost, not just architectural risk. A system that delegates routing, halt, and format compliance to the model layer compensates with longer system prompts, larger context windows, and more capable — more expensive — models to preserve the reliability that instruction-following cannot guarantee. Moving those responsibilities to the code layer reduces the context overhead the model needs to carry, which reduces the pressure that produces degradation in the first place. The economic case and the reliability case point in the same direction.
Design principle 5: Architecture is necessary but not sufficient. The principles above address the structural failure modes that produce unreliable AI security tooling. They do not address what determines performance once the architecture is sound. A well-architected system still produces high variance based on how it is directed — how operators structure context, calibrate trust, decompose roles, and encode failure patterns into durable constraints. That variable — human direction skill — is documented in The Direction Gap (Centaur Security Labs, pending release). The two analyses are sequential, not competing: architecture sets the floor; direction skill determines where above it you operate.
6. Reproducibility¶
The analysis in this paper is derived from ARCHER's development and eval harness. ARCHER is currently a private repository; reproducibility artifacts — eval harness, objective definitions, lab configuration, and baseline pass-rate tables — will be published alongside ARCHER's public release.
Minimum requirements to replicate:
- Hardware: GPU with ≥8 GB VRAM (RTX 4060 Mobile is the reference configuration; 8 GB is the binding constraint for Qwen3-14B quantized inference)
- Software: Python 3.11+, Ollama, Docker, Kali Linux container (Dockerfile provided in repo)
- Model: Qwen3-14B via Ollama (
qwen3:14b, 4-bit quantized, --think=false) - Target environment: Metasploitable2 (vulnerable Linux target), accessible at a fixed IP from the test container. All eval objectives target MS2 unless otherwise noted.
- Eval harness:
testenv/eval_harness.py. 67 active objectives as of 2026-06-03. Run:python3 testenv/eval_harness.py --runs 3 --no-seed-playbook
Baseline
Eval baseline as of 2026-05-09: 94% pass rate (82 passes from 87 evaluation runs — 29 objectives × 3 runs each) under reference configuration (qwen3:14b, --think=false, archer-kali container, Metasploitable2 target). The remaining 22 objectives were added after the baseline was established. Per-objective breakdown to be published alongside ARCHER public release.
Pending reproducibility artifacts (to be published with ARCHER public release):
- Exact MS2 snapshot version and configuration required for objective-by-objective reproducibility
- Router accuracy figures — require routing log from N runs with
--ambiguousflag - V2 fine-tuning reproducibility requires RunPod A100 configuration
7. Recommendations¶
This section is written for security team leads, CISO offices, and procurement decision-makers evaluating AI security tooling.
Do not evaluate AI security tools on demo performance. Demos are constructed to show the model's probabilistic best case. Operational performance is the statistical average across all sessions, including edge cases the demo did not encounter. Ask vendors for pass rates on structured eval objectives against real targets, not against synthetic or curated scenarios.
Ask where routing, halt detection, and audit logging live. If the answer is "the model decides," treat this as a red flag. These are deterministic roles. A model that decides whether a session is complete will sometimes be wrong, with no mechanism for catching the error before the finding is reported.
Require ground-truth verification. Any AI tool that reports a finding should be able to show you the specific tool output the finding is derived from — not the model's paraphrase of it, not a summary, the actual output. If the tool cannot do this, the finding is not auditable and does not meet the evidentiary standard most regulated environments require.
Evaluate under constraint. A tool that requires cloud inference routes operational data — target configurations, vulnerability findings, network topology — to third-party infrastructure. Understand that data flow before deploying in environments with data residency requirements, classification boundaries, or regulatory constraints. Local-first is not a preference; it is a compliance requirement for many environments.
8. Falsifiable Claims¶
The following claims in this paper are falsifiable and should be tested empirically before this paper is submitted for formal review.
-
Output format drift is context-length dependent. Prediction: format compliance rate decreases as session length (token count) increases. Falsified if: compliance rate is uncorrelated with session length across N sessions at each of K context lengths.
-
Keyword routing is outperformed by classifier routing on ambiguous phrasings. Prediction: TF-IDF+LR classifier accuracy > keyword scorer accuracy on the
--ambiguoustask variant set. Partial measurement: keyword-only routing on 235 ambiguous test cases achieved 54.5% accuracy (17 routing failures, 107 misroutes). Classifier accuracy on a held-out ambiguous set is the required comparison; the training set used for the current classifier overlaps with these 235 cases and would not provide a fair comparison. Falsified if: classifier accuracy ≤ keyword accuracy at N ≥ 200 held-out labeled examples per skill category. -
Compensating parser complexity predicts reliability failure. Prediction: Systems with larger parsing layers (line count) have higher rates of parse failure on novel output formats. Falsified if: no correlation between parser line count and parse failure rate across the reference set of AI security tools.
-
Fine-tuned output format compliance outperforms instruction-only compliance. Prediction: V2 model (fine-tuned on ARCHER sessions) produces lower malformed-output rates than V1 model (instruction-only) across the same eval objectives. Falsified if: malformed output rate is equivalent between V1 and V2 at N ≥ 50 sessions per configuration.
-
False-positive halt signals enter training data at a measurable rate without verification gates. (Confirmed by eval data.) Across 87 baseline sessions, 9.1% of OBJECTIVE_ACHIEVED signals failed code-layer ground-truth verification. Across 1,639 total eval sessions, 18.4% failed verification. Both figures exceed the 5% falsification threshold. Without verify_fn, these sessions would have entered the training pipeline as labeled successes. Claim confirmed; the verification gate is not defensive redundancy — it is catching a real and consistent class of model error.
Acknowledgments¶
Analysis derived from ARCHER development sessions conducted 2026. The eval harness, audit pipeline, and session log infrastructure were built iteratively across the development period described in the ARCHER build journal.
Acknowledgments pending formal peer review.
Glossary
Compensating logic: Code added to handle cases where a model fails to follow an expected output format, structure, or behavior. Each piece of compensating logic is evidence that the system has placed the model in a code role — routing, format compliance, completion detection — for which probabilistic systems are structurally unsuited.
Context pressure: The degradation of instruction-following behavior in a language model as the context window fills with prior conversation turns. A model that reliably follows a format in short contexts may stop following it after many prior exchanges, because later instructions must compete with accumulated prior content for attention.
Context window: The maximum amount of text a language model can hold in working memory across a conversation. Bounded for all current models. Once exceeded, early content falls out of the model's effective attention and can no longer influence responses.
Deterministic system: A system whose outputs are fully determined by its inputs and internal state, with no randomness component. Task routing, command execution, and halt detection are examples of subsystems that must be deterministic — and therefore must be implemented in code rather than delegated to a language model.
Halt detection: The code-layer function that determines when an agent has completed its objective and should stop issuing commands. Must be implemented in code, not delegated to the model, because a model may emit completion signals inconsistently depending on context length, output format drift, or prior session content.
Output format drift: The tendency of a language model to stop following a prescribed output format over a long session. Observed as inconsistent or missing structural markers, verbose interpretation where terse tokens were expected, or gradual deviation from established patterns. Increases with context pressure.
Probabilistic system: A system that generates outputs through processes involving randomness or uncertainty, producing different results from the same inputs across runs. Language models are probabilistic systems by construction — identical prompts can produce structurally different outputs.
Stochastic Trap: The design pattern in which a probabilistic component (a language model) is placed in a role that requires deterministic behavior — task routing, format compliance, completion detection — and the system accretes compensating logic to mask the resulting failures rather than reassigning the role to a deterministic mechanism.
Task routing: The determination of which skill, capability, or tool should handle a given user request. Requires consistent classification of semantically varied phrasings into a finite set of categories. A deterministic trained classifier handles this reliably where a language model does not, because the classifier's output is a function of weights, not a sample from a distribution.
About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production. CEH, CHFI, Pentest+, Security+.
Centaur Security Labs — centaursecuritylabs.com
References: Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. DOI: 10.1162/tacl_a_00638. — PTES Technical Guidelines (2012). Penetration Testing Execution Standard. pentest-standard.org. — OWASP Foundation (2020). Web Security Testing Guide, version 4.2. owasp.org/www-project-web-security-testing-guide/v42/. — ISC2 (2025). 2025 AI Adoption Pulse Survey. isc2.org/Insights/2025/07/2025-isc2-ai-pulse-survey. — Hawkins, J. (2026). Investigative provenance as a compliance requirement. Centaur Security Labs. — Anonymous et al. (2025). CAIBench: A comprehensive benchmark for evaluating AI systems on cybersecurity tasks. arXiv:2510.24317. — Tihanyi, N., Ferrag, M. A., Jain, R., Bisztray, T., & Debbah, M. (2024). HackSynth: LLM agent and evaluation framework for autonomous penetration testing. arXiv:2412.01778.