Skip to content

Context as Infrastructure: Accountability Loops in a Three-Layer AI System

Centaur Security Labs — Jay Hawkins

Companion to The Centaur Model


The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.


The Premise

The centaur model describes how to divide labor: deterministic work to the code layer, probabilistic reasoning to the model layer, judgment under uncertainty to the human layer. The division is necessary but not sufficient. A system where each layer executes its role in isolation does not produce a centaur — it produces three systems that happen to share a data format.

What makes the layers a system is information flow. Each layer must be able to see enough of what the other two produced to check it, correct it, and learn from it. When that visibility is absent or distorted, the layers don't just fail to improve each other — they actively mislead each other with confident wrong signals.

This article examines that dynamic: how information moves between ARCHER's three layers, what each layer requires to fulfill its accountability function, and what breaks when it doesn't receive it.


Stochastic and Deterministic Layers

Before examining the flows, the distinction that makes accountability structurally possible.

The model layer is stochastic. Given the same prompt and session state, it produces different outputs across runs. The variation is not random noise — it reflects genuine probabilistic inference — but it means no single run can be treated as ground truth. A model that passes an eval three times in a row may fail a fourth time with no code or environment change. A session that scores 2.8/3 on T2 may have scored 2.2/3 if scored an hour later with a slightly different sampling temperature.

The code layer is deterministic. Given the same session file, the halt checker reaches the same verdict, the router assigns the same skill, the T2 scorer produces the same score. Reproducibility is not a property the code layer happens to have — it is the property that makes the code layer useful as a check on the model layer. If routing decisions were probabilistic, you could not distinguish a misrouted session from a routing system under load.

The human layer is pseudo-deterministic in process, stochastic in attention. A trained reviewer with full context makes consistent decisions. The same reviewer with thin context fills gaps with intuition, which introduces variance the audit trail cannot distinguish from the model's stochastic output. The human layer's reliability is proportional to the quality of the context it receives — not to the reviewer's competence.

This asymmetry is load-bearing. The code layer's determinism is what makes the model's stochastic outputs auditable over time. Statistical patterns emerge only because the code layer records each stochastic event faithfully and consistently. The human layer then applies contextual judgment to the deterministic record of stochastic events — but only if it can see that record clearly.


How Each Layer Generates Context for the Others

The model layer → code and human

The model's primary output is the session log: every command issued, every tool result received, every reasoning step, every final claim. This is the model layer's contribution to the other layers' context.

For the code layer, the session log feeds halt detection, success function evaluation, T2 scoring, and training data extraction. The code layer cannot check model quality without this log — it has no other window into what the model did.

For the human layer, the session log is the forensic record. When T2 flags a session as low-quality or a success function rejects a claimed achievement, the human reviewer's ability to override depends entirely on being able to read what the model actually did. In one session during ARCHER's development, a T2 scorer called a complete nmap vulnerability scan "truncated" because the large output exceeded its visible context — the scan banner read "Nmap done: 1 IP address scanned in 111.62 seconds." The human override was correct and confident, but only because the raw log was accessible. Without it, the reviewer would have deferred to T2's judgment and rejected a valid session from the training corpus.

The code layer → model and human

The code layer's outputs fall into two categories: real-time control signals and post-hoc audit artifacts.

Real-time: routing decisions that determine which skill pack's hints the model receives, halt signals that tell the model when to stop, safety constraints that prevent out-of-scope actions. These directly shape model behavior before any output is produced. A model dispatched to the wrong skill receives the wrong hints and operates in the wrong frame — it is not failing on its own terms, it is succeeding at the wrong task. The code layer is not just infrastructure here; it is the context in which the model reasons.

Post-hoc: eval CSVs, T2 scores, T1 audit flags, session metadata. These are the raw material the human layer needs to distinguish a model that is improving from one that is drifting. Without trend data, the human cannot see direction — only current state. Without sub-dimension breakdowns, a T2 pass rate tells you that sessions are clearing the threshold but not which dimension is holding them back. Without halt reason decomposition, a halt discipline rate is nearly uninterpretable: it could mean the model is appropriately recognizing dead ends, or that it is running out of context mid-task on objectives it was close to completing.

The human layer → model and code

The human layer's outputs are the least automated but the most consequential.

For the model: T3 accept/reject verdicts are the final filter on what enters the fine-tuning corpus. Every verdict is a training signal — an accept says "this session represents the behavior I want to replicate." A contaminated accept teaches the model to replicate a mistake. A false reject removes a correct session from the training distribution. At scale, T3 accuracy determines the model's developmental trajectory. The human is not a passive observer of model quality; they are actively writing the model's next version.

For the code: hint design, eval objective construction, success function logic, and routing label authorship are all human outputs that flow into the code layer. When a hint block is too narrowly scoped to a specific target — teaching the model to solve a particular box rather than a vulnerability class — the code layer faithfully delivers that narrow hint at inference time, and the model duly overfits to it. The problem is not in the code; the code executed correctly. The problem is in the human input the code was given. The code layer surfaces this by showing router label balance and training data distribution, but it cannot fix the underlying authorship problem.


How Each Layer Fails Without Sufficient Context

The model without adequate code context

A model operating without hints is working from general training data. It may still produce correct outputs, but inconsistently — passing some runs and failing others without any systematic relationship to difficulty. Hints compress domain knowledge into the model's working context and raise the floor.

More insidiously: a model operating with wrong code context — misrouted to an adjacent skill, given hints for the wrong target — will produce fluent, confident, wrong outputs. This failure is harder to detect than no-context failure because the session looks plausible. The success function may pass it. T2 may score it well. A human reviewer without domain knowledge may accept it. The error propagates into the training corpus and degrades future performance in ways that are difficult to trace.

The code without accurate human input

Code-layer logic is only as good as its specification. A halt checker that fires too early on legitimate progress is not a code bug — it is a specification error. A success function that accepts a false positive because the expected string appears in error output is not a verification failure — it is a boundary condition the human who wrote it did not anticipate.

The code layer has no mechanism for knowing it is wrong. It executes its logic faithfully and records the result. Only the human layer, examining mismatches between code verdicts and ground truth, can identify these boundary errors and correct them. Without an accessible audit trail and clear presentation of code-layer decisions, the human cannot perform this function. The code layer accumulates systematic errors silently.

The human without dashboard context

This is the failure mode that appears most clearly in ARCHER's development history, because the dashboard has been built iteratively and the before/after comparison is visible.

The headline number trap. A top-line objective-achievement rate can look like a regression when it is really an artifact of what is being compared. Consider an illustrative but representative case from ARCHER's development: an eval run reports roughly 50% OA — a drop of tens of percentage points against the standing baseline — and read on its own it looks like a serious regression worth dropping everything for. The full picture is the opposite. The "regression" run was a phrasing-variance sweep — fifty rephrasings of a single hard objective (a multi-step SSH pivot, 25 of 50 variants achieving) — while the baseline it was compared against was a different, disjoint set of objectives entirely (a broad multi-objective baseline at ~94% OA, most objectives at 100%). The two runs share no objectives, and the swept objective routes almost entirely to the most VRAM-sensitive skill in the pack set (ssh_proxyjump; see ARCHER's eval-data-analytics deep dive §2, where that skill loses ~34pp under VRAM pressure). The aggregate drop is not a regression in model quality — it is a hard objective's phrasing variance measured against an unrelated baseline, dressed up as a single scary number. A human making triage decisions from the headline alone would prioritize the wrong work.

The same caveat applies to every such headline. ARCHER's own eval data shows a documented ~45-percentage-point swing in pass rate by time of day (eval-data-analytics §1), on top of the skill-specific VRAM effects above. Any single OA figure is point-in-time and confounded by collection conditions; the number is a prompt to decompose, not a verdict to act on.

The opaque metric. A halt discipline rate without decomposition is nearly useless. Before the halt reason breakdown chart existed, a high HD rate told the reviewer that halts were frequent. It did not tell them whether those halts were the model correctly recognizing dead ends (good), the model running out of context on achievable objectives (bad), or infrastructure failures masquerading as model behavior (irrelevant to training). These have completely different remediation paths — hint coverage, context budget, Docker stability — and the single number cannot distinguish between them.

The labeling blind spot. The router label balance table showed unknown as a skill with labeled examples. Reaching the 50-label gate threshold appeared to mean router training coverage. In fact, unknown entries in the label CSV represent sessions a reviewer could not map to any skill at all — labeling dead-ends that cannot train the router to dispatch anywhere. The count being tracked as a gate metric was actively misleading: a rising unknown count signals a problem (ambiguous task phrasing or a skill gap in the pack set), not a training success. Without understanding what the data represented, the metric was worse than absent — it provided false confidence.

The T2 truncation false negative. T2 scoring is stochastic: a large session output that exceeds T2's visible context window may be scored on partial information. A complete nmap vulnerability scan was scored 2/3 with Completion Validity flagged as 1/3 — the scorer called the output "truncated." The human reviewing the raw session could see the scan completion banner and override correctly. Without access to the session log, or without understanding the failure mode (T2 has a finite context window and large outputs fall off the back), the human would have accepted T2's verdict and rejected a valid session from the training corpus.


The Upleveling Loops

The system improves when each layer's context gets richer over time — and the improvement is compounding.

Better model outputs reduce the cognitive load on human reviewers. Instead of making fine-grained decisions about borderline sessions, reviewers deal primarily with genuinely ambiguous edge cases. Attention concentrates where judgment is actually required. The freed capacity feeds back into hint quality, eval design, and routing label authorship — all of which improve the code context the model receives.

Better code context (richer hints, tighter routing, accurate halt detection) raises the model's floor on a per-skill basis. Fewer turns are wasted on approach uncertainty. More sessions complete within the command budget. Quality gate pass rates rise. The training pipeline delivers more usable data per eval run, accelerating the improvement cycle.

Better human context (richer dashboards, cross-linked diagnostics, decomposed metrics) converts human attention from interpretation to action. The halt reason breakdown doesn't just tell the reviewer what happened — it points to a specific code-layer remediation. High halt rate with low clean objective achievement → hint coverage problem → Coder files the issue → hint improves → model performance rises → next eval run shows the change. The human layer becomes an effective quality gate rather than a bottleneck that passes everything or blocks everything based on inadequate information.

The accountability flows in both directions on every edge. The model improves the training signal the human evaluates; the human improves the training data the model trains on. The code improves the context the model reasons in; the model's session logs reveal the routing errors and hint gaps the code needs corrected. The human fixes the code specification errors the code layer cannot see; the code layer surfaces the systematic patterns the human cannot detect without aggregation.


The Wrong-Layer Antipattern

The failure mode that most reliably disrupts these loops is assigning work to the wrong layer. When code compensates for model unreliability — a parser that handles output format variations, a fallback when the model doesn't follow a structured response format — the model's unreliability is hidden from the human layer. The session succeeds. T2 scores it well. T3 accepts it. The training corpus includes an example where the model produced malformed output and the code quietly corrected it. The model learns that malformed output is acceptable. The problem compounds.

This pattern has appeared concretely in ARCHER's development. During one eval cycle, a code-layer parser was added to normalize inconsistent tool output formatting before the success function evaluated it. Sessions that produced malformed output therefore cleared the verifier and entered the T3 review queue without the malformation being visible in T2 scoring — which saw only the normalized result. Several of those sessions were accepted into the fine-tuning corpus. The output-format problem was later caught in a code audit, not through the training pipeline's own quality gates. This is the mechanical description of how training data contamination accumulates in systems that optimize for short-term pass rates rather than long-term quality. The code layer should execute, log, and enforce — not compensate. The model's output quality must be visible to the human layer, not absorbed by the code layer before it can be evaluated.

The same principle applies in reverse. When human reviewers make decisions based on automated scores alone — without reading session logs for borderline cases — the human layer has effectively delegated its judgment to a stochastic scorer. The accountability check the human layer exists to provide is not being performed. An automated score is a filter, not a verdict. The human layer's function is to be the check that stochastic variation cannot be.


Context Is Infrastructure

The operator dashboard is not a cosmetic layer on top of ARCHER's data. It is the primary mechanism by which the human layer receives the context it needs to perform its accountability function.

This has a concrete implication for how UI development should be prioritized: the most valuable features are not the ones that add new data, but the ones that make existing data interpretable in context. The halt reason breakdown chart added no new data — the outcome counts were already in the eval CSVs. What it added was decomposition: the ability to see not just that the halt rate was high, but what kind of halts were accumulating and in which direction.

Cross-page links from failing objectives to session logs add no new data either. The sessions were already there, searchable. What the links add is reduced friction: the path from "this objective is failing" to "this is what the model did on its last three runs" goes from a manual search to one click. Human attention is finite. Every decision that requires three manual steps instead of one leaves the human with less cognitive capacity for the judgment that no automation can supply.

The system is a loop, not a hierarchy. The model, code, and human layers improve only in proportion to how clearly each can see what the other two produced. Context is not a convenience feature. It is the substrate on which accountability is built.


Related: The Centaur Model · ARCHER Failure Mode Inventory · Range Lock-In