Guiding Principles¶
These are the operating principles behind Centaur Agent (internal codename: ARCHER) and the research program that surrounds it. They were not written before the work began — they were extracted from decisions made under real constraints, failure modes encountered in production, and arguments about the right way to build AI systems that carry evidentiary weight.
They are called principles rather than rules because rules can be followed without understanding. Understanding why a principle exists — what it costs you when you violate it — is what makes it genuinely useful rather than a checklist.
The principles are grouped into four themes: how we establish what is true, how the three-layer centaur structure checks itself, what keeps the training corpus and its feedback loop sound, and the discipline around changing the system. The grouping is thematic, not a priority ranking — which principle binds hardest in a given moment is set by blast radius (see Proportionality, below), not by position in the list. Cross-references between principles are by name rather than number so the structure can grow without the references rotting.
I. Evidence & Epistemics¶
1. Failure Is Data, Not Outcome¶
A failed eval run is not a setback. It is a labeled example of a specific gap between the model's current behavior and the required behavior. Every 0/3 result is a precise instruction to the training pipeline: this is what the model does; this is what it should have done.
The corollary: a system that suppresses its own failures is actively degrading its ability to improve. ARCHER's benchmark dashboard publishes pass rates in real time — including failures — because failure is only useful if it is visible. A 94% pass rate means the 6% is documented, named, and being worked on.
The violation: treating failures as embarrassing rather than informative, or filtering them from reports before they reach the training pipeline. A model trained only on its own successes learns to reproduce those successes and nothing else.
2. Instrument Everything¶
You cannot improve what you cannot measure. Every ARCHER session produces a timestamped log: each command issued, raw output returned, findings linked to specific tool output. Every session is hashed. Every scoring decision is recorded with a rubric. Every routing decision is logged with its confidence score and the alternative it considered.
This is not bureaucracy. It is the only way to distinguish "the model improved" from "this eval run happened to sample easier variants." Without instrumentation, all improvement claims are anecdotal.
The violation: collecting outcome data without provenance. A pass rate without a session log is an opinion. An LLM-as-judge score without a calibration record is a guess. Instrumentation is what converts an opinion into evidence.
3. RCA Before You Patch¶
Every fix must be preceded by a written root cause: "The failure is caused by X, evidenced by Y in session log Z." If that sentence cannot be written, the cause is unknown and the fix is a guess.
Guesses fix the wrong thing. PT-AD-04 failed because credentials were missing from variant task descriptions — not because the model couldn't perform Kerberoasting. A fix that provided more Kerberoasting training examples would have produced a model that Kerberoasts more confidently and still fails PT-AD-04 because the variant text gives it no credentials to use.
The violation: committing a fix before the diagnosis is written. The written root cause is not a formality — it is the mechanism that forces you to actually understand what you are changing before you change it.
4. Small Samples Lie¶
One failed eval run is a hypothesis. Three is a trend. Ten is evidence. A single 0/1 result on one objective is statistically indistinguishable from infrastructure noise — the eval target might not have been deployed, the model might have loaded cold, the variant sampled might have been the hardest one in the set.
Acting on a hypothesis as if it were evidence produces two failure modes: fixing code that isn't broken (because the failure was noise) or missing the real regression (because the actual failure was low-frequency and the sample was too small to catch it).
The violation: treating a single result as actionable without confirmation. The --runs 3 standard exists because three runs is the minimum threshold to distinguish systematic failure from random failure at common pass rates.
5. If You Can't Follow It, Treat It as Hallucination¶
When the model produces something you cannot follow — a command whose effect you can't predict, a conclusion whose reasoning you can't reconstruct, a finding you can't trace back to evidence — the correct default is not deference. It is suspicion. Treat the output as a hallucination until it makes sense to you, rather than as insight you haven't yet caught up to.
The asymmetry is the whole point. A model that is confidently wrong and a model that is correct-but-beyond-your-current-understanding produce the same subjective experience: fluent output you cannot immediately verify. Those two cases are indistinguishable from the inside, so the only safe default is the one that fails closed. Extending the benefit of the doubt to plausible-sounding output is precisely the reflex that lets confident fabrication pass as competence — The Verifier Must Not Share Substrate with the Agent names that failure Phantom Pass; this principle is the same failure seen from the human's chair. Comprehension is the burden the human layer is supposed to carry; outsourcing it to the model's apparent confidence is how the human check collapses into the rubber-stamp form of layer collapse described in Each Layer Checks the Others.
So the inability to explain a model action is itself a finding. It means one of two things — the action is wrong, or your mental model is incomplete — and both have to be resolved before the output is trusted. Resolving it is real work: read the log, reproduce the step, reconstruct the reasoning until the behavior is explicable. ARCHER's verifiers encode the machine version of this — a result passes because tool-execution evidence confirms it, not because the model claimed it. This principle is the same discipline applied to human judgment: if you cannot say why the output is correct, you do not yet know that it is.
The violation: accepting model output because it sounds authoritative, on the assumption that the gap is in your understanding rather than in the output. Sometimes the gap genuinely is yours — but closing it is the job, not a step to skip. Trusting the fluent answer instead of doing that work is how hallucinations enter the record wearing the costume of expertise.
II. The Centaur Structure¶
6. Roles Have Authority, Not Just Responsibility¶
ARCHER's development runs on four roles: Coder implements, Auditor verifies, Scribe documents, and Researcher investigates. A Coder cannot close their own bug reports. An Auditor does not write source code. A Researcher does not decide what its own findings mean for the product. This separation is not organizational courtesy — it is a control on the feedback loop.
If the same person who writes a fix also verifies it, the verification will be biased toward confirming what they believe. More concretely: a fix that passes one targeted eval run looks confirmed, but may have introduced regressions in adjacent objectives the author didn't think to check. The Auditor's job is to be the person who checks what the author didn't think to check.
The Researcher exists for the same reason, one step earlier in the loop: decisions made on intuition are decisions made without evidence, and a person who both gathers the evidence and acts on it will gather the evidence that confirms what they already intended to do. So the Researcher's authority is deliberately partial. It investigates prior art, the academic literature, the competitive landscape, and our own code, and it produces decision-ready briefings — cited, with disconfirming evidence surfaced and confidence stated. But it recommends; it does not decide, ship, verify, or canonize. A research finding does not become source code, an eval verdict, or a published paper by being written down. It becomes those things only when a Coder, Auditor, or Scribe acts on it — which is the moment a human chooses to.
The violation: author self-verification, and its mirror, research self-ratification. The pending-verification label is not a status update — it is an access control that signals the issue has moved to the Auditor's lane and may not be closed without their sign-off on a passing independent verification run. Likewise, "I researched it" is not "we decided it": a briefing is advisory until a human acts on it through the appropriate role.
7. Each Layer Checks the Others¶
The three-layer centaur structure — model, code, human — only produces reliable output if each layer generates its conclusions independently and has the structural capacity to challenge the layers beside it. The separation of responsibilities (model generates, code enforces, human decides) defines who does what. The epistemic principle is stronger: each layer must reach its conclusions from its own evidence, not by relaying what the layer below reported.
The layers generate categorically different knowledge that cannot substitute for each other. The model generates command sequences and attack chain reasoning from trained domain knowledge. The code generates structured execution records — exit codes, command logs, structured stdout — from deterministic parsing. The human generates contextual judgment from organizational context, threat intelligence, and interpretive authority. When one layer attempts the epistemic work of another, the checking relationship fails even if the responsibility boundary still looks intact on paper.
This has a concrete structural requirement: visibility is a prerequisite for oversight, not a convenience. A human who can only see aggregate pass/fail rates cannot challenge the code layer's interpretation of a specific session. The session logs, audit trail, and drillable dashboard are not reporting tools — they are the mechanism by which human accountability is structurally possible rather than nominal.
The centaur structure's value is precisely that each layer's blind spots are covered by the other two. The model cannot verify that its commands succeeded in the real world — the code layer's execution records are authoritative on that. The code cannot interpret whether a found credential is organizationally significant — that is human judgment. The human cannot reliably generate attack chains from scratch — the model's domain knowledge is the asset. None of these layers is sufficient alone, and none is merely subordinate: each checks the others.
The violation (layer collapse): It manifests in three forms. First, model-checks-model: the model effectively evaluates its own output — the failure the next principle, The Verifier Must Not Share Substrate with the Agent, addresses in full. Second, code-interprets-context: code makes scope or policy decisions that require organizational knowledge it doesn't have, producing correct-looking output for the wrong targets. Third, human-rubber-stamps: the human approves audit output without structural capacity to interrogate a specific session. The audit trail must be navigable, not merely present. A log that exists but cannot be drilled into is not oversight.
8. The Verifier Must Not Share Substrate with the Agent¶
A verifier that reads model-generated text to determine pass/fail is measuring what the model claims, not what it did. Pass conditions must be grounded in externally-sourced, bounded evidence — structured command execution logs, network state, filesystem artifacts — stripped of the model's narration of its own actions.
This has a precise corollary for task design: placeholder strings like <login-endpoint> are non-empty inputs to the model. The model receives a structural cue about what kind of value belongs there. The absence of a real URL is not the same as no information. A verifier must be designed knowing that distinction.
The violation (Phantom Pass): A verifier that checks whether a keyword appears anywhere in session output will pass when the model describes the technique rather than executes it. The corpus fills with sessions where the model narrated the right answer. The trained model learns to narrate more confidently — a behavioral pattern that is indistinguishable from skill until you look at what the verifier actually checked.
The fix is substrate separation: the verifier reads only what the tool execution layer produced, not what the model wrote. These are categorically different evidence types. Conflating them is how confident-sounding failures enter training data as successes.
III. Corpus & the Loop¶
9. Gates Protect the Corpus¶
The training corpus is the model. A model trained on corrupted or low-quality data learns corrupted behavior, and that behavior will be subtle and hard to trace because it presents as model quality degradation rather than data quality failure.
Every session that enters the fine-tuning pipeline passes multiple gates: a structural Tier 1 check (right target, non-empty output, no degenerate loops), a Tier 2 LLM-as-judge score of ≥2, a transferability score confirming the session demonstrates generalizable skill, and a human preflight confirming no sessions are flagged for review. These gates exist because the cost of bad training data is paid silently over many training iterations in ways that are genuinely hard to reverse.
The violation: bypassing gates under time pressure. "It mostly looks fine" is not a pass condition. The gates are set at the thresholds they are because sessions below those thresholds have a documented rate of producing training noise.
10. The Loop Must Close¶
Eval → score → partition → train → re-eval. Every component of the pipeline exists to serve this closed loop. Data collection without scoring is noise accumulation. Scoring without partitioning is undifferentiated quality. Partitioning without training is wasted compute. Training without re-eval is faith.
A system with an open loop has no self-correction mechanism. It can get worse without noticing. The closed loop is what makes ARCHER self-improving rather than self-sustaining at a fixed quality level.
The violation: designing pipeline components that don't feed back into anything. An audit process that doesn't change what enters training closes no loop. A dashboard that reports results nobody acts on closes no loop.
IV. Change & Engineering Discipline¶
11. Explicit Over Implicit¶
Never silently depend on context that might not be there. Variant task descriptions embed credentials explicitly when the system knows them. Routing targets are named with IP addresses, not inferred from environment. Safety constraints are enforced in deterministic code, not trusted to model behavior.
Hidden dependencies produce failures with no traceable cause. A variant that assumes the model knows msfadmin:msfadmin because the eval environment has run these tasks a hundred times will fail silently on any machine, variant phrasing, or model version where that assumption doesn't hold. Explicit dependencies fail loudly, with a cause you can name.
The violation: trusting that "the model will figure it out." In a controlled eval environment, it often does — until it doesn't, and you cannot determine why.
12. Behavioral Equivalence Is Non-Negotiable¶
A change that silently alters outputs under any condition is a hidden regression. Any fix with a default-off flag must produce byte-identical output to the pre-change version when the flag is off. Any verifier change must be re-run against all objectives using that skill pack — not just the target objective.
The scope of a change is never just the lines edited. Every function that calls a modified function, every objective that uses a modified verifier, every eval run that touches a changed component is potentially affected. The boundary of what requires re-verification is determined by the dependency graph, not by the author's intent.
The violation: declaring a fix done based on a targeted re-run that didn't cover adjacent objectives. A verifier fix that passes the target objective but was never run against the full skill pack ships with untested surface area.
13. Ship at 80%, Instrument the Gap¶
A working system at 80% coverage today is more valuable than a perfect system on paper next month. The remaining 20% reveals itself through real runs — if you are instrumented well enough to see it when it appears.
The corollary: the 80% must be genuinely working and genuinely instrumented. "Working" means passing its eval objectives reproducibly. "Instrumented" means the failure modes in the remaining 20% will surface as named, measurable events rather than as vague degradation. A 94% pass rate on a defined objective set is a real statement because those objectives are defined, measured, and tracked over time.
The violation: shipping 80% without the instrumentation to see the gap. An untested system at 80% coverage isn't 80% working — it's 80% coverage of known behavior with an unknown remainder.
14. Constraints Force Clarity¶
A binding constraint is an architectural decision already made. 8 GB VRAM means local-first, single-model, context-budget-aware — not as a choice, but as a fact. Every component built under that constraint has a stated reason to exist. Systems designed without hard constraints accumulate decisions nobody can explain.
The corollary is uncomfortable: lifting a constraint without redesign produces a system whose components have reasons that no longer apply. The 8K context budget forced explicit addendum length targets, short-form prompts, and structured output. Those disciplines improve reliability even on hardware with more headroom. The constraint was the design instructor.
The violation: Treating constraints as temporary — "we'll revisit when we get better hardware." Systems built around soft constraints have no stated reason for any component. When the constraint lifts or a new contributor arrives, nothing is explicable and nothing can be safely changed, because nothing was ever explicitly justified.
Philosophical Values¶
The numbered principles above were extracted from failures. These are different — they are the values that determine which failures we consider worth learning from, and what kind of system we think is worth building. They are harder to operationalize, which is why they live here rather than embedded in process documentation.
Transparency as default. Failures are published in real time. The benchmark dashboard reports what happened, not what we wanted to have happened. Session logs are structured so conclusions have a traceable basis, not so the system appears rigorous. Opacity is always a design choice; we choose against it.
Intellectual honesty. A failure that contradicts a prior assumption is more valuable than a run that confirms it. RCA before patch is an act of intellectual discipline — it forces the question "what do I actually know?" before acting on what I think I know. The small-samples principle is the same discipline applied to inference: a result is not evidence until it is reproducible, regardless of how clean it looked.
Discovery over confirmation. The closed loop is designed so that the unexpected surfaces as measurably as the expected. A system built to confirm existing beliefs will not find what it didn't predict. The 6% failure rate at 94% pass is not a problem to be managed. It is the system working — those failures are the frontier.
Proportionality. Caution scales with blast radius. A single skill pack change and a shared verifier utility change are held to different verification standards because their failure modes differ categorically in scope. The cost of caution should be proportional to the cost of being wrong, not uniform across all changes.
Human primacy. The model generates. The code enforces. The human decides. This division is load-bearing. Interpretive judgment, organizational context, and accountability for irreversible actions belong to the human layer — not because automation cannot approximate them, but because the accountability cannot be delegated regardless of how good the approximation gets.
Craft. HAZOP pre-commit checklists, two-layer hint design, named failure classes, canonical process names — these are expressions of care about getting it right, not performance of rigor. Craft means attending to the details that are easy to skip when no one is watching. The difference between craft and ceremony is that craft produces better outcomes.
Skepticism as method. Multi-tier verification, role separation between implementer and verifier, calibration records for automated scoring, minimum run counts before acting on a result — these are expressions of systematic distrust of single-point evidence. Not nihilism about what can be known, but precision about what must be verified before it is believed.
These principles are not aspirational. They were extracted from failures — cases where violating them produced a measurable cost. That is why they exist, and that is why they hold.