The Learning Loop: Knowledge Architecture for Self-Improving Human-AI Security Systems¶

Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
Companion papers: The Centaur Framework · Training Data Integrity · The Stochastic Trap

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

The Centaur Frameworks archetecture specifies how three layers divide work in a single session. This paper addresses how the system gets better between sessions — formalizing the knowledge that each layer generates, accumulates, and dissiminates across the other layers, and the conditions under which those flows compound into an improvement spiral.

Abstract¶

The three-layer Centaur architecture — model layer, code layer, human layer — specifies which layer performs which work in a given session. It does not specify how the system learns from that work. An AI system that runs sessions without a deliberate learning architecture will plateau, a model that has one, will evolve and improve just as humans do: the model's behavior reflects its pre-deployment training, the code layer's routing accuracy is fixed at classifier training time, and the human layer also improves when the proper information is collected and presented to it, allowing the human layer to further develop and guide the system.

This paper formalizes the knowledge management architecture that makes operationally-sourced learning possible. Each layer of a Centaur implementation generates artifacts during operation that can inform the other layers. I define what each layer generates, what it accumulates, how artifacts should flow between layers, and the quality gate mechanism that determines whether those flows compound into a genuine improvement spiral or degrade into contaminated training signal. I propose twelve design requirements for learning-loop compliance in Centaur implementations and evaluate ARCHER's current state against them. The central empirical claim is bounded: ARCHER has run one complete learning loop iteration in production — the routing log caught a miscalibrated confidence threshold and corrected it from measurement, not intuition. The fine-tuning loop is built and validated; it awaits sufficient data for a first production run.

1. Introduction¶

The Centaur Framework[^1] specifies a session-level architecture: what the model layer does, what the code layer enforces, what the human layer decides, and where the boundaries between them lie. A system that implements the framework correctly will run more consistent sessions, produce more auditable findings, and resist the failure modes the framework documents.

It will not, by itself, get better.

A fully compliant Centaur session produces artifacts: a structured session log, routing decisions, a probabilistic residual (the set of model assertions not confirmed by code-layer ground-truth checks), a fine-tuning candidate if the session succeeded. Those artifacts contain information that could improve the next session — a successful command sequence worth reusing, a routing failure worth correcting, a hint that sent the model down the wrong path and should be revised. Without a deliberate architecture for collecting, gating (filtering artifacts through quality checks before they enter the training pipeline), and applying that information, it accumulates in log files and is never used. The system runs the same sessions next month that it ran this month, with the same failure rate, because no mechanism exists to convert operational experience into improved behavior.

This paper proposes a formal answer to that question, grounded in what the three-layer architecture implies about knowledge generation and information flow, and evaluated against ARCHER's operational implementation.

2. Background¶

2.1 Self-Improving Systems and Their Ground-Truth Requirements¶

Approaches to self-improving AI systems share a common structural requirement: a ground-truth signal independent of the system being improved - in order to operate within reality - it must understand its reality, it must have access to it's evironment. Reinforcement learning from human feedback grounds model improvement on human preference ratings collected independently of the model being trained.[^2] Self-play reinforcement learning (where a model trains by competing against prior versions of itself) uses the game itself as ground truth — win or loss is determined by an external judge, not by the model's own assessment.[^3] Active learning (where the model identifies the examples it is most uncertain about and requests human labels for those specific cases, rather than training on a fixed dataset) incorporates analyst feedback into the training loop as a labeled correction signal that originates outside the model.[^4]

In each case, the improvement signal is externally anchored. The model does not evaluate its own outputs and use that evaluation as training data; a separate mechanism provides the evaluation. This separation is what prevents the system from learning to replicate its own failure modes.

Security operations makes this separation structurally difficult. A session log contains the model's commands and outputs, but the model's self-assessment of whether the session succeeded is not ground truth. The ground truth — whether root was obtained, whether the credential is valid, whether the vulnerability exists — comes from probing the target, not from the model's output. A learning loop that uses model-claimed success as the gate for training candidate inclusion will teach the model to claim success more convincingly, not to succeed more reliably.

This structural risk is the subject of the companion paper on Training Data Integrity,[^5] which catalogs seventeen bug classes produced by inadequate quality gates in ARCHER's early training pipeline. This paper takes the quality gate as necessary and focuses on the broader learning architecture it must be embedded in.

2.2 Tacit Knowledge and the Formalization Problem¶

The human layer in a Centaur implementation accumulates judgment that is not directly encodable in the model or code layers: which failure patterns reflect systemic architectural problems versus edge cases, which objectives are most valuable to prioritize, when a fix addresses a root cause versus a symptom. Polanyi's account of tacit knowledge[^6] — knowledge that cannot be fully specified in explicit rules, embodied in practice rather than documentation — captures the challenge precisely. The human layer's accumulated judgment is operationally real and consequential, but it does not naturally flow to the model or code layers through any mechanism the Centaur architecture specifies. This paper maps where the formalization problem appears and documents partial progress in ARCHER's current implementation; a complete solution remains an open research direction.

2.3 Relationship to the Session Architecture¶

The Centaur Framework's information topology describes what each layer observes during session execution. The learning loop extends that topology across sessions: the question shifts from what each layer can see during a session to what each layer produces after one — and where that production should go. The two topologies are complementary. Session-level topology constrains visibility during execution; learning-loop topology governs what gets generated and how it flows back into the system.

3. The Knowledge Architecture¶

Each layer of a Centaur implementation generates two classes of information during operation: session artifacts (produced immediately, available for the next session) and accumulated knowledge (built up over multiple sessions into persistent artifacts that shape system behavior). The distinction matters for learning loop design: session artifacts are raw material; accumulated knowledge is what the system actually uses.

3.1 The Model Layer¶

Session artifacts: The model layer produces commands, output interpretations, and session-end disclosures. Its most significant learning artifact is the fine-tuning candidate: a structured record of the turn sequence from task input to objective completion, formatted for supervised fine-tuning.

Accumulated knowledge: The model layer's accumulated knowledge is its fine-tuned adapter — the low-rank weight matrices trained on operational sessions that shape how the base model behaves. Unlike the code layer's knowledge, which exists in inspectable frozen artifacts, the model layer's accumulated knowledge is opaque: it is distributed across billions of parameters in a form that cannot be directly read or audited. This opacity has implications for the quality gate. A code-layer artifact (the routing classifier, the playbook) can be inspected for errors; a model-layer artifact (the fine-tuned weights) cannot be directly inspected — only evaluated by running it.

The specialization trajectory: A model trained on internet-scale data generalizes broadly. A model fine-tuned on operational sessions from a specific task distribution narrows. Over time, fine-tuning on ARCHER sessions moves the model from generalizing over security-adjacent text to specializing on the exact task distribution it actually runs against — specific tools, specific target environments, specific output formats. This narrowing is valuable precisely because it is narrow: a model fine-tuned on penetration testing sessions against Linux targets is more reliable for that task than a generalist model. It also creates a maintenance obligation: distribution shift — new domains, new target types, new tool versions — requires updated training data or the fine-tuned model will underperform a generalist on new tasks.

3.2 The Code Layer¶

Session artifacts: The code layer produces routing decisions (with classifier confidence scores), halt events (with halt reasons), ground-truth verification results, and structured session logs. Each has learning value: routing decisions are labeled examples for the classifier, halt events surface patterns in when sessions succeed or fail, verification results close the ground-truth loop.

Accumulated knowledge: The code layer's knowledge is operationalized in frozen, versioned, auditable artifacts. The routing classifier has a weights file, a training set hash, a version history, and a confusion matrix. The playbook is a database of validated command sequences, abstracted for target generalization, with session provenance. The hint library is a set of per-skill guidance strings, each traceable to the failure analyses that motivated their revisions. These are code-layer knowledge in the strict sense: they follow the software development lifecycle, they can be tested independently, and their accuracy is measurable against ground truth.

This is the critical distinction between code-layer and model-layer knowledge accumulation. The routing classifier's accuracy on held-out examples is directly measurable; the fine-tuned model's accuracy is measured indirectly through evaluation runs. A code-layer artifact that is performing poorly can be diagnosed from its confusion matrix; a model that is performing poorly requires session output analysis to diagnose. The code layer's knowledge is accountable in a way the model layer's knowledge is not.

The playbook as lateral knowledge transfer: A winning command sequence from one session against one target should be reusable against a different target for the same task. The playbook implements this through IP abstraction: {ip_address} placeholders replace concrete addresses in stored commands, so that a winning exploitation sequence generalizes to any target where the vulnerability exists. This is the code layer accumulating knowledge across sessions in a directly useful form — not through weight updates, but through versioned structured storage.

3.3 The Human Layer¶

Session artifacts: The human layer's primary session artifact is the failure diagnosis: a written root-cause assessment of why a session failed, which hypothesis it rules out, and what specific change should address it. The session-end disclosure produced by all agents — the withheld_actions field — is also a human-layer artifact, though it originates in the model layer. It is the model's declaration of what it identified but did not act on, surfaced for human review.

Accumulated knowledge: The human layer's accumulated knowledge is the hardest to formalize and the most consequential. It includes: judgment about which failure patterns are systemic versus incidental, knowledge of which objectives are most sensitive to hint quality, awareness of which targets are in a known-bad state, and the broader organizational context that determines whether a technical finding has operational significance. This knowledge currently lives in GitHub issues, commit messages, session-close notes, and annotated failure taxonomy reports. It is operationally real — decisions based on it are traceable in the git log — but it is not queryable in a structured form that future sessions can access without reading through prior conversation history.

Making human-layer accumulated knowledge more structured and recoverable is an open problem. The first step — writing root-cause diagnoses before fixes, not after — is an enforced practice in ARCHER's development protocol.[^7] The second step — structuring those diagnoses in a form that feeds back into hint updates and training data curation — is partially implemented. A complete solution does not yet exist.

4. Information Flows¶

Knowledge architecture describes what each layer produces. Information flows describe how that knowledge reaches the layers that need it. Three flow directions matter.

4.1 Upward Flows: Session → Training Pipeline¶

The primary upward flow carries session artifacts toward the fine-tuning pipeline. The path has a mandatory gate.

A completed session produces a session log and, if the session exited via OBJECTIVE_ACHIEVED or HALT_DISCIPLINE, a fine-tuning candidate in ft.jsonl format. Before either artifact enters the training pipeline, it passes through three audit stages. The Tier 1 Pass performs structural checks: command blocks present, output blocks non-empty, no echo fabrication, no wrong-host indicators, no degenerate loops. The Tier 2 Pass performs quality scoring: an LLM-as-judge evaluation[^8] — structurally independent of the model being trained — scores the session 0–3 on task completion quality; sessions scoring below 2 are excluded from training. Sessions flagged by the Tier 1 Pass that require human judgment go through Manual Review before the Tier 2 Pass proceeds.

Human review closes the gate for contested cases: sessions that pass structural checks but exhibit subtler quality problems that automated scoring cannot reliably detect. The Training Data Integrity companion paper[^5] catalogs the contamination classes that motivated the two-tier architecture. The present paper notes only that the gate is architecturally necessary. A learning loop without a quality gate is not an improvement mechanism — it is a failure amplification mechanism. Sessions where the model claimed success without target-state confirmation will teach future model versions to make that same false claim more fluently.

The upward flow also carries routing labels. Every session produces a routing log entry: the task phrasing that was presented, the classifier's prediction and confidence, whether the prediction was correct. These entries are the training data for the next classifier version — a direct, automatically produced feedback signal that requires no additional labeling effort.

4.2 Downward Flows: Failure Analysis → Code Changes¶

The primary downward flow carries human-layer diagnoses into code-layer fixes. When the human layer identifies a root cause — a hint sending the model toward an unproductive attack chain, a halt condition firing too early on a common but non-terminal output pattern, a routing classification failure on a specific task phrasing — that diagnosis becomes a targeted code-layer change. The change is committed, verified against the failing objective, and the improved behavior appears in subsequent sessions.

The routing log is the most precise instrument in this flow. Every routing decision records the classifier's prediction, its confidence score, whether the keyword fallback overrode it, and what the correct routing decision was (labeled post-session by the eval harness). This log is not an audit artifact in the Centaur Framework sense — it does not contribute to session provenance. It is a measurement instrument. Reading it reveals calibration failures that cannot be detected from individual session outputs.

ARCHER's routing threshold calibration illustrates this directly. The original threshold — 0.7 softmax confidence — was chosen as a reasonable prior. The routing log revealed that correct predictions were being discarded: the classifier was identifying the right skill domain with confidence in the 0.5–0.7 range, but the fallback keyword scorer was overriding correct predictions with its own noisier output. Without the routing log, the threshold would have stayed wrong indefinitely. The threshold was lowered to 0.5 based on the measurement, not intuition. This is a complete downward flow iteration: session artifacts generated, artifact read by measurement tool, human-layer diagnosis made, code-layer change committed, improved behavior in subsequent sessions confirmed.

4.3 Lateral Flows: Session → Reusable Artifacts¶

Lateral flows move information within the same layer — specifically, successful session artifacts into the code layer's reusable knowledge stores.

When a session exits with OBJECTIVE_ACHIEVED and passes quality audit, the winning command sequence is a candidate for playbook storage. The playbook entry stores the command abstracted for target generalization, the task pattern that produced it, the skill domain, and the session provenance. A subsequent session presenting the same task can retrieve the winning command rather than rediscovering it from scratch.

The playbook's value is speed, not intelligence: command sequences that took multiple turns to discover execute in the first turn of subsequent sessions. The sessions that result are shorter, more consistent, and produce cleaner training data — reinforcing the upward flow.

5. The Improvement Spiral¶

The three flow directions — upward to training, downward from failure analysis, lateral into reusable artifacts — are individually useful. They compound into an improvement spiral when they function simultaneously. Better model performance produces higher-quality sessions; those sessions produce better training data; that data improves the model. Improved routing, faster playbook retrieval, and targeted hint fixes each accelerate the same cycle.

The spiral is not automatic. It requires two conditions that must be treated as architectural requirements, not operational preferences.

Condition 1: The quality gate must be independent of the system being improved. A quality gate that uses the model being trained to evaluate training candidates will not catch the failure modes the model has already learned. The structural audit (Tier 1) is independent by design — it applies deterministic checks to observable session properties. The Tier 2 LLM-as-judge evaluation uses a separate model rather than the training target, preserving the independence property. The human review tier is independent by definition. None of the three tiers should use the model being trained as the evaluator of its own training data.

Condition 2: The spiral direction must be observable. A spiral that is degrading rather than improving must be detectable before significant training investment is made on contaminated data. Observable signals include: increasing halt-discipline rates (the model is completing sessions at the command limit rather than finding the objective), declining ground-truth verification rates (the model is claiming success more often without target state confirming it), and classifier confidence drift (routing confidence is declining on tasks that were previously routed with high confidence). Tracking these signals across training cycles is what distinguishes a system that knows it is improving from a system that assumes it is.

Spiral degradation — the failure mode where the loop encodes failure rather than success — is the primary risk in operationally-sourced training. It is most likely when the quality gate is insufficient, when training data volume per skill domain is too low (a single bad session has disproportionate weight), or when human-layer failure analysis is not driving code-layer changes (the model keeps encountering the same failure modes because the underlying problems persist). The Training Data Integrity companion paper documents the specific contamination classes that, if not caught, would produce spiral degradation in ARCHER's training pipeline.

6. Design Requirements for Learning Loop Compliance¶

A Centaur implementation that claims to improve from operational experience should satisfy these requirements. They are organized by function — generation, flow, quality gate, and spiral integrity — rather than by layer, because the learning loop cuts across all three layers by definition.

Requirements labeled architectural are claims about what the system must structurally produce. Requirements labeled operational are claims about process — practices that must be in place for the architecture to function.

Generation Requirements¶

GEN-1. Every session produces a structured log capturing: task string, skill domain, all commands issued, all outputs returned, session outcome, halt reason, and routing attribution. The log is the primary input to all downstream learning flows; gaps in the log produce gaps in the learning architecture. (Architectural)

GEN-2. Every routing decision produces a labeled example: the task phrasing, the predicted skill domain, the classifier confidence, and the correct label (annotated post-session by the eval harness). These examples are the training data for the next classifier version. (Architectural)

GEN-3. Sessions exiting via OBJECTIVE_ACHIEVED or HALT_DISCIPLINE produce fine-tuning candidates in a defined format, capturing the full turn sequence. Sessions exiting via other paths (error, timeout, scope violation) do not produce fine-tuning candidates — those exit paths signal conditions where the model's behavior should not be taught. (Architectural)

GEN-4. Failed sessions produce analyzable failure artifacts: structured indicators of what failed, at which command, against which target state. These artifacts are the input to failure analysis; unstructured failure logs produce unstructured diagnoses, which produce unfalsifiable fixes. (Architectural)

Flow Requirements¶

FLO-1. Routing log entries flow to the classifier training pipeline with session provenance. The routing log is not an audit artifact — it is a measurement instrument. A routing log that is never read produces a classifier that never improves from operational data. (Architectural)

FLO-2. Failure root-cause diagnoses are documented before code-layer changes are made. A change that addresses a symptom without a written diagnosis cannot be verified as correct and cannot be distinguished from a change that addresses the actual cause. The written diagnosis is the mechanism that makes downward flows traceable rather than intuitive. (Operational)

FLO-3. Winning command sequences are accumulated in a reusable form with target abstraction. A command that worked against one target should be usable against a different target for the same task. Without abstraction — IP address removal, credential parameterization — playbook entries are target-specific and do not generalize. (Architectural)

Quality Gate Requirements¶

QG-1. Fine-tuning candidates pass a structural audit before entering the training pipeline. The structural audit applies deterministic checks to observable session properties and is independent of the model being trained. Sessions that fail structural audit are excluded from training regardless of the model's completion signal. (Architectural)

QG-2. The structural audit's failure classes cover the contamination modes most likely to produce spiral degradation: false positive success signals, wrong-host execution, degenerate loop patterns, and echo fabrication. The audit is targeted at the specific ways operationally-sourced training data goes wrong, not a generic validity check. (Architectural)

QG-3. The quality gate does not use the model being trained as the primary evaluator of training candidates. Independence is the property that prevents the gate from learning to pass the failure modes it should catch. (Architectural)

QG-4. Human review closes the gate for contested cases that automated checks cannot determine mechanically. The quality gate is a multi-tier structure because no single tier catches all relevant contamination classes. (Operational)

Spiral Integrity Requirements¶

SPI-1. Training data volume is tracked per skill domain against quantitative deployment thresholds. A fine-tuning run with insufficient examples per domain does not improve the model on underrepresented domains — it may degrade it. The threshold is not a soft guideline; it is the floor below which a training run is not warranted. (Operational)

SPI-2. The routing log enables measurable calibration of routing parameters independently of model retraining. At minimum, the classifier confidence threshold should be calibratable from routing log analysis. A threshold that cannot be adjusted from measurement is a threshold that will stay wrong until something breaks visibly. (Architectural)

SPI-3. Observable signals indicate spiral direction. Increasing halt-discipline rates, declining verification rates, and classifier confidence drift are individually measurable. A system that cannot distinguish an improving spiral from a degrading one cannot make principled decisions about when to train, when to wait for more data, and when to audit the quality gate. (Architectural)

7. Evaluation Against ARCHER¶

7.1 Current State¶

ARCHER's learning loop is partially implemented, with two distinct maturity levels across the twelve requirements.

Fully operational: GEN-1, GEN-2, GEN-3, GEN-4, QG-1, QG-3.

Every session produces structured logs, routing labels, and fine-tuning candidates on qualifying exit paths. Failed sessions produce structured failure artifacts read by the Failure Taxonomy process. The Tier 1 Pass operates automatically and independently of the training target model. The three-stage audit pipeline (Tier 1 Pass → Tier 2 Pass → Manual Review) maintains the independence property required by QG-3.

Partially operational: FLO-1, FLO-2, FLO-3, QG-2, QG-4, SPI-1, SPI-2, SPI-3.

FLO-1 (routing labels flow to classifier pipeline): routing log entries are generated per session and read by the classifier training scripts, but the pipeline from log entry to labeled training example requires a manual Label Build (build_training_data.py) rather than flowing automatically after each session.

FLO-2 (diagnoses before changes): enforced as a development protocol in CLAUDE.md — "add one sentence to the issue stating the failure is caused by X, evidenced by Y in session log Z" before writing code. This is an operational requirement enforced by convention, not by the system itself preventing non-compliant changes.

FLO-3 (playbook abstraction): implemented for IPv4 addresses. Other concretizations — ports, service banners, credential patterns — are not yet abstracted, which limits playbook generalization across diverse target configurations.

QG-2 (audit covers relevant contamination classes): Tier 1 covers the highest-frequency failure classes from the Training Data Integrity taxonomy. It does not cover all seventeen classes. Some contamination classes — cross-objective contamination, state residuals from prior sessions — require behavioral context that deterministic checks cannot encode.

QG-4 (human review closes the gate): the Manual Review process (archer-review-flagged) exists and is operational. Its use is not yet systematically enforced between collection runs and training pipeline invocations.

SPI-1, SPI-2, SPI-3 (spiral integrity signals): volume tracking is implemented (Data Preparation --report), routing threshold calibration has been demonstrated (the 0.7→0.5 adjustment described in §4.2), and halt-discipline rates are tracked via Halt Analysis. These signals exist but require running three separate processes; there is no unified spiral health view.

7.2 The Demonstrated Iteration¶

One complete learning loop iteration has been demonstrated in production. The routing log, generated automatically from session outputs, revealed that the classifier's 0.7 confidence threshold was discarding correct predictions. The failure was visible as a gap in routing quality that persisted across sessions — not detectable from any individual session, only detectable from the aggregate measurement the routing log enabled. The threshold was lowered to 0.5 based on the log analysis; subsequent sessions confirmed improved routing accuracy.

This iteration is modest in scope — a single threshold adjustment is not evidence that the full spiral is functioning. But it is structurally complete: session artifact generated, artifact read by measurement tool, human-layer diagnosis reached, code-layer change made, improved behavior confirmed. The question for the next phase is whether it scales across the more complex flows: fine-tuning on operational data, continuous hint refinement, spiral health monitoring across training cycles.

7.3 Open Gap: The Fine-Tuning Loop¶

The fine-tuning pipeline — the flow from quality-gated sessions to fine-tuned model weights — is built and validated on a test run but has not yet executed a production training cycle. The constraint is data volume: fine-tuning with insufficient examples per skill domain risks degrading rather than improving performance on underrepresented domains. ARCHER's current corpus is approaching but has not yet cleared the per-domain volume threshold that warrants a production run.

8. Limitations¶

Self-assessment circularity. The requirements in Section 6 were derived from operational experience building ARCHER. A framework derived from a system's own design will reflect that system's capabilities more favorably than one derived independently. External evaluation — applying these requirements to a system they were not designed around — would be substantially more informative than this self-assessment.

Tacit knowledge formalization remains open. The human layer's accumulated judgment is the most consequential and least tractable knowledge in the learning loop. The partial implementations described in §3.3 — written diagnoses, annotated failure reports — are meaningful starting points. A complete solution does not yet exist in ARCHER or, to my knowledge, in any published Centaur implementation.

Distribution shift is not addressed. The fine-tuning loop assumes a relatively stable task distribution: the same skill domains, the same target environments, the same tool versions. As ARCHER expands to new domains and as existing tools update, the fine-tuned adapter will develop coverage gaps. Explicit mechanisms for detecting and addressing distribution shift are not specified in this paper and are a necessary extension.

Volume thresholds are empirically set, not theoretically derived. The per-domain floor below which a fine-tuning run is not warranted was set from practical experience, not from a principled analysis of what data volume reliably produces improvement. The right threshold likely varies by skill domain complexity, base model capability, and evaluation harness sensitivity. It should be calibrated empirically over multiple training cycles rather than treated as a fixed constant.

9. Conclusion¶

A Centaur implementation that divides work correctly but does not learn from its work is leaving its most valuable resource unrealized. Every session against a real target, every routing decision, every failure, every successful command sequence is information that could improve the next session — if the architecture is designed to collect, gate, and apply it.

The improvement spiral is not a feature that can be added after the architecture is built. It is implied by the three-layer specification: if each layer generates artifacts during operation, and if those artifacts contain information the other layers need, then whether the system improves is a question of whether those flows are designed and whether the quality gate functions.

The most important design choice for any team implementing this architecture is the quality gate. A learning loop without a gate does not improve the system — it teaches the system to replicate whatever it has been doing, including its failure modes. Building the gate correctly — independent of the model being trained, targeted at the specific contamination classes most likely in your operational environment, closed by human review for the cases automated checks miss — is the prerequisite for everything else. Build the gate first. The spiral follows.

References:

[^1]: Hawkins, J. (2026). The Centaur Framework: A design specification for human-AI collaboration in security operations. Centaur Security Labs. centaursecuritylabs.com/research/centaur-framework

[^2]: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. arXiv:2203.02155

[^3]: Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489. DOI: 10.1038/nature16961

[^4]: Kim, Y., Dán, G., & Zhu, Q. (2024). Human-in-the-Loop Cyber Intrusion Detection Using Active Learning. IEEE Transactions on Information Forensics and Security, 19, 8658–8672. IEEE Xplore document 10613858.

[^5]: Hawkins, J. (2026). Training data integrity in AI security systems: A taxonomy of failure modes. Centaur Security Labs. centaursecuritylabs.com/research/training-data-integrity

[^6]: Polanyi, M. (1966). The Tacit Dimension. Doubleday. (Reprinted 2009, University of Chicago Press.)

[^7]: Hawkins, J. (2026). ARCHER CLAUDE.md: Written diagnosis before fix. Internal development protocol. github.com/jayhawkins108/ARCHER

[^8]: Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36. arXiv:2306.05685

Centaur Security Labs — centaursecuritylabs.com