Training Data Integrity in AI Security Systems: A Taxonomy of Failure Modes¶
Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
Fine-tuning an AI security agent on its own operational sessions sounds efficient — every run becomes training data. In practice, it is a mechanism for encoding every failure mode, false positive, and bad habit into the model's weights permanently. This paper documents the taxonomy of data quality failures observed during ARCHER's V2 training pipeline development, proposes a two-tier audit architecture for catching contaminated sessions before they enter training, and argues that data quality validation is the highest-leverage activity in any AI security training pipeline.
Abstract¶
AI security tools that generate their own training data — using operational sessions to fine-tune models on domain-specific behavior — face a data integrity problem with no parallel in conventional ML pipelines: the model being trained is also the model generating the data. When sessions are automatically accepted as training examples based on a completion signal (the model claimed the objective was achieved), bad sessions enter the training pipeline and the fine-tuned model learns to replicate their failure modes. This paper documents seventeen distinct bug classes identified during ARCHER's data integrity sprint, presents a taxonomy organized by failure type, and describes the two-tier audit architecture developed to prevent contaminated sessions from entering training. I argue that data quality validation — not model selection or prompt engineering — is the primary determinant of fine-tuned model quality in operationally-sourced training pipelines.
1. Introduction¶
The standard approach to AI system improvement is evaluation-driven iteration: measure performance, identify failures, adjust the system, remeasure. In AI security tools where the system generates its own training data from operational runs, this loop has a structural flaw. The evaluation — whether a session produced a valid result — is performed by the same system being evaluated. A model that produces a plausible-but-wrong success signal will generate a training example that teaches future model versions to produce the same plausible-but-wrong signal more reliably.
This is not a theoretical concern. During ARCHER's development, a data integrity audit of the V1 session corpus identified seventeen distinct classes of bugs in the training data: sessions where the model claimed success without tool confirmation, sessions where commands were executed against the wrong host, sessions where the success signal was an echo of the input rather than evidence from output, and others. None of these sessions were obvious failures from the model's perspective — all had completed within command budget and produced plausible-looking output. All would have entered the training pipeline under naive quality criteria.
This paper documents those seventeen bug classes, explains why each one is harmful to training data quality, and describes the audit architecture ARCHER now uses to prevent contaminated sessions from entering training.
2. Background and Related Work¶
The closest prior art for operationally-sourced training pipelines is self-play reinforcement learning and reinforcement learning from human feedback (RLHF). In AlphaGo's self-play training, the model plays against itself, but the ground truth — win or loss — is provided by the game itself (Silver et al., 2016). RLHF grounds model improvement on human preference ratings collected independently of the model being trained (Ouyang et al., 2022). Constitutional AI introduces model self-critique but anchors it in explicitly human-authored principles (Bai et al., 2022). Each approach provides an external ground-truth signal independent of the model being trained. ARCHER's pipeline does not: the completion signal that gates whether a session is accepted as training data is generated by the same model being evaluated for inclusion — a structural feedback risk not present in the prior art.
Data poisoning is a well-established problem domain in machine learning security. Goldblum et al. (2022) survey the space comprehensively: training data can be manipulated to control model behavior, introduce backdoors, or degrade downstream performance, and the attack surface expands as data curation is automated at scale. The problem is acute in language model instruction tuning specifically: Wan et al. (2023) demonstrate that as few as 100 poisoned examples in an instruction-tuning corpus are sufficient to cause predictable behavioral manipulation in fine-tuned models. The failure modes documented in Section 3 of this paper are not adversarial poisoning events — they are naturally-occurring data quality failures — but their mechanism is identical: incorrect examples in the training corpus produce models that reliably replicate the incorrect behavior.
Label quality failures in security-specific training data have a longer history. Tavallaee et al. (2009) documented systematic problems in the KDD CUP 99 dataset — redundant records biasing learning toward frequent examples, and label inconsistencies preventing reliable evaluation — leading to the development of the NSL-KDD replacement. The broader lesson is that curating security training data requires ground-truth validation beyond what the data-generating process can provide. In operationally-sourced pipelines, the model both generates and implicitly labels training data, making external validation structurally necessary rather than optional.
Relevant standards address training data quality as part of broader AI governance. The NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0; NIST AI 100-1, January 2023) identifies data quality under the Govern and Map functions, requiring training data provenance and integrity to be documented. ISO/IEC 42001:2023 (Information technology — Artificial intelligence — Management system) includes requirements for AI training data management, including traceability and quality validation. Neither standard addresses operationally-sourced pipelines specifically; the two-tier architecture in Section 4 addresses their requirements in practice.
Published practices for training data quality validation in commercial AI security tools are limited. Most published work comes from academic research contexts; the two-tier architecture described in this paper is derived from operational experience with ARCHER's training pipeline.
References: Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489. — Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022 (arXiv:2203.02155). — Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI feedback. Anthropic (arXiv:2212.06950). — Goldblum, M. et al. (2022). Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1563–1580. DOI: 10.1109/TPAMI.2022.3162397. — Wan, A., Wallace, E., Shen, S., & Klein, D. (2023). Poisoning language models during instruction tuning. ICML 2023 (arXiv:2305.00944). — Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A detailed analysis of the KDD CUP 99 data set. IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), Ottawa, pp. 1–6. — NIST (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1. — ISO/IEC 42001:2023. Information technology — Artificial intelligence — Management system.
3. Taxonomy of Failure Modes¶
The seventeen bug classes identified during ARCHER's data integrity sprint divide into five families. Each family represents a different mechanism by which invalid sessions reach the training pipeline.
Family 1: False Positive Success Signals¶
The model emits [OBJECTIVE_ACHIEVED] before the objective is actually achieved. These are the most harmful class because they teach the model to claim success prematurely.
1.1 — Echo fabrication. The model produces a success indicator by echoing the expected output format in a shell command (echo "uid=0(root)"), which appears in the session output as if it were real tool output. The completion signal fires because the text is present. The command echo "uid=0(root)" is not a privilege escalation.
Detection signal: Command-output pairs where the success string appears in a block immediately preceded by an echo/printf command. After fix: success_fn parsers must split output by command boundary and reject echo blocks.
1.2 — Premature halt on partial evidence. The model interprets a permission error or service banner as objective completion. "Permission denied" on a privileged file might appear alongside "uid" in the output; the keyword matcher fires.
Detection signal: Success signal co-occurring with error indicators (Permission denied, Connection refused, Authentication failed) within the same output block.
1.3 — Cross-objective contamination. A session targeting PT-EXPLOIT-01 (vsftpd exploitation) achieves a root shell on the correct host, but the session immediately prior had left an artifact (open session, listening process) from PT-EXPLOIT-05. The model "completes" PT-EXPLOIT-01 by interacting with the PT-EXPLOIT-05 artifact.
Detection signal: Success command issued against a port or service not part of the current objective's target profile.
1.4 — Playbook replay false completion. The model replays a winning command from the playbook against a target where it does not apply (different service version, different configuration). The command produces partial output that pattern-matches the success criteria without constituting actual objective completion.
Detection signal: Playbook-sourced command in session + success signal + tool output inconsistent with current target's known service version.
Family 2: Wrong Target Errors¶
The session executes against the wrong host. These sessions produce correct-looking data about an incorrect target, teaching the model the wrong tool invocations for the intended target class.
2.1 — Local interface scan. The model scans 127.0.0.1 or the container's own interface rather than the specified target. Produces "live host" confirmation, proceeds with enumeration of localhost services (which may include ARCHER's own processes).
2.2 — Broadcast or subnet scan. The model scans the subnet rather than the target IP, finds other hosts, and proceeds with the wrong host.
2.3 — Stale IP from prior session. The model uses an IP address from session context (playbook variable, prior finding) that was correct for a previous session but is incorrect for the current one.
Detection signal for all three: Target IP in tool invocation does not match the objective's specified target IP.
Family 3: Degenerate Session Structure¶
The session runs to completion without producing meaningful training signal. These sessions are not false positives — they simply contain no usable behavior to learn from.
3.1 — Depth-blocked early exit. The session reaches the minimum command count with all attempts blocked by network configuration, firewall, or service availability. The model exits cleanly (no error, no success), producing a session log with correct structure but no exploitable commands.
3.2 — Single-command loop. The model issues the same command (or near-identical variants) repeatedly, filling the command budget without progress. No new information is generated after the first iteration.
Empirical observation: After fix #126 (minimum command depth enforcement), depth-blocked sessions are excluded automatically. Pre-fix sessions in the corpus require manual review.
3.3 — Tool not available. The model attempts to use a tool not installed in the execution environment. All commands fail with "command not found." Session ends on max command count. Training this session would teach the model to use non-existent tools.
Detection signal: Command-not-found errors > N% of session commands.
3.4 — Context collapse. At high token counts, the model loses track of the session objective and begins issuing generic exploration commands (ping sweeps, uname, generic enumeration) that are not targeted at the specified objective.
Detection signal: Final N commands have no skill-specific keywords; session ends without progress marker.
Family 4: Output Quality Failures¶
The session structure is valid but the model's behavior within it is incorrect in ways that would teach bad habits.
4.1 — Hallucinated findings. The model records a [FINDINGS] entry that is not supported by tool output in the session. The finding is plausible (correct CVE, correct service name, plausible IP) but was generated by the model rather than extracted from tool output.
Detection signal: Finding contains specific claims (CVE numbers, credential pairs, UIDs) that do not appear in any tool output block in the session.
4.2 — Wrong tool for skill domain. The model uses a tool from a different skill domain (e.g., uses nikto for a network exploitation task, uses msfconsole for a basic recon task). Session may succeed, but the training example teaches the wrong tooling strategy for the skill category.
Detection signal: Tool invocations cross-checked against tools_available list for the active skill category.
4.3 — Incorrect flag usage. The model uses a tool with flags that produce incorrect behavior for the task (e.g., nmap -sT TCP connect scan when the objective requires stealth, msfconsole payload incompatible with target architecture).
Empirical distribution across skill categories not yet measured; flag-error detection is planned for Tier 2 audit prompts when per-skill corpus volume is sufficient for reliable frequency estimation.
4.4 — Credential error propagated. Session uses a hardcoded credential that was correct for a prior target configuration but incorrect for the current one (e.g., admin/password instead of admin/admin for DVWA after a database reset). Session fails, but the wrong credential is embedded in the session log and would be learned by a fine-tuned model.
Note: ARCHER build journal documents a specific instance of this bug class: PT-WEBEX-03 hint was updated with admin/password instead of verifying the actual DVWA credential (admin/admin). A 10-second docker exec test would have caught this before the wasted eval run.
Family 5: Audit Trail Corruption¶
The session log itself contains errors that would corrupt any analysis downstream.
5.1 — Session log truncation. The log writer was interrupted mid-session (container restart, OOM kill, SIGTERM during eval). The session appears complete from the file size but contains only the first N turns. Training on a truncated session teaches the model to halt prematurely.
5.2 — Mixed-session contamination. Two concurrent eval sessions wrote to the same log file (race condition in the session logger). The resulting log contains interleaved turns from two different objectives.
5.3 — Timestamp corruption. System time drift during session causes out-of-order timestamps. Not harmful to training but corrupts the session timeline used by audit tools to assess session validity.
Empirical frequency of audit trail corruption in the V1 corpus has not been systematically measured. Tier 1 structural checks detect session log truncation (incomplete JSON, missing session-end event) and flag affected sessions before training. Mixed-session contamination was addressed by serializing concurrent session writes in current eval infrastructure; timestamp corruption does not affect training data validity and does not trigger session exclusion.
4. The Audit Pipeline¶
ARCHER's response to the taxonomy above is a three-stage audit pipeline that runs before any session enters training: a Tier 1 Pass (automated structural checks), a Tier 2 Pass (LLM-as-judge scoring), and a Manual Review of sessions flagged by either prior stage.
Tier 1 Pass — Structural Checks (Automated, ~1 minute)¶
Tier 1 checks are deterministic. They catch the failure modes that have unambiguous detection signals.
Checks run: - Target IP validation: all tool invocations reference the correct target IP - Success signal provenance: success indicator not immediately preceded by echo/printf - Command count: session within declared min/max budget - Tool availability: no "command not found" errors > 10% of commands - Session log integrity: no truncation, no interleaving signals - Degenerate loop detection: no command substring repeated > N times
Sessions failing any Tier 1 Pass check are flagged for review and excluded from training until cleared.
Tier 2 Pass — LLM-as-Judge Scoring (per Tier-1-clean session)¶
The Tier 2 Pass scores every Tier-1-clean session on a 0–3 scale using a second model as judge. Sessions scoring ≥2 are training candidates; sessions scoring <2 are excluded. Cost: ~$0.0005/session.
The Tier 2 Pass runs after Tier 1 Pass clears a session batch and before Data Preparation writes any training JSONL. It is the final automated gate in the pipeline.
Manual Review — Auditor judgment on flagged sessions¶
Manual Review is triggered for sessions flagged by the Tier 1 Pass and for any collection run where > 20% of sessions from a single skill are flagged.
Manual Review is handled by the Auditor instance — a dedicated Claude Code session that reads the full session log and evaluates three questions: 1. Does the claimed success follow from the tool output in this session, or was it asserted without evidence? 2. Are the commands in this session appropriate for the stated objective and skill category? 3. Is there any evidence that this session's results are not attributable to the intended target?
Sessions confirmed invalid through Manual Review are written to the exclusion list with reason; a GitHub issue is filed under data-quality.
Tier 2 Coverage Gap¶
Tier 2 currently has no explicit rules for the PT-Pivoting skill category (planned but not yet implemented). When the PT-Pivoting skill pack is deployed, Tier 2 adversarial prompts will need new checks specific to pivot success verification — a response from an address not directly reachable from the attacker, not just flag content.
5. Implications for Training Pipeline Design¶
The central finding of the data integrity sprint is not the seventeen bug classes themselves but the structural reason they are hard to prevent: the model that generates training data is the model being evaluated for training data quality. Any quality gate that relies on the model's own completion signal will inherit the model's false positive rate.
Four design principles follow:
Principle 1: External verification is load-bearing. verify_fn — a ground-truth check that runs after any model-claimed success — must be external to the model layer. The check must confirm that the success indicator appears in actual tool output, not in the model's paraphrase of it.
Principle 2: Data quality gates must precede training, not follow it. The cost of a contaminated training example is not one bad session. It is the behavioral change the contaminated session induces in the fine-tuned model, which then generates further contaminated sessions at higher rate. Quality validation before training is always cheaper than quality recovery after training.
Operational validation: A SHA-256 sidecar integrity failure (MOE-5) was detected and resolved before the training pipeline advanced: the sidecar was being written before a final ft.jsonl append, causing 189 mismatches across 1,368 logs. The audit gate caught this before contamination entered training. Moving the sidecar write to the absolute end of session close eliminated all mismatches. Without the pre-training gate, those 189 sessions would have entered the fine-tuning pipeline with corrupted audit trails.
Principle 3: The audit layer must scale separately from data volume. As session volume increases, the audit layer must remain affordable. Tier 1 (structural, free) handles the bulk; Tier 2 (LLM-assisted, per-flagged session) handles the tail. The Tier 2 trigger threshold (> 20% flagged per skill) prevents full-corpus Tier 2 review at scale.
Principle 4: Training data diversity must be engineered, not assumed. In an operationally-sourced pipeline, repeated eval runs converge on a fixed set of task phrasings. Deduplication eliminates new router labels after the first run — session volume increases while label diversity plateaus. A pipeline that relies on operational volume alone will achieve single-phrasing coverage regardless of how many sessions are collected. Diversity requires explicit variant generation: rephrasings, ambiguous phrasings, and underspecified task descriptions must be introduced deliberately into the collection infrastructure. In ARCHER, this required adding a dedicated --ambiguous collection mode distinct from the standard eval loop before the router classifier label gate could be cleared across all 15 skill domains.
6. Methodology¶
Pending
Formal methodology section under development.
This section will document the derivation of the seventeen bug classes, audit corpus size, classification methodology, Tier 1 detection rates, Tier 2 escalation rates, and ground-truth validation procedure.
Current corpus state (2026-05-12): 1,404 collected sessions across 15 skill domains; 0 sessions advanced past Tier 1 audit gate. The pipeline is in the correct pre-training state: Principle 2 is enforced. Tier 1 structural audit pending; Tier 2 review of flagged sessions follows before any fine-tuning run.
Dominant boundary violation type in current production data: false_success_claim accounts for 13 of 39 boundary violations detected in the last 24 hours of operation, consistent with the prediction in §9 that false positive success signals are the most harmful contamination class. The monitoring signal confirms Bug Family 1 (false completion signals) is more frequent in practice than structural failures — the failure mode the Tier 2 audit is designed to catch that Tier 1 misses.
7. Reproducibility¶
The two-tier audit pipeline is available in the ARCHER repository at github.com/jayhawkins108/ARCHER.
To replicate the audit pipeline:
# Tier 1
archer-audit-dry
# Tier 2 (flagged sessions only)
archer-review-flagged
# View audit log
archer-audit-log
To replicate the data integrity sprint:
Pending
Full reproducibility requirements to be documented alongside stable corpus volume data.
- V1 session corpus: volume and eval configuration required to observe the bug distribution
- MS2 configuration required for ground-truth comparison
- Methodology for establishing ground truth
- Expected distribution of bug classes across families
8. Recommendations¶
For security tool developers building operationally-sourced training pipelines:
Treat your completion signal as a hypothesis, not a fact. Any model-generated success claim must be verified against external ground truth before the session is accepted as training data. Build the verification layer before you build the training pipeline.
Audit the taxonomy before the first training run. The seventeen bug classes documented here are not ARCHER-specific — they are properties of any session-sourced training pipeline for a security tool. Run a manual audit of your first session corpus before training. Expect to find representatives of most of these families.
Automate quality gates at the structural level. Target IP validation, echo-block rejection, command count verification — these are mechanical checks that catch a large fraction of contaminated sessions for free. Build them before you build the training infrastructure.
Track data quality separately from model quality. Pass rate on eval objectives can improve while data quality degrades, if the model is learning to pass evaluations rather than to perform the task correctly. Run the audit pipeline continuously and monitor the flagged-session rate as a leading indicator of pipeline health.
For procurement decision-makers evaluating AI security tools with training pipelines:
Ask whether training data is verified before training. If the answer is "we filter by completion signal," that is insufficient. Ask what ground-truth verification runs before a session enters training.
Ask for contamination rates. Any vendor running an operationally-sourced training pipeline should be able to tell you what percentage of their training sessions are excluded by quality gates, and why. An answer of "we don't measure that" is informative.
9. Falsifiable Claims¶
-
False positive success signals are the most harmful training data class. Prediction: fine-tuning on sessions containing echo fabrication bugs produces larger degradation in eval pass rate than fine-tuning on sessions with other bug classes, controlling for session count. Falsified if: degradation is equal across bug classes.
-
Tier 1 structural checks catch > 60% of contaminated sessions. Prediction: manual review of Tier-1-passing sessions finds < 40% contamination. Falsified if: contamination rate in Tier-1-passing sessions exceeds 40%.
-
Wrong-target sessions degrade routing accuracy more than wrong-tool sessions. Prediction: fine-tuning on wrong-target sessions reduces router accuracy on held-out routing labels more than fine-tuning on wrong-tool sessions. (pending: controlled experiment not yet conducted).
-
Data quality degrades with session volume without an audit gate. Prediction: the contamination rate in naive session collection (no audit) increases as total session volume increases, because edge cases and degenerate behaviors appear more frequently at scale. Falsified if: contamination rate is constant across session volume bins.
-
The Tier 2 LLM gate catches hallucinated findings that Tier 1 misses. Prediction: findings-hallucination sessions (bug class 4.1) pass Tier 1 checks at > 80% rate but are caught by Tier 2 at > 70% rate. (pending: measurement against labeled corpus).
Glossary
Audit gate: A required verification step before sessions are advanced to the fine-tuning pipeline. Includes Tier 1 deterministic structural checks and Tier 2 human or LLM-assisted review of flagged sessions. The gate's purpose is to catch contaminated sessions before they enter training data — where errors compound across training cycles rather than being isolated to a single session.
Contaminated session: Any session that should be excluded from the fine-tuning pipeline because it would teach the model incorrect behavior. Categories include wrong-host sessions, depth-blocked sessions, false positive sessions, and findings-hallucination sessions. Contamination is not always visible in structural checks — semantic failures require Tier 2 review.
Depth-blocked session: A session that exits early because it reached the maximum permitted command depth before completing the objective. Excluded from training data because the session does not demonstrate complete task execution and may contain the model giving up rather than finishing.
Echo fabrication: The failure mode in which a model generates findings that appear to be derived from tool output but are actually extrapolated or invented. The model produces plausible-sounding evidence that does not exist in the actual command output — a semantic failure that passes structural checks and is only catchable by comparing findings to raw output.
False positive session: A session that exits with an objective-achieved signal but did not actually complete the objective. Caused by the model claiming success without confirming the actual target state change. Training on false positive sessions teaches the model to declare success prematurely — a quality failure that compounds across training cycles.
Fine-tuning: The process of adapting a pre-trained language model to specific tasks or behaviors by training on task-specific examples. In an operationally-sourced pipeline, fine-tuning uses sessions generated by the production system itself as training examples — making session quality directly upstream of the quality of every subsequent model version.
Operationally-sourced training pipeline: A fine-tuning approach in which training examples are generated by the production system during actual task execution, rather than hand-authored. Creates a direct feedback loop between operational quality and training data quality: contaminated sessions produce a worse model, which produces worse sessions, which become contaminated training data.
Tier 1 audit: Deterministic, automated structural checks applied to all sessions before fine-tuning. Checks for port and IP address plausibility, required tool presence in the command log, output volume thresholds, and timing anomalies. Fast and inexpensive but cannot catch semantic failures — a structurally valid session can still contain fabricated findings or wrong-host targeting.
Tier 2 audit: Human or LLM-assisted review of sessions flagged by Tier 1 or sampled from high-volume skill categories. Catches semantic failures that Tier 1 misses, including findings-hallucination, wrong-host targeting, and model behavior that passes structural checks but is operationally incorrect. Triggered per-skill when Tier 1 flag rate exceeds 20%, or after any collection run that feeds the fine-tune pipeline.
Training data integrity: The property that training examples accurately represent correct task execution. A system lacking training data integrity trains on sessions that teach incorrect behavior, causing model quality to degrade rather than improve over successive training cycles. The degradation is self-reinforcing: a worse model produces worse sessions, which become the next training set.
Wrong-host session: A session in which the model directed its commands at a target other than the intended evaluation target. May produce tool output that looks structurally valid — open ports, service banners — from the wrong host, passing Tier 1 checks while teaching the model to target incorrect hosts.
About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production. CEH, CHFI, Pentest+, Security+.
Centaur Security Labs — centaursecuritylabs.com