The Human Parallel: Psychology, Philosophy, and the Structural Limits of AI Security Tooling¶

Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

The failure modes of AI security tools are not new problems. They are new instances of problems that cognitive psychology has been measuring in humans for decades and that philosophy has been describing for centuries. Naming these parallels is not an academic exercise — each one identifies why a common intuitive fix fails and points toward the architectural requirement that actually works.

Abstract¶

When an AI agent reports a successful exploit it didn't actually achieve, the intuitive response is to write better instructions. This intuition is wrong — and wrong in a way that neuroscience can explain precisely: the behavior resembles confabulation, the brain's automatic gap-filling mechanism, not deliberate deception. When a language model makes inconsistent routing decisions on the same input, the failure looks like a calibration problem. It is actually a consequence of placing a System 1 reasoning process in a role that requires System 2 correctness. When a model produces findings that sound authoritative but cannot be traced to actual tool output, the gap between coherent output and verified knowledge is exactly what Gettier identified as the boundary of genuine understanding in 1963.

These parallels are not decorative. Each one identifies why a common fix fails and points toward the architectural requirement that actually resolves the problem. This paper works through five of them in sequence, anchoring each in empirical data from ARCHER — a local-first AI penetration testing agent developed across 2025–2026. The argument throughout: the problems AI security tools face are structurally familiar. Treating them as such produces better architectural decisions than treating them as novel technical failures that better models will eventually solve.

1. Introduction¶

There is a category error embedded in the standard response to AI tool failures: practitioners treat them as engineering problems with engineering solutions. The model hallucinated — so they adjust the prompt. The agent stopped too early — so they raise the iteration limit. The findings weren't traceable — so they add more logging. Each response treats the failure as a deviation from correct behavior that can be corrected by tuning the same system.

Psychology and philosophy offer a different diagnosis. These failures are not deviations — they are properties of the system type. And both disciplines have spent significant effort studying the same properties in the most familiar system type: the human mind.

A radiologist who confidently reads a finding into an X-ray that isn't there isn't malfunctioning. A chess grandmaster who immediately "sees" the right move — in a way they cannot fully articulate — isn't bypassing their intelligence. A pilot who trusts an instrument over their own perception isn't being reckless. Each is exhibiting a well-understood cognitive pattern with a name, a mechanism, and a documented set of conditions under which it fails.

Language models exhibit analogous patterns. Naming them precisely does two things: it explains why the intuitive fix doesn't work, and it points toward the fix that does.

2. Plato's Cave — The Grounding Problem¶

The Concept¶

In Plato's Republic, prisoners chained in a cave see only shadows cast on the wall by objects passing in front of a fire behind them. The shadows are all they have ever known. They develop sophisticated theories about the shadows — which follow which, which predict which — and they are not wrong within the limits of their experience. They are wrong about what the shadows are. They mistake representations for reality because reality is inaccessible to them.

The Parallel¶

A language model trained on text is in the same position. Its entire experience of the world is the shadow of the world that appears in human writing: descriptions of network scans, transcripts of exploit sessions, documentation of vulnerabilities. It has never contacted a real network, confirmed a live shell, or run a tool against an actual target. It has read millions of accounts of people who have.

This is why a model can produce fluent, confident output about things that aren't true — and why it cannot reliably detect the falseness of its own output. Coherence with the shadows is not the same as correspondence with the world. A model that has read a thousand successful exploit sessions learns what successful exploit sessions sound like. Applied to a new target in a new session, it produces output that sounds the same — whether or not the underlying reality matches.

The Empirical Anchor¶

Across an ARCHER eval corpus of 1,639 sessions with complete outcome logging (snapshot as of 2026-05-13), 658 sessions emitted an OBJECTIVE_ACHIEVED signal. Of those, 121 — 18.4% of OA-emitting sessions — failed code-layer ground-truth verification, the probe of actual target state that checks whether the model's claimed success is real. The rate is computed against the OA-emitting population, not the full corpus: as a share of all 1,639 sessions, false OAs are 7.4%. In a controlled baseline of 87 sessions, the false-OA rate was 9.1%. (Methodology note: sessions run against Metasploitable2 and DVWA targets on a GOAD-Light lab, single operator, using qwen3:14b via Ollama; logged 2025–2026. Both figures are first-party observations from this lab context; generalization to other models and environments is an open question.) These figures are a point-in-time snapshot; ARCHER's eval data documents a ~45-percentage-point time-of-day swing in pass rate and selective VRAM-pressure effects, so false-OA point estimates carry real runtime variance and should be read as a band, not a fixed floor. These are sessions where the model's output was internally consistent, structurally correct, and wrong. The model had seen the shadow of success and reported that success had occurred.

The Architectural Implication¶

The verify_fn is the mechanism that forces contact with the actual world. It is not a safety check added out of caution — it is the minimum requirement for a system operating under the cave constraint. Without it, the only check on the model's claims is whether those claims are consistent with other shadows. That is not a check at all.

3. Confabulation — The Gap-Filling Mind¶

The Concept¶

Confabulation is a neurological phenomenon studied extensively in patients with certain types of brain damage, amnesia, and split-brain conditions. The brain, faced with a gap in its knowledge or memory, fills the gap automatically — producing a confident, coherent, internally consistent account that the patient genuinely believes. The patient is not lying. There is no awareness of the gap, no decision to fabricate, no intent to deceive. The brain's pattern-completion mechanism simply runs and produces an output.

The neuropsychological literature identifies a precise architectural cause. In a healthy brain two systems operate in parallel. The narrative interpreter — principally left-hemisphere, documented in Gazzaniga's split-brain research — constructs coherent stories continuously from available fragments. It is optimized for fluency and coherence, not accuracy. It cannot be turned off and does not know when to stop. Alongside it, the frontal monitoring system performs reality monitoring: tagging memories with source context, checking temporal plausibility, flagging uncertain retrievals, catching narrative drift before it compounds. The monitor is what tells the interpreter when to say I don't know.

Confabulation occurs when the monitor is damaged — through Korsakoff syndrome, frontotemporal dementia, anosognosia — while the interpreter remains intact. Gazzaniga's split-brain experiments are canonical: when the right hemisphere is shown an image and prompted to act on it, the left hemisphere — which controls speech but had no access to the image — immediately confabulates a plausible explanation. The explanation is wrong. It is stated with complete confidence. The patient believes it.

Two mechanisms compound the failure. Source amnesia: the content of a memory survives but its provenance tag is lost. The brain can no longer distinguish I witnessed this from I read this once from I inferred this from surrounding context. The fragment is real; the attribution is fabricated. Gap-filling under narrative pressure: when retrieval fails, the interpreter generates plausible content from surrounding context and general world knowledge. The output is convincing precisely because it is drawn from real patterns — just not from the specific event being queried.

The Parallel¶

Large language models are, architecturally, the narrative interpreter without the monitoring system. Generation proceeds token-by-token over learned probability distributions. There is no structurally separate process during inference that asks is this actually retrievable from training, and with what confidence? The model cannot distinguish high-confidence retrieval from gap-filling — both produce the same fluent output at the same generation temperature. The interpreter runs. The monitor was never built.

A model that emits a false completion signal is not lying. It has no theory of mind, no awareness that the signal is false, no mechanism for knowing the gap between its claim and the actual state of the target. It produced a token sequence that was statistically consistent with its training on task-completion contexts — and in doing so, it confabulated a success. The structure of the output is identical to a genuine success report. The underlying mechanism is the same gap-filling process.

Source amnesia is not a failure mode in AI systems. It is intrinsic to their architecture. Human confabulators lose provenance tags through damage to a system that once held them. Language models never had them: compression into weights during training destroys provenance. A model cannot tell you whether a fact derives from a single credible source, dozens of weakly-corroborating fragments, or the statistical residue of something that was confidently wrong. The tag does not exist in the representation.

This is why "instruct it to be honest" cannot work as a fix. Gazzaniga's patients were not dishonest. Fluency is a property of the generation system, not of accuracy — and optimizing for one does not produce the other. The model cannot report accurately on a reality it has never contacted. It can only produce output consistent with the patterns in its training data, which in success contexts means producing success-sounding output.

The Empirical Anchor¶

The same ARCHER data applies: a 9.1% false-positive rate in the controlled 87-session baseline, and 18.4% of OA-emitting sessions in the broader 1,639-session corpus (snapshot as of 2026-05-13). Each of these sessions produced structured, well-formed completion signals. None produced any internal marker distinguishing the false completion from a genuine one. Prompt refinements across the development period — more explicit instructions, stricter format requirements, clearer definitions of what constitutes a confirmed objective — did not eliminate the false-positive rate. They changed its surface but not its floor. The floor is structural.

The Architectural Implication¶

The neuroscience informs what the cure cannot be. Human confabulation is not addressed by training the narrative interpreter to be more accurate — the interpreter does not have access to the information required to verify its own output. The monitor must be architecturally separate. Clinical management of confabulation is entirely external: verified records substitute for failed source tagging; structured verification prompts slow the narrative system and impose external checks; a second person reviews the patient's account against known facts. What replaces the broken internal monitor in every case is a mechanism outside the generation system.

The translation to AI system design is direct:

Clinical confabulation management	AI system equivalent
External verified records	Retrieval-augmented generation; live tool access
Structured verification before commitment	Chain-of-thought with explicit uncertainty flags
A second reviewer checking the account	The code layer; the human layer
Prompting for uncertainty	Calibrated refusal; confidence-gated output

The accountability for a confabulating model does not belong to the model — it belongs to the system design that placed a pattern-completion mechanism in a role requiring ground-truth access. The remediation is not a better-worded instruction; it is a structurally independent verification layer that the model cannot influence and cannot bypass.

One extension the parallel surfaces that pure capability analysis misses: the danger compounds with capability. A more capable confabulating patient is more convincing and harder to detect. A more capable AI narrator that remains without a monitoring system produces more persuasive, more internally consistent, more authoritative-sounding outputs — ones that are progressively harder for a human observer to identify as gaps filled with plausible invention rather than retrieval from fact. The case for the monitoring layer does not weaken as model capability grows. It strengthens, for exactly the same reason.

4. System 1 and System 2 — The Architecture of Thought¶

The Concept¶

The dual-process framework — most associated with Daniel Kahneman's popular treatment (2011), though the underlying two-system distinction was formalized by Stanovich and West (2000) — describes two distinct modes of cognition. System 1 is fast, automatic, associative, and pattern-driven — it produces answers fluently and with apparent confidence, drawing on learned patterns to generate responses with minimal deliberate effort. System 2 is slow, deliberate, effortful, and rule-governed — it checks System 1's outputs, applies explicit logic, and catches errors that pattern-matching alone would miss.

Neither system is superior. System 1 is what makes expertise possible: a chess grandmaster sees the board differently than a novice because System 1 has internalized patterns across thousands of games. But System 1 also produces systematic errors — cognitive biases, logical fallacies, confident wrong answers — precisely because it generates outputs by pattern association, not by explicit verification. System 2 is the correction mechanism. The problem is that System 2 is resource-limited and can be bypassed under cognitive load, time pressure, or conditions of high confidence.

The Parallel¶

Language models are System 1 at scale. They pattern-match across a training corpus of extraordinary breadth and produce fluent, plausible, contextually appropriate output. What they cannot do — reliably, consistently, under the full range of input conditions — is run the System 2 check: verify that the output is correct, not merely consistent with the patterns that produced it.

Task routing is a System 2 operation: classify this input into the correct category, apply explicit logic, produce a consistent result. Halt detection is a System 2 operation: determine whether the session objective has been met according to explicit criteria, not according to whether the output sounds complete. Audit logging is a System 2 operation: record accurately, completely, and without inference. These are not tasks where pattern-matching is sufficient — they are tasks with correct answers that must be reached by explicit mechanism, not fluent association.

Placing these tasks in the model layer and compensating with prompts is equivalent to asking System 1 to run System 2 checks by telling it to slow down. It works at the center of the distribution. At the tails, under context pressure, in edge cases — it fails. Not because the model is poor, but because the cognitive architecture does not support it.

The Empirical Anchor¶

ARCHER's keyword router — the V1 routing mechanism — operated as a System 1 classifier: fast, pattern-based, fluent on canonical inputs. Across 235 ambiguous task phrasings, it routed correctly 54.5% of the time. An additional 7.2% of inputs produced complete routing failures. The classifier that replaced it — a TF-IDF + logistic regression model trained on high-confidence labels — is a System 2 mechanism: explicit, deterministic, rule-governed. It produces a consistent routing decision on identical inputs, independent of surrounding context. Parser complexity is the second measure: ARCHER's bash extraction function grew from 8 lines handling one format to 33 lines handling four, plus a 40-line JSON repair layer added when the model began producing malformed structured output under context pressure. Each line of that parser is System 2 logic written to compensate for System 1 failure.

The Architectural Implication¶

The separation between the model layer and the code layer is not a design preference — it is a cognitive architecture requirement. Model layer for System 1 work: command generation, output interpretation, investigative reasoning. Code layer for System 2 work: routing, halt detection, logging, verification. The boundary is not arbitrary. It corresponds to the boundary between the work that pattern-matching does well and the work that requires explicit correctness guarantees.

5. Justified True Belief — The Epistemology of Verification¶

The Concept¶

Classical epistemology defines knowledge as justified true belief: to know something, a belief must be true, you must hold it, and your belief must be grounded in good reasons — not coincidence. In 1963, Edmund Gettier published a three-page paper that broke this definition. He showed that a belief can be true and justified — the reasons for holding it can be sound — while still failing to constitute knowledge, because the belief arrives at truth via a route disconnected from what actually makes it true.

The Gettier problem reveals a gap between being right and knowing. A belief that arrives at truth through coincidence, circular reasoning, or a false intermediate step is not knowledge, even when it is true. The justification must track reality, not just correlate with it.

The Parallel¶

A language model can produce a true statement without that statement constituting knowledge in any meaningful sense. It produces output by sampling from distributions learned across training data — and those distributions are correlated with correct outputs about the world, but the correlation is not a direct causal connection to the actual state of things. A model that reports a successful privilege escalation may be right. If it is right because the phrasing matched training patterns from similar sessions — not because it verified the current uid — it arrived at truth via a Gettier route: accidentally correct, with justification disconnected from the fact it's reporting.

The practical consequence is that Gettier-true model outputs are indistinguishable in form from genuinely verified outputs. They have the same structure, the same confidence, the same prose. The architecture cannot tell them apart without a verification step that is independent of the model's reasoning — a check that connects the claim to the actual state of the target, not to the model's internal representation of what a successful session looks like.

The Architectural Implication¶

Ground-truth verification is not a conservative extra step — it is the minimum requirement for any claim to constitute knowledge rather than a well-formed guess. The verify_fn in ARCHER's eval harness is an epistemological institution: it refuses to accept claims on the basis of plausibility and demands that they be connected to actual target state. The 9.1% sessions it rejects are not failed sessions; they are Gettier cases — true-sounding outputs that arrived at the wrong answer for the wrong reasons, or occasionally the right answer for the wrong reasons. Either way, the justification was missing. That matters operationally because Gettier-true findings drive remediation decisions as if they were genuine knowledge. When they're wrong, the error propagates through the findings pipeline unchecked.

6. Tacit Knowledge — What the Human Layer Holds¶

The Concept¶

Michael Polanyi opened The Tacit Dimension (1966) with a claim that proved difficult to dismiss: "We can know more than we can tell." The knowledge a master craftsperson has of their material, that a surgeon has of tissue resistance under a blade, that a chess grandmaster has of a board position — this knowledge is real, operationally consequential, and not reducible to explicit rules. It is embodied in practice, transferred through apprenticeship, and demonstrated through performance. Writing it down does not transmit it. Telling someone the rules does not produce the skill.

Polanyi called this tacit knowledge and argued it underlies all explicit knowledge — that the ability to follow explicit rules itself depends on tacit understanding of how to apply them. Expertise is not stored instructions plus practice time. It is a form of knowledge that exists in a different register entirely.

The Parallel¶

The human layer in a centaur security system holds tacit knowledge that is not transferable to the model layer or the code layer by any current mechanism. A senior analyst's sense that a finding "feels wrong" despite clean tool output, their read of whether a scope boundary is being approached, their judgment about which vulnerability matters in the context of this organization's actual risk profile — these are Polanyi's tacit knowledge applied to security operations. Real, operationally consequential, and not specifiable into rules or model weights.

This is not a temporary limitation waiting on a capability improvement. It is a structural property. Tacit knowledge is not a compressed version of explicit knowledge that hasn't yet been successfully decompressed. It is a different kind of thing. A more capable model does not gain access to it by becoming more capable at text prediction — because the knowledge does not live in text.

The Architectural Implication¶

The human layer is not optional overhead in the centaur architecture. It is not a compliance requirement or a liability management step. It is the only component of the system that can supply the knowledge that determines whether the system's outputs are actually useful in a specific operational context. A tool design that removes the human from those judgments does not replace tacit knowledge — it eliminates the only component capable of providing it and produces output that is optimized for plausibility rather than operational utility.

7. Automation Bias — The Trust Problem¶

The Concept¶

Automation bias is the tendency of human operators to over-trust automated systems — to accept their outputs without critical scrutiny, particularly under cognitive load or time pressure. It is not carelessness or incompetence. It is a well-documented pattern in human-automation interaction, studied extensively in aviation (Parasuraman & Manzey, 2010), clinical decision support, and process control. Automation bias produces two failure modes: complacency errors (missing what the automated system didn't flag) and commission errors (acting on incorrect automated outputs because the operator trusted them without checking).

The research is consistent: the higher an operator's trust in the system and the higher their cognitive load, the more pronounced the bias. Expert operators are not immune — professional pilots exhibit automation bias at rates comparable to non-experts (Mosier, Skitka, Heers & Burdick, 1998; Skitka, Mosier & Burdick, 1999), because experience with a system's reliability builds a trust that vigilance does not automatically override.

The Parallel¶

An analyst reviewing AI-generated security findings is in the automation bias scenario. The output is structured, confident, and detailed — all surface features that trigger trust. Under time pressure, across many findings, at the end of a long engagement, the cost of scrutinizing each finding against its underlying evidence is real. The cost of accepting it on the basis of plausibility is invisible until it isn't.

This is not a criticism of the analyst. It is a description of the cognitive environment. Designing a system that relies on analyst skepticism as the primary check against false findings is designing for a cognitive pattern that the research predicts will erode under exactly the conditions where it is most needed.

The Architectural Implication¶

Verification cannot be voluntary. A system that produces plausible-sounding false findings and relies on analyst skepticism to catch them has distributed the System 2 check across every analyst on every finding under every cognitive load condition — which is precisely the condition under which that check is least reliable. Building the verification layer into the architecture, before output reaches the analyst, removes the cognitive burden from the person least equipped to bear it reliably at scale.

8. The Design Convergence¶

Five frameworks, five different origin disciplines — neuroscience, ancient philosophy, cognitive psychology, analytic epistemology, and human factors research. They converge on the same architectural conclusion:

A system that places a probabilistic pattern-completion mechanism in roles requiring ground-truth access, deterministic correctness, or irreducible human judgment will fail — not occasionally and randomly, but predictably and structurally. The failures look different at the surface: a false completion signal here, a routing error there, an analyst trusting a wrong finding somewhere else. Underneath, they share a mechanism: the wrong kind of component in the wrong kind of role.

The centaur architecture — model layer for generation and interpretation, code layer for routing and verification, human layer for judgment and authorization — is the convergence of what each of these frameworks recommends. The model does what System 1 does well. The code does what System 2 must do reliably. The human does what tacit knowledge requires. The verify_fn grounds the model's output in Plato's actual world rather than the shadows. The architecture as a whole addresses automation bias by catching errors before they reach the analyst's desk.

This is not a coincidence. These frameworks are describing the same underlying fact from different angles: that different kinds of work require different kinds of cognitive architecture, and that the failure to match the architecture to the work is the error — not the capability of any individual component.

9. Falsifiable Claims¶

Confabulation is instruction-resistant. Prediction: prompt refinements targeting false-positive completion signals will not reduce the false-positive rate below a structural floor without a code-layer verification gate. Partially confirmed: ARCHER's development history shows false-positive rate persisted across prompt iterations; the rate only fell when verify_fn was introduced. Full confirmation requires an A/B measurement of prompt-only versus prompt-plus-verify_fn on a held-out session set.
System 1/2 architecture predicts reliability. Prediction: AI security tools that place routing and halt detection in the model layer will show higher error rates on ambiguous inputs than architecturally equivalent tools that implement these in the code layer. Measurable at: routing accuracy on --ambiguous phrasings, halt false-positive rate on structured evals.
Gettier cases enter training data at a measurable rate. Supported (single-run point estimate, not yet replicated): 9.1% of OBJECTIVE_ACHIEVED signals in the ARCHER controlled baseline failed ground-truth verification. This 9.1% is a single --runs 3 pass over 29 objectives — one point on a curve that the broader eval data shows swinging roughly 2× with time-of-day and VRAM conditions, so the specific figure should be read as indicative rather than a stable rate pending repeated runs. The qualitative finding is robust regardless of the exact percentage: these are structurally Gettier cases — correctly formed outputs disconnected from the ground truth they purport to report. Without the verification gate, they would have entered the training pipeline as labeled successes.

Acknowledgments¶

Empirical data from ARCHER development sessions conducted 2025–2026. Philosophical and psychological frameworks: Stanovich & West (2000), Kahneman (2011), Polanyi (1966), Plato (c. 375 BCE), Gettier (1963), Parasuraman & Manzey (2010), Gazzaniga (1967).

Acknowledgments pending formal peer review.

References

Gazzaniga, M. S. (1967). The split brain in man. Scientific American, 217(2), 24–29.

Gettier, E. L. (1963). Is justified true belief knowledge? Analysis, 23(6), 121–123.

Hawkins, J. (2026). The stochastic trap: An architectural critique of current AI security tools. Centaur Security Labs.

Hawkins, J. (2026). The centaur framework: A three-layer architecture for human-AI security operations. Centaur Security Labs.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

Mosier, K. L., Skitka, L. J., Heers, S., & Burdick, M. (1998). Automation bias: Decision making and performance in high-tech cockpits. International Journal of Aviation Psychology, 8(1), 47–63.

Parasuraman, R., & Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410.

Skitka, L. J., Mosier, K. L., & Burdick, M. (1999). Does automation bias decision-making? International Journal of Human-Computer Studies, 51(5), 991–1006.

Stanovich, K. E., & West, R. F. (2000). Individual differences in reasoning: Implications for the rationality debate. Behavioral and Brain Sciences, 23(5), 645–665.

Plato (c. 375 BCE). Republic, Book VII. (Trans. G. M. A. Grube, revised by C. D. C. Reeve, 1992). Hackett Publishing.

Polanyi, M. (1966). The Tacit Dimension. Doubleday.

Glossary

Automation bias: The tendency to over-trust automated system outputs, particularly under cognitive load — producing errors of omission (missing what the system didn't flag) and commission (acting on incorrect outputs without scrutiny).

Confabulation: The automatic, unconscious production of confident, coherent false accounts by the brain to fill gaps in memory or knowledge. Distinguished from lying by the absence of intent: the subject believes the confabulated account.

Gettier problem: The epistemological observation that a belief can be true and justified while still failing to constitute knowledge — because the justification is disconnected from what actually makes the belief true.

System 1 / System 2: Dual-process framework distinguishing fast, associative, pattern-driven cognition (System 1) from slow, deliberate, rule-governed cognition (System 2), formalized by Stanovich and West (2000) and popularized by Kahneman (2011). Language models operate as System 1 at scale; correctness-critical operations require System 2 mechanisms.

Tacit knowledge: Knowledge that cannot be fully articulated in explicit rules — embodied in practice, demonstrated through skilled performance, and transferred through experience rather than instruction. Identified by Polanyi as the foundation of all explicit knowledge.

About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production. CEH, CHFI, Pentest+, Security+.

Full background →

Centaur Security Labs — centaursecuritylabs.com