Investigative Provenance as a Compliance Requirement: Designing AI Security Tools for NIS2 and DORA¶
Status: Technical Report | Centaur Security Labs | 2026
Author: Jay Hawkins, Centaur Security Labs
The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.
NIS2 and DORA do not regulate AI tools by name. They regulate the outputs those tools produce — findings that drive remediation decisions, reports that constitute audit evidence, logs that must be reconstructable for incident investigation. A security tool that uses AI to generate findings is not exempt from evidentiary requirements because it uses AI. This paper argues that investigative provenance — the traceable chain from raw tool output to final finding — is a compliance requirement, not a design preference, and that most current AI security tools cannot satisfy it.
Abstract¶
The EU's NIS2 Directive and the Digital Operational Resilience Act (DORA) impose substantive requirements on the security practices of essential and important entities, including the ability to demonstrate that security activities were performed, findings were derived from evidence, and audit trails are complete and reconstructable. These requirements apply to any security activity performed under those frameworks — including security activities where AI tools generated the findings.
This paper argues that the concept of investigative provenance — the complete, verifiable chain from raw tool output to final security finding — is the design criterion that distinguishes AI security tools capable of satisfying NIS2/DORA from those that cannot. I examine the provenance requirements implied by NIS2 Article 21 (risk management obligations), DORA Article 25 (ICT risk testing), and the EBA/EIOPA/ESMA joint guidance on ICT risk frameworks, identify the architectural characteristics that determine whether a tool can generate provenance-complete findings, and present ARCHER's architecture as a reference implementation of the provenance-complete design.
1. Introduction¶
Compliance frameworks rarely address AI tools directly. NIS2 does not mention language models. DORA does not regulate inference endpoints. What they regulate is the operational outcome: the organization must demonstrate that its security practices meet the required standard, and that demonstration requires evidence.
When an AI tool generates a security finding — "this target is vulnerable to CVE-2023-XXXX," "this account has elevated privileges," "this traffic pattern matches known C2 behavior" — that finding carries the same evidentiary burden as a finding generated by a human analyst. The question a regulator will ask is not whether a human or a machine produced the finding. The question is whether the finding is traceable to specific evidence, whether the evidence is preserved and reconstructable, and whether a qualified professional reviewed and authorized the finding before it drove a remediation decision.
Most current AI security tools cannot satisfy these requirements. They generate findings that are summaries of the model's probabilistic outputs — plausible, often accurate, but not traceable to specific raw evidence in an auditable way. This is not a gap in model capability. It is an architectural choice: systems designed for user experience and capability demonstration rather than evidentiary completeness.
This paper argues that the architecture determines compliance eligibility, and documents what a provenance-complete architecture requires.
2. Regulatory Background¶
Legal Review Pending
The regulatory analysis below is a practitioner's reading of the relevant frameworks, not legal advice. This section requires review by a qualified EU law practitioner before formal publication or use in compliance documentation.
2.1 NIS2 Directive — Article 21¶
Directive (EU) 2022/2555 of the European Parliament and of the Council of 14 December 2022 on measures for a high common level of cybersecurity across the Union (NIS2 Directive), Article 21.^1
Article 21 requires essential and important entities to implement risk management measures including, inter alia: - Security testing and auditing (Article 21(2)(d)) - Incident handling and logging (Article 21(2)(b)) - Business continuity with reliable, traceable security operations (Article 21(2)(c))
The relevant question for AI security tools: when a security assessment is performed using AI tooling, does the resulting report satisfy the Article 21 requirements for demonstrating that risk management measures were implemented? My reading: it does only if the report is traceable to specific technical evidence. A summary generated by a language model, without a citation to the raw tool output, does not constitute evidence of a security measure performed — it constitutes a record that the tool ran.
ENISA's Technical Implementation Guidance on cybersecurity risk management measures (version 1.0, 26 June 2025) provides implementation detail for Commission Implementing Regulation (EU) 2024/2690 across thirteen thematic areas, including risk management documentation and incident handling.[^3] The guidance does not address AI-generated security outputs as a discrete category; the evidentiary requirements above are derived from its general documentation and evidence standards for demonstrating that risk management measures were implemented.
2.2 DORA — ICT Risk Testing Framework¶
Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital operational resilience for the financial sector (DORA), Article 25 and Articles 26–27 (Threat Led Penetration Testing).^2
DORA Article 25 mandates ICT risk testing as a component of the digital operational resilience framework. Article 26-27 establishes Threat Led Penetration Testing (TLPT) as the highest tier — a structured, scoped penetration test performed by qualified testers with methodology validation by an independent assessment team.
TLPT results must be documented in a format that: - Demonstrates scope compliance (testing covered the mandated systems) - Demonstrates methodology compliance (testing followed a recognized standard — PTES, TIBER-EU) - Documents specific findings with their technical evidence - Supports the supervisory review process (regulators can review findings, not just a summary)
The European Central Bank's TIBER-EU Framework (May 2018) — the basis for DORA TLPT methodology — requires a structured Closure Report documenting findings, evidence, and remediation recommendations.[^4] The Closure Report must trace each finding to the specific test actions and evidence that produced it; a summary without a supporting evidence trail does not satisfy this requirement.
An AI tool that produces TLPT findings must meet these documentation requirements. A session log that shows what commands were run and what output was returned is a component of that documentation. A language model summary that describes "a critical vulnerability was identified" without citing the specific tool output that supports the claim is not.
2.3 DORA ICT Risk Management Framework¶
Commission Delegated Regulation (EU) 2024/1774 supplementing DORA requires financial entities to maintain documentation of ICT risk management activities sufficient to demonstrate compliance to competent authorities.[^5] For ICT security testing, this includes records of scope, methodology, findings, and remediation actions. Where AI tools generate security findings that feed into DORA-required risk management documentation, the provenance requirements in Section 3 apply directly: the documentation must be traceable to specific technical evidence. The regulatory technical standards in CDR 2024/1774 were developed by the ESA Joint Committee; their Final Report (JC 2023 86, January 2024) is the primary reference for practitioners interpreting how those requirements should be operationalized.[^6]
2.4 Data Residency and Inference Routing¶
An additional compliance dimension: NIS2 and DORA both impose requirements on the handling of information about vulnerabilities and security weaknesses. When an AI security tool routes inference through cloud infrastructure, the operational data — target descriptions, vulnerability findings, network topology — transits third-party systems.
Legal Analysis Pending
Whether cloud inference for security tooling constitutes a reportable data transfer under NIS2 Article 23, GDPR Article 46, or DORA Article 28 is not a settled question and requires qualified legal analysis before this section can be finalized.
The practitioner's answer: until this question is settled, the conservative position for entities under NIS2/DORA is to use local-first inference for any security tooling that processes information about vulnerabilities, target configurations, or network topology. The alternative — routing operational security data through a third-party inference provider — creates a third-party ICT risk exposure that must itself be managed under DORA Article 28.
3. Investigative Provenance: A Definition¶
Investigative provenance is the complete, verifiable chain from raw tool output to final security finding. A provenance-complete finding satisfies three properties:
Traceability: Every claim in the finding is traceable to specific raw output from a specific tool invocation at a specific time. "This target is vulnerable to CVE-2023-XXXX" traces to the exact output of the tool that identified the CVE — not a model's paraphrase of that output.
Completeness: The provenance chain does not contain gaps. A finding cannot cite a tool output that is not preserved in the session log. A session log cannot omit commands or outputs that occurred during the session.
Integrity: The provenance chain cannot have been altered after the fact. Session logs must be timestamped and protected against modification; the connection between finding and evidence must be established at the time of generation, not reconstructed later.
These three properties — traceability, completeness, integrity — map directly to the evidentiary requirements of NIS2/DORA compliance. A tool that provides them is compliance-eligible. A tool that does not provide them generates findings that may be accurate but cannot be used as compliance evidence.
4. The Provenance Gap in Current AI Security Tools¶
4.1 The Summary Problem¶
Most AI security tools present findings as natural-language summaries generated by a language model. The summary is useful for human consumption — it is readable, context-aware, and often accurate. It is not useful as compliance evidence because it is the model's output, not the tool's output.
Consider the evidentiary chain: 1. Nmap runs and produces raw output 2. The AI model reads the output and generates a summary 3. The tool reports the summary as a finding
What the compliance auditor sees is step 3. What the compliance auditor needs to see is step 1. If step 1 is not preserved and cited, the finding cannot be verified and cannot constitute compliance evidence.
This is not a claim that AI-generated summaries are wrong. It is a claim that they are not sufficient, and that the architectural choice to present summaries rather than evidence-linked findings is a compliance disqualifier for findings that must satisfy NIS2/DORA.
4.2 The Audit Trail Problem¶
Provenance-complete security findings require a session log that is: - Timestamped at the command level (each command invocation and its output) - Integrity-protected (not modifiable after the fact) - Complete (no commands or outputs omitted) - Retained for the duration required by the applicable framework
Commission Delegated Regulation (EU) 2024/1774, implementing DORA Articles 15 and 16(3), requires ICT risk management documentation to be maintained and available for supervisory review; DORA Article 17 requires retention of records relating to ICT incidents for a period enabling post-incident analysis and regulatory inspection. NIS2 Article 23 specifies incident notification timelines but does not specify a documentation retention period — testing documentation retention under NIS2 is determined by member-state implementing measures and competent-authority guidance, which vary by jurisdiction. Qualified legal analysis is required before specifying precise retention periods for any given entity.
Many current AI security tools do not provide session-level audit logs at all. Tools that route inference through cloud APIs may not retain the raw exchange. Tools that operate as web applications may not expose the underlying session structure. Without a session log, there is no provenance chain to inspect.
4.3 The Authorization Gap¶
Investigative provenance includes not only the technical evidence chain but the authorization chain: who authorized the security activity, under what scope, and with what approval. A penetration test performed without documented authorization is not a compliance-satisfying security activity even if its technical findings are accurate.
AI tools that operate autonomously — accepting a task description and executing without explicit scope authorization at each significant decision — create an authorization gap. The tool may perform actions that were not explicitly in scope, against targets that were not explicitly authorized, because the model interpreted the task description broadly.
This is not a theoretical concern. ARCHER's architecture includes explicit scope and target configuration that must be set before the agent executes. The model does not autonomously determine scope; scope is a code-layer constraint that the human operator sets before execution begins.
5. ARCHER's Architecture as a Reference Implementation¶
ARCHER's design was shaped by the provenance requirements described above, though not initially framed in regulatory terms. The design choices that produce provenance-complete output were driven by operational requirements — accurate, verifiable findings — that are coextensive with compliance requirements.
5.1 Session Logging¶
Every ARCHER session produces a timestamped log at ~/.archer_sessions/ containing:
- Session start time, target configuration, scope parameters
- Each command issued to the execution environment, with the exact command string
- The raw output of each command, with timestamp
- The model's [FINDINGS] annotations, linked to the commands that produced them
- The session termination reason (OBJECTIVE_ACHIEVED, HALT_DISCIPLINE, max command count)
The log is written incrementally — each entry is appended as it occurs, not reconstructed after session end. This ensures the log represents the actual session, not a summary generated after the fact.
5.2 Finding Verification¶
ARCHER's verify_fn layer, deployed in the eval harness, provides a ground-truth check on model-claimed success. The check confirms that success indicators appear in actual tool output — not in model-generated text.
The verify_fn pattern separates command blocks by command boundary and excludes echo/printf blocks from the success check. This enforces provenance at the verification layer: a finding must be traceable to tool output, not to the model's assertion of what tool output should look like.
5.3 Local-First Inference¶
ARCHER runs inference locally via Ollama. No operational data — target configuration, vulnerability findings, tool output — transits third-party infrastructure. This design eliminates the third-party ICT risk exposure that cloud inference creates and avoids the data residency question that NIS2/DORA pose for cloud-routed security tooling.
5.4 Human Authorization Layer¶
ARCHER implements the three-layer responsibility split: the model generates commands, the code executes and logs them, the human authorizes the session's scope and reviews findings before they drive remediation. High-impact and irreversible actions require explicit human authorization; the model cannot authorize them autonomously. Mechanical enforcement of role boundaries — boundary violation detection, decision-layer attribution, and withheld-action disclosure — is implemented as Phase 5 of the compliance roadmap.
This is not a feature added for compliance. It is the design principle that defines the Centaur model: the human handles accountability for decisions that carry legal or ethical weight. That principle is also the design principle that satisfies NIS2/DORA's implicit requirement that security activities are performed under human oversight.
6. Methodology¶
Regulatory analysis approach. The evidentiary requirements described in Section 2 are derived through a practitioner's reading of the primary sources: the NIS2 Directive text, the DORA text, the TIBER-EU Framework, CDR 2024/1774, the ESA Joint Committee Final Report (JC 2023 86), and the ENISA Technical Implementation Guidance. The analysis does not identify explicit AI-specific provisions — none exist — but instead applies the general evidentiary and documentation standards in each instrument to the AI-generated finding context. The operative question throughout: when a compliance-required security activity produces AI-generated findings, do those findings satisfy the documentation requirements? The answer is derived from what the documentation requirements specify, not from AI-specific exemptions or inclusions.
ARCHER architecture evaluation. ARCHER's architecture was evaluated against the provenance requirements derived in Section 3 through direct inspection of: (a) session log content and format at ~/.archer_sessions/, verified against the traceability, completeness, and integrity properties; (b) the verify_fn implementation in testenv/eval_harness.py, specifically the split_by_command logic and echo-block exclusion; (c) scope handling in ARCHER.py — the target configuration, --prep-sudo flow, and [CLARIFY] token support for scope questions before execution.
Limitations. Four limitations bound the analysis:
-
Practitioner reading vs. legal advice. The regulatory interpretations in Section 2 are derived from primary source reading, not legal analysis. They require confirmation by a qualified EU law practitioner before use in compliance documentation. The §2.4 and §4.2 legal-review warnings are the specific boundaries; the rest of Section 2 is interpretive but lower-risk.
-
Self-assessment circularity. ARCHER's architecture is evaluated by the same individual who designed it. The verification points in Section 7 allow external evaluators to replicate the assessment; they do not eliminate the circularity of the initial evaluation.
-
Single-jurisdiction focus. The analysis addresses EU frameworks (NIS2, DORA) and does not cover equivalent requirements in other jurisdictions — UK NIST, US NIST CSF, APRA CPS 234, or equivalent national implementations of NIS2 by member states. The provenance principles in Section 3 generalize across frameworks; the specific regulatory citations do not.
-
Static snapshot. Both the regulatory framework (delegated regulations, member-state implementations, supervisory guidance) and the ARCHER architecture are evolving. The analysis reflects the state of both as of the paper's publication date.
7. Reproducibility¶
Pending
Full reproducibility documentation under development.
This paper's claims about ARCHER's architecture are verifiable by inspection of the open-source codebase at github.com/jayhawkins108/ARCHER. Specific verification points:
- Session log format and content:
~/.archer_sessions/*.jsonl— examine after any eval run verify_fnimplementation:testenv/eval_harness.py— search forverify_fnandsuccess_fn; examine thesplit_by_commandlogic that enforces echo-block exclusion- Local inference configuration:
ARCHER.py—--localflag and Ollama endpoint configuration - Human authorization layer:
ARCHER.py—--prep-sudoand scope configuration handling; explicit[CLARIFY]token support for scope questions
The regulatory claims in this paper require external verification: - NIS2/DORA compliance eligibility can only be formally verified by the competent authority of the relevant member state; nothing in this paper constitutes a compliance certification - The legal analysis of cloud inference data residency obligations requires qualified legal review
8. Recommendations¶
For security practitioners in NIS2/DORA-regulated environments:
Before selecting an AI security tool, ask the provenance question. Can you show me the raw tool output that supports this finding? If the tool cannot answer that question, the finding is not compliance evidence.
Treat AI-generated summaries as analysis aids, not findings. A model summary that says "this system is vulnerable to authentication bypass" is a starting point for investigation, not a reportable finding. The finding is the specific CVE, in the specific service version, confirmed by specific tool output, in a log that is timestamped and preserved.
Build the authorization documentation before the tool runs. Scope, target, duration, and authorized actions must be documented before a security activity begins. An AI agent that infers scope from a task description does not provide this documentation automatically.
Evaluate local-first tools for environments with data residency requirements. Cloud inference creates a data flow that must be managed under DORA Article 28. Local-first inference eliminates that exposure entirely. For regulated entities, the compliance cost of managing third-party ICT risk for cloud inference may exceed the capability benefit.
For tool developers targeting regulated environments:
Provenance is a first-class design requirement, not a logging feature. Session logs that satisfy provenance requirements must be designed in from the start, not added as an export feature after the architecture is set. The log must capture raw tool output, not model summaries of tool output.
The human layer is not optional. Compliance frameworks require human authorization for security activities. A tool designed to operate autonomously without explicit scope authorization and human review of findings before reporting cannot satisfy this requirement. Build the authorization and review layers into the product architecture, not into the documentation.
9. Falsifiable Claims¶
-
NIS2/DORA-regulated entities cannot use AI security findings as compliance evidence without a provenance chain. Prediction: a qualified legal analysis of NIS2 Article 21 and DORA Article 25 will confirm that findings must be traceable to specific technical evidence to constitute demonstration of a required security measure. Falsified if: regulatory guidance or legal analysis establishes that AI-generated summaries are sufficient evidence of compliance.
-
Cloud inference for security tooling constitutes a reportable third-party ICT risk under DORA Article 28. Prediction: a qualified legal analysis will confirm that routing vulnerability data through a cloud inference provider requires a third-party ICT risk assessment under DORA Article 28. Falsified if: legal analysis establishes that cloud inference does not constitute a material ICT third-party arrangement under DORA.
-
ARCHER's session logs satisfy the traceability and completeness properties as defined in Section 3. Prediction: inspection of ARCHER session logs will confirm (a) every finding annotation cites a command that appears in the log, and (b) the log is complete — no commands or outputs are omitted. Falsified if: session log inspection reveals finding annotations that cite no specific command output, or commands that executed but do not appear in the log.
-
Local-first inference eliminates the DORA Article 28 exposure that cloud inference creates. Prediction: the data flow analysis of ARCHER's local inference configuration will show that no operational data transits third-party infrastructure. Falsified if: ARCHER's local inference configuration routes data through any external endpoint.
References
[^3]: European Union Agency for Cybersecurity (ENISA). Technical implementation guidance on cybersecurity risk management measures, version 1.0, 26 June 2025. Implementing guidance for Commission Implementing Regulation (EU) 2024/2690. enisa.europa.eu/publications/nis2-technical-implementation-guidance
[^4]: European Central Bank. TIBER-EU — How to implement the European framework for Threat Intelligence-based Ethical Red Teaming. Frankfurt am Main: ECB, May 2018. ecb.europa.eu/pub/pdf/other/ecb.tiber_eu_framework.en.pdf
[^5]: Commission Delegated Regulation (EU) 2024/1774 of 13 March 2024 supplementing Regulation (EU) 2022/2554 with regard to regulatory technical standards specifying ICT risk management tools, methods, processes and policies, and the simplified ICT risk management framework. Official Journal of the European Union, L 2024/1774, 25 June 2024. eur-lex.europa.eu/eli/reg_del/2024/1774/oj/eng
[^6]: European Banking Authority, European Securities and Markets Authority, and European Insurance and Occupational Pensions Authority (Joint Committee). Final Report on draft Regulatory Technical Standards on ICT Risk Management Tools, Methods, Processes and Policies, the Simplified ICT Risk Management Framework and determining significant cyber threats, significant disruptions and significant risks under Regulation (EU) 2022/2554. JC 2023 86. January 2024. eba.europa.eu/publications/eba-bs-2024-023-final-report-rts-ict-risk-management-tools-methods-processes-and-policies
Glossary
Audit trail: A complete, ordered record of actions, decisions, and findings sufficient to reconstruct what happened and why. In security operations, an audit trail must be complete (no gaps in the command record), accurate (findings link to the specific evidence that supports them), and tamper-evident (integrity verifiable after the fact).
DORA (Digital Operational Resilience Act): EU regulation (Regulation 2022/2554) requiring financial entities to manage ICT risk, test operational resilience, and report major ICT incidents. Article 25 mandates threat-led penetration testing for significant institutions; Article 28 governs third-party ICT risk arrangements, including cloud inference providers.
Essential entity: Under NIS2, a category of organization subject to stricter obligations than important entities. Includes operators of critical infrastructure, large digital service providers, and entities designated by member states. Subject to proactive supervision and mandatory incident notification within 24 hours of a significant incident.
ICT risk: Risk arising from information and communication technology — including software, hardware, network infrastructure, and data. DORA defines ICT risk broadly to include third-party exposure from cloud providers and managed services, not only internal systems.
Important entity: Under NIS2, a category of organization subject to NIS2 obligations but with lighter supervisory requirements than essential entities. Subject to reactive supervision following incidents rather than proactive oversight.
Investigative provenance: The property of a security finding that makes it possible to trace the finding back to the specific evidence that supports it — the command executed, the output received, and the analyst judgment applied. A finding with complete investigative provenance can be reconstructed, challenged, and verified by a third party without access to the original analyst.
NIS2 Directive: EU directive (Directive 2022/2555) on security of network and information systems. Requires risk management measures, incident reporting, and supply chain security for essential and important entities across critical sectors. Replaces NIS1; took effect October 2024.
Provenance-complete finding: A security finding that includes a direct citation to the specific command output that supports it. A finding without this citation is provenance-incomplete: it cannot be independently verified, reconstructed for incident investigation, or defended under regulatory scrutiny.
Third-party ICT risk: Risk arising from dependence on external providers of ICT services, including cloud inference providers. DORA Article 28 requires financial entities to assess and manage this risk for critical third-party ICT providers. Routing vulnerability data through a cloud inference API creates a third-party ICT risk exposure that local inference eliminates.
Traceability: The property that allows a specific output — a finding, a decision, a log entry — to be traced back to its input evidence. In investigative provenance, traceability means a finding's supporting command and raw output are present and accessible in the session record, not reconstructed from memory or summarized by the model.
About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production. CEH, CHFI, Pentest+, Security+.
Centaur Security Labs — centaursecuritylabs.com