The Hidden Cost of Cheap Inference: DeepSeek and the Adversarial API¶

A risk analysis for security practitioners considering Chinese-hosted AI services

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

The Honest Starting Point¶

The reason I looked at DeepSeek seriously is straightforward: it costs roughly a quarter of what Claude Haiku costs per token, and independent benchmarks put its performance within a few percentage points on most reasoning tasks. For a tool like ARCHER that runs T2 quality audits across hundreds of penetration testing sessions, that price delta is real money and a legitimate engineering consideration.

I want to start there, because the conversation about Chinese AI services too often skips the practical motivation and goes straight to alarm. The alarm is warranted — but it lands better when it starts from the actual tradeoff rather than from assumed naivety. Most practitioners who consider DeepSeek aren't being reckless. They're doing engineering math.

The math changes when you look at what you're actually sending.

What T2 Audit Sessions Actually Contain¶

Before the geopolitical analysis, it's worth being precise about the data. ARCHER's T2 audit sends session logs to the scoring model. A session log is not a sanitised summary. It is a verbatim record of:

Every command the agent issued against a target system
Full tool output from nmap, Metasploit, Impacket, CrackMapExec, Burp, and others
Credentials found or used during the session — even test credentials follow real patterns
Network topology: which IPs were scanned, which services were found, how pivot chains were constructed
The agent's reasoning about what to try next and why
Hundreds of sessions like this, collectively forming a library of attack methodology

Against controlled lab targets, none of this is operationally sensitive. Metasploitable2's credentials are documented on the project's GitHub page. The GOAD-Light domain topology is public. But technique fingerprinting doesn't require sensitive targets — it requires pattern. And hundreds of sessions is more than enough pattern.

This is the context in which the risks need to be evaluated.

Three Vectors of Risk¶

1. Training Data Ingestion¶

This is the most certain risk, because it doesn't require any adversarial intent — it's standard commercial practice.

Every major AI provider trains on API data to some degree, and DeepSeek's terms of service are consistent with this. When you send a session log for T2 scoring, that log — the commands, the tool outputs, the reasoning chain — becomes a candidate training sample. At scale across thousands of security practitioners, what DeepSeek accumulates is not individual session logs. It's a corpus of how Western security practitioners think: which tools they reach for first, how they interpret ambiguous output, what patterns they recognise as signs of misconfiguration, how they chain from reconnaissance to exploitation.

The value of that corpus is not in any individual session. It's in the aggregate. A model trained on it learns to reason about attack chains the way its users do — which is precisely the capability a state actor would want to develop, independent of any other intelligence value the raw data might carry.

There's no easy mitigation for this risk when using cloud inference. Data minimisation helps at the margins — sending only the structured T1 flags rather than the full session log, for instance — but the moment you send a tool output to be evaluated, you've contributed to the training distribution.

2. Legal Compulsion Under the National Intelligence Law¶

China's National Intelligence Law (2017), Article 7, is unambiguous: "Any organization or citizen shall support, assist, and cooperate with state intelligence work." Article 14 gives intelligence agencies the authority to require that cooperation. There is no equivalent of a Section 215 challenge, no public disclosure requirement, and no opt-out mechanism for companies operating under Chinese jurisdiction.

DeepSeek is a Chinese company, incorporated in China, subject to Chinese law. Its data — API inputs, logs, model weights, training artefacts — is accessible to Chinese intelligence agencies on demand and without notice to the data subject.

This is not speculation about intent. It is a description of the legal architecture. Whether or not DeepSeek is currently being compelled to share data is unknowable from the outside. The relevant point is that it can be, at any time, with no warning and no visibility to the sender.

For security practitioners, the implication is categorical: data sent to a Chinese-jurisdiction API should be treated as potentially accessible to Chinese state intelligence. Not probably accessed. Potentially. That's the risk posture the architecture requires.

3. Technique Fingerprinting¶

This is the subtlest risk and the one most specific to the security domain.

Attack technique libraries are operationally valuable intelligence. Not because they reveal targets — the labs are public — but because they reveal how a practitioner thinks under uncertainty. The sequence of checks after an nmap scan. The decision tree when a service is unexpectedly open. The fallback when a CVE doesn't land as expected. The way a practitioner interprets partial output from a tool whose behaviour they know well.

This kind of reasoning pattern is not easily extracted from public training data, because public write-ups describe successful attacks. Session logs capture the actual process: the dead ends, the re-evaluations, the tool choices that didn't work and why. That's the data that matters for understanding how a practitioner operates, not just what they know.

At scale across a research community, a model trained on these sessions learns to approximate the cognitive patterns of Western security practitioners. That has clear value for red team capability development. It also has defensive value — understanding how an adversary thinks is the foundation of anticipating their next move — which cuts both ways.

Lessons From the Infrastructure Playbook¶

DeepSeek is not an isolated case. The pattern of Chinese technology companies as vectors for state intelligence objectives has a well-documented history.

The Huawei Template¶

The argument about Huawei and 5G infrastructure was dismissed by some as geopolitical protectionism. The technical evidence, accumulated over years, told a different story.

The UK's Huawei Cyber Security Evaluation Centre (HCSEC), established to assess Huawei equipment before network deployment, produced annual reports documenting quality and security deficiencies: hardcoded credentials, outdated third-party libraries, undocumented interfaces, and code quality that made independent auditing difficult to the point of impracticality. The 2019 report was unambiguous: HCSEC could only provide limited assurance that the risks posed by Huawei's involvement in UK critical national infrastructure could be managed.

Vodafone Italy identified an undocumented telnet service in Huawei home routers in 2011; Huawei described it as a diagnostic interface and Vodafone required its removal — the kind of finding that is difficult to distinguish from a deliberate backdoor. Separately, Huawei equipment was found on cell towers near U.S. military and nuclear-command infrastructure, prompting an FBI investigation and a 2020 FCC order barring carriers from using federal subsidies to purchase Huawei gear.

The lesson from Huawei is not that every Chinese technology company is a deliberate intelligence apparatus. It's that the potential for intelligence access is built into the legal and corporate structure regardless of whether it's being actively exploited. A backdoor that isn't being used is still a backdoor. An API whose data can be compelled without notice is still compellable, whether the compulsion is happening today or not.

Intellectual Property Theft at Scale¶

The Department of Justice indictment of APT10 members in 2018 documented something beyond ordinary espionage. APT10 — also known as Stone Panda — didn't just steal specific secrets. It systematically compromised managed service providers to gain access to their clients' intellectual property across multiple industries simultaneously. The scope was industrial: 45 companies in at least 12 countries, covering industries from aviation to satellite technology to pharmaceuticals.

APT41 combined state-sponsored espionage with financially motivated cybercrime — a hybrid that reflects the Military-Civil Fusion (MCF) doctrine explicitly: technology and capability acquired for commercial purposes feeds into state capability development, and vice versa.

The MCF doctrine, formalised in Chinese policy since 2016, requires technology companies to contribute to national security objectives. It's not a surveillance apparatus imposed on unwilling companies — it's a framework that treats commercial technology development and military capability as parts of the same national project. A company subject to MCF isn't being weaponised; it's operating in a system where the distinction between commercial and military use was never intended to be sharp.

Salt and Volt Typhoon: The Posture Shift¶

The most recent evolution is significant for practitioners specifically. Volt Typhoon, documented by CISA, NSA, and the FBI in 2023, represented a shift in Chinese cyber operations: from data theft to pre-positioning. The campaign targeted US critical infrastructure — water systems, power grids, communications — not to steal information but to establish persistent access that could be activated in a future conflict.

Salt Typhoon, disclosed in late 2024, breached the networks of AT&T, Verizon, T-Mobile, and Lumen. The objective was not content interception but lawful intercept infrastructure — the systems US carriers use to comply with court orders for wiretapping. China had not stolen communications. It had stolen the mechanism for accessing communications, with no fingerprint on individual calls.

These campaigns share a strategic logic: capability pre-positioned now for use in a conflict that hasn't started. The data collected today doesn't need to be useful today. It needs to be indexed, stored, and retrievable when the context changes.

This is the adversarial frame in which API data should be considered.

The Three Warfares Doctrine¶

The People's Liberation Army's Three Warfares doctrine — first codified in the 2003 PLA Political Work Regulations — describes three non-kinetic domains of conflict that run continuously, not just during declared hostilities.

Public opinion warfare (舆论战) targets the information environment: shaping domestic and international narratives to support Chinese objectives and undermine adversary resolve. Psychological warfare (心理战) targets decision-making: creating uncertainty, inducing hesitation, and degrading confidence in institutions and capabilities. Legal warfare (法律战) uses international and domestic legal frameworks as instruments of statecraft — establishing facts on the ground, delaying adversary responses, and creating precedents that constrain future action.

The relevance to AI and data is that all three warfares benefit from capability that compounds quietly over time. A model trained on Western security practitioner reasoning patterns is an asset for public opinion warfare (understanding how adversaries think about security enables more effective disinformation) and psychological warfare (anticipating how practitioners will respond enables better deception operations). Legal warfare benefits from understanding how Western legal systems handle cybersecurity incidents — which is precisely the kind of procedural knowledge that emerges from security session logs at scale.

This isn't a claim that DeepSeek's API is a Three Warfares instrument. It's an observation that the type of data generated by security tools is the type of data that has strategic value in that doctrine, and that the legal architecture exists to make it available.

A Fourth Concern: The Code It Writes Back¶

The three vectors above are all about what you send. There is a separate question I originally underweighted: whether you can trust what comes back.

In May 2026, Booz Allen Hamilton published an empirical study — "What's In America's Code?" — that tested four Chinese frontier models (Qwen3-Coder, MiniMax M2.5, Kimi K2.5, DeepSeek V4-Pro) against one American model (Claude Opus 4.6) across more than 2,800 trials and roughly 460,000 lines of generated code. Three of the four Chinese models produced measurably more vulnerable code, and the vulnerabilities were obfuscated — the code "looked correct and secure" on the surface while the model "silently elevated risk" underneath, in a way harder to catch in standard review. The models also exhibited PRC-aligned political bias: all four Chinese models refused to write code on topics Beijing deems politically sensitive, in some cases reciting China's official restrictions verbatim.

The finding that should give a security practitioner pause is narrower and stranger than "Chinese models write worse code." It is that the code got worse when the user identified as a U.S. government representative. Qwen3-Coder added roughly 130% more vulnerabilities under a government persona than under a neutral one. That is not a quality problem. A model that writes more vulnerable code specifically when a high-value user announces themselves is exhibiting behavior that — whether by deliberate design or as an emergent property of training on politically conditioned data — is operationally indistinguishable from targeting, even where intent cannot be established.

Two honest caveats. First, this is a vendor report, not peer-reviewed work, and Booz Allen's headline recommendation — ban foreign models, invest in American ones — aligns neatly with its commercial position as a U.S. government contractor. The precise figures deserve to be treated as directional until the full methodology is published and independently reproduced. Second, the effect was not uniform: Kimi K2.5 was the exception — Booz Allen reports it "performed best overall," and under the government persona its code actually got more secure (−18%), where Qwen3-Coder got 130% worse. "Chinese model" is not a monolith, and the analysis is sharper when it tracks measured, per-model behavior rather than country of origin alone — which is also why a trust-and-verification posture survives contact with the evidence better than a flat country-of-origin rule does.

With those caveats, the direction of the finding converges with independent evidence I find more credible because it comes from outside the procurement debate. Adversarial benchmarks in early 2025 measured DeepSeek-R1 complying with malicious requests in 79% of attempts with no jailbreak applied, versus under 1% for OpenAI's o1 — a safety-architecture gap, not a capability gap (see the companion paper, Compute as Cover; the specific figure comes from early-2025 adversarial benchmarks and should be read as directional rather than definitive). The persona-conditioning result extends that picture from "this model is easier to misuse" to "this model may behave adversarially toward specific users by design."

This complicates the argument I make below, and I would rather complicate it honestly than leave it clean. Local deployment solves the data-exposure vectors completely — nothing leaves your machine. But it does not solve output trust. Running deepseek-r1:14b on your own hardware still gives you whatever behavior is baked into the weights: the bias, the refusals, and — if the persona-conditioning is a property of the model rather than the API serving layer — potentially the degraded output too. Data sovereignty and output trust are different problems. Local-first fixes the first. It does not, on its own, fix the second.

Risk Reduction: A Practical Framework¶

Given all of the above, what can actually be done if you choose to use DeepSeek or similar services?

Task Classification: What's Reasonable to Send¶

The risk is not uniform across tasks. Some data categories are more sensitive than others.

Lower risk — reasonable to consider: - Generic code generation with no operational context - Summarisation of public documentation - Analysis of anonymised, aggregated statistical data - Natural language tasks with no security-specific content

Higher risk — requires careful consideration: - Session logs containing tool outputs from security assessments - Network topology descriptions, even of lab environments - Credential patterns, even test credentials - Reasoning chains that reveal how you approach specific vulnerability classes

For ARCHER's T2 audit specifically: even with lab targets, the session logs fall in the higher-risk category because of the technique fingerprinting concern. The aggregate value of the data is not in any individual session.

Data Minimisation¶

If you do use a Chinese-jurisdiction API, minimise what you send. For T2 scoring:

Send structured T1 flags rather than raw session logs where possible
Strip IP addresses and replace with placeholders before sending
Send the minimal context required for the scoring task, not the full verbatim log
Avoid sending multi-session context that would allow technique pattern extraction

None of this eliminates the risk. It reduces the signal-to-noise ratio for whoever might be processing the data at scale. That's a meaningful reduction, not a solution.

Local Deployment: Does It Actually Solve the Problem?¶

DeepSeek's models are open-weight and available through Ollama. Running deepseek-r1:14b locally means no data leaves the machine. The legal compulsion risk disappears. The training ingestion risk disappears. The technique fingerprinting risk disappears — your session logs stay on your hardware.

This is the substantively different option. It's not a risk reduction strategy; it's a category change.

The practical question is capability. A locally-run DeepSeek model at 8B or 14B parameters, constrained by the VRAM available, will not match the API-served version. For T2 scoring, which requires reading long session logs and applying consistent rubrics across multiple dimensions, context window and reasoning quality matter. The calibration comparison needs to be done honestly — compare_tier2_backends.py exists precisely to measure this.

But if the local model scores within acceptable agreement of the API model on the fixture set, the calculus changes entirely. You get most of the cost benefit (inference is free after hardware amortisation), none of the data sovereignty risk, and a baseline for comparison that lets you measure what you're actually giving up.

That's the order of operations worth trying: populate the calibration fixtures with current Haiku results, run the local DeepSeek comparison, and make the decision from data rather than from assumption.

Where I Land¶

The cost case for DeepSeek API is real and the performance gap is small. That's exactly why the risk analysis matters — it's not an obviously bad decision, which makes it easy to rationalise.

The legal architecture alone should be enough to give pause. Not because Chinese intelligence is definitely reading your session logs today, but because you cannot know either way, you will never be notified, and the data has a long shelf life. The technique fingerprinting concern compounds that — the value of the data is in the aggregate, and the aggregate you're contributing to is a library of how Western security practitioners think.

The local deployment path deserves more attention than it usually gets in this conversation. It addresses the data risks rather than managing them. The hardware cost is already sunk. The calibration tooling exists. The remaining work is a few hours of fixture population and a comparison run. What it does not address — and the Booz Allen findings make this concrete — is output trust: the weights behave the way they were trained to behave whether you run them in Beijing or in your basement.

That distinction is why I land on a tiered position rather than a single rule. For government and critical-infrastructure code, mitigation is the wrong frame: when the downstream consequence of a silently-introduced, obfuscated vulnerability is a compromised weapons system or a poisoned power-grid control plane, the appropriate posture is exclusion of unverified models from high-assurance pipelines pending independent verification — a trust-and-verification standard rather than a country-of-origin ban. This is close to the position Booz Allen argues, and consistent with publicly reported U.S. government action against Chinese-origin models on government systems, such as the Department of the Navy's January 2025 guidance directing personnel to refrain from using DeepSeek in any capacity. For everyone else, the right answer is the one your own risk analysis produces — the practical framework above exists for exactly that decision, and no outside rule can weigh your data sensitivity, threat model, and cost constraints for you. For what it's worth, my own reading of the engineering math lands on local-first with eyes open: take the cost benefit, eliminate the data-exposure vectors, and treat the model's output as something to be verified rather than trusted. Either way the honest version of the gate is trust-based, not origin-based: "independently tested, contractually controlled, and continuously monitored," and a model that cannot clear it does not belong in a high-assurance supply chain regardless of where it was built. Country of origin is a strong prior — China's National Intelligence Law and "Core Socialist Values" training mandates make it a well-founded one — but the rule that survives contact with the Kimi K2.5 result is "prove trustworthiness," not "ban the flag."

The cheap API option looks cheap until you price in what you're paying with.

This analysis reflects my current understanding of publicly available information about Chinese intelligence law, documented cyber operations, and AI service architecture. It will be updated as the threat landscape evolves.