Range Lock-In: When AI Learns the Box Instead of the Vulnerability¶

Centaur Security Labs — Jay Hawkins

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

Abstract¶

AI security agents trained on cyberrange targets face a systematic bias: hints written to pass specific eval objectives teach the model to reproduce the exact steps that work on the training target — the URL structure, the credential, the token field name, the endpoint. On an enterprise target with different implementation details, the model fails to recognize the same vulnerability pattern. It learned the box, not the class.

This paper names that failure mode — range lock-in — describes why it emerges from the hint-writing process, and documents the two-layer design principle that addresses it. The fix is structural, not a matter of writing better hints. It requires that every target-specific hint block be paired with a generic companion that teaches the transferable pattern. A companion analysis examines the same failure at the eval infrastructure layer: hardcoded target configurations in the eval harness produce pass rates that measure performance against a specific target environment rather than against the vulnerability class.

1. The Failure Mode¶

A cyberrange is a controlled training environment. Metasploitable2, DVWA, bWAPP — these are known-state targets where the vulnerabilities are documented, the endpoints are fixed, and the correct exploitation steps are reproducible. They are the right tool for measuring whether an AI agent can execute a penetration testing methodology.

They are the wrong tool for teaching one.

The distinction matters because of how AI training data is generated. In ARCHER, evaluation runs against cyberrange targets produce fine-tuning examples from sessions that end in confirmed objective passes. The model that trained on those examples learned, specifically and concretely, what worked against those targets. When the training pipeline is working correctly, this is a good thing — the model improves on the task distribution it actually sees. The problem is that "the task distribution it actually sees" and "the task distribution it should generalize to" are different things, and the hint system is the mechanism that determines which one the model learns.

Range lock-in is what happens when hints are written primarily to pass evals rather than to teach transferable technique. The model learns to solve the box. On an enterprise target with a different URL structure, different token field name, or different authentication mechanism, the model fails — not because the vulnerability is different, but because it encoded the training target's implementation details rather than the underlying exploitation pattern.

2. Why It Happens¶

Hints are written under pressure to pass eval objectives. The fastest path to a passing eval is specificity: give the model the exact URL, credential, token field, and request format for the training target. The hint works. The objective passes. The session is recorded. The training data enters the pipeline.

This is not a failure of intent. It is a structural consequence of how the feedback loop is set up. Eval pass rate is the visible metric. Generalization to unseen targets is not measured at training time — it surfaces later, when the model is deployed against a real target and fails in a way that looks inexplicable until you trace it back to the hint.

The most common form is IP-based triggering. A hint for SQL injection triggers on "192.168.56.105" in task — the lab IP of the training target. The hint body hardcodes the login path, the injection point, and the exact payload syntax that works against DVWA's implementation. The model trained on this data learns a conditional: when I see this IP and SQL injection, do these steps. It does not learn: SQL injection requires identifying the injection point from the target's response behavior, constructing a payload that works against that implementation, and confirming data extraction from the output.

The model has learned a recipe. The recipe works exactly once.

3. A Concrete Example¶

During ARCHER V2 development, a systematic audit of PT-Web.py hint blocks found the following pattern across five objectives:

Hint	Trigger	Hardcoded specifics
PT-WEBEX-01 SQLi	`"192.168.56.105" in task`	DVWA login path, UNION payload, exact success string
LFI	`"192.168.56.105" in task or "dvwa"`	`/dvwa/vulnerabilities/fi/?page=`, 4-level traversal depth
bWAPP LFI	`"192.168.56.104" in task`	Pure IP trigger, `/bWAPP/directory_traversal_1.php`
PT-XSS-02 Stored XSS	`"192.168.56.105" in task`	DVWA guestbook POST endpoint
PT-XSS-01 Reflected XSS	`"192.168.56.105" in task`	`/dvwa/vulnerabilities/xss_r/?name=`

Every one of these hints would fail on a target with a different IP. Several would fail on the same application deployed to a different host. None of them teach the model anything about how to find these vulnerabilities on an unfamiliar target.

The PT-WEBEX-04 CSRF hint was the first one addressed. The original form triggered on the training target's IP and hardcoded the DVWA login path, the user_token CSRF field name, and the GET-based request format specific to DVWA's implementation.

The fix split it into two blocks:

App-specific block (trigger: "dvwa" in task_lower):

DVWA CSRF — three steps: (1) authenticate to /dvwa/login.php with admin:password,
(2) GET /dvwa/vulnerabilities/csrf/ and extract user_token from HTML,
(3) replay the state-changing request with that token.

Generic companion block (trigger: "csrf" in task_lower and target present):

CSRF exploitation — three phases regardless of target:
(1) authenticate first — most CSRF vulnerabilities sit behind a login wall,
(2) retrieve the token — load the page with the vulnerable form and extract
the CSRF token field from the HTML source,
(3) replay the action — submit the state-changing request with the token you
extracted, not a guessed value.
Use <login-endpoint>, <token-field>, <action-url> as placeholders — fill from recon.

The first block produces reliable eval passes and clean training data against DVWA. The second block teaches the underlying three-phase pattern that works regardless of application. A model trained on both learns the recipe and the principle.

4. The Design Principle¶

Every hint block that triggers on a specific application or IP must have a generic companion. The two layers are not redundant — they serve different purposes.

The app-specific block (trigger: application identity — "dvwa" in task_lower, "metasploitable" in task_lower, "bwapp" in task_lower):

Provides exact commands for the training target
Drives reliable eval pass rate and training data yield
Should remain in the codebase as long as that target is in use
Encodes the what, not the why

The generic companion block (trigger: vulnerability keyword + target context):

Teaches the transferable pattern: what to enumerate, what evidence to look for, what to do with it
Uses placeholders (<login-endpoint>, <token-field>, <action-url>) that the model fills from recon output
Encodes the why — the methodology that survives target variation
This is what enterprise generalization depends on

A hint that exists only in app-specific form is training data for solving one target. A hint with both layers is training data for understanding a vulnerability class.

The cyberrange is a training prop. The vulnerability class is the curriculum.

5. Implications for Agent Training¶

Range lock-in is not unique to ARCHER. Any AI security agent trained on cyberrange targets using task-specific prompting faces this failure mode. The mechanism is the same: the training signal rewards specificity because specificity is what passes the eval, and the eval is the only feedback signal available during training.

Three practices follow from this:

Audit triggers before auditing content. The fastest way to identify range lock-in is to search for IP literals and application-specific paths in hint trigger conditions. If the trigger is an IP address, the hint is almost certainly range-locked regardless of how transferable the hint body appears.

Treat the generic companion as mandatory, not optional. The natural instinct when writing hints is to get the eval passing first and add generality later. "Later" reliably does not happen — the next eval objective is waiting, the cap is filling, and the generic companion stays on the backlog. The fix is to treat the two-layer structure as a minimum viable hint, not an enhancement.

Measure generalization separately from pass rate. Eval pass rate against training targets is a necessary signal. It is not a sufficient one. A separate evaluation against targets the model has never seen — even a small set of intentionally varied implementations of the same vulnerability classes — would surface range lock-in before it enters the fine-tuning pipeline. That evaluation requires a target-configurable harness: as long as target addresses are hardcoded, the same infrastructure that measures training performance cannot measure generalization. The two problems are coupled — fixing eval-layer range lock-in (Section 6) is the prerequisite for this practice becoming achievable. Note: this practice is currently proposed, not implemented in ARCHER's eval infrastructure; Section 6 describes the architectural change required before it is achievable.

6. The Same Failure at the Eval Layer¶

The analysis in this section describes a proposed architectural fix; the target-configurable harness and implementation-agnostic success functions are design goals, not completed features of the current ARCHER eval infrastructure.

Range lock-in in hints is a training data problem. The same structural failure operates one layer up, at the eval infrastructure itself.

ARCHER's eval harness encodes the training environment directly. TARGET = "192.168.56.103" is Metasploitable2 on a VirtualBox host-only network. Active Directory objectives reference GOAD lab addresses specific to a VirtualBox configuration on a single development machine. Juice Shop runs at 192.168.56.1:3000. Several success functions call these addresses directly to confirm objective completion.

The consequence: pass rates measure performance against a fixed, specific target set on a specific network configuration — not performance against the vulnerability class. To make this concrete with a hypothetical: an agent achieving a high pass rate — say, in the 90s — on Metasploitable2 at 192.168.56.103 may perform substantially differently against a different Metasploitable2 instance at a different address, against a different Linux target with different service versions, or against an enterprise environment with different implementation details. (The specific 94% figure cited in earlier drafts of this paper was a point-in-time lab observation; it is not asserted here as a general or stable result.) The eval tells you whether the agent can solve the range. It does not tell you whether the agent can solve the class.

The structural cause is the same as at the hint layer: writing evals that pass is easier than writing evals that measure generalization. A hardcoded IP is a working test. A configurable target with a parameterized success function is additional design work. The natural pressure is toward specificity, and specificity produces range-locked metrics.

The fix requires the same two-layer approach. At the hint layer, the fix is app-specific block plus generic companion. At the eval layer:

Target-configurable harness: Replace hardcoded IP constants with CLI parameters (--target, --goad-dc01, etc.) so the same eval objectives can be run against different target environments without code changes.

Implementation-agnostic success functions: Where possible, verify_fn should confirm that the exploit class succeeded rather than checking for a response string specific to one target's implementation. A root shell is a root shell regardless of target — confirming uid=0 generalizes. Confirming a Metasploitable2-specific banner string does not.

Neither fix eliminates training-target specificity — ARCHER still needs concrete targets to measure pass rate against. The goal is the same as at the hint layer: the specific target drives the training signal, and the configurable structure enables generalization measurement. Both layers are required.

The third practice in Section 5 — "measure generalization separately from pass rate" — is only achievable with a target-configurable harness. Without it, generalization measurement remains aspirational regardless of how well the hint layer is designed.

The failure mode described here — a model overfitting to training-environment specifics rather than learning transferable patterns — is a well-understood problem in machine learning under names like dataset bias and distribution shift. The contribution of this paper is not to discover the underlying phenomenon but to name its hint-layer instantiation in AI security agent training, characterize the specific mechanism (IP-triggered hints with hardcoded implementation details), and document a structural fix that preserves training-target pass rate while separately encoding transferable technique.

Readers familiar with NetSecGame (Drasar et al., 2024) and related network security simulation research will recognize analogous generalization challenges in that body of work. The two-layer hint design proposed here addresses the hint-writing layer specifically; it does not claim to solve the broader overfitting/generalization problem, and the degree to which generic companion blocks actually improve out-of-distribution performance is an open empirical question in this lab.

About the author: Jay Hawkins spent twenty years in the U.S. Army, including a decade in cyber operations — serving at USCYBERCOM, USCENTCOM, USNORTHCOM, and USEUCOM — and holds an active TS/SCI clearance. He builds local-first AI security tools and writes about the methodology, the hard lessons, and the compliance implications of doing it in production.

Full background →

Centaur Security Labs — centaursecuritylabs.com