The Measurement Instrument Problem in Eval-Driven AI Development¶

Centaur Security Labs — Jay Hawkins

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

Abstract¶

Eval-driven AI development uses a continuous evaluation loop — run sessions against objectives, measure results, improve the system, repeat — to guide model training and system refinement. The evaluation harness that implements this loop serves two roles simultaneously: it is the measurement instrument producing training data quality assessments, and it is a software artifact subject to the same improvement pressures as the rest of the system. These roles create a constraint that has no parallel in conventional software engineering: the harness cannot be refactored without risking corruption of the longitudinal benchmark data it has been producing. This paper names and analyzes that constraint — the measurement instrument problem — and derives practical implications for how eval-driven AI development projects should structure the relationship between harness evolution and training pipeline stability.

1. Introduction¶

Refactoring is a normal part of software development. Systems accumulate technical debt; abstractions that were appropriate early in a project become liabilities as the system grows; duplicate code drifts into inconsistency; tightly coupled components resist independent modification. The standard response to these conditions is refactoring — restructuring the code without changing observable behavior — which is generally understood to be safe when the behavioral tests pass before and after.

Eval-driven AI development introduces a complication. The evaluation harness — the software that runs sessions against objectives, measures outcomes, and produces the data that enters the training pipeline — is not just a component of the system under development. It is the measurement instrument for the system under development.

When the measurement instrument changes, the measurements change. This is true even when the changes are purely structural — moving code into different modules, resolving duplicate data structures, changing how objectives are loaded from configuration files. A behavioral test that checks whether the harness produces the same output for the same input can verify that individual sessions are scored consistently. It cannot verify that the statistical distribution of scores across the objective space is unchanged — because that distribution depends not just on how individual sessions are scored, but on the order they run, the configuration of the target environment, and the accumulated behavior of the model being evaluated.

Any accumulated trend data in a longitudinal benchmark represents something that cannot be reconstructed: the historical behavior of a specific model version against a specific objective set on a specific target configuration. Any change that introduces even small systematic differences in how sessions are scored potentially makes that historical data non-comparable with new data — which effectively truncates the longitudinal record at the point of the change.

This paper develops the implications of that constraint.

2. The Constraint Defined¶

The measurement instrument problem in eval-driven AI development can be stated precisely:

The evaluation harness simultaneously serves as (a) the measurement instrument producing training data quality assessments and (b) a software artifact subject to improvement pressure. These roles conflict when harness improvements change the statistical properties of the measurements the harness produces.

The constraint has three components:

Epistemic primacy. The harness has a special status in the development process: it is the ground truth for whether the system is improving. A model's behavior is only visible through what the harness measures. If the harness changes, what the harness reports about the model's behavior changes — independent of any actual change in the model. This means harness changes can produce apparent improvements or regressions that are measurement artifacts rather than model behavior changes.

Longitudinal dependency. Eval-driven development derives value from trends, not just point measurements. The question "is the model improving?" requires comparing today's measurements to yesterday's, last month's, last year's. If the harness changes in a way that shifts measurements systematically, comparisons across the change boundary are invalid. The longitudinal record is a single time series; any point of measurement discontinuity splits it into two incomparable series.

Refactoring non-equivalence. In conventional software, "refactoring" means changing code without changing observable behavior, verified by passing tests. In eval-driven AI development, a harness refactoring can be behaviorally equivalent at the session level — each individual session scores identically before and after — while changing the statistical distribution of scores across the objective space due to changes in execution ordering, resource management, or subtle configuration differences. Session-level tests cannot detect this class of change.

3. Why This Doesn't Appear in the Software Engineering Literature¶

The measurement instrument problem is not systematically named as a first-class constraint in the conventional software engineering literature, in part because conventional software systems don't have measurement instruments in the relevant sense.

Production software systems produce outputs for users. Those outputs can be tested. When a refactoring leaves the outputs unchanged, the refactoring is verified. The "measurement" of system behavior is the test suite, which is separate from the system under test and can be updated independently.

In eval-driven AI development, the evaluation harness is not separate from the system under development in the same way. The model being developed is trained on data produced by the harness, then evaluated by the harness, then re-trained on the new data, recursively. The harness and the model co-evolve. The harness shapes what the model learns; the model's behavior determines what the harness measures.

This co-evolutionary relationship means that the harness occupies a position in the development process that has no conventional analog — it is simultaneously the test suite (measuring behavior) and a design artifact (shaping behavior through the data it produces). In conventional systems, these roles are cleanly separated. In eval-driven AI development, they are not.

The closer analog in the measurement literature is instrument drift or systematic measurement error: the measuring instrument itself changes over time in ways that shift readings independently of any change in the underlying system. A miscalibrated thermometer does not change the temperature — but it changes every temperature record produced until the miscalibration is caught. In eval-driven AI development, harness changes play the same role: they shift measurements independently of any change in model behavior, and because the harness is both the instrument and a component of the development loop, the drift propagates into the training data the harness was supposed to be faithfully recording.

4. Practical Implications¶

4.1 Separation of stable and unstable components¶

The harness contains components with different change rates. Objective definitions, verifier logic, and T1/T2 scoring are fundamental to what the harness measures; they should be stable. Session logging, dashboard generation, and reporting infrastructure do not affect measurements directly; they can be refactored more freely.

A well-structured eval harness separates these components so that infrastructure changes — which carry low measurement risk — can proceed independently of core measurement logic changes, which carry high measurement risk.

In ARCHER, this separation is the primary motivation for the AgentEval extraction plan: moving the domain-agnostic measurement infrastructure (harness runner, OA/FP/HD measurement, T1/T2 scoring) into a separate layer from ARCHER-specific content (objectives, verifiers, rubrics). The separation makes the stable measurement core visible as a distinct artifact whose stability can be independently enforced.

4.2 The stable baseline requirement¶

Before any harness change that could affect measurement properties, a stable baseline must exist: a set of objectives run a sufficient number of times with the current harness to establish reliable expected outcome distributions.

The baseline serves as the reference point for detecting measurement discontinuity. After the change, if the objective outcome distributions match the baseline, the change was measurement-neutral. If they don't match, the change introduced measurement differences that need to be characterized before the longitudinal record is extended.

A stable baseline is not just a record of recent pass rates. It is a distributional snapshot — the expected range of outcomes across the objective set, with sufficient samples that the range is meaningfully stable. Point measurements are not sufficient.

4.3 Change sequencing relative to training milestones¶

The measurement instrument problem is most acute during active training pipeline operation. When the harness is actively producing sessions that enter the training corpus, any change to measurement properties affects the quality of the training data being generated in real time.

The appropriate sequencing is: complete a training milestone (producing a validated model checkpoint), then address accumulated harness technical debt, then establish a new stable baseline, then resume data collection. This sequencing means that each segment of the longitudinal record was produced by a stable measurement configuration, with documented change points between segments.

Ad hoc harness changes during active training should be avoided unless they address critical measurement errors — and in that case, the change should be documented as introducing a measurement discontinuity, with the pre- and post-change data treated as separate records.

4.4 Phase 0 as a prerequisite¶

Any substantial harness refactoring should be preceded by a Phase 0 that resolves internal measurement inconsistencies before introducing structural changes.

In ARCHER, Phase 0 of the AgentEval extraction addresses the duplicate objective list problem: generate_dashboard.py maintains _OBJ_LABELS and _OBJ_DOMAIN dictionaries that manually duplicate information in eval_harness.py's OBJECTIVES list. These have already drifted. Until they are reconciled to a single source of truth, changes to either file risk producing a harness where the measurement logic and the reporting logic disagree about what is being measured.

Resolving this inconsistency before structural refactoring begins means the refactoring starts from a coherent measurement baseline. Attempting structural refactoring before Phase 0 means refactoring a measurement system whose internal state is already inconsistent.

5. The Trigger Condition¶

Given the constraints above, when is it safe to refactor the evaluation harness?

The trigger condition is: training milestone complete, stable baseline established, and at least one external use case for the separated harness exists.

The training milestone requirement ensures the refactoring doesn't disrupt active training data collection. The stable baseline requirement ensures there is a reference point for detecting measurement changes introduced by the refactoring. The external use case requirement prevents speculative extraction — the complexity and risk of refactoring the measurement instrument are only justified when someone needs the result.

Refactoring the harness before V2 Phase 5 (QLoRA fine-tune validated) ships would mean touching the measurement instrument while it is actively measuring V2 training data quality. The risk is not theoretical. A systematic change in how sessions are scored would affect which sessions enter the fine-tuning corpus — altering the model that emerges from training without any visible symptom in the harness output until the model's changed behavior surfaces in post-training eval runs.

6. Generalizing Beyond ARCHER¶

The measurement instrument problem is not specific to ARCHER or to AI security tooling. It applies to any eval-driven AI development project with these properties:

A continuous evaluation loop producing training data
A longitudinal record that matters for understanding model improvement
A harness that has accumulated technical debt requiring refactoring

As AI development increasingly relies on continuous evaluation loops — self-play, RLHF pipelines, agent eval frameworks — the measurement instrument problem will appear in more development contexts. The field does not yet have established conventions for managing harness stability, change sequencing, or longitudinal continuity under harness evolution.

This paper names the problem as a first step toward developing those conventions.

7. Falsifiable Claims¶

Eval harness refactorings that are session-level behavior-equivalent can nonetheless change the statistical distribution of outcomes across the objective set. This is testable: run 100 sessions before and after a structural refactoring, compare the per-objective pass-rate distributions, and measure KL divergence between the pre-refactor and post-refactor outcome distributions across all objectives. A non-trivial KL divergence in the absence of any model change is evidence of measurement discontinuity.
Longitudinal benchmark data produced before and after a harness refactoring is not directly comparable without characterizing the measurement delta introduced by the change. This follows from the above.
The risk of measurement discontinuity is greater for changes to objective loading, verifier logic, and T1/T2 scoring than for changes to logging, reporting, and dashboard infrastructure. This is a structural claim about which components affect measurement properties; it can be verified by examining which code paths influence session scoring.