Smarter Than One: Model Tiering, Domain Specialization, and the Future of Multi-Model Security Agents¶

Status: In Preparation | Centaur Security Labs | 2026

The views expressed in this publication are those of the author and do not reflect the official policy or position of NORAD, USNORTHCOM, USCYBERCOM, the Department of the Army, the Department of War, or the United States Government.

Abstract¶

Security automation exposes an inconvenient truth about large language models: a single model that can reason about Active Directory lateral movement is expensive overkill for running nmap -sn 192.168.56.0/24. Yet until recently, ARCHER used the same 14-billion-parameter model for every task — from host discovery to credential attack chains. This article documents what I learned by routing different task classes to differently-sized models, why I expect domain-fine-tuned variants to outperform generalists even at the same parameter count, and what architectures become possible when the constraint shifts from "which model should I use?" to "how do I compose the right model for each moment?"

1. The Observation That Started This¶

This came out of two distinct evals, run on consecutive days, and it is worth keeping them separate because each one establishes a different fact. The first was a comparison eval (2026-06-01): qwen3:8b run on every skill, head to head against the qwen3:14b baseline on ARCHER's full 64-objective benchmark. That run is where the asymmetry showed up — the smaller model was not uniformly worse.

On reconnaissance, port scanning, service enumeration, web enumeration, and entity identification — the tasks where inference is primarily selecting the right tool and flag combination — the comparison eval recorded qwen3:8b as 40–93% faster than qwen3:14b on execute-and-report tasks, and the 8b model passed all Tier 1 objectives in that run (51/51 = 100%, the comparison eval's Tier 1 objective count). On interpret-and-reason tasks (vulnerability identification, complex exploitation chains, Active Directory operations), the 8b model showed 69–125% latency increases and dropped overall pass rate to 91.7% (177/193) — below the 95.4% threshold required for a baseline update. (One caveat on the speed numbers, expanded in §2: the 40–93% figure is a cross-run comparison, not a controlled one.)

The discriminating axis wasn't parameter count. It was task structure. I call this the execute-and-report / interpret-and-reason split:

Execute-and-report tasks have a clear right answer derivable from a small decision tree. Given a live host range, the correct action is nmap -sn <range> and a brief summary. The output is deterministic. The model's job is to select from a known tool inventory and emit a clean result. A smaller model that was pretrained on nmap documentation can do this as well as — and often faster than — a larger model that is simultaneously holding context about NTLM authentication flows, CVSS scoring, and exploit databases.

Interpret-and-reason tasks require multi-step inference over ambiguous output. Determining that a 403 response on /admin combined with an X-Powered-By: PHP/7.4.3 header and a CSRF token in a hidden field implies a specific exploitation path isn't a tool-selection problem — it's a reasoning problem. Larger models with broader pretraining have a genuine, measurable advantage here.

This is not a small effect. On the recon and scanning phase of a full engagement, the 8b model's speed advantage compresses wall time meaningfully — and with no measurable accuracy cost. The second eval — the tiered rebaseline (2026-06-02), which ran the production routing map rather than 8b-on-everything — confirmed this: 60 of 60 Tier 1 objective-runs passed with qwen3:8b. Zero failures. Every failure in that run was in a Tier 2 objective running qwen3:14b. That 60 is 20 Tier 1 objectives each run 3 times, not 60 distinct objectives; a perfect pass rate on n this small carries a wide confidence interval, so the claim is "no regression observed," not "provably zero error rate."

2. Tiered Routing: What I Built and What I Gained¶

The implementation is deliberately simple. ARCHER's eval harness defines two constants (_TIER1_MODEL, _DEFAULT_MODEL) and a skill-to-model map (_PLAY_MODEL_MAP):

_TIER1_MODEL = "qwen3:8b"
_DEFAULT_MODEL = "qwen3:14b"

_PLAY_MODEL_MAP = {
    # Tier 1 — tool-bound, deterministic (5 skills)
    "reconnaissance":        _TIER1_MODEL,
    "port_scanning":         _TIER1_MODEL,
    "service_enumeration":   _TIER1_MODEL,
    "web_enumeration":       _TIER1_MODEL,
    "entity_identification": _TIER1_MODEL,
    # Tier 2 — reasoning-heavy: full model
    "web_exploitation":      _DEFAULT_MODEL,
    "network_exploitation":  _DEFAULT_MODEL,
    "linux_privesc":         _DEFAULT_MODEL,
    "ad_lateral_movement":   _DEFAULT_MODEL,
    "reporting":             _DEFAULT_MODEL,
    # ... 16 additional Tier 2 skills (21 Tier 2 entries total)
}

The skill map is not a permanent classification — it's a hypothesis register. Each entry represents a decision about where reasoning depth matters more than inference speed, subject to revision as eval data accumulates.

What this gains:

Speed. On the rebaseline benchmark, Tier 1 objectives averaged 20–78 seconds each depending on skill, down from 35–60 seconds with qwen3:14b. Even with a known prewarm bug (#775) that loads the 14b model between Tier 1 runs before evicting it — erasing roughly half the speed gain — the improvement is measurable. Once fixed, per-run savings of ~15 seconds per Tier 1 objective are expected, compressing a full 3-run eval by approximately 15 minutes.

A caveat on the headline speed figure: the 40–93% Tier 1 speedup quoted in §1 comes from the 2026-06-01 comparison eval, which is a cross-run measurement — the 8b and 14b numbers were collected in separate runs, not interleaved. That run also happened to execute in the 00:00–05:00 UTC window, which ARCHER's eval-data analytics flags as the worst time-of-day band (a ~45-percentage-point pass-rate swing across the day, attributed to VRAM contention and thermal effects that also move latency). So the speed comparison is directional, not controlled for time-of-day or thermal state; treat the magnitude as an estimate rather than a benchmarked constant.

VRAM headroom. On the RTX 4060 Laptop that ARCHER's lab runs on (8.2 GB total), qwen3:14b occupies 7.5 GB (91%). qwen3:8b occupies approximately 6.2 GB (75%) at the default Ollama quantization level (Q4_K_M). These cannot coexist. (Provenance caveat: the quantization is the default Ollama tag, which is mutable and whose pull date is unrecorded; the VRAM figures are as-asserted in ARCHER's model-routing record, not from a separately captured primary measurement.) Every model switch incurs a full eviction and reload — approximately 15 seconds — which means the routing map has a switching cost that penalizes fine-grained per-objective swaps. The current boundary (5 Tier 1 skills vs 21 Tier 2 skills) was chosen partly to minimize switches while preserving the speed gains: the eval runs through the Tier 1 block early, then transitions to Tier 2 and stays there. With better VRAM (or quantized models that can coexist), finer routing becomes tractable.

Accuracy preservation. The 2026-06-02 rebaseline eval saw 60/60 Tier 1 objective-runs pass (20 objectives × 3 runs). Within this sample, the accuracy cost of tiering in the Tier 1 region is zero — with the small-n caveat from §1 that a clean sweep at this scale bounds the error rate loosely rather than proving it is zero.

3. The Case for Domain Fine-Tuned Models¶

Tiered routing solves the size mismatch. Fine-tuning solves a different problem: the distribution mismatch.

A pretrained model like qwen3:14b has encountered fragments of nmap documentation, Metasploit module descriptions, CVE summaries, and Kali Linux forum posts scattered across its training corpus — but it has never been trained to complete a penetration test objective from ARCHER's task format. Its knowledge of gobuster is encyclopedic and diffuse. ARCHER's hints are narrow and precise: call gobuster dir with these exact flags against this specific host, look for this class of response, emit this evidence format.

The gap between encyclopedic knowledge and task-appropriate knowledge is where fine-tuning lives.

ARCHER already collects the training signal. Every eval run with --ft-log writes a .ft.jsonl alongside the session log: the task prompt, each command issued, the tool output, whether the objective passed, and the T2 verification score where available. As of this writing, that dataset spans six weeks of intensive eval-driven development — thousands of sessions across 75 objectives, including successful chains, failed chains, recovery sequences, and HALT conditions. The compressed timeline reflects high-cadence eval cycling: every session produces labeled data, which means six weeks of daily runs accumulates volume that conventional development schedules would spread across months.

What a fine-tuned model would improve over a generalist:

Tool flag precision. Generalist models sometimes emit plausible but suboptimal flag combinations — technically valid nmap invocations that don't match the lab environment's constraints. A model fine-tuned on ARCHER's passing sessions learns the specific flag patterns that work against GOAD-Light's host configuration, WinRM setup, and SMB version.

Evidence capture conventions. ARCHER's T2 scoring checks whether specific artifacts appear in session output: hash formats, service banners, user lists, privilege indicators. A generalist model will often complete the underlying operation but fail to capture the artifact in the expected form. A fine-tuned model learns what T2 is looking for because it was trained on sessions that passed T2.

Failure avoidance. The failure mode taxonomy documents 16 classes of recurring errors — startup log misreads, wrong-host confusion, premature objective achievement. These are not model capability failures; they are distribution failures. The model generates plausible output that happens to match a failure pattern. A model fine-tuned on ARCHER's hint_change=True correction sessions — sessions where the hint was specifically updated to prevent a failure class — has been shown examples of what to avoid and been rewarded for not doing it.

Shorter chains. Generalist models often take exploratory paths: try one approach, observe, redirect. A domain-fine-tuned model has seen enough ARCHER sessions to know that on a GOAD-Light target with WinRM open, the credential attack sequence follows a predictable branch. It gets there in two commands instead of five. This compounds across a full engagement.

Why LoRA adapters rather than full fine-tunes:

Full parameter fine-tuning of a 14b model requires hardware unavailable in the local lab. LoRA adapters — rank-decomposed weight updates applied at inference — can be trained on consumer hardware and hot-swapped without reloading the base model. The current --adapter flag in ARCHER already supports this path. The expected architecture: a shared base model with per-domain adapters for network, web, Active Directory, and post-exploitation phases. Switching phases swaps an adapter, not a model — eliminating the 15-second eviction cost entirely.

To be specific: this is not four separate 14b model instances, each requiring 7.5 GB of VRAM — that would require ~30 GB, roughly four times the available hardware budget. It is one shared 14b base model held resident in VRAM, with per-domain adapter sets (~200–500 MB each) that load and unload as the active domain changes. The adapter swap touches only a small fraction of the total parameter space; the base model weights never move. The eviction penalty drops from 15 seconds to the time required to load a few hundred megabytes — sub-second rather than the full model-reload penalty, though exact swap latency depends on hardware and implementation.

Domain-adaptive pre-training. LoRA fine-tuning on task data is the second step in the training sequence. The first — domain-adaptive pre-training (DAPT) — conditions the base model on domain-specific corpora before any labeled task data is seen. DAPT is continued training on unlabeled text: tool documentation, CVE summaries, protocol specifications, exploit databases. A model that has seen precise nmap flag semantics and Metasploit module syntax during pre-training requires fewer task examples to reach the same fine-tuned accuracy. Running DAPT locally is not feasible at 14b parameters and 8.2 GB VRAM; cloud GPU-as-a-service removes that ceiling. Training ARCHER's base model on curated tool documentation corpora via a cloud training job — before the LoRA fine-tuning pass — is planned. Until that infrastructure is in place, retrieval-augmented generation serves as the bridge: inject relevant tool documentation sections into the session context at inference time. Same information, no training cost, deployable today.

4. Future Architectures¶

The tiered routing implementation is a two-model system with a static map. It demonstrates that model composition is tractable and measurable. Several more ambitious architectures become worth investigating as ARCHER's eval infrastructure matures.

Speculative pre-loading. An eval run's objective sequence is determined at start time. In principle, a scheduler could look ahead: while the Tier 1 model is executing PT-SCAN-04, begin pre-warming the Tier 2 model so it's resident by the time PT-EXPLOIT-01 starts. The overlap eliminates the 15-second penalty for the first Tier 2 objective. On hardware with marginal VRAM this requires careful memory accounting, but the scheduler code is straightforward and the win is real.

A lightweight task router. The current skill map is hand-authored and static. A 1–3b model trained specifically to classify ARCHER task descriptions into skill-model pairings could route dynamically, handling tasks that don't map cleanly to the existing skill taxonomy. This matters for real-world use: targets in production engagements generate tasks that don't fit neatly into training distribution. A router that generalizes better than a lookup table extends the tiering benefit to novel tasks. This direction aligns with the broader routing literature — notably RouteLLM (Ong et al., 2024), which demonstrates that a learned router can match the quality of a strong model at a fraction of the inference cost by selectively escalating only the queries that require it.

Specialized verifiers. T2 scoring in ARCHER currently uses claude-haiku-4-5 — a capable general-purpose model used for the narrow job of checking whether specific evidence artifacts appear in session output. This is another distribution mismatch: T2 is not a general reasoning task, it's a structured extraction and comparison task. A small model fine-tuned specifically on ARCHER's T2 verification data — thousands of (session output, objective criteria, pass/fail) tuples — could match Haiku's accuracy at significantly lower cost and latency, and could be run locally. Given that T2 scoring currently sits 2,444 sessions behind collection, a faster local verifier has an obvious operational benefit.

Multi-model consensus for high-stakes steps. Certain decisions in an engagement carry asymmetric risk: sending a payload that crashes a service, issuing a command that writes to disk on a production host, generating an executive summary that will be read by a client. For these, running two models and surfacing disagreement as a confidence signal is a plausible safeguard. This doesn't require agreement — the disagreement itself is meaningful. If qwen3:14b and a fine-tuned domain model reach different conclusions about whether a given SUID binary is exploitable, that's a flag for human review, not an automatic choice for one answer.

Calibration models. One class of ARCHER failures — over-confidence on ambiguous output — is not addressable by making the model smarter at the task. It requires the model to know that it doesn't know. A calibration layer trained on ARCHER's HALT and tier2_manual sessions could learn to predict when the primary model is about to generate a confidently wrong answer, and trigger a fallback or escalation. This is distinct from the task model and should probably be a separate inference call rather than a prompting strategy.

Per-engagement fine-tuning. ARCHER already parses target-specific context (open ports, detected services, extracted credentials) into a run context file that's passed to each session. A longer-horizon version of this idea: after the reconnaissance and scanning phases complete, generate a minimal adapter update — a few gradient steps — that conditions the exploitation model on this specific target's observed behavior. The adapter would be ephemeral (discarded after the engagement) but would concentrate the model's weights on what this target actually looks like, rather than what targets generally look like.

5. What I Don't Know Yet¶

These architectures are extensions of a design that's working, not proven solutions. Several open questions shape what's worth pursuing first.

Switching cost vs. granularity tradeoff. The 15-second model switch penalty means fine-grained routing has a floor below which it costs more than it saves. LoRA adapters would eliminate this for the base model, but I haven't measured adapter swap time in ARCHER's context. The tradeoff may look different with quantized models that co-fit in VRAM.

Fine-tuning data quality vs. quantity. ARCHER's ft.jsonl dataset is large by session count but concentrated in passing sessions — the failure sessions that would teach a model what not to do are underrepresented. Whether the current dataset produces meaningful adapter improvements, or whether I need more targeted data collection from the correction sessions flagged with hint_change=True, is an empirical question.

Eval validity under model composition. The current eval methodology assumes a single consistent model across a run. Tiered routing breaks that assumption — the baseline now reflects a mixed configuration. This is operationally correct (the baseline should reflect production config), but it complicates regression analysis: a pass rate change could reflect a hint change, a model change, or a routing boundary change. The eval CSV does not currently record the inference model per row — there is no model column in the schema — so the model used for a given objective has to be reconstructed indirectly from skill_selected and the routing map (_PLAY_MODEL_MAP), with the hint_change annotation available to separate hint effects. That reconstruction is workable but lossy, and the analysis is no longer a single-variable comparison. Adding an explicit per-row model column to the CSV is the clean fix and is an open instrumentation item.

Generalization vs. overfitting in domain models. A model fine-tuned heavily on ARCHER's GOAD-Light eval targets may become brittle on novel targets. The goal of the fine-tuning program is better execution of the right operation, not memorization of GOAD-Light's specific configuration. This requires deliberate data curation and probably holdout evaluation against targets the fine-tuned model has never seen — a capability I don't currently have in the eval infrastructure.

6. Small Swarms: Parallelism as a Model Composition Strategy¶

Tiered routing distributes work across models sequentially — one objective at a time, one model active at a time. The next logical extension is distributing work concurrently: small swarms of agents running in parallel, each handling a different part of the engagement simultaneously.

This is not a hypothetical. There are specific cases where parallel model instances produce something a single sequential model cannot: faster throughput on independent work, mutual verification on high-stakes decisions, and adversarial pressure that improves output quality.

Objective-level parallelism. Many of ARCHER's objectives have no dependency relationship. PT-RECON-01 (host discovery on 192.168.56.0/24) and PT-WEBENUM-01 (directory enumeration of the web server) can run simultaneously — one doesn't need the other's output. The current eval runs them sequentially because a single 8.2 GB GPU can't hold two model instances at once. With LoRA adapters that co-fit in VRAM, or cloud inference that removes the memory ceiling entirely, running five Tier 1 objectives in parallel is straightforward. A full reconnaissance phase that currently takes 15 minutes becomes 3.

Embarrassingly parallel scoring. The T2 scoring backlog — 2,444 sessions unscored as of mid-2026 — is the clearest immediate case. Each T2 call is independent. Ten concurrent Haiku instances could clear the backlog in the time it takes one to process 250 sessions. The bottleneck isn't compute; it's that the scoring pipeline is single-threaded by design. A swarm of small, cheap scoring models is the right tool.

Speculative parallel execution. When a recon step surfaces two plausible exploitation paths, a sequential agent picks one, tries it, and pivots if it fails. A parallel swarm tries both simultaneously and takes whichever returns a passing T1 signal first. The losing branch is killed. This trades compute for latency — the right tradeoff in time-sensitive engagements where a 30-second parallelism overhead beats a 10-minute sequential retry cycle.

Variant testing for hint development. Finding the right hint phrasing for a failing objective currently requires running 3 sequential eval cycles per variant. With parallel execution, three hint variants can be tested simultaneously against the same objective — the best phrasing surfaces in one cycle instead of three. This directly accelerates the development loop.

Multi-model debate on high-stakes steps. Some operations carry asymmetric risk: a command that deletes files on a target, a payload that crashes a service, a summary that will be read by a client. For these, running two model instances — one generating, one critiquing — and surfacing disagreement as a confidence signal is a useful safeguard. The debate doesn't need consensus. The disagreement itself is the output: if qwen3:14b and a fine-tuned domain model reach different conclusions about whether a SUID binary is safely exploitable, that gap is a flag for human review, not a coin flip.

Red-team / blue-team within a single engagement. A small model generates attack variants; a second, independently instantiated model evaluates each variant for likely detectability, noise footprint, and operational safety. Both run concurrently. Neither blocks the other. The output isn't just "can we do this?" but "should we, given these detection risks?" This is a capability that no single sequential model can provide — it requires genuine independence between the generating and evaluating instances.

The Constraint That Changes Everything¶

All of these patterns are currently blocked by a single physical constraint: 8.2 GB of VRAM, of which qwen3:14b occupies 91%. A second model instance cannot coexist.

The constraint dissolves under two conditions. First, LoRA adapters on a shared base model — both "instances" share weights, only the adapter layers differ, total VRAM is close to the base model alone. Second, cloud inference — the VRAM ceiling disappears entirely, replaced by cost and latency considerations that are more tractable.

Neither requires new research. The eval infrastructure already supports the orchestration patterns (run objectives, collect results, dispatch follow-up work). Parallelism is a scheduling change, not an architectural one.

The data sovereignty boundary. Not all work belongs in the cloud regardless of VRAM headroom. ARCHER sessions contain target IPs, extracted credentials, service banners, and vulnerability findings scoped to specific engagements. That data does not transit a cloud provider — not for inference, not for training. Local-first is the production default for the same reason air-gapped systems exist: the security boundary is the architecture, not a policy document. Cloud inference is appropriate for two specific categories where no operational data is present: development evaluation against ARCHER's private lab environment (GOAD-Light targets under controlled conditions), and offline workloads that don't touch live session data — T2 scoring backlogs, adapter training on sanitized eval sessions, variant testing during hint development. Everything that touches a real engagement target stays local. This constraint is not a preference; it is the condition that makes this architecture deployable in real operational contexts.

The implication for the current roadmap: the .ft.jsonl pipeline and LoRA adapter work aren't just about improving a single model's accuracy. They're the enabling layer for concurrent multi-model operation. Get the adapters to swap at near-zero cost, and every parallelism pattern described above becomes available at the same hardware budget.

7. The Underlying Principle¶

A security agent dispatches hundreds of operations across a single engagement. Those operations vary enormously in the cognitive demand they place on a model: selecting the right scan flags is not the same operation as reconstructing an attack chain from fragmented service responses. Treating them as equivalent — giving both to the same model with the same resource allocation — is a design choice that looks like a neutral default but isn't.

Tiered routing is a concrete application of one principle: match the model to the task, not to the problem domain. The problem domain is security. The task varies by step. The current implementation captures the coarsest version of that match — two tiers, five Tier 1 skills, a static map. The fine-tuning program captures a different dimension — same model size, better task-specific weight distribution. The future architectures described above capture additional dimensions: temporal (speculative loading), risk-adaptive (consensus on high-stakes steps), target-adaptive (per-engagement conditioning), and parallel (concurrent instances on independent work).

None of these require new research. They require ARCHER's existing eval infrastructure to serve as the signal source it was built to be: run objectives, measure outcomes, use the data to make the next decision.

Development details, eval data, and source code referenced in this article are from the ARCHER project, maintained at Centaur Security Labs.