QLoRA Configuration Protocol¶

Status: Active | Established 2026-06-05 | Issue: #861
Applies to: V2 Phase 5 — fine-tuning qwen3:14b on the ARCHER pentest corpus via Unsloth QLoRA

Recommended configuration¶

python3 scripts/finetune.py \
    --lora-rank 8 \
    --lr-warmup-ratio 0.10 \
    --learning-rate 2e-4 \
    --batch-size 4 \
    --lora-alpha 16 \
    --lora-dropout 0.05 \
    --target-modules q_proj,v_proj

Note: --lora-rank and --lr-warmup-ratio parameters require Coder implementation (#861). Until shipped, set these values directly in scripts/finetune.py.

The 8 GB VRAM constraint¶

qwen3:14b loaded at 4-bit quantization (NF4 via bitsandbytes) occupies ~7.2 GB VRAM leaving ~800 MB for gradients and the LoRA adapter. The parameters below are calibrated to that headroom. Any configuration that exceeds it will silently OOM mid-training.

Component	VRAM impact	Notes
qwen3:14b 4-bit base	~7.2 GB	Fixed; cannot be reduced without swapping model
LoRA adapter (r=8)	~50 MB	Scales with rank and number of target modules
Gradient accumulation	~200 MB	For batch size 4, seq len 2048
Total (r=8, q+v only)	~7.5 GB	Safe margin within 8 GB
LoRA adapter (r=16)	~100 MB	Still fits; tighter margin
LoRA adapter (r=32)	~200 MB	Risky — leave < 500 MB headroom

Parameter rationale¶

Rank (r=8, fallback r=16)¶

Start at r=8. The LoRA rank controls how many degrees of freedom the adapter has to modify the base model. Higher rank = more capacity = more forgetting risk.

The ARCHER fine-tuning task is narrow: a 14B model being adapted on ~2,000–3,000 pentest sessions. The base model already has the required capabilities (bash command generation, JSON formatting, sequential reasoning). The fine-tuning goal is behavioral alignment — teaching the model when to stop, how to format findings, how to chain commands — not installing new capabilities. r=8 is sufficient for behavioral alignment on a narrow task.

Use r=16 if and only if: the fine-tuned model shows underfitting at r=8 — defined as eval pass rate on rare-skill objectives (PT-PIVOT-01, PT-AD-03, PT-PERSIST-03) that fails to improve vs. base model after 3 evaluation runs. See Detection below.

Do not exceed r=16 without a specific justification. r=32 and above approach the forgetting regime for a narrow 2K-session fine-tune.

Alpha (alpha=16 for r=8; alpha=32 for r=16)¶

Set alpha = 2 × rank. This is the standard Hu et al. LoRA practice. The effective update scaling factor is alpha/r = 2.0. Deviating from this ratio changes the effective learning rate of the adapter — keep it at 2× to isolate rank as the single variable during hyperparameter evaluation.

Learning rate (2e-4 with 10% warmup)¶

2e-4 is the standard instruction-tuning learning rate for QLoRA on 7B–14B models. Do not exceed 3e-4 — above this threshold, 4-bit quantized models show loss instability on narrow domain datasets.

Linear warmup over 10% of total steps (--lr-warmup-ratio 0.10) prevents the adapter from making large early updates before the base model's embedding representations are stable in the quantized precision regime. Without warmup, the first few hundred steps can corrupt the adapter initialization.

Batch size (4)¶

Batch size 4 at sequence length 2048 fits within the VRAM budget. If gradient accumulation is available, accumulate over 4 steps (effective batch size 16) for more stable gradient estimates. Do not exceed batch size 8 without checking VRAM headroom.

Dropout (0.05)¶

Standard for QLoRA fine-tunes on narrow datasets. Provides light regularization against the training set without meaningfully impeding task adaptation. Increase to 0.1 if the model overfits to training sessions (high training loss drop, flat eval improvement).

Target modules¶

Minimum viable: q_proj, v_proj

These are the attention query and value projections — the layers most responsible for in-context learning and instruction following. They are always sufficient for behavioral alignment.

Optional extension: add k_proj, o_proj

Key and output projections extend coverage at the cost of ~2× adapter memory. Use the 4-module set if r=8 + q/v only produces underfitting. Do not add MLP layers (gate_proj, up_proj, down_proj) unless there is a specific capability gap that attention-only adaptation cannot address — MLP fine-tuning carries higher forgetting risk.

Detecting underfitting vs catastrophic forgetting¶

These are the two failure modes in opposite directions. Distinguishing them requires running both the base model and the fine-tuned model against the same eval sets.

Underfitting (rank too low or insufficient training)¶

Signal: Fine-tuned model eval pass rate does not exceed base model by ≥5 percentage points on the 75-objective set across 3 independent runs.

Confirmation check: Run both models on the rare-skill objectives that the base model already handles poorly (PT-PIVOT-01, PT-AD-03, PT-PERSIST-03). If the fine-tuned model shows no improvement on these, the adapter has insufficient capacity to adapt the base model's behavior on sparse training examples.

Response: Re-run with r=16. If r=16 also fails to produce improvement, the bottleneck is training data volume (rare-skill labels), not rank. Do not increase rank further — add sessions via run_data_collection.sh on the relevant objectives.

Catastrophic forgetting (rank too high or too many target modules)¶

Signal: Fine-tuned model capability benchmark (#860) scores ≥15 percentage points below base model on any general-purpose category (code generation, math reasoning, natural language).

Confirmation check: Run the base model and fine-tuned model on 20 general-purpose tasks outside the ARCHER system prompt structure. Score delta > 15% on any category is a hard stop. See #860 for the benchmark specification and the data/capability_benchmark/ output directory.

Response: Revert to base model. Re-run with lower rank (r=4 if coming from r=8) or fewer target modules (q_proj only). The ARCHER eval pass rate improvement is not worth a general capability regression — a range-locked fine-tuned model that can't reason outside the training distribution is not a V2 baseline.

Pre-training gates¶

All of the following must pass before running finetune.py. Non-negotiable — these gates exist because the failure modes they cover are hard to detect after training.

# Gate 1: Calibration — T2 scoring must be stable
python3 scripts/t2_calibration_check.py       # must exit 0

# Gate 2: Corpus quality — zero flagged sessions
python3 scripts/prepare_finetune.py --report --exclude-bv --skill-cap 200
# skipped_bv must be 0; no skill should exceed 200

# Gate 3: General capability baseline (#860)
# Run before fine-tuning; record to data/capability_benchmark/baseline.json

# Gate 4: OOD eval (#815)
# data/ood_eval_results.json must exist with pass_rate >= 0.60

Full pre-training checklist is maintained in the V2 epic (#76).

Conditions to revisit¶

Revisit this config if: - Corpus grows beyond 10,000 sessions — at that volume, higher rank becomes viable without proportionally increasing forgetting risk - A new base model replaces qwen3:14b — the VRAM budget and parameter rationale change - A hardware upgrade beyond 8 GB VRAM allows larger batch sizes or higher rank without the current headroom constraints

References¶

V2 Failure Mode Analysis — CSL internal doc, not in repo
V1 to V2 Evolution — context for what changes and what stays
V2 epic (#76) — phase map and pre-training checklist
scripts/finetune.py — implementation; scripts/export_lora.py — post-training GGUF export