Most “AI red team” spreadsheets rot because setup takes longer than curiosity lasts. Adoption happens when testers can replay a baseline bundle in one command, compare gateway verdict deltas across commits, promote findings as regression fixtures, and see latency impact alongside security lift—not screenshots of amusing model replies labelled "risk".
Properties of an effective lab corpus
- Versioned transcripts with reproducible embeddings (pinned index builds) rather than brittle static strings.
- Ablation families: multilingual, homoglyphs, chunked markdown, OCR noise, multimodal payloads where relevant.
- Negative controls—baseline benign traffic ensuring policy churn doesn’t erode usability.
- Severity rubric aligning product + legal—not every creative insult is Sev0.
Close the loop to engineering
Each simulated failure produces a regression ID checked into CI—or it didn’t happen. Gate merges when novel bypass rate exceeds SLA; gate releases when remediation commit lacks paired simulator deltas.
Metrics that matter beyond percent accuracy
- Time-to-patch for novel bypass archetypes (median / p95).
- "Escape velocity" drift after model vendor updates—automatic diff runs when upstream weights change.
- Operational cost: extra milliseconds per inference vs risk reduction—not security theater at arbitrary latency.
- Human adjudication backlog size—if reviewers drown, signals are noisy or UX copy failed.
Intertrace exposes simulation surfaces so teams iterate without mocking half the gateway stack offline. Tie lab runs to dashboards; let operators diff threat categories and classifications like any other flaky test suite—with owners, deadlines, traceable fixes.