Prompt injection and indirect attacks in production

Why production LLM workloads face a different class of attacker than labs—and how indirect channels, embeddings, and tool chains turn a single misplaced trust boundary into outage or data exposure.

Prompt injection stops being an academic curiosity the moment your model can read mail, browse the web, call internal tools, or summarize documents that originated outside your security boundary. Direct attacks—“ignore prior instructions”—are noisy. In production we spend more cycles on indirect attacks: payloads that survive formatting, truncation, multilingual paraphrasing, OCR errors, markdown quirks, or nested content inside PDFs and HTML that your pipeline never validated as hostile input.

The trust boundary moves with context

Architecturally, the safe mental model is: anything that influences token generation before the classifier or policy sees it should be classified as potentially adversarial—even if your product owner calls it “user content”. That includes snippets injected by integrations, retrieval chunks, MCP tool-return bodies, summarized conversation history, cached system prompts patched by CI, or feature flags flipped at runtime.

Retrieval augmented generation (RAG) turns your vector store into a delivery channel—poison one popular doc and every agent that cites it inherits the attacker’s intent scaffold.
Email/support bots concatenate threads; the tenth reply may carry a microscopic instruction stanza engineered to bypass shallow keyword checks.
Browsing or code-execution plugins let the assistant fetch fresh instructions after your static policy snapshot was evaluated—classic TOCTOU for LLMs.
Multimodal payloads hide instructions inside images’ alt text or table cells that OCR normalizes oddly, fragmenting defenses that key on plaintext alone.

What “good enough” mitigation looks like

You will not eradicate prompt injection with a single regex, static deny list, or one-shot refusal prompt. Robust production posture combines layers: deterministic structural controls (sandboxing dangerous tools), semantic classifiers oriented to misuse intent—not surface form—output scanning before user-visible delivery, and durable audit trails tying each model call to corpus version, retrieval IDs, policy pack ID, and tool arguments.

Operational teams should treat high-risk intents (credential hunting, lateral movement language, abnormal tool cardinality) like security signals, not QA nits. Prefer blocking or degraded responses with explicit policy references over silent model “helpfulness”; silent success teaches attackers which phrasing survives filters.

Operational playbooks

Inventory every inbound channel that concatenates untrusted strings into model context—include third-party MCP servers and spreadsheets, not only chat bubbles.
Run continuous simulation using paraphrases and multilingual variants of known jailbreak scaffolding; forbid release if regression rate jumps without an accepted risk memo.
Segment models and keys so a compromised workspace bot cannot escalate to treasury or HR copilots by reusing bearer tokens.
Correlate gateway events with tool trace IDs—when an assistant “helps” oddly, reviewers need the exact arguments your runtime allowed through.

Intertrace treats prompt injection defense as runtime policy plus evidence: deterministic gateway decisions, categorized threats, latency-bounded classification, and enough structured telemetry for incident timelines. The goal isn’t novelty—it’s repeatability engineers can automate and auditors can replay.