AI Security // Prompt & Policy Control

Prompt Injection & Jailbreaks

Prompt injection is useful to attackers because it targets the instruction hierarchy itself. The goal is not comedy output. The goal is to make untrusted content win against hidden policy, expose protected context, redirect planning or manipulate whatever trusts the response next.

field briefoperator referencepublic sources

Why this topic matters

Direct prompt injection matters when a user prompt can override system intent. Indirect prompt injection matters when the malicious instruction arrives from documents, web pages, tickets, emails or retrieved chunks that the system treats as content instead of control input.

The interesting question is whether the system leaks hidden instructions, breaks policy in a repeatable way, changes tool-selection logic or influences humans and automation downstream. A nice-looking refusal followed by unsafe hidden behaviour is still a control failure.

Attack lanes

  • Direct injection against the visible chat surface.
  • Indirect injection through RAG, browsing, imported files and helpdesk content.
  • System prompt extraction and policy leakage.
  • Safety-evasion chains that rely on roleplay, translation, summarisation or format-shifting.
  • Output steering where the model convinces another component or analyst to take an unsafe step.

Reporting angle

Good reporting preserves the exact payload, the preconditions, the response pattern, the trust boundary crossed and the downstream consequence. That is what turns a jailbreak into a security finding instead of a screenshot.

Curated public references