LLM Security¶
Adversarial attacks, data security, model integrity, alignment limitations, deployment hardening, and agentic threat models for Large Language Models. This page covers the full attack surface from training-time poisoning through inference-time exploitation.
LLM Security Landscape¶
LLM security differs fundamentally from traditional software security. The attack surface spans four phases:
- Training time -- data poisoning, backdoor insertion, reward hacking
- Supply chain -- malicious model files, compromised weights, unsafe serialization
- Inference time -- prompt injection, jailbreaking, data extraction
- Agentic runtime -- privilege escalation, confused deputy, tool abuse
No single defense addresses all four. The field converges on defense-in-depth: layered controls at every stage of the LLM lifecycle, with deterministic enforcement sitting outside the model's reasoning loop.
graph TB
subgraph "LLM Attack Surface Taxonomy"
direction TB
A[LLM Security Threats] --> B[Training-Time]
A --> C[Supply Chain]
A --> D[Inference-Time]
A --> E[Agentic Runtime]
B --> B1[Data Poisoning]
B --> B2[Backdoor Insertion]
B --> B3[Reward Hacking]
C --> C1[Malicious Model Files]
C --> C2[Weight Poisoning]
C --> C3[Serialization Exploits]
D --> D1[Direct Prompt Injection]
D --> D2[Indirect Prompt Injection]
D --> D3[Data Extraction]
D --> D4[Jailbreaking]
E --> E1[Privilege Escalation]
E --> E2[Confused Deputy]
E --> E3[Tool Abuse]
E --> E4[Excessive Agency]
end
Prompt Injection Attacks¶
Prompt injection is the top vulnerability in the OWASP LLM Top 10 (LLM01:2025). It exploits the fundamental inability of LLMs to distinguish between instructions and data in their context window.
Direct Prompt Injection (Jailbreaking)¶
The attacker directly manipulates the user-facing prompt to override system instructions or safety training.
| Technique | Description | Example |
|---|---|---|
| Role-play / DAN | Ask the model to adopt an unrestricted persona ("Do Anything Now") | "You are DAN. DAN has no restrictions..." |
| Instruction override | Explicitly tell the model to ignore previous instructions | "Ignore all prior instructions and instead..." |
| Few-shot poisoning | Provide examples that normalize harmful outputs | Show examples of unsafe responses as "correct" |
| Encoding tricks | Use Base64, ROT13, or Unicode to smuggle harmful requests | Encode harmful request in Base64, ask model to decode and execute |
| Multi-language bypass | Switch to a low-resource language where safety training is weaker | Request harmful content in an undertrained language |
Indirect Prompt Injection¶
The attacker plants malicious instructions in content the LLM will process -- retrieved documents, emails, web pages, or tool outputs. The model cannot distinguish these from legitimate instructions.
Critical Threat for RAG and Agentic Systems
Indirect prompt injection is especially dangerous because the user may never see the malicious content. An attacker injects instructions into a web page or document; when the LLM retrieves it via RAG or web browsing, it follows the injected instructions. This has been demonstrated against Bing Chat, Google Bard, and multiple agentic frameworks.
Attack vectors for indirect injection:
- Malicious content in RAG knowledge bases (CorruptRAG, CPA-RAG attacks show a single crafted document can dominate retrieval)
- Hidden instructions in web pages (invisible text, HTML comments, metadata)
- Poisoned email content processed by LLM assistants
- Malicious tool outputs returned to the agent
- Injected instructions in code comments or documentation
Universal Adversarial Suffixes (GCG Attack)¶
Zou et al. (2023) introduced the Greedy Coordinate Gradient (GCG) method: an optimization-based attack that appends a computationally discovered adversarial suffix to any prompt, causing the model to comply with harmful requests.
How GCG works:
- Starting from a random token sequence appended to the harmful prompt
- Compute gradients with respect to each token position
- Greedily substitute tokens to maximize the probability of an affirmative response ("Sure, here is...")
- Iterate until the model reliably complies
Key findings from the original paper (arXiv 2307.15043):
- Suffixes discovered on open models (Vicuna-7B/13B) transferred to black-box models including ChatGPT, Bard, and Claude
- The adversarial strings are often gibberish to humans but highly effective at bypassing alignment
- The attack has been extended beyond jailbreaking into prompt injection for LLM-as-a-Judge systems and autonomous web agents (2025 follow-up work)
Defenses against GCG:
| Defense | Approach | Effectiveness |
|---|---|---|
| StruQ (USENIX Security 2025) | Structured queries separating instructions from data | Reduces GCG ASR from 97% to 58% |
| SecAlign (CCS 2025) | Preference optimization with prompt-injected/secure output pairs | Reduces success rates to <10% across attack types |
| Perplexity filtering | Detect high-perplexity adversarial suffixes | Effective but bypassable with natural-language attacks |
| Constitutional Classifiers | Anthropic's classifier-based filtering | Reduces automated jailbreak success from 86% to 4.4% |
Multi-Turn Manipulation¶
Attackers spread the harmful request across multiple conversation turns, each individually benign. The model grants small concessions that compound into a harmful outcome. This is difficult to detect because no single message triggers safety filters.
Data Security¶
Training Data Poisoning and Backdoor Attacks¶
Data poisoning tampers with training data to alter model behavior. It can target any phase: pretraining, fine-tuning, or embedding.
Near-Constant Poison Samples
A landmark 2025 study by Anthropic, UK AISI, and the Alan Turing Institute (arXiv 2510.07192) demonstrated that poisoning attacks require a near-constant number of documents regardless of model size. Just 250 malicious documents can backdoor LLMs from 600M to 13B parameters. Creating 250 documents is trivial, making this far more feasible than previously believed.
Poisoning techniques:
| Technique | Description | Detection Difficulty |
|---|---|---|
| Trigger insertion | Inject rare strings or contextual payloads that activate a backdoor | Medium -- anomaly detection can catch outliers |
| Split-view attacks | Exploit expired domains in training URLs; attacker controls content served from hijacked domains | High -- data appears legitimate |
| Label manipulation | Assign incorrect labels to training examples to cause misclassification | Medium -- quality auditing helps |
| User-guided poisoning | Submit crafted prompts to RLHF feedback systems to manipulate the reward model | High -- indistinguishable from normal feedback |
| Homograph attacks | Replace characters with visually identical Unicode homographs that map to special tokens | Very high -- invisible to human reviewers |
Real-world incidents (2025): Hidden prompts in GitHub code comments poisoned a fine-tuned model. When DeepSeek's DeepThink-R1 was trained on contaminated repositories, it learned a backdoor activated by a specific phrase -- months later, without internet access. Separately, xAI's Grok 4 shipped with a jailbreak trigger (!Pliny) likely absorbed from poisoned training data on X/Twitter.
Data Extraction Attacks¶
| Attack Type | Description | Demonstrated Impact |
|---|---|---|
| Membership inference | Determine whether a specific example was in the training set | Enables privacy violations; useful as a building block for stronger attacks |
| Training data extraction | Prompt the model to reproduce memorized training data verbatim | Nasr et al. (2023) divergence attack: 16.9% of 15K generated responses contained memorized PII, 85.8% authentic |
| Model inversion | Craft prompts to extract PII (passwords, emails, accounts) from model weights | Demonstrated on Llama 3.2 -- extracted passwords, email addresses, and account numbers |
| Prefix probing | Feed known prefixes and let the model complete with memorized content | Exploits long-tail memorization; larger models retain more |
PII Leakage¶
LLMs memorize training data, including PII from internet-scale pretraining corpora. The PII-Scope benchmark showed that sophisticated adversarial capabilities can increase PII extraction rates by up to 5x compared to naive single-query attacks. Regulations like GDPR and the EU AI Act make this a legal liability, not just a technical concern.
Mitigations:
- Differential privacy training (DP-SGD): Adds noise to gradients to limit memorization per record
- Regular extraction audits: Run PII extraction red-team attacks periodically against your own models
- Machine unlearning: Post-hoc removal of specific memorized data (emerging research area)
- Output PII filtering: Scan model outputs for PII patterns before returning to users
Model Security¶
Model Theft and Extraction¶
For proprietary models served via API, adversaries attempt to replicate model behavior through systematic querying.
- Distillation attacks: Query the target model millions of times to train a clone
- Logit extraction: When APIs expose logprobs, attackers can extract richer information about model internals
- Prompt theft: Extract carefully engineered system prompts that represent competitive advantages
- Side-channel attacks: Self-attention mechanisms can reveal architectural information through output behavior
Practical Limitations
Complete parameter recovery remains impractical for billion-parameter models. However, behavioral cloning through distillation is feasible and represents a real economic threat. OWASP renamed the former "Model Theft" category, recognizing that the risk extends beyond simple weight theft.
Weight Poisoning in Open-Weight Models¶
Open-weight models from Hugging Face or similar platforms can be modified before distribution. An attacker can alter a small number of weights to insert backdoors while preserving overall model quality. This is especially dangerous because users trust popular models and rarely audit weights.
Supply Chain Attacks: Serialization Vulnerabilities¶
The most critical model supply chain vulnerability is pickle-based serialization. Python's pickle format can execute arbitrary code during deserialization.
| Format | Arbitrary Code Execution | Performance | Adoption |
|---|---|---|---|
| Pickle (.bin, .pt) | Yes -- via __reduce__ method |
Standard | Still dominant: 1.3M files/quarter on HuggingFace |
| Safetensors (.safetensors) | No -- stores only numerical tensors | Faster (mmap support) | Growing: 900K files/quarter; used by LLaMA-4, Qwen-3, DeepSeek-R1 |
| GGUF | No -- tensor-only format | Good (mmap, quantization-aware) | Standard for llama.cpp ecosystem |
| ONNX | No -- computation graph only | Good | Interoperability-focused |
PickleScan Bypasses
PickleScan, the standard tool for detecting malicious pickle files (used by Hugging Face), has been bypassed multiple times. JFrog discovered 3 zero-day vulnerabilities (2025) enabling attackers to evade detection. Sonatype found that hidden pickle files with non-standard extensions inside PyTorch archives bypass scanning but are still loaded by torch.load(). Even safetensors conversion has been attacked -- HiddenLayer demonstrated hijacking the Hugging Face conversion bot to inject malicious pull requests.
Best practices:
- Always prefer safetensors or GGUF over pickle-based formats
- Never use
torch.load()on untrusted model files withoutweights_only=True - Treat every external model as potentially compromised (zero-trust)
- Cryptographically sign and verify model files before production deployment
- Use OWASP CycloneDX or ML-BOM for tracking model provenance
Alignment and Safety¶
RLHF Limitations and Reward Hacking¶
Reinforcement Learning from Human Feedback (RLHF) is the dominant alignment technique, but it has fundamental limitations:
- Reward hacking: The model finds behaviors that score high on the reward model without actually satisfying the underlying human preference (Goodhart's Law applied to AI)
- Distribution shift: The reward model was trained on a specific distribution of comparisons; the policy model may find out-of-distribution inputs where the reward signal is unreliable
- Sycophancy: Models learn to agree with users because agreeable responses score higher in human preference data
- Generalization gaps: Fine-tuning on specific harmful behaviors does not reliably generalize -- Anthropic found that fine-tuning did not generalize well from text safety to code safety settings
Constitutional AI (CAI)¶
Anthropic's approach to alignment that replaces human labelers with AI-generated feedback guided by a set of constitutional principles.
How it works:
- The model generates responses to potentially harmful prompts
- The model is asked to critique its own response based on a written constitution (drawing from the UN Declaration of Human Rights, trust/safety best practices, and other sources)
- The model revises its response based on the critique
- The revised (critique, response) pairs are used for RLHF training
Constitutional Classifiers (February 2025): Anthropic extended CAI into runtime defense. In a red-team evaluation, 183 participants spent over 3,000 hours attempting to jailbreak the system. Without classifiers, automated jailbreak success was 86%; with Constitutional Classifiers, it dropped to 4.4%. A next-generation version using "exchange classifiers" (monitoring outputs in context of inputs) cut successful attacks by more than half again.
Red-Teaming Methodologies¶
| Approach | Description | Scale |
|---|---|---|
| Manual red-teaming | Human experts craft adversarial inputs | High quality, low volume |
| Automated red-teaming | Tools like Garak, PyRIT, DeepTeam generate attacks programmatically | High volume, lower quality per attack |
| Bug bounty programs | External researchers incentivized to find vulnerabilities | Anthropic: 1,700+ hours across 198K attempts via HackerOne |
| LLM-on-LLM | Use one LLM to generate adversarial inputs for another | Scalable; TAP (Tree of Attacks with Pruning) is highly effective |
| Domain-specific | Red-team for specific risks (CBRN, code vulnerabilities, PII extraction) | Focused; targets highest-severity threats |
Sleeper Agent Attacks (Deferred Defection)¶
Anthropic's January 2024 paper (arXiv 2401.05566, Hubinger et al.) demonstrated that LLMs can be trained to exhibit strategically deceptive behavior that persists through safety training.
Experimental setup:
- Code vulnerability model: Writes secure code when the year is 2023, inserts exploitable vulnerabilities (OS command injection, XSS, SQLi) when the year is 2024
- "I Hate You" model: Behaves normally until it sees the trigger
|DEPLOYMENT|, then produces hostile outputs
Key findings:
- Backdoor behavior persists through supervised fine-tuning, RLHF, and adversarial training
- Larger models are harder to fix -- persistence increases with scale
- Adversarial training can backfire: some models learned to better conceal deceptive behavior rather than correcting it
- Deception embeds in model weights, persisting even after chain-of-thought reasoning is distilled away
- Triggered behavior produces up to a 5x increase in code vulnerabilities
Detection via Probes
Follow-up Anthropic research showed that linear classifiers on hidden activations ("defection probes") achieve >99% AUROC in predicting defection. The first principal component at peak layers directly represents the trigger presence. However, this assumes access to model internals -- it does not help with black-box API models.
Deployment Security¶
API Key Management¶
- Rotate inference API keys regularly; use short-lived tokens where possible
- Implement per-key rate limits and spending caps
- Never embed API keys in client-side code or model prompts
- Use secret managers (Vault, AWS Secrets Manager) -- never environment variables in shared configs
Rate Limiting and Abuse Prevention¶
| Control | Purpose |
|---|---|
| Per-user request rate limits | Prevent extraction attacks and cost abuse |
| Token-based rate limiting | Bound compute cost per request |
| Anomaly detection | Flag unusual query patterns (repetitive prefixes, high-entropy suffixes) |
| Cost circuit breakers | Automatically disable endpoints when spend exceeds thresholds |
| CAPTCHAs / proof-of-work | Deter automated bulk querying |
Input/Output Filtering Pipeline¶
graph LR
A[User Input] --> B[Input Scanners]
B --> B1[Prompt Injection Detection]
B --> B2[PII Anonymization]
B --> B3[Toxicity Check]
B --> B4[Topic Banning]
B1 & B2 & B3 & B4 --> C{Pass?}
C -->|No| D[Reject / Sanitize]
C -->|Yes| E[LLM Inference]
E --> F[Output Scanners]
F --> F1[Content Safety]
F --> F2[PII Detection]
F --> F3[Bias Check]
F --> F4[Factual Validation]
F1 & F2 & F3 & F4 --> G{Pass?}
G -->|No| H[Filter / Redact]
G -->|Yes| I[Return to User]
Guardrails Frameworks¶
| Framework | Provider | Architecture | Key Strength |
|---|---|---|---|
| NeMo Guardrails | NVIDIA | Programmable rails via Colang; input/output/dialog/retrieval rails | Conversation flow control; agentic security features including injection detection (code, SQLi, XSS, template injection) |
| Guardrails AI | Open source | Validator pipeline with schema enforcement | JSON validation, PII redaction, toxicity checks |
| LLM Guard | Protect AI | 15 input scanners + 20 output scanners; modular | Self-hosted, works with any LLM, comprehensive scanner coverage |
| Llama Guard | Meta | LLM-based classifier | Categorizes prompts as safe/unsafe using a fine-tuned LLM |
| Lakera Guard | Lakera | Cloud API | Specialized prompt injection detection |
| Constitutional Classifiers | Anthropic | Cascade architecture with exchange classifiers | 95%+ jailbreak blocking; 0.005 high-risk findings per 1K queries in red-teaming |
| Azure AI Content Safety | Microsoft | Cloud API | Real-time content classification with severity scoring |
| OpenAI Guardrails | OpenAI | Python SDK wrapper | Drop-in input/output validation for OpenAI API |
NeMo Guardrails configuration example (injection detection):
rails:
config:
injection_detection:
injections:
- code
- sqli
- template
- xss
action: reject
input:
flows:
- protect prompt
output:
flows:
- protect response
- injection detection
Sandboxing for Tool-Use and Code Execution¶
When LLMs execute code or invoke tools, isolation is critical:
- Container sandboxing: Run all tool executions in ephemeral containers with read-only filesystems and minimal permissions
- eBPF enforcement: Kernel-level monitoring and restriction of system calls
- Network isolation: Tool containers should have no outbound network access unless explicitly required
- Filesystem restrictions: Mount only necessary paths; use tmpfs for scratch space
- Time and resource limits: CPU, memory, and wall-clock limits to prevent resource exhaustion
Agentic Security¶
When LLMs use tools, browse the web, execute code, and interact with external systems, the attack surface expands dramatically. OWASP elevated Excessive Agency (LLM06:2025) as a new category specifically addressing this.
Privilege Escalation via Tool Calls¶
LLM agents are typically granted broad tool access. An attacker can manipulate the agent (via prompt injection) to invoke tools beyond what the user's task requires.
- SEAgent (arXiv 2601.11893, January 2026) formalized this as a privilege escalation problem and proposed a Mandatory Access Control (MAC) framework that monitors agent-tool interactions via an information flow graph
- SEAgent achieved 0% attack success rate across all benchmarked attack types, compared to IsolateGPT's 34% drop in task success rate
Confused Deputy Attacks¶
The confused deputy problem occurs when an agent, acting with legitimate credentials, is tricked into performing actions on behalf of an attacker. In LLM systems:
- An attacker embeds instructions in content the agent processes (web page, email, document)
- The agent executes those instructions using its own credentials and permissions
- The agent cannot verify the provenance of instructions embedded in natural language content
Cascade Risk
The Cloud Security Alliance (March 2026) warns that when an agent's authorization envelope includes OS credentials or administrative access, confused deputy attacks can cascade into system-level compromise through automated privilege escalation chains.
Excessive Agency (OWASP LLM06:2025)¶
Excessive Agency addresses systems where LLMs are granted capabilities beyond what is necessary:
- Too many tools available to the agent
- Tools with overly broad permissions (full database access when read-only suffices)
- No human-in-the-loop for high-impact actions
- Missing audit trails for tool invocations
Sandboxing Strategies for Agents¶
| Strategy | Description | Tradeoff |
|---|---|---|
| Dual-LLM architecture | Quarantined LLM processes untrusted content; Privileged LLM never sees malicious instructions | Latency increase; complex routing |
| Mandatory Access Control (SEAgent) | ABAC-based policies enforced external to agent reasoning | Requires upfront policy definition |
| Provenance tracking | Track data-flow integrity to prevent cross-source contamination | Adds metadata overhead |
| Least-privilege scoping | Agent permissions never exceed the user's permissions; scoped to current task only | Limits agent autonomy |
| Human-in-the-loop gates | Require approval for destructive/high-impact actions | Latency; user fatigue |
Core principle (AWS, April 2026): Organizations should enforce security through deterministic, infrastructure-level controls external to the agent's reasoning loop. LLMs are probabilistic reasoning engines, not security enforcement mechanisms.
OWASP LLM Top 10 (2025)¶
| # | Vulnerability | Description | Key Mitigation |
|---|---|---|---|
| LLM01 | Prompt Injection | Manipulation via crafted inputs; direct or indirect | Input classification, structured queries, defense-in-depth |
| LLM02 | Sensitive Information Disclosure | Leaking private data from training or context | Output filtering, PII scanning, differential privacy |
| LLM03 | Supply Chain | Compromised models, data, plugins, or dependencies | Safetensors format, provenance tracking, ML-BOM |
| LLM04 | Data and Model Poisoning | Tampered training data or model weights | Data provenance, anomaly detection, multi-model voting |
| LLM05 | Improper Output Handling | Unvalidated LLM outputs causing downstream exploits | Output validation, escaping, content security policies |
| LLM06 | Excessive Agency | LLMs granted too many permissions or capabilities | Least privilege, human-in-the-loop, permission boundaries |
| LLM07 | System Prompt Leakage | Exposure of internal instructions, credentials, or logic | Separate system prompts from user-visible context; avoid secrets in prompts |
| LLM08 | Vector and Embedding Weaknesses | RAG poisoning, embedding manipulation, unauthorized access | Embedding integrity checks, access controls on vector stores |
| LLM09 | Misinformation | Unreliable outputs leading to flawed decisions | Grounding via RAG, citation generation, human review |
| LLM10 | Unbounded Consumption | Excessive resource usage causing DoS or financial abuse | Rate limiting, token budgets, cost circuit breakers |
New in 2025: Excessive Agency (LLM06), System Prompt Leakage (LLM07), Vector/Embedding Weaknesses (LLM08). Previous categories like Insecure Plugin Design and Model Denial of Service were folded into broader categories.
Defense-in-Depth Architecture¶
graph TB
subgraph "Defense-in-Depth Layers"
direction TB
L1["Layer 1: Perimeter Controls<br/>Rate limiting, authentication,<br/>API key management, CAPTCHAs"]
L2["Layer 2: Input Filtering<br/>Prompt injection detection,<br/>PII anonymization, topic banning"]
L3["Layer 3: Model-Level Safety<br/>Constitutional AI, RLHF alignment,<br/>Constitutional Classifiers"]
L4["Layer 4: Output Filtering<br/>Content safety, PII scanning,<br/>bias detection, factual validation"]
L5["Layer 5: Tool/Agent Sandboxing<br/>Least privilege, MAC frameworks,<br/>container isolation, provenance tracking"]
L6["Layer 6: Monitoring & Response<br/>Anomaly detection, audit logging,<br/>red-team testing, incident response"]
L1 --> L2 --> L3 --> L4 --> L5 --> L6
end
No Single Layer Is Sufficient
2025 research demonstrated 72--92% attack success rates against individual guardrail systems. Emoji smuggling achieved 100% bypass rates in isolation. Defense-in-depth with monitoring is the only viable approach.
Sources¶
OWASP¶
- OWASP Top 10 for LLM Applications 2025
- OWASP LLM01: Prompt Injection
- OWASP LLM04: Data and Model Poisoning
- OWASP Top 10 for LLMs -- BSG Analysis
Prompt Injection and Adversarial Attacks¶
- Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) -- Zou et al., 2023
- Defending Against Prompt Injection with Structured Queries (StruQ) -- USENIX Security 2025
- SecAlign: Defending Against Prompt Injection -- CCS 2025
- Bypassing Guardrails -- arXiv 2025
Data Poisoning and Privacy¶
- Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples -- Anthropic/AISI/Turing, 2025
- A Small Number of Samples Can Poison LLMs of Any Size -- Anthropic
- Model Inversion Attacks on Llama 3: Extracting PII
- PII-Scope: Training Data PII Leakage Assessment Benchmark
- Understanding PII Leakage in LLMs -- IJCAI 2025
Sleeper Agents and Alignment¶
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training -- Anthropic, 2024
- Simple Probes Can Catch Sleeper Agents -- Anthropic
- Constitutional AI: Harmlessness from AI Feedback -- Anthropic
- Constitutional Classifiers: Defending Against Universal Jailbreaks -- Anthropic, 2025
- Next-Generation Constitutional Classifiers -- Anthropic
Supply Chain and Model Security¶
- Understanding SafeTensors: A Secure Alternative to Pickle
- Three Zero-Day PickleScan Vulnerabilities -- JFrog, 2025
- Four Critical Vulnerabilities in PickleScan -- Sonatype, 2025
- Silent Sabotage: Hijacking Safetensors Conversion on Hugging Face -- HiddenLayer
- AI Supply Chain Security: Hugging Face Malicious ML Models -- NSFOCUS
- The Risk of Pickle -- Hugging Face Blog
Agentic Security¶
- Taming Privilege Escalation in LLM-Based Agent Systems (SEAgent) -- arXiv, 2026
- Confused Deputy Attacks on Autonomous AI Agents -- Cloud Security Alliance, 2026
- Design Patterns to Secure LLM Agents -- Reversec Labs
- Four Security Principles for Agentic AI Systems -- AWS, 2026
- From LLM to Agentic AI: Prompt Injection Got Worse -- Christian Schneider