ZDR Architecture¶

Zero Data Retention fundamentally alters the lifecycle of prompts and data flowing through an LLM pipeline. It requires a shift from implicit trust to technical enforcements.

Data Lifecycle: Where Your Prompts Go¶

When ZDR is not properly configured, or without protective layers, your data is exposed at multiple points. ZDR architectures seek to close the gaps.

graph TD
    A[Client App] -->|Prompt + Data| B(API Gateway/Proxy)
    B -->|Logs & APM| C[(Local Storage)]
    B -->|API Request| D{LLM Provider}
    D -->|Abuse Monitoring| E[(Provider Logs)]
    D -->|Model Training| F[(Training Corpus)]
    D -->|Generation| G[Response]
    G --> B
    B --> A

    classDef danger fill:#f8d7da,stroke:#f5c6cb,stroke-width:2px;
    classDef safe fill:#d4edda,stroke:#c3e6cb,stroke-width:2px;

    class C danger;
    class E danger;
    class F danger;

A properly configured ZDR environment ensures data is never persisted at rest:

graph TD
    A[Client App] -->|Prompt + Data| B(DLP Proxy / PII Redaction)
    B -->|Sanitized Logs| C[(Local Storage)]
    B -->|Redacted Request| D{ZDR-Enabled LLM Provider}
    D -->|Volatile Memory Only| E[Generation]
    E --> B
    B --> A

    classDef safe fill:#d4edda,stroke:#c3e6cb,stroke-width:2px;
    class B safe;
    class D safe;

Architecture Blueprints¶

Enterprise AI implementations generally follow one of three architectural blueprints to achieve ZDR and compliance.

1. Cloud ZDR with Private Networking¶

This is the standard approach for enterprises adopting frontier models. It combines contractual ZDR with network-level isolation so data never transverses the public internet.

Key Components: - Cloud Provider (AWS/Azure/GCP): Hosting the application logic. - Private Link / Private Endpoints: Ensures the connection between the application VPC and the LLM API endpoint remains within the cloud provider's backbone. - ZDR API: The LLM provider configuration explicitly configured to ContentLogging: false (Azure) or utilizing opt-in logging defaults (AWS Bedrock).

Pros: - Access to the most capable frontier models (GPT-4o, Claude 3.5 Sonnet). - Zero hardware management overhead. - Scalable without upfront capital expenditure.

Cons: - Relies on contractual trust that the provider will honor the ZDR agreement. - Vendor lock-in to specific cloud ecosystems.

2. Gateway-Based Multi-Provider ZDR¶

To avoid vendor lock-in, organizations utilize an AI gateway or router that dynamically selects LLM providers while enforcing ZDR across the board.

Key Components: - AI Gateway: An intermediate proxy (e.g., OpenRouter, Cloudflare AI Gateway, Portkey) that routes requests. - ZDR Enforcement Headers: Setting provider.data_collection: "deny" or similar flags per-request to ensure the gateway only selects backends that support ZDR. - DLP Middleware: Incorporating Presidio or LLM Guard at the gateway level to redact PII before it even reaches the ZDR-enabled providers.

Pros: - Prevents vendor lock-in and allows seamless fallback routing. - Centralized audit logging and cost control. - Centralized PII redaction logic.

Cons: - Introduces an additional point of failure and latency. - The gateway itself becomes a target and must be trusted (or self-hosted).

3. Self-Hosted Production Stack¶

For maximum privacy, self-hosting open-weight models provides an air-gapped or VPC-isolated environment where data literally never leaves the organization.

Key Components: - Inference Engine: vLLM or SGLang running on dedicated GPU instances. - Open-Weight Models: Deploying highly capable open models (e.g., Llama 3 70B, DeepSeek-R1, Qwen). - Internal API: An OpenAI-compatible endpoint exposed only to internal VPC subnets.

Pros: - Cryptographic-level certainty of zero data retention (since you control the entire stack). - Flat operational costs at high scales (no per-token pricing). - Operates entirely offline for air-gapped classified environments.

Cons: - High capital expenditure for hardware (GPUs). - Ongoing operational burden for updates, scaling, and maintenance. - Often trails the capabilities of frontier proprietary models for complex reasoning tasks.

Hardware Sizing for Self-Hosting¶

When pursuing the self-hosted ZDR architecture, determining the correct hardware for the chosen model is critical.

Model Size	VRAM (FP16)	VRAM (INT4 Quantized)	Recommended GPU	System RAM
7B	~14 GB	~4 GB	1x RTX 3080/4090	16 GB
13B	~26 GB	~7 GB	1x RTX 4090 / A100	32 GB
32B	~64 GB	~18 GB	1x A100 40GB / H100	64 GB
70B	~140 GB	~38 GB	2x A100 80GB / 1x H100	128 GB
400B+ (MoE)	~800 GB	~200 GB	8x H100	512 GB
671B (DeepSeek-R1)	~1.3 TB	~340 GB	8-16x H100 (FP8)	1 TB

Quantization Trade-offs

Quantization (e.g., Q4_K_M) retains approximately 95% of full-precision quality while drastically reducing memory requirements. However, for reasoning models like DeepSeek-R1, aggressive quantization can disproportionately harm reasoning accuracy. FP8 or higher is recommended for critical reasoning tasks.

Inference Frameworks Benchmark Context¶

When deploying self-hosted models, the inference server dictates the performance and concurrency capabilities:

vLLM: Optimized for production serving and high concurrency. Utilizes PagedAttention, which can reduce memory fragmentation by over 40%, yielding ~19x higher throughput compared to simpler runners like Ollama.
Ollama: Ideal for local development or simple single-node deployments. Offers one-command setup and automatic quantization.
SGLang: Optimized for high-throughput structured generation and fast constrained decoding, critical when LLM outputs must match specific JSON schemas.
llama.cpp: Best suited for CPU inference or edge devices lacking high-end GPUs.

Security Hardening for Self-Hosted Architecture¶

To ensure the self-hosted architecture remains secure: - Network Isolation: Deploy within a private VPC/subnet with no internet egress. Use security groups to restrict access exclusively to the application layer. - Authentication: Situate an auth proxy (e.g., OAuth2 Proxy, Envoy with JWT validation) in front of the inference endpoint. - TLS: Terminate TLS at a load balancer or reverse proxy. Never expose the raw inference port directly. - Audit Logging: Log request metadata (identity, timestamp, model used) without logging prompt content to maintain internal ZDR. - Model Provenance: Verify model checksums against official sources; avoid downloading from untrusted mirrors to prevent supply chain attacks.