AI Platform Engineering — Security¶
Threat Model Overview¶
AI platform infrastructure introduces security concerns beyond traditional Kubernetes workloads due to the high value of GPU resources, model weights, and training data.
graph TB
subgraph "Attack Surface"
A1["Model Weight Theft"]
A2["GPU Resource Hijacking<br/>(Cryptomining)"]
A3["Training Data Exfiltration"]
A4["Inference API Abuse"]
A5["Supply Chain<br/>(Poisoned Models)"]
A6["Multi-Tenant Isolation<br/>Bypass"]
end
subgraph "Assets at Risk"
M["Model Weights<br/>(Proprietary IP)"]
G["GPU Compute<br/>(High $$$ value)"]
D["Training Data<br/>(PII, proprietary)"]
I["Inference Endpoints<br/>(Production services)"]
end
A1 --> M
A2 --> G
A3 --> D
A4 --> I
A5 --> M
A6 --> G & M & D
Authentication and Authorization¶
Kubernetes RBAC for GPU Resources¶
Control who can schedule GPU workloads to prevent unauthorized GPU consumption:
# Role restricting GPU pod creation
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gpu-user
namespace: ml-team
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "list", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "delete"]
---
# ResourceQuota limiting GPU allocation per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
vLLM API Authentication¶
vLLM's OpenAI-compatible API does not provide built-in authentication. Production deployments must use:
- API Gateway (Kong, Envoy, NGINX) for token validation, rate limiting, and request logging
- Kubernetes NetworkPolicies to restrict which pods can reach vLLM endpoints
- Service mesh (Istio, Linkerd) for mTLS between services
# NetworkPolicy: only allow inference gateway to reach vLLM
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-access
namespace: inference
spec:
podSelector:
matchLabels:
app: vllm
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: inference-gateway
ports:
- port: 8000
Ray Cluster Security¶
Ray clusters require attention to several security surfaces:
| Component | Risk | Mitigation |
|---|---|---|
| Ray Dashboard (8265) | Unauthenticated access to cluster management | NetworkPolicy, ingress with auth |
| GCS Port (6379) | Cluster control plane access | Pod-to-pod mTLS, no external exposure |
| Object Store | In-memory data accessible between workers | Namespace isolation, trusted workloads only |
| Ray Client (10001) | Remote code execution | TLS, authentication proxy |
# NetworkPolicy: isolate Ray cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-cluster-isolation
namespace: ray
spec:
podSelector:
matchLabels:
ray.io/cluster: gpu-cluster
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
ray.io/cluster: gpu-cluster
- from:
- podSelector:
matchLabels:
app: ray-job-submitter
ports:
- port: 8265
egress:
- to:
- podSelector:
matchLabels:
ray.io/cluster: gpu-cluster
- to: [] # Allow outbound for model downloads
Multi-Tenant GPU Isolation¶
MIG vs Time-Slicing Security Comparison¶
| Dimension | MIG | Time-Slicing |
|---|---|---|
| Memory isolation | Hardware-enforced — separate memory controllers and DRAM paths | Software-enforced — relies on CUDA context isolation |
| Compute isolation | Dedicated SMs, L2 cache banks, GPU engines | Shared GPU with time-based access rotation |
| Side-channel risk | Low — physically separate paths | Higher — shared cache and memory bus |
| QoS guarantees | Predictable — dedicated resources | Variable — depends on co-tenant behavior |
| Suitable for | Multi-tenant production, compliance environments | Development, trusted single-tenant clusters |
Multi-Tenant Isolation
Time-slicing does NOT provide hardware-level isolation. Workloads from different tenants sharing a GPU via time-slicing can potentially observe side-channel information through shared L2 cache timing. For regulated or untrusted multi-tenant environments, MIG or dedicated GPU allocation is required.
Namespace-Based GPU Isolation¶
# Dedicated GPU nodes per team using taints and tolerations
# Taint GPU nodes for specific teams
# kubectl taint nodes gpu-node-1 team=ml-production:NoSchedule
apiVersion: v1
kind: Pod
metadata:
name: production-inference
namespace: ml-production
spec:
tolerations:
- key: "team"
operator: "Equal"
value: "ml-production"
effect: "NoSchedule"
nodeSelector:
gpu-pool: production
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
Model Weight Protection¶
Securing Model Storage¶
Model weights represent significant IP investment. Protection strategies:
- Encrypted storage at rest — Use encrypted PersistentVolumes or encrypted object storage (S3 SSE, MinIO encryption)
- Access control — Restrict model registry access (MLflow, HuggingFace Hub) with per-team credentials
- Network segmentation — Model downloads should traverse private networks, not public internet
- Signed models — Verify model integrity using checksums or signatures before loading
# Secret for model registry credentials
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: inference
type: Opaque
data:
token: <base64-encoded-huggingface-token>
---
# Pod with model credentials mounted as env var
# (avoid mounting as files in shared volumes)
apiVersion: v1
kind: Pod
spec:
containers:
- name: vllm
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: HF_HOME
value: "/models/cache"
volumeMounts:
- name: model-cache
mountPath: /models/cache
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
GPU Cryptomining Prevention¶
GPU nodes are high-value targets for cryptomining. Detection and prevention:
# Monitor for unexpected GPU utilization
# Alert on sustained high GPU utilization from non-whitelisted pods
# PromQL alert rule:
DCGM_FI_DEV_GPU_UTIL > 90
and on (pod)
kube_pod_labels{label_workload_type!="training", label_workload_type!="inference"}
Prevention measures:
- Pod Security Standards — Restrict privileged containers on GPU nodes
- Image allowlisting — Only permit approved container images on GPU nodes
- Admission webhooks — Validate that GPU-requesting pods match approved workload patterns
- Resource quotas — Limit GPU allocation per namespace
Encryption¶
Data in Transit¶
| Path | Protocol | Configuration |
|---|---|---|
| Client → Inference API | HTTPS/TLS | API Gateway with TLS termination |
| Pod → Pod (inference) | mTLS | Service mesh (Istio/Linkerd) |
| Ray inter-node | TLS | Ray TLS configuration |
| GPU node → Storage | HTTPS | Encrypted object store endpoints |
| NVLink (intra-node) | N/A | Physical hardware path, no encryption needed |
| RDMA/InfiniBand | N/A | Typically trusted fabric, IPsec optional |
Data at Rest¶
| Component | Encryption Method |
|---|---|
| Model weights on PV | Encrypted PersistentVolumes (StorageClass encryption) |
| Training data | Encrypted object storage (SSE-S3, SSE-KMS) |
| KV Cache (GPU memory) | Not encrypted (volatile GPU memory) |
| Model registry | Application-level encryption + encrypted backend storage |
| Logs/metrics | Encrypted storage backend |
Compliance Considerations¶
GPU Workload Auditing¶
# Kubernetes audit policy for GPU resource events
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["pods"]
verbs: ["create", "delete"]
# Log who creates/deletes GPU pods
- level: RequestResponse
resources:
- group: "batch.volcano.sh"
resources: ["jobs"]
# Full audit for Volcano jobs
Checklist for Regulated Environments¶
- MIG or dedicated GPU allocation (no time-slicing for cross-tenant workloads)
- NetworkPolicies isolating inference endpoints
- mTLS between all AI platform services
- Model weight encryption at rest and access logging
- GPU utilization monitoring with anomaly detection
- RBAC restricting GPU resource creation to authorized teams
- Container image scanning and allowlisting on GPU nodes
- Audit logging for all GPU resource allocation events
- Training data access controls and lineage tracking
- Inference API rate limiting and authentication