AI Platform Engineering — Security¶

Threat Model Overview¶

AI platform infrastructure introduces security concerns beyond traditional Kubernetes workloads due to the high value of GPU resources, model weights, and training data.

graph TB
    subgraph "Attack Surface"
        A1["Model Weight Theft"]
        A2["GPU Resource Hijacking<br/>(Cryptomining)"]
        A3["Training Data Exfiltration"]
        A4["Inference API Abuse"]
        A5["Supply Chain<br/>(Poisoned Models)"]
        A6["Multi-Tenant Isolation<br/>Bypass"]
    end

    subgraph "Assets at Risk"
        M["Model Weights<br/>(Proprietary IP)"]
        G["GPU Compute<br/>(High $$$ value)"]
        D["Training Data<br/>(PII, proprietary)"]
        I["Inference Endpoints<br/>(Production services)"]
    end

    A1 --> M
    A2 --> G
    A3 --> D
    A4 --> I
    A5 --> M
    A6 --> G & M & D

Authentication and Authorization¶

Kubernetes RBAC for GPU Resources¶

Control who can schedule GPU workloads to prevent unauthorized GPU consumption:

# Role restricting GPU pod creation
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gpu-user
  namespace: ml-team
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "get", "list", "delete"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "delete"]
---
# ResourceQuota limiting GPU allocation per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"

vLLM API Authentication¶

vLLM's OpenAI-compatible API does not provide built-in authentication. Production deployments must use:

API Gateway (Kong, Envoy, NGINX) for token validation, rate limiting, and request logging
Kubernetes NetworkPolicies to restrict which pods can reach vLLM endpoints
Service mesh (Istio, Linkerd) for mTLS between services

# NetworkPolicy: only allow inference gateway to reach vLLM
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-access
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: inference-gateway
      ports:
        - port: 8000

Ray Cluster Security¶

Ray clusters require attention to several security surfaces:

Component	Risk	Mitigation
Ray Dashboard (8265)	Unauthenticated access to cluster management	NetworkPolicy, ingress with auth
GCS Port (6379)	Cluster control plane access	Pod-to-pod mTLS, no external exposure
Object Store	In-memory data accessible between workers	Namespace isolation, trusted workloads only
Ray Client (10001)	Remote code execution	TLS, authentication proxy

# NetworkPolicy: isolate Ray cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-cluster-isolation
  namespace: ray
spec:
  podSelector:
    matchLabels:
      ray.io/cluster: gpu-cluster
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              ray.io/cluster: gpu-cluster
    - from:
        - podSelector:
            matchLabels:
              app: ray-job-submitter
      ports:
        - port: 8265
  egress:
    - to:
        - podSelector:
            matchLabels:
              ray.io/cluster: gpu-cluster
    - to: []  # Allow outbound for model downloads

Multi-Tenant GPU Isolation¶

MIG vs Time-Slicing Security Comparison¶

Dimension	MIG	Time-Slicing
Memory isolation	Hardware-enforced — separate memory controllers and DRAM paths	Software-enforced — relies on CUDA context isolation
Compute isolation	Dedicated SMs, L2 cache banks, GPU engines	Shared GPU with time-based access rotation
Side-channel risk	Low — physically separate paths	Higher — shared cache and memory bus
QoS guarantees	Predictable — dedicated resources	Variable — depends on co-tenant behavior
Suitable for	Multi-tenant production, compliance environments	Development, trusted single-tenant clusters

Multi-Tenant Isolation

Time-slicing does NOT provide hardware-level isolation. Workloads from different tenants sharing a GPU via time-slicing can potentially observe side-channel information through shared L2 cache timing. For regulated or untrusted multi-tenant environments, MIG or dedicated GPU allocation is required.

Namespace-Based GPU Isolation¶

# Dedicated GPU nodes per team using taints and tolerations
# Taint GPU nodes for specific teams
# kubectl taint nodes gpu-node-1 team=ml-production:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  name: production-inference
  namespace: ml-production
spec:
  tolerations:
    - key: "team"
      operator: "Equal"
      value: "ml-production"
      effect: "NoSchedule"
  nodeSelector:
    gpu-pool: production
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 1

Model Weight Protection¶

Securing Model Storage¶

Model weights represent significant IP investment. Protection strategies:

Encrypted storage at rest — Use encrypted PersistentVolumes or encrypted object storage (S3 SSE, MinIO encryption)
Access control — Restrict model registry access (MLflow, HuggingFace Hub) with per-team credentials
Network segmentation — Model downloads should traverse private networks, not public internet
Signed models — Verify model integrity using checksums or signatures before loading

# Secret for model registry credentials
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: inference
type: Opaque
data:
  token: <base64-encoded-huggingface-token>
---
# Pod with model credentials mounted as env var
# (avoid mounting as files in shared volumes)
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: vllm
      env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: HF_HOME
          value: "/models/cache"
      volumeMounts:
        - name: model-cache
          mountPath: /models/cache
  volumes:
    - name: model-cache
      persistentVolumeClaim:
        claimName: model-cache-pvc

GPU Cryptomining Prevention¶

GPU nodes are high-value targets for cryptomining. Detection and prevention:

# Monitor for unexpected GPU utilization
# Alert on sustained high GPU utilization from non-whitelisted pods
# PromQL alert rule:
DCGM_FI_DEV_GPU_UTIL > 90
  and on (pod)
  kube_pod_labels{label_workload_type!="training", label_workload_type!="inference"}

Prevention measures:

Pod Security Standards — Restrict privileged containers on GPU nodes
Image allowlisting — Only permit approved container images on GPU nodes
Admission webhooks — Validate that GPU-requesting pods match approved workload patterns
Resource quotas — Limit GPU allocation per namespace

Encryption¶

Data in Transit¶

Path	Protocol	Configuration
Client → Inference API	HTTPS/TLS	API Gateway with TLS termination
Pod → Pod (inference)	mTLS	Service mesh (Istio/Linkerd)
Ray inter-node	TLS	Ray TLS configuration
GPU node → Storage	HTTPS	Encrypted object store endpoints
NVLink (intra-node)	N/A	Physical hardware path, no encryption needed
RDMA/InfiniBand	N/A	Typically trusted fabric, IPsec optional

Data at Rest¶

Component	Encryption Method
Model weights on PV	Encrypted PersistentVolumes (StorageClass encryption)
Training data	Encrypted object storage (SSE-S3, SSE-KMS)
KV Cache (GPU memory)	Not encrypted (volatile GPU memory)
Model registry	Application-level encryption + encrypted backend storage
Logs/metrics	Encrypted storage backend

Compliance Considerations¶

GPU Workload Auditing¶

# Kubernetes audit policy for GPU resource events
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["pods"]
    verbs: ["create", "delete"]
    # Log who creates/deletes GPU pods
  - level: RequestResponse
    resources:
      - group: "batch.volcano.sh"
        resources: ["jobs"]
    # Full audit for Volcano jobs