Skip to content

AI Platform Engineering — Security

Threat Model Overview

AI platform infrastructure introduces security concerns beyond traditional Kubernetes workloads due to the high value of GPU resources, model weights, and training data.

graph TB
    subgraph "Attack Surface"
        A1["Model Weight Theft"]
        A2["GPU Resource Hijacking<br/>(Cryptomining)"]
        A3["Training Data Exfiltration"]
        A4["Inference API Abuse"]
        A5["Supply Chain<br/>(Poisoned Models)"]
        A6["Multi-Tenant Isolation<br/>Bypass"]
    end

    subgraph "Assets at Risk"
        M["Model Weights<br/>(Proprietary IP)"]
        G["GPU Compute<br/>(High $$$ value)"]
        D["Training Data<br/>(PII, proprietary)"]
        I["Inference Endpoints<br/>(Production services)"]
    end

    A1 --> M
    A2 --> G
    A3 --> D
    A4 --> I
    A5 --> M
    A6 --> G & M & D

Authentication and Authorization

Kubernetes RBAC for GPU Resources

Control who can schedule GPU workloads to prevent unauthorized GPU consumption:

# Role restricting GPU pod creation
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gpu-user
  namespace: ml-team
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "get", "list", "delete"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "delete"]
---
# ResourceQuota limiting GPU allocation per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"

vLLM API Authentication

vLLM's OpenAI-compatible API does not provide built-in authentication. Production deployments must use:

  • API Gateway (Kong, Envoy, NGINX) for token validation, rate limiting, and request logging
  • Kubernetes NetworkPolicies to restrict which pods can reach vLLM endpoints
  • Service mesh (Istio, Linkerd) for mTLS between services
# NetworkPolicy: only allow inference gateway to reach vLLM
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-access
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: inference-gateway
      ports:
        - port: 8000

Ray Cluster Security

Ray clusters require attention to several security surfaces:

Component Risk Mitigation
Ray Dashboard (8265) Unauthenticated access to cluster management NetworkPolicy, ingress with auth
GCS Port (6379) Cluster control plane access Pod-to-pod mTLS, no external exposure
Object Store In-memory data accessible between workers Namespace isolation, trusted workloads only
Ray Client (10001) Remote code execution TLS, authentication proxy
# NetworkPolicy: isolate Ray cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ray-cluster-isolation
  namespace: ray
spec:
  podSelector:
    matchLabels:
      ray.io/cluster: gpu-cluster
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              ray.io/cluster: gpu-cluster
    - from:
        - podSelector:
            matchLabels:
              app: ray-job-submitter
      ports:
        - port: 8265
  egress:
    - to:
        - podSelector:
            matchLabels:
              ray.io/cluster: gpu-cluster
    - to: []  # Allow outbound for model downloads

Multi-Tenant GPU Isolation

MIG vs Time-Slicing Security Comparison

Dimension MIG Time-Slicing
Memory isolation Hardware-enforced — separate memory controllers and DRAM paths Software-enforced — relies on CUDA context isolation
Compute isolation Dedicated SMs, L2 cache banks, GPU engines Shared GPU with time-based access rotation
Side-channel risk Low — physically separate paths Higher — shared cache and memory bus
QoS guarantees Predictable — dedicated resources Variable — depends on co-tenant behavior
Suitable for Multi-tenant production, compliance environments Development, trusted single-tenant clusters

Multi-Tenant Isolation

Time-slicing does NOT provide hardware-level isolation. Workloads from different tenants sharing a GPU via time-slicing can potentially observe side-channel information through shared L2 cache timing. For regulated or untrusted multi-tenant environments, MIG or dedicated GPU allocation is required.

Namespace-Based GPU Isolation

# Dedicated GPU nodes per team using taints and tolerations
# Taint GPU nodes for specific teams
# kubectl taint nodes gpu-node-1 team=ml-production:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  name: production-inference
  namespace: ml-production
spec:
  tolerations:
    - key: "team"
      operator: "Equal"
      value: "ml-production"
      effect: "NoSchedule"
  nodeSelector:
    gpu-pool: production
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 1

Model Weight Protection

Securing Model Storage

Model weights represent significant IP investment. Protection strategies:

  1. Encrypted storage at rest — Use encrypted PersistentVolumes or encrypted object storage (S3 SSE, MinIO encryption)
  2. Access control — Restrict model registry access (MLflow, HuggingFace Hub) with per-team credentials
  3. Network segmentation — Model downloads should traverse private networks, not public internet
  4. Signed models — Verify model integrity using checksums or signatures before loading
# Secret for model registry credentials
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: inference
type: Opaque
data:
  token: <base64-encoded-huggingface-token>
---
# Pod with model credentials mounted as env var
# (avoid mounting as files in shared volumes)
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: vllm
      env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: HF_HOME
          value: "/models/cache"
      volumeMounts:
        - name: model-cache
          mountPath: /models/cache
  volumes:
    - name: model-cache
      persistentVolumeClaim:
        claimName: model-cache-pvc

GPU Cryptomining Prevention

GPU nodes are high-value targets for cryptomining. Detection and prevention:

# Monitor for unexpected GPU utilization
# Alert on sustained high GPU utilization from non-whitelisted pods
# PromQL alert rule:
DCGM_FI_DEV_GPU_UTIL > 90
  and on (pod)
  kube_pod_labels{label_workload_type!="training", label_workload_type!="inference"}

Prevention measures:

  • Pod Security Standards — Restrict privileged containers on GPU nodes
  • Image allowlisting — Only permit approved container images on GPU nodes
  • Admission webhooks — Validate that GPU-requesting pods match approved workload patterns
  • Resource quotas — Limit GPU allocation per namespace

Encryption

Data in Transit

Path Protocol Configuration
Client → Inference API HTTPS/TLS API Gateway with TLS termination
Pod → Pod (inference) mTLS Service mesh (Istio/Linkerd)
Ray inter-node TLS Ray TLS configuration
GPU node → Storage HTTPS Encrypted object store endpoints
NVLink (intra-node) N/A Physical hardware path, no encryption needed
RDMA/InfiniBand N/A Typically trusted fabric, IPsec optional

Data at Rest

Component Encryption Method
Model weights on PV Encrypted PersistentVolumes (StorageClass encryption)
Training data Encrypted object storage (SSE-S3, SSE-KMS)
KV Cache (GPU memory) Not encrypted (volatile GPU memory)
Model registry Application-level encryption + encrypted backend storage
Logs/metrics Encrypted storage backend

Compliance Considerations

GPU Workload Auditing

# Kubernetes audit policy for GPU resource events
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["pods"]
    verbs: ["create", "delete"]
    # Log who creates/deletes GPU pods
  - level: RequestResponse
    resources:
      - group: "batch.volcano.sh"
        resources: ["jobs"]
    # Full audit for Volcano jobs

Checklist for Regulated Environments

  • MIG or dedicated GPU allocation (no time-slicing for cross-tenant workloads)
  • NetworkPolicies isolating inference endpoints
  • mTLS between all AI platform services
  • Model weight encryption at rest and access logging
  • GPU utilization monitoring with anomaly detection
  • RBAC restricting GPU resource creation to authorized teams
  • Container image scanning and allowlisting on GPU nodes
  • Audit logging for all GPU resource allocation events
  • Training data access controls and lineage tracking
  • Inference API rate limiting and authentication