On-Premises AI Security: Protecting Self-Hosted LLMs and GPU Infrastructure
Running AI on your own infrastructure gives you control over your data. It also means you own the security. Here is how to secure Ollama, vLLM, and other self-hosted AI deployments properly.
Why On-Premises AI Is Having a Moment
The conversation around enterprise AI has shifted. Eighteen months ago, the default answer to "how do we use AI?" was almost universally "use the OpenAI API." Now organizations are asking harder questions and choosing self-hosted AI more frequently—for good reasons. Data sensitivity: Healthcare providers, law firms, defense contractors, and financial institutions often cannot send data to third-party AI providers. Regulatory requirements, contractual obligations, or well-founded risk aversion. Cost at scale: At high inference volumes, running your own models can be significantly cheaper than API fees. Those per-token charges add up fast when processing millions of documents. Customization: Fine-tuned models running on your infrastructure, trained on your specific data, with no concerns about data leaving your environment. Control: You define the model version, the update cadence, the availability guarantees, and what data the model can see.
But here is what rarely gets discussed when teams make this decision: on-premises AI comes with security responsibilities that cloud providers would otherwise absorb. Those responsibilities are real, and most security teams are not prepared for them.
The On-Premises AI Threat Model
When you run AI on your own infrastructure, you take on security responsibilities that did not exist before. Let us map out what you are now accountable for: New attack surface:
- GPU infrastructure security (specialized hardware, often unfamiliar)
- Inference server software (Ollama, vLLM, LocalAI each have their own vulnerabilities)
- Model supply chain (you are downloading model weights from the internet)
- Model weight storage (large files representing significant intellectual property)
- Network security (same principles, new assets you did not have before)
- Access control (who can reach the inference endpoint, who can manage models)
- Secrets management (inference APIs need auth tokens, managed identity, or API keys)
- Monitoring (what does "normal" look like for GPU inference workloads?)
- Patch management (AI software moves extremely fast, vulnerabilities emerge regularly)
Securing Common Self-Hosted Inference Servers
Ollama
Ollama is the most popular self-hosted LLM runtime for a reason—it works well and has excellent model support. But its defaults are not production-ready, and many tutorials make this worse.
The core problem: Ollama's REST API has no authentication by default. If the process listens on anything other than localhost, anyone who can reach it on the network can send requests, list models, pull new models, and in some configurations delete them. Many tutorials instruct you to set OLLAMA_HOST=0.0.0.0 for remote access without adding any authentication.
The right approach: Keep Ollama bound to localhost, and put a reverse proxy in front of it that handles TLS and authentication:
# /etc/nginx/sites-available/ollama-secure
server {
listen 443 ssl;
server_name ai.internal.yourcompany.com;
ssl_certificate /etc/ssl/certs/internal-ca.crt;
ssl_certificate_key /etc/ssl/private/internal-ca.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
location / {
auth_basic "AI Inference Server";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
# Block model management endpoints from regular users
location /api/pull {
allow 10.0.1.5; # Only the model management host
deny all;
}
location /api/delete {
allow 10.0.1.5;
deny all;
}
location /api/push {
deny all;
}
}
For enterprise environments, replace basic auth with oauth2-proxy integrated with your existing SSO/identity provider.
vLLM
vLLM delivers high-performance production inference and is increasingly common in enterprise AI deployments. It exposes an OpenAI-compatible API, which is great for compatibility—and means you need to apply the same security rigor.
# Secure vLLM startup configuration
python -m vllm.entrypoints.openai.api_server --model /models/llama-3.1-8b-instruct --host 127.0.0.1 # Bind to localhost only—always
--port 8000 --api-key "${INFERENCE_API_KEY}" # Require authentication
--max-model-len 4096 # Limit context (prevents memory exhaustion)
--max-num-seqs 32 # Limit concurrent requests
--disable-log-requests # Do not log raw request content
Always run vLLM behind a reverse proxy that handles TLS termination. The vLLM process itself should never be directly exposed to the network.
LocalAI and Llama.cpp Server
The same principles apply across all inference servers. Any process binding to 0.0.0.0 without authentication is a security problem, regardless of whether it is on an "internal" network. Internal networks get breached. The inference server should be treated as internal infrastructure, not as a trusted zone.
GPU Infrastructure Hardening
GPU servers for AI inference are specialized infrastructure with their own hardening requirements. Most server hardening guides do not cover them specifically.
Operating System Baseline
Start with a minimal, purpose-built OS installation:
# GPU inference server hardening - essential steps
# 1. Disable unnecessary services
systemctl disable avahi-daemon cups bluetooth
systemctl mask rpcbind nfs-server
# 2. Set correct permissions for GPU device files
# Inference service user should be in 'render' and 'video' groups, not root
usermod -aG render,video ai_inference
# 3. Configure firewall with explicit rules
ufw default deny incoming
ufw default deny outgoing
ufw allow from 10.0.1.0/24 to any port 8000 # Only from app servers
ufw allow out to any port 443 # HTTPS for updates only
ufw allow out to any port 53 # DNS
ufw enable
# 4. Audit model file access
auditctl -w /models/ -p rwxa -k model_access
Run Inference as a Dedicated Service Account
Never run AI inference as root. Create a non-privileged service account:
# Create dedicated service user
useradd --system --shell /bin/false --no-create-home ai_inference
# Restrict model directory access
chown -R ai_inference:ai_inference /models/
chmod -R 750 /models/
# Systemd service hardening
[Service]
User=ai_inference
Group=ai_inference
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/models /var/log/ai-inference
GPU Resource Controls
# Restrict which GPUs a process can access
CUDA_VISIBLE_DEVICES=0,1 python vllm_server.py
# Prevent memory exhaustion - leave headroom
# In vLLM:
--gpu-memory-utilization 0.85
# Monitor GPU utilization - baseline and alert on deviations
nvidia-smi dmon -s u -d 30
Network Isolation
The goal is to make inference infrastructure accessible only to authorized applications—not to the broader network or to end users directly.
Dedicated Network Segment
Put AI infrastructure in its own segment with explicit firewall rules:
Recommended network topology:
[Users] → [Application Servers (App Subnet)] → [AI Inference Subnet] → [Model Storage]
AI Inference Subnet rules:
✓ Inbound: Only from Application Subnet, on inference port
✗ Inbound: Block everything else, including direct user traffic
✓ Outbound: Model storage subnet, approved package repositories
✗ Outbound: Deny all other outbound
Management access:
✓ SSH from dedicated jump host only
✓ Separate management VLAN
Air-Gapped Deployments
For the most sensitive environments—defense, classified data, highly regulated healthcare—consider true air-gapping. The inference server has no internet connectivity. Models are transferred through an approved, audited process. All software updates go through formal change management.
This is operationally demanding but provides the strongest possible isolation. Worth the overhead for environments where data exfiltration is a catastrophic risk.
Model Integrity and Supply Chain
Every model you download from the internet is a supply chain risk. This is not theoretical—malicious models have been identified on Hugging Face and similar platforms. Model weights can contain embedded malware in pickle-serialized formats.
Verify Sources and Hashes
Only download models from verified publishers: Meta for Llama models, Mistral AI for Mistral, official quantization authors with established track records.
# Always verify model hash against the official source
# Download from official source first
wget https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/model.safetensors
# Verify hash - get expected value from official release notes, NOT from the download page
sha256sum model.safetensors
# For Ollama - check model manifests
cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2/latest
Prefer SafeTensors Format
SafeTensors is a serialization format specifically designed to prevent arbitrary code execution vulnerabilities inherent in pickle-based formats (PyTorch's .pt and .bin files). When a .bin file is loaded, arbitrary Python code in the file can execute. SafeTensors cannot.
# When available, always prefer the .safetensors version
# model.safetensors <-- safe
# model.bin <-- can execute code on load
# model.pt <-- can execute code on load
# Scan pickle-format files before loading if safetensors unavailable
# pip install picklescan
picklescan --path /models/
Internal Model Registry
Do not let teams freely download models to production inference servers. Establish a controlled process:
- Submit request to add model (source, purpose, expected hash)
- Security review of model source and format
- Test in isolated environment
- Approve and add to internal registry
- Deploy to production from registry only
This adds friction that feels bureaucratic until the day it stops a malicious model from reaching production.
Access Control
Define Access Tiers Explicitly
AI Access Control Tiers:
AI Users:
- Can send inference requests to approved models
- Cannot manage models, view configurations, or access logs
AI Operators:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Can start and stop inference services</li>
<li class="text-gray-600">Can view inference logs (without raw content)</li>
<li class="text-gray-600">Cannot pull new models or modify access controls</li>
</ul>
AI Administrators:
- Can add and remove models
- Can modify access controls and configurations
- Requires MFA, all actions audited
Integrate with Your Identity Provider
For single-user or small team deployments, basic auth or API tokens may be acceptable. For enterprise, integrate with your existing identity provider through oauth2-proxy or similar:
# oauth2-proxy in front of inference server
oauth2-proxy --provider=oidc --oidc-issuer-url=https://login.yourcompany.com/oauth2 --client-id=ai-inference-server --upstream=http://localhost:11434 --email-domain=yourcompany.com --cookie-secret="${COOKIE_SECRET}" --pass-authorization-header=true
Monitoring and Anomaly Detection
On-premises AI requires you to build your own monitoring stack. There is no built-in dashboard from your AI provider.
What to Monitor
Operational baseline—alert on significant deviation:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Requests per second (per model, per authenticated user)</li>
<li class="text-gray-600">Tokens generated per hour</li>
<li class="text-gray-600">GPU utilization and memory usage</li>
<li class="text-gray-600">Model load and unload events</li>
</ul>
Security events—alert immediately:
- Authentication failures (especially repeated failures from same source)
- Requests to model management endpoints from unauthorized sources
- New IP addresses or user agents not seen before
- Very long input context windows (potential injection attack)
- Inference requests during unusual hours (for business-hours workloads)
Structured Logging for Security Visibility
# Nginx JSON access log for AI inference proxy
log_format ai_access escape=json
'{"timestamp":"$time_iso8601",'
'"client_ip":"$remote_addr",'
'"authenticated_user":"$remote_user",'
'"method":"$request_method",'
'"endpoint":"$request_uri",'
'"http_status":"$status",'
'"response_bytes":"$bytes_sent",'
'"request_duration_ms":"$request_time"}';
access_log /var/log/nginx/ai-access.log ai_access;Ship these logs to your SIEM or log management platform. Build alerts on authentication failures, management endpoint access, and usage volume spikes.
Production Readiness Checklist
Infrastructure:
✓ Inference server bound to localhost, proxied through nginx/caddy
✓ TLS with valid certificate on the proxy
✓ Authentication required for all endpoints
✓ Management endpoints restricted to authorized hosts
✓ Inference service runs as non-root service account
✓ Model files owned by service account, restricted permissions
✓ GPU server on isolated network segment with explicit firewall rules
Supply Chain:
✓ Model hashes verified against official release notes
✓ SafeTensors format used where available
✓ Model approval process documented and enforced
✓ No automatic model downloads in production environments
Monitoring:
✓ Structured access logging enabled
✓ Authentication events captured
✓ GPU utilization monitored
✓ Usage baseline established, anomaly alerts configured
Operations:
✓ Patch management process for inference software
✓ Incident response playbook for AI infrastructure incidents
✓ Access reviews scheduled quarterlyOn-premises AI gives you control over your data and your infrastructure. That control is worth having. It also means security is entirely your responsibility. The organizations that do this well treat their AI infrastructure with the same discipline as their production databases—because that is exactly what it is.
Questions & Answers
Need Help with Your Security?
Our team of security experts can help you implement the strategies discussed in this article.
Contact Us