On-Premises AI Security: Self-Hosted LLM Security Guide 2026

Why On-Premises AI Is Having a Moment

The conversation around enterprise AI has shifted. Eighteen months ago, the default answer to "how do we use AI?" was almost universally "use the OpenAI API." Now organizations are asking harder questions and choosing self-hosted AI more frequently—for good reasons. Data sensitivity: Healthcare providers, law firms, defense contractors, and financial institutions often cannot send data to third-party AI providers. Regulatory requirements, contractual obligations, or well-founded risk aversion. Cost at scale: At high inference volumes, running your own models can be significantly cheaper than API fees. Those per-token charges add up fast when processing millions of documents. Customization: Fine-tuned models running on your infrastructure, trained on your specific data, with no concerns about data leaving your environment. Control: You define the model version, the update cadence, the availability guarantees, and what data the model can see.

But here is what rarely gets discussed when teams make this decision: on-premises AI comes with security responsibilities that cloud providers would otherwise absorb. Those responsibilities are real, and most security teams are not prepared for them.

The On-Premises AI Threat Model

When you run AI on your own infrastructure, you take on security responsibilities that did not exist before. Let us map out what you are now accountable for: New attack surface:

GPU infrastructure security (specialized hardware, often unfamiliar)
Inference server software (Ollama, vLLM, LocalAI each have their own vulnerabilities)
Model supply chain (you are downloading model weights from the internet)
Model weight storage (large files representing significant intellectual property)

Familiar problems in new contexts:

Network security (same principles, new assets you did not have before)
Access control (who can reach the inference endpoint, who can manage models)
Secrets management (inference APIs need auth tokens, managed identity, or API keys)
Monitoring (what does "normal" look like for GPU inference workloads?)
Patch management (AI software moves extremely fast, vulnerabilities emerge regularly)

Securing Common Self-Hosted Inference Servers

Ollama

Ollama is the most popular self-hosted LLM runtime for a reason—it works well and has excellent model support. But its defaults are not production-ready, and many tutorials make this worse. The core problem: Ollama's REST API has no authentication by default. If the process listens on anything other than localhost, anyone who can reach it on the network can send requests, list models, pull new models, and in some configurations delete them. Many tutorials instruct you to set OLLAMA_HOST=0.0.0.0 for remote access without adding any authentication. The right approach: Keep Ollama bound to localhost, and put a reverse proxy in front of it that handles TLS and authentication:

# /etc/nginx/sites-available/ollama-secure
server {
    listen 443 ssl;
    server_name ai.internal.yourcompany.com;

ssl_certificate /etc/ssl/certs/internal-ca.crt; ssl_certificate_key /etc/ssl/private/internal-ca.key; ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

location / {
        auth_basic "AI Inference Server";
        auth_basic_user_file /etc/nginx/.htpasswd;

proxy_pass http://127.0.0.1:11434; proxy_set_header Host $host; proxy_read_timeout 300s; }

# Block model management endpoints from regular users
    location /api/pull {
        allow 10.0.1.5;   # Only the model management host
        deny all;
    }
    location /api/delete {
        allow 10.0.1.5;
        deny all;
    }
    location /api/push {
        deny all;
    }
}

For enterprise environments, replace basic auth with oauth2-proxy integrated with your existing SSO/identity provider.

vLLM

vLLM delivers high-performance production inference and is increasingly common in enterprise AI deployments. It exposes an OpenAI-compatible API, which is great for compatibility—and means you need to apply the same security rigor.

# Secure vLLM startup configuration
python -m vllm.entrypoints.openai.api_server         --model /models/llama-3.1-8b-instruct         --host 127.0.0.1                # Bind to localhost only—always
    --port 8000         --api-key "${INFERENCE_API_KEY}"   # Require authentication
    --max-model-len 4096               # Limit context (prevents memory exhaustion)
    --max-num-seqs 32                  # Limit concurrent requests
    --disable-log-requests              # Do not log raw request content

Always run vLLM behind a reverse proxy that handles TLS termination. The vLLM process itself should never be directly exposed to the network.

LocalAI and Llama.cpp Server

The same principles apply across all inference servers. Any process binding to 0.0.0.0 without authentication is a security problem, regardless of whether it is on an "internal" network. Internal networks get breached. The inference server should be treated as internal infrastructure, not as a trusted zone.

GPU Infrastructure Hardening

GPU servers for AI inference are specialized infrastructure with their own hardening requirements. Most server hardening guides do not cover them specifically.

Operating System Baseline

Start with a minimal, purpose-built OS installation:

# GPU inference server hardening - essential steps

# 1. Disable unnecessary services systemctl disable avahi-daemon cups bluetooth systemctl mask rpcbind nfs-server

# 2. Set correct permissions for GPU device files
# Inference service user should be in 'render' and 'video' groups, not root
usermod -aG render,video ai_inference

# 3. Configure firewall with explicit rules ufw default deny incoming ufw default deny outgoing ufw allow from 10.0.1.0/24 to any port 8000 # Only from app servers ufw allow out to any port 443 # HTTPS for updates only ufw allow out to any port 53 # DNS ufw enable

# 4. Audit model file access
auditctl -w /models/ -p rwxa -k model_access

Run Inference as a Dedicated Service Account

Never run AI inference as root. Create a non-privileged service account:

# Create dedicated service user
useradd --system --shell /bin/false --no-create-home ai_inference

# Restrict model directory access chown -R ai_inference:ai_inference /models/ chmod -R 750 /models/

# Systemd service hardening
[Service]
User=ai_inference
Group=ai_inference
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/models /var/log/ai-inference

GPU Resource Controls

# Restrict which GPUs a process can access
CUDA_VISIBLE_DEVICES=0,1 python vllm_server.py

# Prevent memory exhaustion - leave headroom # In vLLM: --gpu-memory-utilization 0.85

# Monitor GPU utilization - baseline and alert on deviations
nvidia-smi dmon -s u -d 30

Network Isolation

The goal is to make inference infrastructure accessible only to authorized applications—not to the broader network or to end users directly.

Dedicated Network Segment

Put AI infrastructure in its own segment with explicit firewall rules:

Recommended network topology:

[Users] → [Application Servers (App Subnet)] → [AI Inference Subnet] → [Model Storage]

AI Inference Subnet rules:
✓ Inbound: Only from Application Subnet, on inference port
✗ Inbound: Block everything else, including direct user traffic
✓ Outbound: Model storage subnet, approved package repositories
✗ Outbound: Deny all other outbound

Management access: ✓ SSH from dedicated jump host only ✓ Separate management VLAN

Air-Gapped Deployments

For the most sensitive environments—defense, classified data, highly regulated healthcare—consider true air-gapping. The inference server has no internet connectivity. Models are transferred through an approved, audited process. All software updates go through formal change management.

This is operationally demanding but provides the strongest possible isolation. Worth the overhead for environments where data exfiltration is a catastrophic risk.

Model Integrity and Supply Chain

Every model you download from the internet is a supply chain risk. This is not theoretical—malicious models have been identified on Hugging Face and similar platforms. Model weights can contain embedded malware in pickle-serialized formats.

Verify Sources and Hashes

Only download models from verified publishers: Meta for Llama models, Mistral AI for Mistral, official quantization authors with established track records.

# Always verify model hash against the official source
# Download from official source first
wget https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/model.safetensors

# Verify hash - get expected value from official release notes, NOT from the download page sha256sum model.safetensors

# For Ollama - check model manifests
cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2/latest

Prefer SafeTensors Format

SafeTensors is a serialization format specifically designed to prevent arbitrary code execution vulnerabilities inherent in pickle-based formats (PyTorch's .pt and .bin files). When a .bin file is loaded, arbitrary Python code in the file can execute. SafeTensors cannot.

# When available, always prefer the .safetensors version
# model.safetensors    <-- safe
# model.bin            <-- can execute code on load
# model.pt             <-- can execute code on load

# Scan pickle-format files before loading if safetensors unavailable # pip install picklescan picklescan --path /models/

Internal Model Registry

Do not let teams freely download models to production inference servers. Establish a controlled process:

Submit request to add model (source, purpose, expected hash)
Security review of model source and format
Test in isolated environment
Approve and add to internal registry
Deploy to production from registry only

This adds friction that feels bureaucratic until the day it stops a malicious model from reaching production.

Access Control

Define Access Tiers Explicitly

AI Access Control Tiers:

AI Users:

Can send inference requests to approved models
Cannot manage models, view configurations, or access logs

AI Operators:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Can start and stop inference services</li>
<li class="text-gray-600">Can view inference logs (without raw content)</li>
<li class="text-gray-600">Cannot pull new models or modify access controls</li>
</ul>

AI Administrators:

Can add and remove models
Can modify access controls and configurations
Requires MFA, all actions audited

Integrate with Your Identity Provider

For single-user or small team deployments, basic auth or API tokens may be acceptable. For enterprise, integrate with your existing identity provider through oauth2-proxy or similar:

# oauth2-proxy in front of inference server
oauth2-proxy         --provider=oidc         --oidc-issuer-url=https://login.yourcompany.com/oauth2         --client-id=ai-inference-server         --upstream=http://localhost:11434         --email-domain=yourcompany.com         --cookie-secret="${COOKIE_SECRET}"         --pass-authorization-header=true

Monitoring and Anomaly Detection

On-premises AI requires you to build your own monitoring stack. There is no built-in dashboard from your AI provider.

What to Monitor

Operational baseline—alert on significant deviation:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Requests per second (per model, per authenticated user)</li>
<li class="text-gray-600">Tokens generated per hour</li>
<li class="text-gray-600">GPU utilization and memory usage</li>
<li class="text-gray-600">Model load and unload events</li>
</ul>

Security events—alert immediately:

Authentication failures (especially repeated failures from same source)
Requests to model management endpoints from unauthorized sources
New IP addresses or user agents not seen before
Very long input context windows (potential injection attack)
Inference requests during unusual hours (for business-hours workloads)

Structured Logging for Security Visibility

# Nginx JSON access log for AI inference proxy
log_format ai_access escape=json
    '{"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"authenticated_user":"$remote_user",'
    '"method":"$request_method",'
    '"endpoint":"$request_uri",'
    '"http_status":"$status",'
    '"response_bytes":"$bytes_sent",'
    '"request_duration_ms":"$request_time"}';

access_log /var/log/nginx/ai-access.log ai_access;

Ship these logs to your SIEM or log management platform. Build alerts on authentication failures, management endpoint access, and usage volume spikes.

Production Readiness Checklist

Infrastructure:
✓ Inference server bound to localhost, proxied through nginx/caddy
✓ TLS with valid certificate on the proxy
✓ Authentication required for all endpoints
✓ Management endpoints restricted to authorized hosts
✓ Inference service runs as non-root service account
✓ Model files owned by service account, restricted permissions
✓ GPU server on isolated network segment with explicit firewall rules

Supply Chain: ✓ Model hashes verified against official release notes ✓ SafeTensors format used where available ✓ Model approval process documented and enforced ✓ No automatic model downloads in production environments

Monitoring:
✓ Structured access logging enabled
✓ Authentication events captured
✓ GPU utilization monitored
✓ Usage baseline established, anomaly alerts configured

Operations: ✓ Patch management process for inference software ✓ Incident response playbook for AI infrastructure incidents ✓ Access reviews scheduled quarterly

On-premises AI gives you control over your data and your infrastructure. That control is worth having. It also means security is entirely your responsibility. The organizations that do this well treat their AI infrastructure with the same discipline as their production databases—because that is exactly what it is.

Why On-Premises AI Is Having a Moment

The On-Premises AI Threat Model

When you run AI on your own infrastructure, you take on security responsibilities that did not exist before. Let us map out what you are now accountable for: New attack surface:

GPU infrastructure security (specialized hardware, often unfamiliar)
Inference server software (Ollama, vLLM, LocalAI each have their own vulnerabilities)
Model supply chain (you are downloading model weights from the internet)
Model weight storage (large files representing significant intellectual property)

Familiar problems in new contexts:

Network security (same principles, new assets you did not have before)
Access control (who can reach the inference endpoint, who can manage models)
Secrets management (inference APIs need auth tokens, managed identity, or API keys)
Monitoring (what does "normal" look like for GPU inference workloads?)
Patch management (AI software moves extremely fast, vulnerabilities emerge regularly)

Securing Common Self-Hosted Inference Servers

Ollama

# /etc/nginx/sites-available/ollama-secure
server {
    listen 443 ssl;
    server_name ai.internal.yourcompany.com;

location / {
        auth_basic "AI Inference Server";
        auth_basic_user_file /etc/nginx/.htpasswd;

proxy_pass http://127.0.0.1:11434; proxy_set_header Host $host; proxy_read_timeout 300s; }

# Block model management endpoints from regular users
    location /api/pull {
        allow 10.0.1.5;   # Only the model management host
        deny all;
    }
    location /api/delete {
        allow 10.0.1.5;
        deny all;
    }
    location /api/push {
        deny all;
    }
}

For enterprise environments, replace basic auth with oauth2-proxy integrated with your existing SSO/identity provider.

vLLM

# Secure vLLM startup configuration
python -m vllm.entrypoints.openai.api_server         --model /models/llama-3.1-8b-instruct         --host 127.0.0.1                # Bind to localhost only—always
    --port 8000         --api-key "${INFERENCE_API_KEY}"   # Require authentication
    --max-model-len 4096               # Limit context (prevents memory exhaustion)
    --max-num-seqs 32                  # Limit concurrent requests
    --disable-log-requests              # Do not log raw request content

Always run vLLM behind a reverse proxy that handles TLS termination. The vLLM process itself should never be directly exposed to the network.

LocalAI and Llama.cpp Server

GPU Infrastructure Hardening

GPU servers for AI inference are specialized infrastructure with their own hardening requirements. Most server hardening guides do not cover them specifically.

Operating System Baseline

Start with a minimal, purpose-built OS installation:

# GPU inference server hardening - essential steps

# 1. Disable unnecessary services systemctl disable avahi-daemon cups bluetooth systemctl mask rpcbind nfs-server

# 2. Set correct permissions for GPU device files
# Inference service user should be in 'render' and 'video' groups, not root
usermod -aG render,video ai_inference

# 4. Audit model file access
auditctl -w /models/ -p rwxa -k model_access

Run Inference as a Dedicated Service Account

Never run AI inference as root. Create a non-privileged service account:

# Create dedicated service user
useradd --system --shell /bin/false --no-create-home ai_inference

# Restrict model directory access chown -R ai_inference:ai_inference /models/ chmod -R 750 /models/

# Systemd service hardening
[Service]
User=ai_inference
Group=ai_inference
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/models /var/log/ai-inference

GPU Resource Controls

# Restrict which GPUs a process can access
CUDA_VISIBLE_DEVICES=0,1 python vllm_server.py

# Prevent memory exhaustion - leave headroom # In vLLM: --gpu-memory-utilization 0.85

# Monitor GPU utilization - baseline and alert on deviations
nvidia-smi dmon -s u -d 30

Network Isolation

The goal is to make inference infrastructure accessible only to authorized applications—not to the broader network or to end users directly.

Dedicated Network Segment

Put AI infrastructure in its own segment with explicit firewall rules:

Recommended network topology:

[Users] → [Application Servers (App Subnet)] → [AI Inference Subnet] → [Model Storage]

AI Inference Subnet rules:
✓ Inbound: Only from Application Subnet, on inference port
✗ Inbound: Block everything else, including direct user traffic
✓ Outbound: Model storage subnet, approved package repositories
✗ Outbound: Deny all other outbound

Management access: ✓ SSH from dedicated jump host only ✓ Separate management VLAN

Air-Gapped Deployments

This is operationally demanding but provides the strongest possible isolation. Worth the overhead for environments where data exfiltration is a catastrophic risk.

Model Integrity and Supply Chain

Verify Sources and Hashes

Only download models from verified publishers: Meta for Llama models, Mistral AI for Mistral, official quantization authors with established track records.

# Always verify model hash against the official source
# Download from official source first
wget https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/model.safetensors

# Verify hash - get expected value from official release notes, NOT from the download page sha256sum model.safetensors

# For Ollama - check model manifests
cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2/latest

Prefer SafeTensors Format

# When available, always prefer the .safetensors version
# model.safetensors    <-- safe
# model.bin            <-- can execute code on load
# model.pt             <-- can execute code on load

# Scan pickle-format files before loading if safetensors unavailable # pip install picklescan picklescan --path /models/

Internal Model Registry

Do not let teams freely download models to production inference servers. Establish a controlled process:

Submit request to add model (source, purpose, expected hash)
Security review of model source and format
Test in isolated environment
Approve and add to internal registry
Deploy to production from registry only

This adds friction that feels bureaucratic until the day it stops a malicious model from reaching production.

Access Control

Define Access Tiers Explicitly

AI Access Control Tiers:

AI Users:

Can send inference requests to approved models
Cannot manage models, view configurations, or access logs

AI Operators:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Can start and stop inference services</li>
<li class="text-gray-600">Can view inference logs (without raw content)</li>
<li class="text-gray-600">Cannot pull new models or modify access controls</li>
</ul>

AI Administrators:

Can add and remove models
Can modify access controls and configurations
Requires MFA, all actions audited

Integrate with Your Identity Provider

For single-user or small team deployments, basic auth or API tokens may be acceptable. For enterprise, integrate with your existing identity provider through oauth2-proxy or similar:

# oauth2-proxy in front of inference server
oauth2-proxy         --provider=oidc         --oidc-issuer-url=https://login.yourcompany.com/oauth2         --client-id=ai-inference-server         --upstream=http://localhost:11434         --email-domain=yourcompany.com         --cookie-secret="${COOKIE_SECRET}"         --pass-authorization-header=true

Monitoring and Anomaly Detection

On-premises AI requires you to build your own monitoring stack. There is no built-in dashboard from your AI provider.

What to Monitor

Operational baseline—alert on significant deviation:
<ul class="list-disc pl-6 mb-4 space-y-2">
<li class="text-gray-600">Requests per second (per model, per authenticated user)</li>
<li class="text-gray-600">Tokens generated per hour</li>
<li class="text-gray-600">GPU utilization and memory usage</li>
<li class="text-gray-600">Model load and unload events</li>
</ul>

Security events—alert immediately:

Authentication failures (especially repeated failures from same source)
Requests to model management endpoints from unauthorized sources
New IP addresses or user agents not seen before
Very long input context windows (potential injection attack)
Inference requests during unusual hours (for business-hours workloads)

Structured Logging for Security Visibility

# Nginx JSON access log for AI inference proxy
log_format ai_access escape=json
    '{"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"authenticated_user":"$remote_user",'
    '"method":"$request_method",'
    '"endpoint":"$request_uri",'
    '"http_status":"$status",'
    '"response_bytes":"$bytes_sent",'
    '"request_duration_ms":"$request_time"}';

access_log /var/log/nginx/ai-access.log ai_access;

Ship these logs to your SIEM or log management platform. Build alerts on authentication failures, management endpoint access, and usage volume spikes.

Production Readiness Checklist

Infrastructure:
✓ Inference server bound to localhost, proxied through nginx/caddy
✓ TLS with valid certificate on the proxy
✓ Authentication required for all endpoints
✓ Management endpoints restricted to authorized hosts
✓ Inference service runs as non-root service account
✓ Model files owned by service account, restricted permissions
✓ GPU server on isolated network segment with explicit firewall rules

Monitoring:
✓ Structured access logging enabled
✓ Authentication events captured
✓ GPU utilization monitored
✓ Usage baseline established, anomaly alerts configured

Operations: ✓ Patch management process for inference software ✓ Incident response playbook for AI infrastructure incidents ✓ Access reviews scheduled quarterly

Why On-Premises AI Is Having a Moment

The On-Premises AI Threat Model

Securing Common Self-Hosted Inference Servers

Ollama

vLLM

LocalAI and Llama.cpp Server

GPU Infrastructure Hardening

Operating System Baseline

Run Inference as a Dedicated Service Account

GPU Resource Controls

Network Isolation

Dedicated Network Segment

Air-Gapped Deployments

Model Integrity and Supply Chain

Verify Sources and Hashes

Prefer SafeTensors Format

Internal Model Registry

Access Control

Define Access Tiers Explicitly

Integrate with Your Identity Provider

Monitoring and Anomaly Detection

What to Monitor

Structured Logging for Security Visibility

Production Readiness Checklist

Idan Ohayon

Share this article

Questions & Answers

Need Help with Your Security?

Why On-Premises AI Is Having a Moment

The On-Premises AI Threat Model

Securing Common Self-Hosted Inference Servers

Ollama

vLLM

LocalAI and Llama.cpp Server

GPU Infrastructure Hardening

Operating System Baseline

Run Inference as a Dedicated Service Account

GPU Resource Controls

Network Isolation

Dedicated Network Segment

Air-Gapped Deployments

Model Integrity and Supply Chain

Verify Sources and Hashes

Prefer SafeTensors Format

Internal Model Registry

Access Control

Define Access Tiers Explicitly

Integrate with Your Identity Provider

Monitoring and Anomaly Detection

What to Monitor

Structured Logging for Security Visibility

Production Readiness Checklist

Idan Ohayon

Share this article

Questions & Answers

Need Help with Your Security?