Cyber Intelligence
AI Security17 min read

On-Premises AI Security: Protecting Self-Hosted LLMs and GPU Infrastructure

Running AI on your own infrastructure gives you control over your data. It also means you own the security. Here is how to secure Ollama, vLLM, and other self-hosted AI deployments properly.

I
Microsoft Cloud Solution Architect
On-Premises AI Security: Protecting Self-Hosted LLMs and GPU Infrastructure infographic showing key AI Security concepts and controls
On-Premises AIOllamavLLMSelf-Hosted LLMGPU SecurityAI InfrastructureLLM Security
Video transcript

You've just deployed Ollama on your own servers. Your data never touches the cloud. So you're secure, right. Wrong. Running self-hosted L L M s means you own every layer of defense. When people skip security on on-premises A I, attackers don't need cloud credentials. They just pivot through your unprotected G P U infrastructure and steal training data, model weights, or worse. Ninety-two percent of breaches involve inadequate access controls on internal systems. Start with network isolation. Think of it like this: your A I server is a vault. Don't leave the door open to every system in your data center. Use a Z T N A approach where only verified users and services can even see it exists. Next, lock down your A P I endpoints. Tools like Ollama and vL L M expose ports by default. That's like advertising your vault location. Require M F A, rate limiting, and I A M policies so only your applications can call your models. Finally, monitor everything with S I E M and S O A R solutions. Log every inference request, every weight transfer, every login. Patterns reveal attacks faster than signatures ever could. Start today: audit your current Ollama or vL L M deployment for open ports and missing authentication. Read the complete guide at protego dot me.

Why On-Premises AI Is Having a Moment

The conversation around enterprise AI has shifted. Eighteen months ago, the default answer to "how do we use AI?" was almost universally "use the OpenAI API." Now organizations are asking harder questions and choosing self-hosted AI more frequently, for good reasons.

Data sensitivity: Healthcare providers, law firms, defense contractors, and financial institutions often cannot send data to third-party AI providers. Regulatory requirements, contractual obligations, or well-founded risk aversion.

Cost at scale: At high inference volumes, running your own models can be significantly cheaper than API fees. Those per-token charges add up fast when processing millions of documents.

Customization: Fine-tuned models running on your infrastructure, trained on your specific data, with no concerns about data leaving your environment.

Control: You define the model version, the update cadence, the availability guarantees, and what data the model can see.

But here is what rarely gets discussed when teams make this decision: on-premises AI comes with security responsibilities that cloud providers would otherwise absorb. Those responsibilities are real, and most security teams are not prepared for them.

The On-Premises AI Threat Model

When you run AI on your own infrastructure, you take on security responsibilities that did not exist before. Let us map out what you are now accountable for:

New attack surface:

  • GPU infrastructure security (specialized hardware, often unfamiliar)
  • Inference server software (Ollama, vLLM, LocalAI each have their own vulnerabilities)
  • Model supply chain (you are downloading model weights from the internet)
  • Model weight storage (large files representing significant intellectual property)

Familiar problems in new contexts:

  • Network security (same principles, new assets you did not have before)
  • Access control (who can reach the inference endpoint, who can manage models)
  • Secrets management (inference APIs need auth tokens, managed identity, or API keys)
  • Monitoring (what does "normal" look like for GPU inference workloads?)
  • Patch management (AI software moves extremely fast, vulnerabilities emerge regularly)

Securing Common Self-Hosted Inference Servers

Ollama

Ollama is the most popular self-hosted LLM runtime for a reason: it works well and has excellent model support. But its defaults are not production-ready, and many tutorials make this worse.

The core problem: Ollama's REST API has no authentication by default. If the process listens on anything other than localhost, anyone who can reach it on the network can send requests, list models, pull new models, and in some configurations delete them. Many tutorials instruct you to set OLLAMA_HOST=0.0.0.0 for remote access without adding any authentication.

The right approach: Keep Ollama bound to localhost, and put a reverse proxy in front of it that handles TLS and authentication:

# /etc/nginx/sites-available/ollama-secure
server {
    listen 443 ssl;
    server_name ai.internal.yourcompany.com;

    ssl_certificate /etc/ssl/certs/internal-ca.crt;
    ssl_certificate_key /etc/ssl/private/internal-ca.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

    location / {
        auth_basic "AI Inference Server";
        auth_basic_user_file /etc/nginx/.htpasswd;

        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }

    # Block model management endpoints from regular users
    location /api/pull {
        allow 10.0.1.5;   # Only the model management host
        deny all;
    }
    location /api/delete {
        allow 10.0.1.5;
        deny all;
    }
    location /api/push {
        deny all;
    }
}

For enterprise environments, replace basic auth with oauth2-proxy integrated with your existing SSO/identity provider.

vLLM

vLLM delivers high-performance production inference and is increasingly common in enterprise AI deployments. It exposes an OpenAI-compatible API, which is great for compatibility, and means you need to apply the same security rigor.

# Secure vLLM startup configuration
python -m vllm.entrypoints.openai.api_server         --model /models/llama-3.1-8b-instruct         --host 127.0.0.1                # Bind to localhost only - always
    --port 8000         --api-key "${INFERENCE_API_KEY}"   # Require authentication
    --max-model-len 4096               # Limit context (prevents memory exhaustion)
    --max-num-seqs 32                  # Limit concurrent requests
    --disable-log-requests              # Do not log raw request content

Always run vLLM behind a reverse proxy that handles TLS termination. The vLLM process itself should never be directly exposed to the network.

LocalAI and Llama.cpp Server

The same principles apply across all inference servers. Any process binding to 0.0.0.0 without authentication is a security problem, regardless of whether it is on an "internal" network. Internal networks get breached. The inference server should be treated as internal infrastructure, not as a trusted zone.

GPU Infrastructure Hardening

GPU servers for AI inference are specialized infrastructure with their own hardening requirements. Most server hardening guides do not cover them specifically.

Operating System Baseline

Start with a minimal, purpose-built OS installation:

# GPU inference server hardening - essential steps

# 1. Disable unnecessary services
systemctl disable avahi-daemon cups bluetooth
systemctl mask rpcbind nfs-server

# 2. Set correct permissions for GPU device files
# Inference service user should be in 'render' and 'video' groups, not root
usermod -aG render,video ai_inference

# 3. Configure firewall with explicit rules
ufw default deny incoming
ufw default deny outgoing
ufw allow from 10.0.1.0/24 to any port 8000  # Only from app servers
ufw allow out to any port 443                  # HTTPS for updates only
ufw allow out to any port 53                   # DNS
ufw enable

# 4. Audit model file access
auditctl -w /models/ -p rwxa -k model_access

Run Inference as a Dedicated Service Account

Never run AI inference as root. Create a non-privileged service account:

# Create dedicated service user
useradd --system --shell /bin/false --no-create-home ai_inference

# Restrict model directory access
chown -R ai_inference:ai_inference /models/
chmod -R 750 /models/

# Systemd service hardening
[Service]
User=ai_inference
Group=ai_inference
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/models /var/log/ai-inference

GPU Resource Controls

# Restrict which GPUs a process can access
CUDA_VISIBLE_DEVICES=0,1 python vllm_server.py

# Prevent memory exhaustion - leave headroom
# In vLLM:
--gpu-memory-utilization 0.85

# Monitor GPU utilization - baseline and alert on deviations
nvidia-smi dmon -s u -d 30

Network Isolation

The goal is to make inference infrastructure accessible only to authorized applications, not to the broader network or to end users directly.

Dedicated Network Segment

Put AI infrastructure in its own segment with explicit firewall rules:

Recommended network topology:

[Users] → [Application Servers (App Subnet)] → [AI Inference Subnet] → [Model Storage]

AI Inference Subnet rules:
✓ Inbound: Only from Application Subnet, on inference port
✗ Inbound: Block everything else, including direct user traffic
✓ Outbound: Model storage subnet, approved package repositories
✗ Outbound: Deny all other outbound

Management access:
✓ SSH from dedicated jump host only
✓ Separate management VLAN

Air-Gapped Deployments

For the most sensitive environments (defense, classified data, highly regulated healthcare), consider true air-gapping. The inference server has no internet connectivity. Models are transferred through an approved, audited process. All software updates go through formal change management.

This is operationally demanding but provides the strongest possible isolation. Worth the overhead for environments where data exfiltration is a catastrophic risk.

Model Integrity and Supply Chain

Every model you download from the internet is a supply chain risk. This is not theoretical: malicious models have been identified on Hugging Face and similar platforms. Model weights can contain embedded malware in pickle-serialized formats.

Verify Sources and Hashes

Only download models from verified publishers: Meta for Llama models, Mistral AI for Mistral, official quantization authors with established track records.

# Always verify model hash against the official source
# Download from official source first
wget https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/model.safetensors

# Verify hash - get expected value from official release notes, NOT from the download page
sha256sum model.safetensors

# For Ollama - check model manifests
cat ~/.ollama/models/manifests/registry.ollama.ai/library/llama3.2/latest

Prefer SafeTensors Format

SafeTensors is a serialization format specifically designed to prevent arbitrary code execution vulnerabilities inherent in pickle-based formats (PyTorch's .pt and .bin files). When a .bin file is loaded, arbitrary Python code in the file can execute. SafeTensors cannot.

# When available, always prefer the .safetensors version
# model.safetensors    <-- safe
# model.bin            <-- can execute code on load
# model.pt             <-- can execute code on load

# Scan pickle-format files before loading if safetensors unavailable
# pip install picklescan
picklescan --path /models/

Internal Model Registry

Do not let teams freely download models to production inference servers. Establish a controlled process:

  1. Submit request to add model (source, purpose, expected hash)
  2. Security review of model source and format
  3. Test in isolated environment
  4. Approve and add to internal registry
  5. Deploy to production from registry only

This adds friction that feels bureaucratic until the day it stops a malicious model from reaching production.

Access Control

Define Access Tiers Explicitly

AI Access Control Tiers:

AI Users:
  • Can send inference requests to approved models
  • Cannot manage models, view configurations, or access logs
AI Operators:
  • Can start and stop inference services
  • Can view inference logs (without raw content)
  • Cannot pull new models or modify access controls
AI Administrators:
  • Can add and remove models
  • Can modify access controls and configurations
  • Requires MFA, all actions audited

Integrate with Your Identity Provider

For single-user or small team deployments, basic auth or API tokens may be acceptable. For enterprise, integrate with your existing identity provider through oauth2-proxy or similar:

# oauth2-proxy in front of inference server
oauth2-proxy         --provider=oidc         --oidc-issuer-url=https://login.yourcompany.com/oauth2         --client-id=ai-inference-server         --upstream=http://localhost:11434         --email-domain=yourcompany.com         --cookie-secret="${COOKIE_SECRET}"         --pass-authorization-header=true

Monitoring and Anomaly Detection

On-premises AI requires you to build your own monitoring stack. There is no built-in dashboard from your AI provider.

What to Monitor

Operational baseline (alert on significant deviation):
  • Requests per second (per model, per authenticated user)
  • Tokens generated per hour
  • GPU utilization and memory usage
  • Model load and unload events
Security events (alert immediately):
  • Authentication failures (especially repeated failures from same source)
  • Requests to model management endpoints from unauthorized sources
  • New IP addresses or user agents not seen before
  • Very long input context windows (potential injection attack)
  • Inference requests during unusual hours (for business-hours workloads)

Structured Logging for Security Visibility

# Nginx JSON access log for AI inference proxy
log_format ai_access escape=json
    '{"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"authenticated_user":"$remote_user",'
    '"method":"$request_method",'
    '"endpoint":"$request_uri",'
    '"http_status":"$status",'
    '"response_bytes":"$bytes_sent",'
    '"request_duration_ms":"$request_time"}';

access_log /var/log/nginx/ai-access.log ai_access;

Ship these logs to your SIEM or log management platform. Build alerts on authentication failures, management endpoint access, and usage volume spikes.

Production Readiness Checklist

Infrastructure:
✓ Inference server bound to localhost, proxied through nginx/caddy
✓ TLS with valid certificate on the proxy
✓ Authentication required for all endpoints
✓ Management endpoints restricted to authorized hosts
✓ Inference service runs as non-root service account
✓ Model files owned by service account, restricted permissions
✓ GPU server on isolated network segment with explicit firewall rules

Supply Chain:
✓ Model hashes verified against official release notes
✓ SafeTensors format used where available
✓ Model approval process documented and enforced
✓ No automatic model downloads in production environments

Monitoring:
✓ Structured access logging enabled
✓ Authentication events captured
✓ GPU utilization monitored
✓ Usage baseline established, anomaly alerts configured

Operations:
✓ Patch management process for inference software
✓ Incident response playbook for AI infrastructure incidents
✓ Access reviews scheduled quarterly

On-premises AI gives you control over your data and your infrastructure. That control is worth having. It also means security is entirely your responsibility. The organizations that do this well treat their AI infrastructure with the same discipline as their production databases, because that is exactly what it is.

Frequently Asked Questions

Why is it dangerous to expose Ollama or vLLM directly on a public network interface?

Ollama's default configuration binds to localhost and has no authentication. If you change the bind address to 0.0.0.0 to make it accessible across a network without adding authentication, anyone who can reach that port can make unlimited inference requests at no cost to themselves and at full GPU cost to you. Worse, management endpoints on inference servers often expose model loading and configuration operations that could be used to replace a production model with a malicious one. Every self-hosted inference server must sit behind an authenticated reverse proxy, never exposed directly.

What is the model supply chain risk and how do you verify a model from Hugging Face is safe?

Models downloaded from Hugging Face or similar platforms may contain backdoors, biases, or malware, particularly in pickle-serialized formats (.pt, .bin) that can execute arbitrary Python code when loaded. To verify safety: check that the publisher has an established track record with many downloads and community reviews, verify the model's SHA-256 hash against the hash published in official release notes (not the same page you downloaded from), prefer SafeTensors format which cannot execute code on load, and run new models in an isolated environment for initial testing before production deployment. An internal model approval process with a named reviewer is the organizational control that ensures this happens consistently.

How should GPU servers be isolated on the network for self-hosted AI deployments?

GPU inference servers should sit on a dedicated network segment with a restrictive firewall policy. The inference server's management interface should only be reachable from a jump host or bastion used by authorized administrators. The inference API should only be reachable from the application servers that call it, not from the broader corporate network. Outbound internet access from the GPU server should be blocked after model deployment except for specific update and telemetry endpoints. This isolation limits the blast radius if the inference software has a vulnerability and prevents unauthorized access even if an attacker gains a foothold elsewhere in the network.

What should be monitored on a self-hosted LLM deployment to detect security incidents?

Four categories of events require monitoring: authentication failures (especially repeated failures from the same source, indicating brute-force attempts), management endpoint access from any source not on the explicit authorized list, unusual usage patterns (requests per second significantly above established baseline, very long input context windows which may indicate injection attempts, requests at unusual hours for the application's normal usage pattern), and model management events (any model load or unload operation that was not initiated through the approved deployment process). These should generate immediate alerts rather than daily digest reports, because most attacks move quickly once an initial foothold is established.

What are the credential security requirements for service accounts running AI inference services?

The service account running the inference server should have read-only access to model files and no other filesystem permissions. It should have no sudo privileges and should not be a member of any privileged groups. Its credentials should not be the same as any human user account. Application service accounts that call the inference API should use short-lived tokens (OAuth2/OIDC) rather than static API keys where the inference server supports it, or rotate static keys on a 90-day schedule at most. Log all authentication events so that any credential misuse is detectable. Service account passwords should never be shared or documented in wikis, tickets, or chat platforms.

N

Recommended tool: Nordpass

Up to 40% commission

Get weekly security insights

Cloud security, zero trust, and identity guides — straight to your inbox.

I

Microsoft Cloud Solution Architect

Cloud Solution Architect with deep expertise in Microsoft Azure and a strong background in systems and IT infrastructure. Passionate about cloud technologies, security best practices, and helping organizations modernize their infrastructure.

Share this article

Questions & Answers

Related Articles

Need Help with Your Security?

Our team of security experts can help you implement the strategies discussed in this article.

Contact Us