I can’t tell you how many sleepless nights I’ve had because of broken VPN connections. After 7+ years dealing with Azure networking (and plenty of mistakes along the way!), I wanted to share some real-world fixes that have saved my bacon more times than I can count.
The Basics: Common S2S VPN Issues
So what usually breaks these connections? In my experience:
- Mismatched config parameters – One tiny setting and boom, nothing works
- Firewall/traffic filtering – Where do packets go? Nobody knows!
- Certificate/PSK problems – Authentication is a pain in the…
- Routing config issues – Traffic goes in, but doesn’t come out
- Azure platform quirks – Yeah, sometimes it’s just Azure being Azure
Configuration Verification: First Things First
Look, before you go down a 3-hour rabbit hole (been there!), just double-check these settings on both sides:
- Connection type (route-based vs policy-based)
- Encryption algorithms (AES-256, AES-128, etc.)
- Authentication methods (SHA-1, SHA-256)
- DH Groups/PFS settings
- IKE version (v1 or v2)
- SA lifetime values
True story: Last month I spent an entire afternoon troubleshooting only to find I’d fat-fingered the subnet mask on the Azure side. Ugh.
Troubleshooting Through the Azure Portal (The Easy Way)
Before diving into logs and packet captures, the Azure Portal actually has some built-in tools that can save you hours. Let me share a couple lifesavers:
Connection Troubleshooter
This thing has bailed me out more times than I’d like to admit:
- Head to your Virtual Network Gateway in the Azure portal
- Click on “Connections” in the left menu
- Select the problematic connection
- Hit the “Troubleshoot” button at the top
What happens next is pretty cool – Azure runs diagnostics on both the connection and gateway, checking for common issues like:
- Mismatched shared keys
- Configuration problems
- Certificate issues
- Connection status
It’ll give you a report with recommended actions. Last month, it instantly spotted that my on-prem device’s pre-shared key had expired (which would’ve taken me forever to figure out manually).
Reset Connection (The Nuclear Option)
Sometimes connections get stuck in a weird state, and no amount of config tweaking helps. That’s when I use the reset feature:
- Navigate to your Virtual Network Gateway
- Click on “Connections”
- Select the problem connection
- Hit the “Reset” button at the top
Be careful though – this is basically turning it off and on again. It’ll disrupt any current traffic, but I’ve seen it fix mysterious issues when nothing else worked. Had a customer last year with an intermittent connection that would drop randomly – we tried everything for days until a simple reset fixed it permanently.
Using Azure Diagnostic Logs (That Most People Skip)
Azure actually has pretty decent logs that nobody seems to use. Enable them by:
- Going to your Virtual Network Gateway
- Clicking “Diagnostic settings” under Monitoring
- Setting up GatewayDiagnosticLog and TunnelDiagnosticLog
- Sending them somewhere useful (Log Analytics or storage)
Pro tip: Filter these logs with keywords like “connection,” “IKE,” or “tunnel” – saves tons of time.
Connection Validation That Actually Works
When I’m stuck, I run these basic tests:
From on-prem:
ping [Azure Virtual Network IP range]
traceroute [Azure Virtual Network IP range]
From Azure VM traceroute will not work we need to use connection troubleshooting
Don’t freak out if ping fails – lots of places block ICMP. Try TCP tests instead:
Test-NetConnection -ComputerName [target IP] -Port [target port]
Packet Captures: The Last Resort That’s Actually the Best Resort
Whenever I’m totally stumped (which happens more than I’d like to admit), packet captures save me. Use your on-prem vendor’s tools, and for Azure:
- Hit up Network Watcher
- Click “Packet capture”
- Target your Azure VM
- Filter for UDP 500/4500 for IKE stuff
When reviewing captures, I’m mainly looking for:
- Failed IKE negotiations
- Repeated connection attempts that go nowhere
- Mysterious timeouts
- TCP resets out of nowhere
A Story About Routing That Still Haunts Me
So this one time (about 6 months ago), I had this weird issue – traffic flowed from on-prem to Azure just fine, but nothing came back. The tunnel showed connected, everything looked perfect, but nada.
After wasting a day, I finally checked Azure’s effective routes and found the problem. Had to:
- Create a route table in Azure
- Add a route for my on-prem network pointing to the VPN gateway
- Associate it with my subnets
Lesson learned: Always check effective routes in Azure when traffic only flows one way. Network Watcher is your friend.
Platform Limitations Nobody Tells You About
Some stuff I’ve learned the hard way:
- You get max 30 S2S VPN tunnels per gateway (highest SKU Gen1, but 100 for some Gen2!)
- There are bandwidth caps depending on your gateway SKU (read the fine print!)
- Some on-prem devices just hate Azure VPN (especially with old firmware)
- Policy-based VPNs only support one tunnel (why, Microsoft?)
Check out this table of VPN Gateway limits that’s saved me from making capacity planning mistakes:
VPN Gateway Generation | SKU | S2S/VNet-to-VNet Tunnels | P2S SSTP Connections | P2S IKEv2/OpenVPN Connections | Aggregate Throughput Benchmark | BGP | Zone-redundant | Supported Number of VMs in the Virtual Network |
---|---|---|---|---|---|---|---|---|
Generation1 | Basic | Max. 10 | Max. 128 | Not Supported | 100 Mbps | Not Supported | No | 200 |
Generation1 | VpnGw1 | Max. 30 | Max. 128 | Max. 250 | 650 Mbps | Supported | No | 450 |
Generation1 | VpnGw2 | Max. 30 | Max. 128 | Max. 500 | 1 Gbps | Supported | No | 1300 |
Generation1 | VpnGw3 | Max. 30 | Max. 128 | Max. 1000 | 1.25 Gbps | Supported | No | 4000 |
Generation1 | VpnGw1AZ | Max. 30 | Max. 128 | Max. 250 | 650 Mbps | Supported | Yes | 1000 |
Generation1 | VpnGw2AZ | Max. 30 | Max. 128 | Max. 500 | 1 Gbps | Supported | Yes | 2000 |
Generation1 | VpnGw3AZ | Max. 30 | Max. 128 | Max. 1000 | 1.25 Gbps | Supported | Yes | 5000 |
Generation2 | VpnGw2 | Max. 30 | Max. 128 | Max. 500 | 1.25 Gbps | Supported | No | 685 |
Generation2 | VpnGw3 | Max. 30 | Max. 128 | Max. 1000 | 2.5 Gbps | Supported | No | 2240 |
Generation2 | VpnGw4 | Max. 100* | Max. 128 | Max. 5000 | 5 Gbps | Supported | No | 5300 |
Generation2 | VpnGw5 | Max. 100* | Max. 128 | Max. 10000 | 10 Gbps | Supported | No | 6700 |
Generation2 | VpnGw2AZ | Max. 30 | Max. 128 | Max. 500 | 1.25 Gbps | Supported | Yes | 2000 |
Generation2 | VpnGw3AZ | Max. 30 | Max. 128 | Max. 1000 | 2.5 Gbps | Supported | Yes | 3300 |
Generation2 | VpnGw4AZ | Max. 100* | Max. 128 | Max. 5000 | 5 Gbps | Supported | Yes | 4400 |
Generation2 | VpnGw5AZ | Max. 100* | Max. 128 | Max. 10000 | 10 Gbps | Supported | Yes | 9000 |
*I found all these details in the official Microsoft documentation. Worth bookmarking that page – I refer to it constantly when planning deployments.
Set It and Forget It? Nope, Monitor This Stuff
After fixing a connection, I always set up alerts for:
- Tunnel ingress/egress bytes (drops to zero = bad news)
- Tunnel connection status (duh)
- Gateway P2S connection count (surprising how informative this is)
This way I usually catch problems before users start blowing up my phone.
The Human Side of VPNs (Yes, There Is One)
Tech stuff aside, here’s what’s saved me more than once: good documentation. Keep a shared doc with:
- All config details (both sides)
- Change history (WHO touched WHAT and WHEN)
- Contact info for Azure and on-prem teams
You’d be surprised how often the issue is “Oh, Bob made a change last Friday but didn’t tell anyone.” Classic Bob.
Wrapping Up
Look, Azure VPN troubleshooting is partly science, partly dark art. Start with the basics, work methodically, and don’t forget that sometimes the simplest explanation is the right one.
My final piece of advice? Don’t be afraid to tear it all down and start over. Sometimes that’s genuinely faster than trying to debug a mysterious issue for days on end.
May your connections stay up and your weekend alerts stay quiet!