Troubleshooting Azure VPN Site-to-Site Connections: A Practical Guide

Troubleshooting Azure VPN Site-to-Site Connections: A Practical Guide

I can’t tell you how many sleepless nights I’ve had because of broken VPN connections. After 7+ years dealing with Azure networking (and plenty of mistakes along the way!), I wanted to share some real-world fixes that have saved my bacon more times than I can count.

The Basics: Common S2S VPN Issues

So what usually breaks these connections? In my experience:

  1. Mismatched config parameters – One tiny setting and boom, nothing works
  2. Firewall/traffic filtering – Where do packets go? Nobody knows!
  3. Certificate/PSK problems – Authentication is a pain in the…
  4. Routing config issues – Traffic goes in, but doesn’t come out
  5. Azure platform quirks – Yeah, sometimes it’s just Azure being Azure

Configuration Verification: First Things First

Look, before you go down a 3-hour rabbit hole (been there!), just double-check these settings on both sides:

  • Connection type (route-based vs policy-based)
  • Encryption algorithms (AES-256, AES-128, etc.)
  • Authentication methods (SHA-1, SHA-256)
  • DH Groups/PFS settings
  • IKE version (v1 or v2)
  • SA lifetime values

True story: Last month I spent an entire afternoon troubleshooting only to find I’d fat-fingered the subnet mask on the Azure side. Ugh.

Troubleshooting Through the Azure Portal (The Easy Way)

Before diving into logs and packet captures, the Azure Portal actually has some built-in tools that can save you hours. Let me share a couple lifesavers:

Connection Troubleshooter

This thing has bailed me out more times than I’d like to admit:

  1. Head to your Virtual Network Gateway in the Azure portal
  2. Click on “Connections” in the left menu
  3. Select the problematic connection
  4. Hit the “Troubleshoot” button at the top

What happens next is pretty cool – Azure runs diagnostics on both the connection and gateway, checking for common issues like:

  • Mismatched shared keys
  • Configuration problems
  • Certificate issues
  • Connection status

It’ll give you a report with recommended actions. Last month, it instantly spotted that my on-prem device’s pre-shared key had expired (which would’ve taken me forever to figure out manually).

Reset Connection (The Nuclear Option)

Sometimes connections get stuck in a weird state, and no amount of config tweaking helps. That’s when I use the reset feature:

  1. Navigate to your Virtual Network Gateway
  2. Click on “Connections”
  3. Select the problem connection
  4. Hit the “Reset” button at the top

Be careful though – this is basically turning it off and on again. It’ll disrupt any current traffic, but I’ve seen it fix mysterious issues when nothing else worked. Had a customer last year with an intermittent connection that would drop randomly – we tried everything for days until a simple reset fixed it permanently.

Using Azure Diagnostic Logs (That Most People Skip)

Azure actually has pretty decent logs that nobody seems to use. Enable them by:

  1. Going to your Virtual Network Gateway
  2. Clicking “Diagnostic settings” under Monitoring
  3. Setting up GatewayDiagnosticLog and TunnelDiagnosticLog
  4. Sending them somewhere useful (Log Analytics or storage)

Pro tip: Filter these logs with keywords like “connection,” “IKE,” or “tunnel” – saves tons of time.

Connection Validation That Actually Works

When I’m stuck, I run these basic tests:

From on-prem:

ping [Azure Virtual Network IP range]
traceroute [Azure Virtual Network IP range]

From Azure VM traceroute will not work we need to use connection troubleshooting

Don’t freak out if ping fails – lots of places block ICMP. Try TCP tests instead:

Test-NetConnection -ComputerName [target IP] -Port [target port]

Packet Captures: The Last Resort That’s Actually the Best Resort

Whenever I’m totally stumped (which happens more than I’d like to admit), packet captures save me. Use your on-prem vendor’s tools, and for Azure:

  1. Hit up Network Watcher
  2. Click “Packet capture”
  3. Target your Azure VM
  4. Filter for UDP 500/4500 for IKE stuff

When reviewing captures, I’m mainly looking for:

  • Failed IKE negotiations
  • Repeated connection attempts that go nowhere
  • Mysterious timeouts
  • TCP resets out of nowhere

A Story About Routing That Still Haunts Me

So this one time (about 6 months ago), I had this weird issue – traffic flowed from on-prem to Azure just fine, but nothing came back. The tunnel showed connected, everything looked perfect, but nada.

After wasting a day, I finally checked Azure’s effective routes and found the problem. Had to:

  1. Create a route table in Azure
  2. Add a route for my on-prem network pointing to the VPN gateway
  3. Associate it with my subnets

Lesson learned: Always check effective routes in Azure when traffic only flows one way. Network Watcher is your friend.

Platform Limitations Nobody Tells You About

Some stuff I’ve learned the hard way:

  • You get max 30 S2S VPN tunnels per gateway (highest SKU Gen1, but 100 for some Gen2!)
  • There are bandwidth caps depending on your gateway SKU (read the fine print!)
  • Some on-prem devices just hate Azure VPN (especially with old firmware)
  • Policy-based VPNs only support one tunnel (why, Microsoft?)

Check out this table of VPN Gateway limits that’s saved me from making capacity planning mistakes:

VPN Gateway GenerationSKUS2S/VNet-to-VNet TunnelsP2S SSTP ConnectionsP2S IKEv2/OpenVPN ConnectionsAggregate Throughput BenchmarkBGPZone-redundantSupported Number of VMs in the Virtual Network
Generation1BasicMax. 10Max. 128Not Supported100 MbpsNot SupportedNo200
Generation1VpnGw1Max. 30Max. 128Max. 250650 MbpsSupportedNo450
Generation1VpnGw2Max. 30Max. 128Max. 5001 GbpsSupportedNo1300
Generation1VpnGw3Max. 30Max. 128Max. 10001.25 GbpsSupportedNo4000
Generation1VpnGw1AZMax. 30Max. 128Max. 250650 MbpsSupportedYes1000
Generation1VpnGw2AZMax. 30Max. 128Max. 5001 GbpsSupportedYes2000
Generation1VpnGw3AZMax. 30Max. 128Max. 10001.25 GbpsSupportedYes5000
Generation2VpnGw2Max. 30Max. 128Max. 5001.25 GbpsSupportedNo685
Generation2VpnGw3Max. 30Max. 128Max. 10002.5 GbpsSupportedNo2240
Generation2VpnGw4Max. 100*Max. 128Max. 50005 GbpsSupportedNo5300
Generation2VpnGw5Max. 100*Max. 128Max. 1000010 GbpsSupportedNo6700
Generation2VpnGw2AZMax. 30Max. 128Max. 5001.25 GbpsSupportedYes2000
Generation2VpnGw3AZMax. 30Max. 128Max. 10002.5 GbpsSupportedYes3300
Generation2VpnGw4AZMax. 100*Max. 128Max. 50005 GbpsSupportedYes4400
Generation2VpnGw5AZMax. 100*Max. 128Max. 1000010 GbpsSupportedYes9000

*I found all these details in the official Microsoft documentation. Worth bookmarking that page – I refer to it constantly when planning deployments.

Set It and Forget It? Nope, Monitor This Stuff

After fixing a connection, I always set up alerts for:

  • Tunnel ingress/egress bytes (drops to zero = bad news)
  • Tunnel connection status (duh)
  • Gateway P2S connection count (surprising how informative this is)

This way I usually catch problems before users start blowing up my phone.

The Human Side of VPNs (Yes, There Is One)

Tech stuff aside, here’s what’s saved me more than once: good documentation. Keep a shared doc with:

  • All config details (both sides)
  • Change history (WHO touched WHAT and WHEN)
  • Contact info for Azure and on-prem teams

You’d be surprised how often the issue is “Oh, Bob made a change last Friday but didn’t tell anyone.” Classic Bob.

Wrapping Up

Look, Azure VPN troubleshooting is partly science, partly dark art. Start with the basics, work methodically, and don’t forget that sometimes the simplest explanation is the right one.

My final piece of advice? Don’t be afraid to tear it all down and start over. Sometimes that’s genuinely faster than trying to debug a mysterious issue for days on end.

May your connections stay up and your weekend alerts stay quiet!

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *