Blog: WAN

 

One of our customers is using an MPLS network for their WAN connections. but they have installed backup Internet connections at their branch locations in order to maintain connectivity to the corporate office if one of the MPLS connections was to fail. See the diagram below for reference:

In order to configure the backup VPN, a VPN tunnel must be built between the main location and the branch location, in this case Router2 and Router3, respectively. Also, GRE tunnels were built between these routers so that routes could be dynamically exchanged using EIGRP, which also requires adding static routes for the GRE tunnel interfaces to route this traffic out the Internet connections instead of over the MPLS connection. All routers participate in the same EIGRP process. BGP and EIGRP are redistributed into each other on Router1. The EIGRP routes will not be preferred on the branch router (Router3) because EIGRP has a higher administrative distance (90) than BGP (20) and the BGP routes will be used. On the Internet VPN router (Router2), these routes will be preferred because EIGRP internal has a lower administrative distance (90) than external EIGRP (170). This behavior can be changed by adding the  “distance eigrp DesiredInternalAdminstrativeDistance DesiredExternalAdminitrativeDistance” (“distance eigrp 90 80” in this case) command to the EIGRP configuration.

This setup will work for failover, but will additional configuration is needed for the router to automatically failback to the MPLS connection. If the failure is at Router3, Router3 will return to using BGP routes from MPLS when the connection is restored. By default, Router1 will continue to use the routes obtained via EIGRP because the weight of these injected routes of 32768 is greater than the default weight of 0 for learned routes. This can be resolved by setting the learned routes from the BGP neighbor (ISP) to a higher weight than the injected routes. In this case, “neighbor NeighborIPAddress weight 35000” was added to the BGP configuration on Router1. These higher weight routes overwrite the injected routes. 

 The combination of changing the administrative distance in EIGRP and BGP neighbor weights allows this connection to fail to the Internet VPN and return to the MPLS connection dynamically.


 

After converting a site to a new MPLS provider I began to experience about 20% packet loss to that site.  There were a lot of things that changed during the migration:

  • Added GRE Tunnels
  • Implemented EIGRP to handle routing all of the LAN subnets
  • Restricted BGP to only handle the WAN, or MPLS, interfaces

These are the troubleshooting steps I took to narrow down the problem:

  1. Ping from the tunnel interface at the main site to the tunnel interface at the branch site.  0% Packet Loss
  2. Ping from the LAN port on router at the main site to the tunnel interface at the branch site. 0% Packet Loss
  3. Ping from the LAN port on router at the main site to the LAN port at the branch site. 0% Packet Loss
  4. Ping from a client at the main site to the tunnel interface at the branch site. 0% Packet Loss
  5. Ping from a client at the main site to a client at the branch site. ~20% Packet Loss
  6. Ping from the LAN port on router at the main site to a client at the branch site. ~20% Packet Loss
  7. Ping from the tunnel interface at the main site to a client at the branch site. ~20% Packet Loss

This process seemingly narrowed it down to the problem originating at the branch site.  I checked for negotiation errors in the logs of the switch and the routers.  BGP appeared to be working fine because the peer was up and I was receiving all the routes that I expected.  The ping loss seemed to be very random.  I then decided to enable debugging on the router and start a continuous ping from a client at the main site to a client at the branch site.  I quickly noticed that every time I saw packet loss, I also so a BGP error message being logged.  There were a few different error messages that were being populated and each caused different amounts of ping loss. 

Apparently, the ping loss wasn’t as random as I thought!  After speaking with a coworker about the BGP turn up he was currently doing with another customer, he suggested that I needed to add a static route to the branch router for the BGP peer.  Everything began working!  So, to make a long story short, it is best to have a specific static route added for a BGP peer if that peer isn’t directly connected. Even if that static route has the same next-hop as the default route.


 

We use Platespin to do scheduled P2V migrations to provide DR for some of the physical servers at a customer site. I have been troubleshooting some issues with the scheduled protection jobs over the last week or so. The jobs had been running fine for the last couple of months. I have the jobs scheduled to do full synchronizations once per month (first of the month) and all but 2 failed this month. The problem was really strange because I could kick off the full sync and it would run fine for a long time and then all the sudden it would just stall out with a “recoverable error”. I tried all my usual steps to recover from this error…nothing worked.  I used to see this all the time after the new Barracuda was installed. For that issue, I would just add a setting to the ofxcontroller.config file on the source side to bypass the proxy. So, I started searching for another config file that might need to be changed. After tracking the traffic with wireshark, I finally decided there was no interference by the proxy. I submitted a support ticket with Platespin and the tech that working my case asked me whether I was using the “WAN optimizations”. WAN optimizations? That must be a config setting I had never seen. He explained that the problem I was having was that I was running into the 24-hour job termination window. [more]

Any Platespin job MUST complete in 24-hours or it will fail with this “Recoverable Error” message. Actually, the error is not recoverable at all…you have to abort the job and start over. PlateSpin uses WinPE for the target side pre-execution environment when doing the migration/protection jobs. WinPE requires a license if launched for more than 24-hours…platespin doesn’t have the license so the target VM will REBOOT ITSELF after 24 hrs. Hence, the recoverable error that isn’t recoverable. So, back to the WAN optimizations. To help the job finish in time, there are config values you can change in the product’s productinternal.config (for v8.0 or powerconvert.config for PowerConvert Server 7.0) configuration file, located on your Portability Suite Server host, in the following directory:   \Program Files\PlateSpin Portability Suite Server\Web\

Setting Default For WANs
fileTransferThreadCount 2 4 to 6
fileTransferMinCompressionLimit 0 (disabled) 65536 (64KB which is the max)
fileTransferCompressionThreadsCount 2 n/a (the 2 is ignored if compression is disabled)
fileTransferSendReceiveBufferSize 0(8192 bytes) 5242880 (5 MB is max, use formula(LINK_SPEED(Mbps)/8)*DELAY(sec)) *1024*1024 to figure out what your setting should be)

After implementing these settings, full sync jobs were completing in 25% of the time they had been taking. It’s a huge improvement.

You might also want to check out a previous post on Moving a PlateSpin Image Between Image Servers to Setup a DR Sync that discusses using local image servers at both ends to seed a server image across a WAN.