After converting a site to a new MPLS provider I began to experience about 20% packet loss to that site.  There were a lot of things that changed during the migration:

  • Added GRE Tunnels
  • Implemented EIGRP to handle routing all of the LAN subnets
  • Restricted BGP to only handle the WAN, or MPLS, interfaces

These are the troubleshooting steps I took to narrow down the problem:

  1. Ping from the tunnel interface at the main site to the tunnel interface at the branch site.  0% Packet Loss
  2. Ping from the LAN port on router at the main site to the tunnel interface at the branch site. 0% Packet Loss
  3. Ping from the LAN port on router at the main site to the LAN port at the branch site. 0% Packet Loss
  4. Ping from a client at the main site to the tunnel interface at the branch site. 0% Packet Loss
  5. Ping from a client at the main site to a client at the branch site. ~20% Packet Loss
  6. Ping from the LAN port on router at the main site to a client at the branch site. ~20% Packet Loss
  7. Ping from the tunnel interface at the main site to a client at the branch site. ~20% Packet Loss

This process seemingly narrowed it down to the problem originating at the branch site.  I checked for negotiation errors in the logs of the switch and the routers.  BGP appeared to be working fine because the peer was up and I was receiving all the routes that I expected.  The ping loss seemed to be very random.  I then decided to enable debugging on the router and start a continuous ping from a client at the main site to a client at the branch site.  I quickly noticed that every time I saw packet loss, I also so a BGP error message being logged.  There were a few different error messages that were being populated and each caused different amounts of ping loss. 

Apparently, the ping loss wasn’t as random as I thought!  After speaking with a coworker about the BGP turn up he was currently doing with another customer, he suggested that I needed to add a static route to the branch router for the BGP peer.  Everything began working!  So, to make a long story short, it is best to have a specific static route added for a BGP peer if that peer isn’t directly connected. Even if that static route has the same next-hop as the default route.