After converting a site to a new MPLS provider I began to experience about 20% packet loss to that site. There were a lot of things that changed during the migration:
-
Added GRE Tunnels
-
Implemented EIGRP to handle routing all of the LAN subnets
-
Restricted BGP to only handle the WAN, or MPLS, interfaces
These are the troubleshooting steps I took to narrow down the problem:
-
Ping from the tunnel interface at the main site to the tunnel interface at the branch site. 0% Packet Loss
-
Ping from the LAN port on router at the main site to the tunnel interface at the branch site. 0% Packet Loss
-
Ping from the LAN port on router at the main site to the LAN port at the branch site. 0% Packet Loss
-
Ping from a client at the main site to the tunnel interface at the branch site. 0% Packet Loss
-
Ping from a client at the main site to a client at the branch site. ~20% Packet Loss
-
Ping from the LAN port on router at the main site to a client at the branch site. ~20% Packet Loss
-
Ping from the tunnel interface at the main site to a client at the branch site. ~20% Packet Loss
This process seemingly narrowed it down to the problem originating at the branch site. I checked for negotiation errors in the logs of the switch and the routers. BGP appeared to be working fine because the peer was up and I was receiving all the routes that I expected. The ping loss seemed to be very random. I then decided to enable debugging on the router and start a continuous ping from a client at the main site to a client at the branch site. I quickly noticed that every time I saw packet loss, I also so a BGP error message being logged. There were a few different error messages that were being populated and each caused different amounts of ping loss.
Apparently, the ping loss wasn’t as random as I thought! After speaking with a coworker about the BGP turn up he was currently doing with another customer, he suggested that I needed to add a static route to the branch router for the BGP peer. Everything began working! So, to make a long story short, it is best to have a specific static route added for a BGP peer if that peer isn’t directly connected. Even if that static route has the same next-hop as the default route.