Troubleshooting SPF Calculation and Route Flapping

Troubleshooting SPF Calculation and Route Flapping

This section explains the most common reasons behind route flapping in OSPF and SPF calculation. Whenever there is a change in topology, OSPF runs the SPF algorithm to compute the shortest path first tree again. Unstable links existing within the OSPF network could cause constant SPF calculation.

This section discusses the problem of SPF running constantly in the network for the following reasons:

  • Interface flap within the network
  • Neighbor flap within the network
  • Duplicate router ID

SPF Running Constantly—Cause: Interface Flap Within the Network

This is a common problem in OSPF. Whenever there is a link flap in an area, OSPF runs SPF. So, if a network has unstable links, it can cause constant SPF run. SPF itself is not a problem because OSPF is just adjusting the change in database through calculating SPF. The real prob-lem occurs if there are small routers in the network and a constant SPF run might cause a CPU spike in a router. A link flap is shown in Figure 9-99. Because R1 also is included in area 0, any link flap in area 0 causes all routers in area 0 to run SPF.

Troubleshooting CPUHOG Problemsfig9.99

Figure 9-100 shows the flowchart to follow to solve this problem.

Troubleshooting CPUHOG Problemsfig9.100

Debugs and Verification
A link flap in an area causes SPF to run. If a link is flapping constantly, this can increase the number of SPF calculations in an area. A constant number of SPF calculations is not a problem, but if the number is incrementing constantly, it is an indication of a problem.

Example 9-264 shows the output of show ip ospf, which shows that there is a huge counter for SPF in area 0.

Example 9-264 Determining How Often SPF Is Running

The easiest way to find out which particular LSA is flapping is to turn on debug ip ospf monitor. This debug shows exactly which LSA is flapping. Example 9-265 shows the output of debug ip ospf monitor and reveals that a router LSA is flapping in area 0.

Example 9-265 debug ip ospf monitor Output Pinpoints Route Flap

The next step is to go on that router whose router LSA is flapping and check the log for any interface flap. Example 9-266 shows the log of the router with router ID 192.168.1.129. The log shows that a serial link keeps going up and down. Whenever there is an interface flap, it causes SPF to run.

Example 9-266 Router Log Pinpoints the Interface Causing Route Flap

Solution
Actually two solutions exist in this case:

  • Fix the link flap.
  • Redefine the area boundaries.

Sometimes, the first solution might not be manageable because the link is flapping as the result of some telco outage beyond your control. One way to fix this temporarily is to manually shut down that interface.

The second solution requires some redesigning. If the link flap is happening too often, it might be possible to redefine the area, exclude this router from the area, and make it a member of a totally stubby area. Sometimes, this is also difficult to implement.

In short, link flaps are realities; if there are too many link flaps, the number of routers in an area should be decreased so that fewer routers are affected.

SPF Running Constantly—Cause: Neighbor Flap Within the Network

A neighbor flap also causes SPF to run. A neighbor flap can happen because of several reasons discussed already in this chapter. When a link goes down, the neighbor goes down as well.

When a neighbor goes down, it causes a change in topology, so SPF runs. In Figure 9-101, R3 is suffering from a neighbor flap, and all the routers in area 0 are running SPF because of this.

Troubleshooting CPUHOG Problemsfig9.101

Figure 9-102 shows the flowchart to follow to solve this problem.

Troubleshooting CPUHOG Problemsfig9.102

Debugs and Verification
Example 9-267 shows that SPF is being run constantly in area 0.

Example 9-267 Determining How Often SPF Is Running

The next thing to do here is to go to R3 and check the logs, as done in previous example. There is a way to track the neighbor changes in OSPF. Configure ospf log-adjacency-changes under router ospf to track all the neighbor changes. Example 9-268 shows how to configure ospf log-adjacency-changes.

Example 9-268 Configuring ospf log-adjacency-changes on R3

When this command is configured, it saves all the neighbor state changes in the router’s sys log. Example 9-269 shows a sys log message of R3 that shows neighbor state changes. The output shows one instance, but there are a lot of instances of neighbor change.

Example 9-269 Sys Log Messages of R3 Shows OSPF State Changes

In some older versions of Cisco IOS Software, the ospf log-adjacency-changes command is not available or might not be configured on the router. In this case, the show ip ospf neighbor command helps. Example 9-270 shows that R3 sees R4 going from FULL to INIT and then back to FULL through the show ip ospf neighbor command. This process keeps repeating.

Example 9-270 Determining Neighbor State

Solution
This problem is common in Frame Relay hub-and-spoke environments. If there are too many neighbors in Frame Relay, there is a high chance that their Hellos might start dropping. The solution in this case is to tune the broadcast queue so that it doesn’t drop the OSPF Hello packets. The neighbor goes into INIT after FULL because the neighbor missed three Hellos and declared R2 dead. This can be confirmed by looking at the show interface statistics that indicate that the serial interface broadcast queue is dropping many packets. Example 9-271 shows the output of show interface for the Serial 1 interface, which shows a significant number of drops in the Frame Relay broadcast queue.

Example 9-271 Displaying Broadcast Queue Status

The output in Example 9-271 further proves that there is some problem at the interface level. Too many drops are occurring at the interface level. This is causing the route to flap. To correct this problem, you must tune the Frame Relay broadcast queue accordingly. Tuning the Frame Relay broadcast queue is beyond the scope of this book, but several papers on Cisco’s web site discuss how to tune the Frame Relay broadcast queue. For further research, you can consult them at the following URLs:

www.cisco.com/warp/partner/synchronicd/cc/techno/media/wan/frame/prodlit/256_pb.htm

www.cisco.com/warp/public/125/20.html

Example 9-272 shows that after fixing the interface drop problem, route flapping disappears. The broadcast queue size is changed from 64 to 256. The correct number can be determined after reading the URLs mentioned earlier for tuning the broadcast queue.

Example 9-272 Verifying That the Broadcast Queue Has Been Fixed

SPF Running Constantly—Cause: Duplicate Router ID

This is also a common problem in OSPF. When two routers have identical router IDs, confusion results in the OSPF topology database, and the route keeps getting added and deleted. The most common symptom of this problem is that the LS Age field always has a small value.

This problem usually is generated by a cut and paste of a router configuration into another router. This results in two routers with identical router IDs. Figure 9-103 shows a network setup in which R2 and R3 have duplicate router IDs of 192.168.1.129.

Troubleshooting CPUHOG Problemsfig9.103

Figure 9-104 shows the flowchart to follow to solve this problem.

Troubleshooting CPUHOG Problemsfig9.104

Debugs and Verification
When there is a duplicate router ID, it causes SPF frequently, and the SPF counter keeps incrementing unless the problem is fixed. Example 9-273 shows that SPF in area 0 ran 2446 times, which is a large number.

Example 9-273 Determining How Often SPF Is Running

The next step is to turn on debug ip ospf monitor. This debug shows exactly which LSA to chase. Example 9-274 shows the output of debug ip ospf monitor, which shows that a router with a router ID of 192.168.1.129 is the problem. The output also shows that it’s a router LSA.

Example 9-274 debug ip ospf monitor Output Pinpoints the Router Causing This Problem

Example 9-275 shows the output of the router LSA in question. There are two instances of this output taken 15 seconds apart. The first output shows that the number of links in this router is one; the second output shows that the number of links on this router is three. This is a discrepancy because of a duplicate router ID. This means that there must be another router with the same router ID causing the number of links to change every 15 seconds. Also, the LS Age field is always less than 10 seconds.

The first output in this example is the router LSA of R2; the second output is the router LSA of R3.

Example 9-275 Determining the Discrepancy in the Router LSA

Example 9-276 shows that R2 and R3 have identical router IDs.

Example 9-276 Detecting Duplicate Router IDs

Solution
To correct this problem, either change the router ID of R3 or change the router ID of R2. Example 9-277 shows how to change the router ID of R3 and gives the output of the show ip ospf command to verify that the router ID has been changed.

Example 9-277 Changing the Router ID of R3

Example 9-278 shows that after changing the router ID of R3, the LS age for 192.168.1.129 becomes stable in the OSPF database. The LS age has reached 90 seconds, so the entry is now stable.

Example 9-278 The LS Age for the Problem LSA Is Now Stable

About the author

Prasanna

Leave a Comment