Troubleshooting SPF Calculation and Route Flapping
This section explains the most common reasons behind route flapping in OSPF and SPF calculation. Whenever there is a change in topology, OSPF runs the SPF algorithm to compute the shortest path first tree again. Unstable links existing within the OSPF network could cause constant SPF calculation.
This section discusses the problem of SPF running constantly in the network for the following reasons:
- Interface flap within the network
- Neighbor flap within the network
- Duplicate router ID
SPF Running Constantly—Cause: Interface Flap Within the Network
This is a common problem in OSPF. Whenever there is a link flap in an area, OSPF runs SPF. So, if a network has unstable links, it can cause constant SPF run. SPF itself is not a problem because OSPF is just adjusting the change in database through calculating SPF. The real prob-lem occurs if there are small routers in the network and a constant SPF run might cause a CPU spike in a router. A link flap is shown in Figure 9-99. Because R1 also is included in area 0, any link flap in area 0 causes all routers in area 0 to run SPF.
Figure 9-100 shows the flowchart to follow to solve this problem.
Debugs and Verification
A link flap in an area causes SPF to run. If a link is flapping constantly, this can increase the number of SPF calculations in an area. A constant number of SPF calculations is not a problem, but if the number is incrementing constantly, it is an indication of a problem.
Example 9-264 shows the output of show ip ospf, which shows that there is a huge counter for SPF in area 0.
Example 9-264 Determining How Often SPF Is Running
R1#show ip ospf Routing Process "ospf 1" with ID 192.168.254.13 Supports only single TOS(TOS0) routes It is an area border SPF schedule delay 5 secs, Hold time between two SPFs 10 secs Minimum LSA interval 5 secs. Minimum LSA arrival 1 secs Number of external LSA 8. Checksum Sum 0x48C3E Number of DCbitless external LSA 0 Number of DoNotAge external LSA Number of areas in this router is 3. 2 normal 1 stub 0 nssa Area BACKBONE(0) Number of interfaces in this area is Area has no authentication SPF algorithm executed 2668 times
The easiest way to find out which particular LSA is flapping is to turn on debug ip ospf monitor. This debug shows exactly which LSA is flapping. Example 9-265 shows the output of debug ip ospf monitor and reveals that a router LSA is flapping in area 0.
Example 9-265 debug ip ospf monitor Output Pinpoints Route Flap
R1# debug ip ospf monitor OSPF: Schedule SPF in area 0.0.0.0 Change in LS ID 192.168.1.129, LSA type R, OSPF: schedule SPF: spf_time 1620348064ms wait_interval 10s
The next step is to go on that router whose router LSA is flapping and check the log for any interface flap. Example 9-266 shows the log of the router with router ID 192.168.1.129. The log shows that a serial link keeps going up and down. Whenever there is an interface flap, it causes SPF to run.
Example 9-266 Router Log Pinpoints the Interface Causing Route Flap
R3#show log *Mar 29 01:59:07: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1, changed state to down *Mar 29 01:59:09: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1, changed state to up *Mar 29 01:59:30: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1, changed state to down *Mar 29 02:00:03: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1, changed state to up
Solution
Actually two solutions exist in this case:
- Fix the link flap.
- Redefine the area boundaries.
Sometimes, the first solution might not be manageable because the link is flapping as the result of some telco outage beyond your control. One way to fix this temporarily is to manually shut down that interface.
The second solution requires some redesigning. If the link flap is happening too often, it might be possible to redefine the area, exclude this router from the area, and make it a member of a totally stubby area. Sometimes, this is also difficult to implement.
In short, link flaps are realities; if there are too many link flaps, the number of routers in an area should be decreased so that fewer routers are affected.
SPF Running Constantly—Cause: Neighbor Flap Within the Network
A neighbor flap also causes SPF to run. A neighbor flap can happen because of several reasons discussed already in this chapter. When a link goes down, the neighbor goes down as well.
When a neighbor goes down, it causes a change in topology, so SPF runs. In Figure 9-101, R3 is suffering from a neighbor flap, and all the routers in area 0 are running SPF because of this.
Figure 9-102 shows the flowchart to follow to solve this problem.
Debugs and Verification
Example 9-267 shows that SPF is being run constantly in area 0.
Example 9-267 Determining How Often SPF Is Running
R1#show ip ospf Routing Process "ospf 1" with ID 192.168.254.13 Supports only single TOS(TOS0) routes It is an area border SPF schedule delay 5 secs, Hold time between two SPFs 10 secs Minimum LSA interval 5 secs. Minimum LSA arrival 1 secs Number of external LSA 8. Checksum Sum 0x48C3E Number of DCbitless external LSA 0 Number of DoNotAge external LSA Number of areas in this router is 3. 2 normal 1 stub 0 nssa Area BACKBONE(0) Number of interfaces in this area is 1 Area has no authentication SPF algorithm executed 2458 times
The next thing to do here is to go to R3 and check the logs, as done in previous example. There is a way to track the neighbor changes in OSPF. Configure ospf log-adjacency-changes under router ospf to track all the neighbor changes. Example 9-268 shows how to configure ospf log-adjacency-changes.
Example 9-268 Configuring ospf log-adjacency-changes on R3
R3# router ospf 1 ospf log-adjacency-changes
When this command is configured, it saves all the neighbor state changes in the router’s sys log. Example 9-269 shows a sys log message of R3 that shows neighbor state changes. The output shows one instance, but there are a lot of instances of neighbor change.
Example 9-269 Sys Log Messages of R3 Shows OSPF State Changes
R3#show log %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from FULL to DOWN, Neighbor Down %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from FULL to INIT , 1-Way %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from DOWN to INIT, Received Hello %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from INIT to 2WAY, 2-Way Received %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from 2WAY to EXSTART, AdjOK? %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from EXSTART to EXCHANGE, Negotiation Done %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from EXCHANGE to LOADING, Exchange Done %OSPF-5-ADJCHG: Process 1, Nbr 192.168.4.4 on Serial1 from LOADING to FULL , Loading Done
In some older versions of Cisco IOS Software, the ospf log-adjacency-changes command is not available or might not be configured on the router. In this case, the show ip ospf neighbor command helps. Example 9-270 shows that R3 sees R4 going from FULL to INIT and then back to FULL through the show ip ospf neighbor command. This process keeps repeating.
Example 9-270 Determining Neighbor State
R3#show ip ospf neighbor Neighbor ID Pri State Dead Time Address Interface 192.168.4.4 1 FULL/- 00:00:34 131.108.1.1 Serial1.1 _____________________________________________________________________________________ R2#show ip ospf neighbor Neighbor ID Pri State Dead Time Address Interface 192.168.4.4 1 INIT/- 00:00:33 131.108.1.1 Serial1.1 _____________________________________________________________________________________ R2#show ip ospf neighbor Neighbor ID Pri State Dead Time Address Interface 192.168.4.4 1 FULL/- 00:00:37 131.108.1.1 Serial1.1
Solution
This problem is common in Frame Relay hub-and-spoke environments. If there are too many neighbors in Frame Relay, there is a high chance that their Hellos might start dropping. The solution in this case is to tune the broadcast queue so that it doesn’t drop the OSPF Hello packets. The neighbor goes into INIT after FULL because the neighbor missed three Hellos and declared R2 dead. This can be confirmed by looking at the show interface statistics that indicate that the serial interface broadcast queue is dropping many packets. Example 9-271 shows the output of show interface for the Serial 1 interface, which shows a significant number of drops in the Frame Relay broadcast queue.
Example 9-271 Displaying Broadcast Queue Status
R3#show interface Serial1 Serial1 is up, line protocol is up Hardware is MK5025 Description: Charlotte Frame Relay Port DLCI 100 MTU 1500 bytes, BW 1024 Kbit, DLY 20000 usec, rely 255/255, load 44/25 Encapsulation FRAME-RELAY, loopback not set, keepalive set (10 sec) LMI enq sent 7940, LMI stat recvd 7937, LMI upd recvd 0, DTE LMI up LMI enq recvd 0, LMI stat sent 0, LMI upd sent 0 LMI DLCI 1023 LMI type is CISCO frame relay DTE Broadcast queue 64/64, broadcasts sent/dropped 1769202/1849660, interface broadcasts 3579215
The output in Example 9-271 further proves that there is some problem at the interface level. Too many drops are occurring at the interface level. This is causing the route to flap. To correct this problem, you must tune the Frame Relay broadcast queue accordingly. Tuning the Frame Relay broadcast queue is beyond the scope of this book, but several papers on Cisco’s web site discuss how to tune the Frame Relay broadcast queue. For further research, you can consult them at the following URLs:
www.cisco.com/warp/partner/synchronicd/cc/techno/media/wan/frame/prodlit/256_pb.htm
www.cisco.com/warp/public/125/20.html
Example 9-272 shows that after fixing the interface drop problem, route flapping disappears. The broadcast queue size is changed from 64 to 256. The correct number can be determined after reading the URLs mentioned earlier for tuning the broadcast queue.
Example 9-272 Verifying That the Broadcast Queue Has Been Fixed
R3#show interface Serial1 Serial1 is up, line protocol is up Hardware is MK5025 Description: Charlotte Frame Relay Port DLCI 100 MTU 1500 bytes, BW 1024 Kbit, DLY 20000 usec, rely 255/255, load 44/255 Encapsulation FRAME-RELAY, loopback not set, keepalive set (10 sec) LMI enq sent 7940, LMI stat recvd 7937, LMI upd recvd 0, DTE LMI u LMI enq recvd 0, LMI stat sent 0, LMI upd sent 0 LMI DLCI 1023 LMI type is CISCO frame relay DTE Broadcast queue 0/256, broadcasts sent/dropped 1769202/0 , interface broadcasts 3579215
SPF Running Constantly—Cause: Duplicate Router ID
This is also a common problem in OSPF. When two routers have identical router IDs, confusion results in the OSPF topology database, and the route keeps getting added and deleted. The most common symptom of this problem is that the LS Age field always has a small value.
This problem usually is generated by a cut and paste of a router configuration into another router. This results in two routers with identical router IDs. Figure 9-103 shows a network setup in which R2 and R3 have duplicate router IDs of 192.168.1.129.
Figure 9-104 shows the flowchart to follow to solve this problem.
Debugs and Verification
When there is a duplicate router ID, it causes SPF frequently, and the SPF counter keeps incrementing unless the problem is fixed. Example 9-273 shows that SPF in area 0 ran 2446 times, which is a large number.
Example 9-273 Determining How Often SPF Is Running
R1#show ip ospf Routing Process "ospf 1" with ID 192.168.2.129 Supports only single TOS(TOS0) routes It is an area border SPF schedule delay 5 secs, Hold time between two SPFs 10 secs Minimum LSA interval 5 secs. Minimum LSA arrival 1 secs Number of external LSA 8. Checksum Sum 0x48C3E Number of DCbitless external LSA 0 Number of DoNotAge external LSA 0 Number of areas in this router is 4. 1 normal 0 stub 0 nssa Area BACKBONE(0) Number of interfaces in this area is 1 Area has no authentication SPF algorithm executed 2446 times
The next step is to turn on debug ip ospf monitor. This debug shows exactly which LSA to chase. Example 9-274 shows the output of debug ip ospf monitor, which shows that a router with a router ID of 192.168.1.129 is the problem. The output also shows that it’s a router LSA.
Example 9-274 debug ip ospf monitor Output Pinpoints the Router Causing This Problem
R1# debug ip ospf monitor OSPF: Schedule SPF in area 0.0.0.0 Change in LS ID 192.168.1.129, LSA type R, OSPF: schedule SPF: spf_time 1620348064ms wait_interval 10s
Example 9-275 shows the output of the router LSA in question. There are two instances of this output taken 15 seconds apart. The first output shows that the number of links in this router is one; the second output shows that the number of links on this router is three. This is a discrepancy because of a duplicate router ID. This means that there must be another router with the same router ID causing the number of links to change every 15 seconds. Also, the LS Age field is always less than 10 seconds.
The first output in this example is the router LSA of R2; the second output is the router LSA of R3.
Example 9-275 Determining the Discrepancy in the Router LSA
R1#show ip ospf database router 192.168.1.129 OSPF Router with ID (192.168.2.129) (Process ID 1) Router Link States (Area 0.0.0.0) LS age: 9 Options: (No TOS-capability, DC) LS Type: Router Links Link State ID: 192.168.1.129 Advertising Router: 192.168.1.129 LS Seq Number: 80067682 Checksum: 0xC456 Length: 36 Number of Links: 1 Link connected to: a Transit Network (Link ID) Designated Router address: 192.168.254.14 (Link Data) Router Interface address: 192.168.254.14 Number of TOS metrics: 0 TOS 0 Metrics: 10 R1#show ip ospf database router 192.168.1.129 OSPF Router with ID (192.168.2.129) (Process ID 1) Router Link States (Area 0.0.0.0) LS age: 7 Options: (No TOS-capability, DC) LS Type: Router Links Link State ID: 192.168.1.129 Advertising Router: 192.168.1.129 LS Seq Number: 8006768 Checksum: 0xA7D8 Length: 6 Number of Links: 3 Link connected to: another Router (point-to-point) (Link ID) Neighboring Router ID: 192.168.2.129 (Link Data) Router Interface address: 192.168.252.13 Number of TOS metrics: 0 TOS 0 Metrics: 66 Link connected to: a Stub Network (Link ID) Network/subnet number: 192.168.252.12 (Link Data) Network Mask: 255.255.255.252 Number of TOS metrics: 0 TOS 0 Metrics: 66 Link connected to: a Transit Network (Link ID) Designated Router address: 192.168.253.14 (Link Data) Router Interface address: 192.168.253.14 Number of TOS metrics: 0 TOS 0 Metrics: 1 R1
Example 9-276 shows that R2 and R3 have identical router IDs.
Example 9-276 Detecting Duplicate Router IDs
R2#show ip ospf Routing Process "ospf 1" with ID 192.168.1.129 _____________________________________________________________________________________ R3#show ip ospf Routing Process "ospf 1" with ID 192.168.1.129
Solution
To correct this problem, either change the router ID of R3 or change the router ID of R2. Example 9-277 shows how to change the router ID of R3 and gives the output of the show ip ospf command to verify that the router ID has been changed.
Example 9-277 Changing the Router ID of R3
R3(config)#interface loopback 0 R3(config-if)#ip address 192.168.3.129 255.255.255.255 R3(config-if)#end R3#show ip ospf Routing Process "ospf 1" with ID 192.168.3.129
Example 9-278 shows that after changing the router ID of R3, the LS age for 192.168.1.129 becomes stable in the OSPF database. The LS age has reached 90 seconds, so the entry is now stable.
Example 9-278 The LS Age for the Problem LSA Is Now Stable
R1#show ip ospf database router 192.168.1.129 OSPF Router with ID (192.168.2.129) (Process ID 1) Router Link States (Area 0.0.0.0) LS age: 90 Options: (No TOS-capability, DC) LS Type: Router Links Link State ID: 192.168.1.129 Advertising Router: 192.168.1.129 LS Seq Number: 80067686 Checksum: 0xC456 Length: 36 Number of Links: 1 Link connected to: a Transit Network (Link ID) Designated Router address: 192.168.254.14 (Link Data) Router Interface address: 192.168.254.14 Number of TOS metrics: 0 TOS 0 Metrics: 10