Fast STP Convergence
When discussing the five STP states, Chapter 6 mentioned that Spanning Tree starts ports in a Blocking state. Over a period of 30 to 50 seconds, ports work their way through the Listening and Learning states to finally reach the Forwarding state where user data is actually passed. Spanning Tree was intentionally designed with this conservative logic. Consider the alternative: if STP immediately brought ports into the Forwarding state, loops could form and prevent STP from ever getting any BPDUs through! Instead, STP keeps all ports from forwarding anything but BPDUs for at least 30 seconds by default. This is designed to give the bridges in almost all networks plenty of time to learn the physical topology and then create a loop-free logical topology.
However, the downside to this conservative behavior is slow convergence time. In a world that has grown accustomed to sub-ten-second-failover protocols such as Open Shortest Path First (OSPF), Enhanced IGRP (EIGRP), and Hot Standby Routing Protocol (HSRP), Spanning Tree’s default behavior can be intolerably sluggish. In response to this need for speed, there are a variety of techniques available on Catalyst switches to improve on Spanning Tree’s default performance. Some of these techniques merely implement features originally designed into the 802.1D protocol. Others take advantage of new and patented features researched and implemented by Cisco. These capabilities can play an important role in building large and stable switched backbones.
In total, Catalysts offer seven techniques to improve Spanning Tree’s convergence time:
- Tuning Max Age
- Tuning Forward Delay
- Lowering Hello Time
- PortFast
- UplinkFast
- BackboneFast
- Disabling PAgP on EtherChannel-capable ports
Tuning Max Age
Recall that Spanning Tree centers around a process of each port saving a copy of the best BPDU it has heard. However, there must be some process to age out this stored information, otherwise, the network would never repair itself after the failure of a bridge or link. The Max Age timer controls the length of this age out period.
The Root Bridge generates Configuration BPDUs on all of its ports every 2 seconds (the default Hello Time interval). This triggers a cascade of Configuration BPDUs that, under normal conditions, should reach every segment in the network. As long as a non-Designated Port continues to receive Configuration BPDUs from the segment’s Designated Port, the port’s saved information never ages out. In other words, this continual refresh of more attractive BPDU information keeps the non-Designated Port in a Blocking state.
However, if an upstream bridge or link fails, the flow of Configuration BPDUs is interrupted. After Max Age seconds, the port starts sending its own Configuration BPDUs as it transitions into the Listening state. As long as no more attractive BPDUs are heard from other devices, the port eventually transitions through the Learning state and ends up in the Forwarding state. At this point, the port has restored connectivity to the local segment.
In practice, Max Age is used to detect indirect failures. For example, the directly connected failure illustrated in Figure 7-20 does not require Cat-C to use the Max Age timer.
Figure 7-20. Direct Failures Take Approximately 30 Seconds to Reconverge
In this network, Cat-A is the Root Bridge, and Cat-C (an IDF switch) has selected Port 1/2 as its Root Port because it has a lower Root Path Cost (23 versus 1000). Assume that the cable connecting Cat-C and Cat-B fails. This produces an immediate physical layer loss of link on Cat-C:Port-1/2 and causes that port to be placed in the not-connected state. Port 1/2 is then immediately excluded from STP processing and causes Cat-C to start searching for a new Root Port. 30 seconds later (twice the Forward Delay), Port 1/1 enters the Forwarding state and connectivity resumes.
- Tip
Max Age is used to detect and recover from indirect failures.
On the other hand, Figure 7-21 depicts an indirect failure.
Figure 7-21. Indirect Failures Take Approximately 50 Seconds to Reconverge
In this case, the link between Cat-A and Cat-B fails. Cat-C:Port-1/2 receives no direct notification that anything has changed. All Cat-C notices is that Configuration BPDUs stop arriving on Port 1/2. After waiting for the number of seconds specified by the Max Age timer, Cat-C:Port-1/1 starts to take over as the Root Port. This reconvergence takes considerably longer: 50 seconds as opposed to 30 seconds.
The default Max Age value of 20 seconds is designed to take two factors into account:
- End-to-End BPDU Propagation Delay
- Message Age Overestimate
Calculating End-to-End BPDU Propagation Delay for Max Age
End-to-End BPDU Propagation Delay is the amount of time that it takes for a BPDU to travel from one edge of the network to the other edge of the network. The 802.1D specification assumes that up to three BPDUs can get lost along the way, the maximum distance between any two nodes in the network is seven bridge hops, and each bridge can take up to one second to propagate a BPDU after receiving it. Also, the default Hello Time interval of two seconds is assumed. Table 7-2 documents these assumptions.
Table 7-2. Assumptions Used to Calculate the Default Max Age
Parameter | Value | Tunable | Description |
lost_msgs | 3 | N | Lost Messages: Number of BPDUs that can be lost as a message moves from one end of the bridged network to the other. Reasons for the potential loss of BPDUs include issues such as congestion, CRC errors, physical layer issues, and software deficiencies (that is, bugs). |
dia | 7 hops | Y | Diameter: The maximum number of bridge hops between any two end-station nodes in the network. |
bpdu_delay | 1 sec | N | BPDU Transmission Delay: The maximum time a single bridge takes to propagate a BPDU after receiving it on another port. |
hello_t | 2 secs | Y | Hello Time: The time between Configuration BPDUs that are originated on the Root Bridge. |
These values can be used to calculate the End-to-End BPDU Propagation Delay using the following formula:
end-to-end_bpdu_propagation_delay = ((lost_msgs + 1) x hello_t) + (bpdu_delay x (dia– 1)) = ((3 + 1) x 2) + (1 x (7 – 1)) = 8 + 6 = 14 seconds
Calculating Message Age Overestimate for Max Age
The other component of Max Age is something referred to as Message Age Overestimate. This is the amount of time that the Message Age field carried in BPDUs can be overstated. Recall from the earlier section “STP Timers” that Message Age is the amount of time that has passed since the Root Bridge first originated the information the current BPDU is based on.
Although the 802.1D specification forbids a bridge from underestimating the Message Age field, it can become overstated because many bridges have limited timer resolution and just add a second at every hop. As a result, each bridge hop generally increases the amount of this overstatement, causing downstream bridges to potentially expire the Max Age counter too soon. 802.1D assumes that each bridge can contribute one second of Message Age Overstatement. Therefore, the total Message Age Overestimate can be calculated as follows:
message_age_overestimate = (dia– 1) x overestimate_per_bridge = (7 – 1) x 1 = 6 seconds
- Tip
Note that this process of simply incrementing the Message Age field in each bridge causes bridges farther from the Root Bridge to age out their Max Age counters first. Therefore, this effect is more pronounced in flat-earth networks that consist of many Layer 2 switches connected without any intervening routers or Layer 3 switches. This is another advantage to creating hierarchy with Layer 3 switching as prescribed by the multilayer model in Chapter 14.
Calculating and Using Max Age
Max Age is simply the sum of these two previously calculated values:
Max Age = end-to-end_bpdu_propagation_delay + message_age_overestimate = 14 + 6 = 20 seconds
Of the values assumed in this calculation, two are something you can consider tunable:
- Diameter
- Hello Time
If your bridged network diameter is considerably smaller than seven hops or your Hello Time has been set to one second, you might want to recalculate a new Max Age value. The result can then be entered at your Root Bridges using the following command:
set spantree maxage agingtime [vlan]
- Tip
When using Layer 3 switching to limit the size of your Spanning Tree domains, the Max Age timer can be safely tuned. The extent of tuning that is possible is based on the style of Layer 3 switching in use and the overall campus network design. See Chapter 15 for specific details and recommendations.
You can only modify the timer values on Root Bridges. Don’t forget to also change the values on any backup Root Bridges.
If you do lower the Hello Time interval, carefully consider the impact that it has on your CPU. STP can easily be the single most intensive CPU process running on a modern bridge (which does all of the frame forwarding in hardware). Cutting the Hello Time interval in half doubles the load that STP places on your CPU. See the sections “Lowering Hello Time to One Second” and “Tips and Tricks: Mastering STP” later in this chapter for more guidance.
- Tip
Decreasing Hello Time to one second doubles the STP load placed on your CPU. Use the formula presented at the end of the chapter to be certain that this does not overload your CPU. However, in networks that contain a limited numbers of VLANs, lowering Hello Time to one second can be an excellent way to improve convergence times. See the section “Lowering Hello Time to One Second” for more information.
Precautions for Tuning Max Age
If you assume a lower network diameter, be careful to look for unexpected bridge hops that can crop up during failure conditions. Also, diameter is defined as the largest number of bridge hops between any two end-station nodes in your network. It is not just the number of hops between your primary Root Bridge and the farthest leaf node.
- Tip
Be careful when calculating bridge diameter—unexpected hops can creep in when other links or devices fail.
Some of the 802.1D values might appear overly conservative. For instance, most users would argue that their networks would never drop three BPDUs while transferring information a mere seven bridge hops. Likewise, the assumption that each bridge takes one second to propagate a BPDU seems strange in a world of high-horsepower switching.
Although it might be tempting to recalculate the formula with more “real-world” values, I strongly recommend against this.Keep in mind that these values were chosen to provide adequate margin in networks experiencing failure conditions, not just networks happily humming along while everything is operating at peak efficiency. When a failure does occur, your bandwidth and CPU capacity can be depleted as the network tries to recover. Be sure to leave some reserves to handle these situations.
- Tip
Only modify the diameter and Hello Time variables in the Max Age calculation. Modifying the other values can surprise you some day (when you least expect it!).
Although any form of STP timer tuning can be dangerous, reducing Max Age can be less risky than other forms. If you set Max Age too low, a brief interruption in the flow of Configuration BPDUs in the network can cause Blocking ports to age out their BPDU information. When this happens, this rogue bridge starts sending Configuration BPDUs in an attempt to move into the Forwarding state.
If there is a functioning Designated Port available for that segment, it refutes the BPDU with a Configuration BPDU of its own (this is the exception processing discussed earlier in the “Configuration BPDU Processing” section). However, if the Designated Port has failed, no device will refute the BPDU, and the rogue bridge might form a bridging loop before it hears from any neighboring bridges.
- Tip
Modifying Max Age is less dangerous than changing the other timer values. Unfortunately, it only improves convergence in the case of an indirect failure.
Tuning Forward Delay
Forward Delay can add an agonizing amount of delay to the recovery time experienced in your network. Unlike Max Age, where direct failures cause that stage to be bypassed, all failures must wait through the Forward Delay stage twice (Listening and then Learning).
Given this sluggishness, it can be very tempting to try improving STP’s performance by lowering the Forward Delay timer. However, unless this is done with considerable planning and care, it can have a devastating impact on your network. This section explores the thought process behind Forward Delay’s default value and explores reasonable guidelines for tuning this parameter.
Forward Delay is used to control the migration of a port from the Blocking state to the Forwarding state. If this was an instantaneous transition, a loop could easily form and the network would collapse under the resulting load. Instead, bridges wait for twice the Forward Delay to allow BPDUs to propagate across the network and all traffic on the old topology to die. In addition, it must allow time for the same overestimate error in Message Age seen in the previous section.
To accommodate these three items, the default Forward Delay value recommended in the 802.1D specification accounts for the following four components that affect STP reconvergence:
- End-to-End BPDU Propagation Delay
- Message Age Overestimate
- Maximum Transmission Halt Delay
- Maximum Frame Lifetime
You might recognize the first two factors affecting convergence as being the same two used to calculate Max Age. These values represent the same two factors here: the time that it takes for BPDUs to move across the network and nodes expiring information too early because of an overstatement in the Message Age field. Conversely, the combination of the last two items allows time for information to age out of the old topology before the new topology takes effect.
The Forward Delay calculation uses the values shown in Table 7-3.
Table 7-3. Assumptions Used to Calculate the Default Forward Delay
Parameter | Value | Tunable | Description |
lost_msgs | 3 | N | Lost Messages: Number of BPDUs that could get lost as a message moves from one end of the bridged network to the other. |
dia | 7 hops | Y | Diameter: The maximum number of bridge hops between any two nodes in the network. |
bpdu_delay | 1 sec | N | BPDU Transmission Delay: The maximum time a single bridge takes to propagate a BPDU after receiving it on another port. |
hello_t | 2 secs | Y | Hello Time: The time between Configuration BPDUs that are originated on the Root Bridge. |
tx_halt_delay | 1 sec | N | Transmit Halt Delay: The time it takes for a bridge to stop sending traffic on a port after it enters the Blocking state. |
transit_delay | 1 sec | N | Bridge Transit Delay: The time it takes for a bridge to propagate a data frame. |
med_access_delay | .5 sec | N | Maximum Medium Access Delay: The time it takes for a device to gain access to the wire to initially transmit a frame. |
Calculating End-to-End BPDU Propagation Delay and Message Age Overestimate for Forward Delay
These components are used to calculate Forward Delay as follows:
end-to-end_bpdu_propagation_delay = ((lost_msgs+ 1) x hello_t) + (bpdu_delayx (dia– 1)) = ((3 + 1) x 2) + (1 x (7 – 1)) = 8 + 6 = 14 seconds message_age_overestimate = (dia– 1) x overestimate_per_bridge = (7 – 1) x 1 = 6 seconds
These two calculations are the same two used to derive Max Age. With Forward Delay, just as in Max Age, they account for the time it takes to propagate BPDUs across the network and for the error present in the Message Age field of Configuration BPDUs.
Calculating Maximum Transmission Halt Delay
Maximum Transmission Halt Delay is designed to account for the lag that can occur before a bridge puts a port into the Blocking state. In other words, the Spanning-Tree Protocol might determine that a port should be placed in the Blocking state. However, it might take the bridge or switch some time to follow through on this decision. As Table 7-3 indicates, 802.1D allows one second for this event.
Calculating Maximum Frame Lifetime
In addition to the delay that it takes for a bridge to “close the door” by blocking the port, the algorithm needs to account for frames that have already “gone out the door” and are running around the network looking for their final destination. This is done with the Maximum Frame Lifetime. Maximum Frame Lifetime can be calculated with the following formula:
max_frame_lifetime = (diax transit_delay) + med_access_delay = (7 x 1) + .5 = 8 seconds (rounded)
The diax transit_delay part allows time for frames to die out, whereas the med_access_delay accounts for the time that it takes for a frame to initially gain access to the wire.
Calculating and Using Forward Delay
The entire pre-forwarding period can be calculated by combining these four parts:
pre-forwarding_period = end-to-end_bpdu_propagation_delay + message_age_overestimate + tx_halt_delay + max_frame_lifetime = 14 + 6 + 1 + 8 = 29 seconds
Because the pre-forwarding period is divided into two halves, this number can be divided by two to calculate Forward Delay:
forward_delay
= 29 / 2 = 15 seconds (rounded)
As with Max Age, you can substitute the appropriate values for your network to potentially calculate a smaller Forward Delay value (only modify diameter and Hello Time). This value can then be set on Root Bridges using the following command:
set spantree fwddelay delay [vlan]
- Tip
You can only modify the timer values on Root Bridges. Don’t forget to also change the values on any backup Root Bridges so as to be consistent during primary Root Bridge failure. As with other Spanning Tree commands, it is best to get into the habit of always specifying the VLAN parameter.
Precautions for Tuning Forward Delay
You should be very careful when adjusting Forward Delay. If you are too aggressive, you can disable the entire network for extended periods of time. Recall that one of the goals of Forward Delay is to let the entire network learn the active topology before any of the ports start passing traffic. If ports are brought into the Forwarding state without giving ample time for information to propagate, loops can be the result. The other goal of Forwarding Delay is to let frames in the old topology die out. If this does not occur, some frames might be delivered in duplicate, something the 802.1D specification prohibits. Although this can corrupt certain applications, it is generally far less dangerous than the bridging loops created by the first issue (bridge loops can take down every application and device on the network).
Note that if a bridging loop forms in even a small area, it can quickly spread throughout the entire network. As the loop grows in size, the number of ports replicating the data grows. As the volume of data grows, output buffers and available link bandwidth begin to exhaust—it’s become harder to send BPDUs that might fix the problem. Then bridge CPUs become overburdened with broadcast and multicast traffic—making it harder to generate BPDUs to fix the loop. The resulting downward spiral can quickly lead to a network-wide meltdown.
- Tip
Be very careful and conservative when you adjust Forward Delay. If you set Forward Delay too low, it can create network-wide outages.
Lowering Hello Time to One Second
The previous two sections mention the impact that a lower Hello Time can have on the network. By causing the Root Bridge to generate Configuration BPDUs twice as often, information, in general, propagates twice as quickly through the network.
However, notice that merely lowering Hello Time from the default of two seconds to one second does not improve convergence times. It causes the network to learn information more quickly, but only by reducing Max Age or Forward Delay does this actually lead to a faster convergence time.
- Tip
Unlike Forward Delay and Max Age, lowering the Hello Time value does not improve convergence. On the contrary, you lower the Hello Time to make it possible for you to also lower the Forward Delay and/or Max Age timers. In general, it is simplest to use the set spantree root macro discussed in Chapter 6 (it automatically makes all necessary adjustments based on the suggested formulas in 802.1D).
The Hello Time can be adjusted with the set spantree hello command. For instance, the following command lowers the Hello Time for VLAN 3 to one second:
set spantree hello 1 3
If you do lower the Hello Time value, carefully consider the CPU overload warning mentioned in the “Tuning Max Age” section. For more information, see the formula presented in the “Tips and Tricks: Mastering STP” section.
PortFast
PortFast is a feature that is primarily designed to optimize switch ports that are connected to end-station devices. By using PortFast, these devices can be granted instant access to the Layer 2 network.
Think for a moment about what happens when you boot your PC every morning. You flip the big red switch, the monitor flickers, it beeps and buzzes. Somewhere during that process your network interface card (NIC) asserts Ethernet link, causing a Catalyst port to jump from not connected to the STP Learning state. Thirty seconds later, the Catalyst puts your port into Forwarding mode, and you are able to play Doom to your heart’s content.
Normally, this sequence never even gets noticed because it takes your PC at least 30 seconds to boot. However, there are two cases where this might not be true.
First, some NICs do not enable link until the MAC-layer software driver is actually loaded. Because most operating systems try to use the network almost immediately after loading the driver, this can create an obvious problem. Several years ago, this problem was fairly common with certain Novell ODI NIC drivers. With more modern NICs, this problem is fairly common with PC Card (PCMCIA) NICs used in laptop computers.
Second, there is a race—a race between Microsoft and Intel. Intel keeps making the CPUs faster and Microsoft keeps making the operating systems slower… and so far Intel is winning. In other words, PCs are booting faster than ever. In fact, some modern machines are done booting (or at least far enough along in the process) and need to use the network before STP’s 30-second countdown has finished. Dynamic Host Control Protocol (DHCP) and NT Domain Controller authentication are two common activities that occur late in the initialization process.
In both cases, STP’s default settings can create a problem. How do you know if you have this problem? Probably the easiest is to plug both the PC andthe Catalyst port into a hub. This provides a constant link to the Catalyst and keeps the port in Forwarding mode regardless of whether the PC is booted or not. Another classic symptom is if your PC always has problems when you first cold boot it in the morning, but it never has problems when you warm boot it during the day or try to manually complete login or DHCP sequences after booting.
This problem motivates some network administrators to disable STP altogether. This certainly fixes any STP booting problems, but it can easily create other problems. If you employ this strategy, it requires that you eliminate all physical loops (a bad idea from a resiliency standpoint) and carefully avoid all physical layer loops (something that can be difficult to do in the real world). Also, keep in mind that you can’t disable STP for a single port. set spantree disable[vlan] is a per-VLAN global command that disables STP for every port that participates in the specified VLAN (and, as you would expect, VLAN 1 is the default if you do not specify the VLAN parameter). Moreover, some of the Layer 3 switching technologies, such as the Catalyst 5000 Supervisor Module III NetFlow Feature Card (NFFC), require that Spanning Tree be disabled on the entire box (all VLANs)!
In short, rather than disabling STP, you should consider using Cisco’s PortFast feature. This feature gives you the best of both worlds—immediate end-station access and the safety net of STP.
PortFast works by making a fairly simple change in the STP process. Rather than starting out at the bottom of the Blocking-to-Listening-to-Learning-to-Forwarding hierarchy of states as with normal STP, PortFast starts at the top. As soon as your switch sees the link, the port is placed in the Forwarding state (Catalyst 3000s actually spend one second in both Listening and Learning, but who’s counting?). If STP later detects that you have a loop, it does all of the Root and Designated Port calculations discussed earlier. If a loop is detected, the port is put in the Blocking state.
This magic only occurs when the port first initializes. If the port is forced into the Blocking state for some reason and later needs to return to the Forwarding state, the usual Listening and Learning processing is done.
Precautions for Using PortFast
You will probably run into people who recommend that you only enable PortFast if it is absolutely necessary. The stern warning that Catalysts issue when you enable this feature backs up this opinion:
Warning: Spantree port fast start should only be enabled on ports connected to a single host. Connecting hubs, concentrators, switches, bridges, etc. to a fast start port can cause temporary Spanning Tree loops. Use with caution.
However, to the contrary, PortFast can actually improve the stability of large networks! Recall the discussion of TCN BPDUs. TCN BPDUs are sent every time a bridge detects a change in the active topology to shorten the bridge table age-out time to the Forward Delay interval. Do you really want to potentially flush large sections of your bridging tables every time a user boots? Probably not.
- Tip
Use PortFast on your end-station ports. Not only does it avoid problems when these devices boot, it reduces the amount of Topology Change Notifications in your network.
Despite all of PortFast’s benefits, you should not carelessly enable it on every port. Only enable it on ports that connect to workstations. Because servers rarely reboot (you hope), don’t enable it here.
- Tip
One exception to the rule of not using PortFast on server ports involves the use of fault-tolerant NICs. If you are using one of these NICs that toggles link-state during failover (most don’t), you should enable PortFast on these server ports.
Finally, you cannot use PortFast on trunk ports. Although Catalysts allow the command to be entered on trunk links, it is ignored. In short, PortFast is like any other power tool: it is extremely useful, but only if used correctly.
- Tip
Do not enable PortFast on looped ports.
Using PortFast
Enabling PortFast is simple. Simply use the set spantree portfast command:
set spantree portfast mod_num/port_num {enable | disable}
For example, to enable PortFast on every port of a 24-port module in slot 3, issue the following command:
set spantree portfast 3/1-24 enable
You can later disable PortFast with the following command:
set spantree portfast 3/1-24 disable
If you want to check to see where you have PortFast enabled, you can use the show spantree command as in Example 7-14.
Example 7-14 Showing Where PortFast Is Enabled
Cat-A (enable) show spantree 1 VLAN 1 Spanning tree enabled Spanning tree type ieee Designated Root 00-90-92-16-28-00 Designated Root Priority 100 Designated Root Cost 19 Designated Root Port 1/1 Root Max Age 20 sec Hello Time 2 sec Forward Delay 15 sec Bridge ID MAC ADDR 00-90-92-bf-70-00 Bridge ID Priority 32768 Bridge Max Age 20 sec Hello Time 2 sec Forward Delay 15 sec Port Vlan Port-State Cost Priority Fast-Start Group-method --------- ---- ------------- ----- -------- ---------- ------------ 1/1 1 forwarding 19 32 disabled 1/2 1 blocking 1000 32 disabled 3/1 1 forwarding 100 32 enabled 3/2 1 forwarding 100 32 enabled 3/3 1 forwarding 100 32 enabled
Look under the Fast-Start column. Notice how the end-station ports on module three have PortFast enabled, whereas the uplink ports on the Supervisor do not.
- Tip
In many cases, you might experience a 17–20 second delay even after you have enabled PortFast. This is almost always caused by a side effect of the Port Aggregation Protocol (PAgP) used to handle EtherChannel negotiations. As discussed in the “Disabling Port Aggregation Protocol” section later in this chapter, PAgP hides port initialization changes for approximately 17–18 seconds. In other words, although PortFast might enable the link as soon as it is aware that the port has transitioned, PAgP delays this notification. In a future software release, Cisco is considering disabling PAgP on ports where PortFast is enabled, a change that would avoid this problem.
UplinkFast
UplinkFast is an exciting feature that Cisco rolled out in the 3.1 NMP release. This exclusive feature (it is patented) allows wiring closet switches to converge in two to three seconds!
The syntax for UplinkFast is even simpler than PortFast:
set spantree uplinkfast {enable | disable} [rate station_update_rate]
You should only enable UplinkFast on IDF-like wiring closet switches in correctly designed networks. UplinkFast is designed to only operate on switches that are leaves (end nodes) in your Spanning Tree. If you enable it in the core of you network, it generally leads to unexpected traffic flows.
For example, consider Figure 7-22, the typical campus introduced earlier.
Figure 7-22. A Typical Campus Network Using UplinkFast
Cat-D is an IDF switch that is connected to two MDF switches (Cat-B and Cat-C). Although set spantree uplinkfast is a global command that applies to all VLANs, this section only analyzes a single VLAN: VLAN 2. Cat-A, the server farm switch, is the Root Bridge for VLAN 2. Cat-D has two uplink ports that are potential Root Port candidates. Utilizing the load balancing techniques discussed earlier, the cost on Port 1/2 has been increased to 1000 to force VLAN 2’s traffic across the 1/1 link. Notice that Port 1/1 becomes the Root Port. UplinkFast is then enabled on Cat-D with the following command:
Cat-D> (enable) set spantree uplinkfast enable
This causes Cat-D to notice that Port 1/2 is Blocking and therefore constitutes a redundant connection to the Root Bridge. By making a note of this backup uplink port, Cat-D can set itself up for a quick rollover in the event that Port 1/1 fails. The list of potential uplink ports can be viewed with the show spantree uplinkfast command as in Example 7-15.
Example 7-15 Showing Forwarding and Backup Ports with UplinkFast
Cat-D> (enable) show spantree uplinkfast Station update rate set to 15 packets/100ms uplinkfast all-protocols field set to off VLAN port list ----------------------------------------------- 1 1/1(fwd),1/2
Port 1/1 is shown as the primary port (it is in the Forwarding state) and Port 1/2 is the backup. If three uplink ports exist, all three appear in the output.
It is important to recognize that UplinkFast is a Root Port optimization. It allows wiring closet switches to quickly bring up another Root Port in the event that the primary port fails.
- Tip
UplinkFast is a Root Port optimization.
Therefore, it is futile to enable UplinkFast on a Root Bridge—because Root Bridges contain no physical Root Ports there is nothing for UplinkFast to optimize. In other words, only implement this feature on leaf-node switches sitting at the ends of the branches of your Spanning Tree. These leaf-node switches should not be used as a transit switches to reach the Root Bridge. So, as a general rule, only enable UplinkFast on your IDF wiring closet switches.
- Tip
Do not enable UplinkFast on every switch in your network! Only enable UplinkFast on leaf-node Catalysts such as your IDF switches.
To enforce the requirement of leaf-node status, Cisco modifies several STP parameters when UplinkFast is enabled. Take a look at the output of the set spantree uplinkfast command in Example 7-16.
Example 7-16 set spantree uplinkfast Command Output
Cat-D> (enable) set spantree uplinkfast enable VLANs 1-1005 bridge priority set to 49152 The port cost and portvlancost of all ports set to above 3000 Station update rate set to 15 packets/100ms. uplinkfast all-protocols field set to off uplinkfast enabled for bridge
First, the Bridge Priority is modified to an unusually high value of 49,152. This causes the current switch to effectively take itself out of the election to become the Root Bridge. Second, it adds 3000 to the cost of all links. This is done to discourage other switches from using the current switch as a transit switch to the Root Bridge. Notice that neither of these actions limits STP failover in your network. The Bridge Priority modification only discourages other switches from electing this switch as the Root Bridge. If the other switches fail, this switch happily becomes the Root Bridge.
Also, the increase to Path Cost only discourages other switches from using the current switch as a transit path to the Root Bridge. However, if no alternate paths are available, the current switch gleefully transfers traffic to and from the Root Bridge.
Notice the third line in the output in Example 7-16 (in bold). This is evidence of a subtle trick that is the crux of what UplinkFast is all about. It should probably be fairly obvious by now that a failure on Cat-D:Port-1/1 forces Cat-D to take all MAC addresses associated with Port 1/1 in the Bridging Table and points them to Port 1/2. However, a more subtle process must take place to convert the bridging tables in other switches. Why is this extra step necessary? Figure 7-23 shows the network with the left-hand link broken.
Figure 7-23. UplinkFast Behavior When the Primary Uplink Is Lost
Cat-D changes MAC address 00-AA-00-12-34-56 (Host-A) to Port 1/2 so that it has a correct view of the network. However, notice that Cat-A, Cat-B, and Cat-C are still trying to send traffic for 00-00-1D-2B-DE-AD (Host-B) to the broken link! This is where the real ingenuity of UplinkFast comes in: Cat-D sends out a dummy multicast frame for the addresses in its local Bridging Table. One frame is sent for each MAC address that is not associated with one of the uplink ports.
These packets are sent to a multicast 01-00-0C-CD-CD-CD destination address to ensure that they are flooded throughout the bridged network. Recall from Chapter 3 that multicast addresses are flooded as with broadcast frames. However, note that Cisco does not use the traditional multicast address of 01-00-0C-CC-CC-CC. Because this multicast address is reserved for single hop protocols such as Cisco Discovery Protocol (CDP), VLAN Trunk Protocol (VTP), Dynamic ISL (DISL), and Dynamic Trunk Protocol (DTP), Cisco devices have been programmed to not flood the 01-00-0C-CC-CC-CC. To avoid this behavior, a new multicast address needed to be introduced.
Each frame contains the source address of a different entry in the local Bridging Table. As these packets are flooded through the network, all of the switches and bridges make a note of the new interface the frame arrived on and, if necessary, adjust their bridging tables. By default, the Catalyst sends 15 of these dummy frames every 100 milliseconds, but this rate can be adjusted with the [rate station_update_rate] parameter (the number represents how many dummy updates to send every 100 milliseconds).
However, adjusting the rate parameter usually does not improve failover performance. Notice that only MAC addresses not learned over the uplinks are flooded. Because UplinkFast only runs on leaf-node switches where the vast majority of the MAC addresses in the bridging table are associated with the uplink ports, usually only a few hundred addresses require flooding. The default rate floods 450 to 600 addresses in the 3–4 second UplinkFast convergence period. Therefore, it only makes sense to increase the rate if you have more than about 500 devices connected to your wiring closet switch.
UplinkFast is an extremely effective and useful feature. It provides much faster convergence than any of the timer tuning techniques discussed earlier and is much safer. As long as you only deploy it in leaf-node switches, it can be a wonderful way to maintain the safety of STP while dramatically improving failover times in most situations.
BackboneFast
BackboneFast is a complementary (and patented) technology to UplinkFast. Whereas UplinkFast is designed to quickly respond to failures on links directly connected to leaf-node switches, it does not help in the case of indirect failures in the core of the backbone. This is where BackboneFast comes in.
Don’t expect BackboneFast to provide the two to three second rollover performance of UplinkFast. As a Max Age optimization, BackboneFast can reduce the indirect failover performance from 50 to 30 seconds (with default parameters; and from 14 to 8 seconds with the tunable values set their minimums). However, it never eliminates Forwarding Delay and provides no assistance in the case of a direct failure (recall from the “Tuning Max Age” section that direct failures do not use Max Age).
- Tip
BackboneFast is a Max Age optimization. It allows the default convergence time for indirect failures to be reduced from 50 seconds to 30 seconds.
As discussed in the previous section, UplinkFast should only be enabled on a subset of all switches in your network (leaf-node, wiring closet switches). On the other hand, BackboneFast should be enabled on every switch in your network. This allows all of the switches to propagate information about link failures throughout the network.
- Tip
BackboneFast should be enabled on every switch in your network.
When a device detects a failure on the link directly connected to its Root Port, the normal rules of STP dictate that it begin sending Configuration BPDUs in an attempt to become the Root Bridge. What other devices do with these Configuration BPDUs depends on where the Designated Ports are located. If a Designated Port hears these inferior BPDUs, it immediately refutes them with a Configuration BPDU as discussed in the “Configuration BPDU” section earlier. If a non-Designated Port receives the inferior BPDU, it is ignored. However, in either case, the 802.1D standard does not provide a mechanism that allows switches receiving inferior BPDUs to make any judgments about the state of the network.
How does BackboneFast magically eliminate Max Age from the STP convergence delay? By taking advantage of the following two mechanisms:
- The first allows switches to detect a possible indirect failure.
- The second allows them to verify the failure.
The BackboneFast detection mechanism is built around the concept that inferior BPDUs are a signal that another bridge might have lost its path to the Root Bridge. BackboneFast’s verification mechanism employs a request and response protocol that queries other switches to determine if the path to the Root Bridge has actually been lost. If this is the case, the switch can expire its Max Age timer immediately, reducing the convergence time by 20 seconds.
To detect the possible failure of the Root Bridge path, BackboneFast checks the source of the inferior BPDU. If the BPDU is from the local segment’s Designated Bridge, this is viewed as a signal of an indirect failure. If the inferior BPDU came from another switch, it is discarded and ignored.
The verification process is more complex than the detection process. First, BackboneFast considers if there are alternate paths to the Root Bridge. If the switch receiving an inferior BPDU has no ports in the Blocking state (ports looped to itself are excluded), it knows that it has no alternate paths to the Root Bridge. Because it just received an inferior BPDU from its Designated Bridge, the local switch can recognize that it has lost connectivity to the Root Bridge and immediately expire the Max Age timer.
If the switch does have blocked ports, it must utilize a second verification mechanism to determine if those alternate paths have lost connectivity to the Root Bridge. To do this, the Catalysts utilize a Root Link Query (RLQ) protocol. The RLQ protocol employs two types of packets—RLQ Requests and RLQ Responses.
RLQ Requests are sent to query upstream bridges if their connection to the Root Bridge is stable. RLQ Responses are used to reply to RLQ Requests. The switch that originates the RLQ Request sends RLQ frames out all non-Designated Ports except the port that received the inferior BPDU. A switch that receives an RLQ Request replies with an RLQ Response if it is the Root Bridge or it knows that it has lost its connection to the Root Bridge. If neither of these conditions is true, the switches propagate the RLQ Requests out their Root Ports until the stability of the Root Bridge is known and RLQ Responses can be sent.
If the RLQ Response is received on an existing Root Port, the switch knows that its path to the Root Bridge is stable. On the other hand, if the RLQ Response is received on some port other than the current Root Port, it knows that it has lost its connection to the Root Bridge and can immediately expire the Max Age timer. A switch propagates BPDUs out all Designated Ports until the switch that originated the RLQ Request is reached.
To illustrate this process, consider the simplified campus network shown in Figure 7-24.
Figure 7-24. BackboneFast Operation
As discussed earlier, BackboneFast must be enabled on all three switches in this network. Assume that Cat-A is the Root Bridge. This results in Cat-B:Port-1/2 and Cat-C:Port-1/1 becoming Root Ports. Because Cat-B has the lower BID, it becomes the Designated Bridge for Segment 3, resulting in Cat-C:Port-1/2 remaining in the Blocking state.
Next, assume that Segment 1 fails. Cat-A and Cat-B, the switches directly connected to this segment, instantly know that the link is down. To repair the network, it is necessary that Cat-C:Port-1/2 enter the Forwarding state. However, because Segment 1 is not directly connected to Cat-C, Cat-C does not start sending any BPDUs on Segment 3 under the normal rules of STP until the Max Age timer has expired.
BackboneFast can be used to eliminate this 20-second delay with the following eight-step process (illustrated in Figure 7-25):
- Segment 1 breaks.
- Cat-B immediately withdraws Port 1/2 as its Root Port and begins sending Configuration BPDUs announcing itself as the new Root Bridge on Port 1/1. This is a part of the normal STP behavior (Steps 3–7 are specific to BackboneFast).
- Cat-C:Port-1/2 receives the first Configuration BPDU from Cat-B and recognizes it as an inferior BPDU.
- Cat-C then sends an RLQ Request out Port 1/1.
- Cat-A:Port-1/1 receives the RLQ Request. Because Cat-A is the Root Bridge, it replies with an RLQ Response listing itself as the Root Bridge.
- When Cat-C receives the RLQ Response on its existing Root Port, it knows that it still has a stable connection to the Root Bridge. Because Cat-B originated the RLQ Request, it does not need to forward the RLQ Response on to other switches.
- Because Cat-C has a stable connection to the Root Bridge, it can immediately expire the Max Age timer on Port-1/2.
- As soon as the Max Age timer expires in Step 7, the normal rules of STP require Port Cat-C:Port-1/2 to start sending Configuration BPDUs. Because these BPDUs list Cat-A as the Root Bridge, Cat-B quickly learns that it is not the Root Bridge and it has an alternate path to Cat-A.
Figure 7-25. BackboneFast Steps to Eliminating the 20-Second Delay
Although this allows Cat-B to learn about the alternate path to the Root Bridge within several seconds, it still requires that Cat-C:Port-1/2 go through the normal Listening and Learning states (adding 30 seconds of delay to the convergence with the default values and 8 seconds with the minimum value for Forward Delay).
- Tip
BackboneFast requires 4.1 or later code on the Catalyst 5000. All Catalyst 4000s and 6000s support BackboneFast.
Disabling Port Aggregation Protocol
In certain situations, the Port Aggregation Protocol (PAgP) can create “unexplainable STP delays” after link initialization. By default, current implementations of EtherChannel-capable ports reserve the first 15–20 seconds after link initialization for PAgP negotiations. As is discussed in Chapter 8, “Trunking Technologies and Applications,” PAgP is a protocol that assists in correctly configuring bundles of Fast and Gigabit Ethernet links that act as one large EtherChannel pipe.
PAgP defaults to a mode called the auto state where it looks for other EtherChannel-capable ports. While this process is occurring, STP is not aware that the link is even active. This condition can be observed with show commands. For example, show port displays a connected status for Port 1/1 immediately after it has been connected as a trunk link (see Example 7-17).
Example 7-17 show port Output Immediately After Port 1/1 Is Connected
Cat-D (enable) show port Port Name Status Vlan Level Duplex Speed Type ----- ------------------ ---------- ---------- ------ ------ ----- ------------ 1/1 connected trunk normal a-half a-100 10/100BaseTX 1/2 notconnect trunk normal a-half a-100 10/100BaseTX 3/1 notconnect 1 normal half 10 10BaseT 3/2 notconnect 1 normal half 10 10BaseT
However, if show spantree is issued at the same time, it still displays the port as not-connected as demonstrated in Example 7-18.
Example 7-18 show spantree Output Immediately After Port 1/1 is Connected
Cat-D (enable) show spantree 1 VLAN 1 Spanning tree enabled Spanning tree type ieee Designated Root 00-90-92-16-28-00 Designated Root Priority 100 Designated Root Cost 19 Designated Root Port 1/1 Root Max Age 20 sec Hello Time 2 sec Forward Delay 15 sec Bridge ID MAC ADDR 00-90-92-bf-70-00 Bridge ID Priority 32768 Bridge Max Age 20 sec Hello Time 2 sec Forward Delay 15 sec Port Vlan Port-State Cost Priority Fast-Start Group-method --------- ---- ------------- ----- -------- ---------- ------------ 1/1 1 non-connected 19 32 disabled 1/2 1 forwarding 19 32 disabled 3/1 1 not-connected 100 32 disabled 3/2 1 not-connected 100 32 disabled
After approximately 15–20 seconds, PAgP releases the port for use by the rest of the box. At this point, the port enters Listening, Learning, and then Forwarding. In short, because of PAgP, the port took 50 seconds instead of 30 seconds to become active.
Therefore, you should carefully consider the impact of PAgP in your campus implementations. First, it is advisable to use the desirable channeling state for links where an EtherChannel bundle is desired. Specifically, you should avoid using the on state because it hard-codes the links into a bundle and disables PAgP’s capability to intelligently monitor the bundle. For example, all STP BPDUs are sent over a single link of the EtherChannel. If this one link fails, the entire bundle can be declared down if PAgP is not running in the auto or desirable states.
- Tip
When using EtherChannel, code the ports to the desirable channeling state. Do not use the on state because it disables PAgP’s capability to handle Spanning Tree failover situations.
However, in cases where EtherChannel is not in use, disabling PAgP can improve Spanning Tree performance dramatically. In general, campus networks benefit from disabling PAgP in three situations:
- End-Station Ports
- Servers using fault-tolerant NICs that toggle link state during failover
- Testing
End-stations can benefit from disabling PAgP on their switching ports. This can be especially noticeable when used in conjunction with PortFast. Even with PortFast enabled, EtherChannel-capable ports still require almost 20 seconds for activation because PAgP hides the port activation from STP. By disabling PAgP with the set port channelmod_num/port_numoff command, this 20-second delay can be almost eliminated, allowing PortFast to function as expected.
Fault-tolerant server NICs that toggle link state during failover can also benefit from a similar performance improvement (however, most fault-tolerant NICs do not toggle link). Otherwise, the PAgP delay needlessly interrupts server traffic for almost 20 seconds.
Finally, you should consider disabling PAgP on non-channel ports when performing STP performance. Otherwise, the 20-second PAgP delay can skew your results.
- Tip
You might want to disable PAgP on EtherChannel-capable end-station and fault-tolerant server ports. This can also be useful when testing STP performance.
The good news is that this should not affect trunk link failover performance in production in most situations. For example, assume that Cat-D is using Port 1/1 and Port 1/2 as uplinks. If Port 1/1 fails, failover can start immediately because both links have been active for some time and are therefore past the initial PAgP lockout period. On the other hand, if Port 1/2 was acting as a cold standby and not connected when Port 1/1 failed, that is a different matter. In this case, you need to walk up and physically plug in Port 1/2 and PAgP does add to the STP failover time.