CCNP Switch Notes Implementing High Availability
A highly available network is the goal of every network engineer. Having a highly available network makes the job easier because it helps to prevent network outages and minimize downtime.
Components of High Availability
There are five components to high availability: redundancy, technology, people, processes, and tools. The first two can be obtained through network design; the last three are more difficult to implement and control.
Redundancy
Redundancy attempts to eliminate single points of failure by providing duplicate devices and links. This costs more, so the added cost must be balanced against the added benefit. Add redundancy where it will have the most impact on availability, in the core of your network, data center, or e-commerce module. Critical WAN or ISP connections are another possible location.
A redundant network haspath diversitywith multiple links between multiple devices. It can havegeographic diversity, with data centers in multiple sites. Networks frequently have dual core and distribution switches, with dual uplinks to each. Dual WAN providers, with dual WAN edge routers, are commonly used. Companies can design their networks with connections to dual Telco central offices and power substations to achieve additional redundancy.
Technology
Some of the technologies found in Cisco routers and Layer 3 switches enhance availability by providing routing continuity, fast failure detection to trigger a failover, and fast routing convergence. These include:
- Cisco Nonstop Forwarding (NSF)
- Stateful Switchover (SSO)
- Stackwise technology on 3750 switches
- Virtual Switch System (VSS)
- Monitoring tools such as SNMP and Syslog
- IP Service Level Agreement (SLA)
Each of these technologies is discussed in later sections of this chapter.
Some other technological features that enhance availability include server load balancing, firewall stateful failover, and fast routing convergence.
People
Although the “people” part of high availability is not usually under the control of the network engineer, it is an important part of the equation. The following items should be considered:
- Staff work habits:Staff should pay attention to detail, and their work should be reliable and consistent to make troubleshooting easier.
- Staff skills and technical training:A knowledgeable staff understands the network technologies and configures devices correctly. A company lab enables failover scenarios to be tested before incorporating them into the network and allows network engineers to practice their skills
- Communication and documentation:There should be good communication between teams responsible for the network, security, servers, and applications. There should also be communication with users. Good documentation, readily available, is critical to understanding how the network is designed and how it should behave during a failure.
- Sufficient time to accomplish a task:Not having enough time to accomplish a network-related task leads to important components, such as testing and documentation, being left out. The design target should be a better than just “adequate” network.
- Align staff with the services they support:This helps ensure clear lines of responsibility for the different segments of the network. Be sure to include the people responsible for a segment in the planning for its high availability.
Processes
Companies that build repeatable processes and design templates have more cohesive networks and save time in troubleshooting problems. Process documentation should include configuration change procedures, failover and lab testing procedures, and network implementation procedures. These should be regularly reviewed and improved as part of the PPDIOO process.
A lab that reflects the current production network enables thorough testing and validation of such changes as new configurations and IOS versions and ensures that the staff thoroughly understands network failover processes.
Having a meaningful change control process includes the complete testing of all changes and how they affect failover within the entire network before they are implemented. Changes must be well planned with a roll-back strategy in place. A risk analysis can also help determine if the change is worthwhile.
Network management processes are often overlooked. These should include:
- Capacity audits
- IOS version management
- Corporate best-practice design compliance
- Disaster recovery and business continuity plans
- Evaluating the security impact of a proposed change
Tools
A well-designed, highly available network can have a failure without it being noticed by users. It is important to have tools in place to monitor the network and send alerts when a failover occurs. Monitoring can also help spot problems as they begin to occur, enabling you to be proactive in your network management. There are many third-party tools available for this; some IOS tools are discussed in later sections of this chapter.
Good documentation is a critical tool to have. Good documentation includes up-to-date network diagrams with network addresses, VLAN information, and interface information. Important servers, applications, and services should be noted. Document not only HOW the network is designed, but also WHY it is designed that way.
Resiliency and High Availability
A highly available network is a resilient network. A resilient network employs various methods to allow it to recover and continue operating in the event of a failure. Resiliency leads to high availability through the following components:
- Network-level resiliency (the focus of this book):This includes redundant links and redundant devices, but it doesn’t stop there. Those devices must be configured so they fail between devices, or links, quickly.
- System-level resiliency:This includes redundancy within the hardware, such as dual power supplies, and coldstandby parts, such as extra stackable switches or switch modules. It also includes features within the hardware that enable fast failover.
- Network management and monitoring:You need to detect a failure immediately and be informed of the actions taken automatically to remediate it.
Network Level Resiliency
Redundant links were discussed in Chapter 2. STP blocks a redundant link by default so that they are in an active/backup configuration. Etherchannels enables multiple links to be active. If a failure occurs they distribute traffic across the remaining links.
Configure your devices for fast convergence to avoid traffic drops when a link fails. RSTP is preferred over 802.1D STP because it provides faster failover. Use routing protocols such as EIGRP that have fast convergence times. You might need to tune the Layer 2 and Layer 3 protocol timers.
For accurate monitoring statistics, it is important that network clocks are synchronized. Use NTP for this. Syslog, SNMP, and IP SLA are some tools that help you monitor and track your network’s resiliency. They are discussed in more detail in a future section.
Fast Failover
When measuring network resiliency, you must consider how long it takes for failover and convergence at all layers of the OSI stack, not just Layers 1–3. Table 5-1 outlines some of the typical convergence times.
Table 5-1 Convergence Times for Network Components
Network Component | Convergence Time |
Rapid Spanning Tree | Subsecond for minor failures, 1–2 seconds for major failures. |
Etherchannel | Approximately 1 second to redirect traffic to a different link in the channel. |
First Hop Redundancy Protocols such as HSRP, VRRP, or GLBP | Default of 10 seconds. Recommended tuning of hello time to 1 second and hold time to 3 second yields a 3 second convergence time. |
Routing Protocols | Subsecond for OSPF and EIGRP with recommended tuning of timers. |
Switch Service Modules | Typically 3–5 seconds. Exception is Cisco Application Control Engine (ACE) with 1 second failover in active/active configuration. |
Computer/Server TCP Stacks | 9-second session teardown for Windows, longer for other OSs. |
Optimizing Redundancy
You should be aware that redundancy does not always equal resiliency. Too much redundancy can increase the network complexity to a point that it becomes harder to troubleshoot and actually leads to a less-available network. There are too many paths for the data to follow, so it becomes less deterministic. The cost is much higher, also.
NSF with SSO
Layers 2–4 convergence time is enhanced in Cisco 4500 and 6500 series switches with redundant route processors (RP) by using NSF with SSO. When using this, only one RP is active. The standby RP synchronizes its configuration and dynamic state information (such as CEF, MAC, and FIB tables) with the active RP. When the active RP fails, SSO enables the standby RP to take over immediately. NSF keeps the switch forwarding traffic during the switchover, using the existing route and CEF tables. The goal of NSF with SSO is to prevent routing adjacencies from resetting, which prevents a routing flap. The switchover to the new RP must be completed before routing timers expire, or the router’s neighbors will tear down their adjacency and routing will be disrupted.
When the new RP is up, the old routes are marked as stale, and the RP asks its routing peers to refresh them. When routing is converged, it updates the routing and CEF tables on the switch and the linecards.
NSF is supported with EIGRP, OSPF, ISIS, and BGP. AnNSF-capable routersupports NSF; anNSF-aware routerdoes not support NSF but understands it and continues forwarding traffic during SSO.
Use NSF with SSO in locations where you do not have a duplicate switch for failover, such as at the user access or
Enterprise network edge. Otherwise it can actually cause longer convergence. Routing protocols timers can be tuned very short to provide fast convergence. With SSO, the switchover to the standby RP might not occur before the tuned routing Dead timer expires, and the adjacency would be reset.
Designing for Redundancy
Figure 5-1 shows where you would typically use redundancy within a campus network. Access switches are either
chassis-based with dual Supervisor engines and dual power supplies or are stackable switches. They have redundant, fully meshed links to redundant distribution switches, which, in turn, have redundant links to redundant core switches.Distribution and core switch pairs are connected via a Layer 2 or Layer 3 link. This design minimizes single points of failure and enables the network to recover from a link or switch failure
Layer 2 Versus Layer 3 Access Design
You can use a Layer 2 or a Layer 3 access layer. When using L2, VLANs can either be distributed across multiple
switches or local to each switch. Figure 5-2 shows L2 access switches with VLAN 10 on both of them. This design is not recommended. The FHRP Active switch and the STP Root must be statically configured as the same switch. STP blocks one uplink per access switch. RSTP helps speed convergence
There must be a physical link between distribution switches, and it should be a L2 trunk. Without that link, any traffic between switches must go through an access switch. Additionally, failure of one of the access-to distribution uplinks causes packets to be dropped until the FHRP dead timer expires.
Figure 5-3 shows the recommended design when using L2 access switches. Each VLAN is local to one switch. The FHRP Active and STP Root must still be the same switch. They are still statically configured per VLAN so that traffic flow will be deterministic. Because the link between distribution switches is L3, there are no L2 loops. Thus no links are blocked by STP. However, traffic does not load balance between links because each switch forwards traffic only over the link to its HSRP Active and STP Root switch. RSTP is still used for faster convergence
In Figure 5-4 the access switches are L3. This gives the faster convergence and is easiest to implement. All links between switches are L3. There is no need for HSRP, although STP should still be enabled in case of a misconfiguration. Access switches can load balance traffic across both uplinks. The access switches either run a routing protocol or use static routes. The distribution switches summarize routes for the access VLANs.
Using Nonchassis Based Access Switches
Using more than one stand-alone switch, such as the Cisco 3560 or 3750, in an access closet requires special design consideration. You can either daisy-chain the switches or use the Cisco Stackwise technology. When you daisy-chain switchs, the top and bottom members of the chain typically uplink to one distribution switch each. You must add a link (orloopback cable) between the top and bottom switch. Otherwise, a failure in the link between two access switches might cause return traffic to be blackholed. Alternatively you can configure the link between the distribution switches as an L2 trunk.
Stackwise switches enable you to manage each group of access switches as one. Two stack member switches uplink to the distribution switches. Special cables connect the switches, and you should still connect the top and bottom members of the stack using a Stackwise cable. The link between distribution switches can then be an L3 link without worry of blackholing return traffic.
Network Management for High Availability
Network administrators use network management tools:
- To verify network performance
- To characterize, or baseline, network performance
- To understand amount and direction of traffic flow within the network
- To troubleshoot network problems
Syslog
Cisco devices produce system logging (or syslog) messages that can be output to the device console, VTY connection, system buffer, or remote syslog server. If sent to a syslog server, messages are sent on UDP port 514. You are probably familiar with the syslog message %SYS-5-CONFIG_I: Configured from console by console, for instance. A syslog message always starts with the percent sign and has the following format:
%FACILTY-SUBFACILITY-SEVERITY-MNEMONIC: message text
Each portion of a syslog message has a specific meaning:
- FACILITY-SUBFACILITY: This tells the protocol, module, or process that generated the message. Some examples are SYS for the operating system, OSPF, IF for an interface, and IP.
- SEVERITY: A number from 0 to 7 designating the importance of the action reported. The levels are:
- Emergency: 0
- Alert: 1
- Critical: 2
- Error: 3
- Warning: 4
- Notice: 5
- Informational: 6
- Debugging: 7
- MNEMONIC: A code that identifies the action reported.
- A plain-text description of the event that triggered the syslog message
SNMP
An SNMP manager collects information from SNMP agents residing on network devices, either through regular polling or by event-generated traps. The information is stored on the local device in a Management Information Base (MIB). Access to the MIB is controlled by SNMP community strings. Access can be read-only (RO) or read-write(RW).
There are three versions of SNMP. Versions 1 and 2 send the community strings in clear text. They cannot authenticate the source of a message or encrypt a message. Therefore they should be used only for read-only access. SNMPv3 adds three security levels:
noAuthNoPriv: Neither authenticates nor encrypts
authNoPriv: Authenticates the sender but does not encrypt the message
authPriv: Both authenticates the sender and encrypts the message
The following configuration creates a standard access list that allows only traffic sourced from the host at 10.1.1.1. Two community-strings are created, “ccnp” for read-only access and “c1sc0” for read-write access. Read-write access is permitted only from the host specified in access list 1. Next, the SNMP server address is given, along with the command to send traps messages to that server. Because SNMP version 3 is used, the username “admin” is needed.
sw1(config)#access-list 1 permit 10.1.1.1 sw1(config)#snmp-server community ccnp ro sw1(config)#snmp-server community c1sc0 rw 1 sw1(config)#snmp-server host 10.1.1.2 traps admin
IP SLA
IP SLA is a feature that enables a Cisco router or switch to simulate specific types of traffic and send it to a receiver, called aresponder. IP SLA probes can simulate various types of traffic, such as HTTP, FTP, DHCP, UDP jitter, UDP echo, HTTP, TCP connect, ICMP echo, ICMP path echo, ICMP path jitter, and DNS, and can report statistics such as path jitter. It has highly granular application configuration options such as TCP/UDP port numbers, TOS byte, and IP prefix bits. This is useful for measuring application performance end-to-end across your network. It can also be used to track reachability and then decrement HSRP priority values or bring up secondary links. Additionally, IP SLA can also be used as a measure of reliability and continuous availability. SNMP traps can be generated from events such as connection loss, timeout, roundtrip time threshold, average jitter threshold, one-way packet loss, one-way jitter, and one-way latency.
To enable IP SLA, configure the source to send the required type of data probes. The receiver can be a computer, or it can be another Cisco device. The configuration of a Cisco responder is simple. Use the globalip sla responder command. One benefit of using a Cisco device as the responder is that it can add time stamps to help measure latency and jitter.These time stamps take into account the device processing time so that the measurement reflects only network latency.
The configuration of the IP SLA source is more complex. You must create a monitor session, list the traffic type, responder IP address, and any other desired variables such as DSCP value. Then you schedule the probes. Optionally configure tracking using the IP SLA session. The following commands set up an IP SLA session that measures UDP jitter for a voice port. Traffic is sent every 120 seconds, starting when the last command is given and continues until it is manually stopped.
sw1(config)#ip sla 1 sw1(config-ip-sla)#udp-jitter 10.1.1.3 65422 codec g729a sw1(config-ip-sla-jitter)#frequency 120 sw1(config-ip-sla-jitter)#exit sw1(config)#ipsla schedule 1 life forever start-time now