The em0 interface is supposed to be ‘UP’ all the time because it provides the communication channel between the Routing Engine (RE) and Packet Forwarding Engines (PFEs); however, in a certain condition the link status will be changed to ‘DOWN’.
When EM0 is marked as ‘DOWN’ it can cause the following issues.
- Chassis Cluster failover
- Chassis Cluster failover of Data RG’s followed by traffic loss
- Chassis Cluster split brain
- Severe service impact including restarting of FPCs (Chassis Cluster or standalone)
This article provides the way of distinguishing the Tx lockup status and Rx lockup status on the em0 interface that triggers EM0 DOWN state.
NOTE: This issue is primarily seen in Chassis Clusters but can be seen in standalone devices as well. Without the software fix, a manual system reboot may be required to recover.
The ‘Tx lockup’ issue is more common than the ‘Rx lockup’ issue. The ‘Rx lockup’ issue happens very rarely, but it causes more severe problems because both nodes stop working (suicide condition). For more details, please refer to the section of ‘Rx lockup condition’.
Tx lockup condition
From the output of ‘show log message’, you can see
'tsec1: watchdog timeout outstanding desc <number>, ifd_snd <number>.
... 22:37:56 SRX3K eventd: %SYSLOG-3: sendto: Network is down ... 22:37:56 SRX3K /kernel: %KERN-3: rdp keepalive expired, connection dropped - src 0x0f100001:1011 dest 0x0110011c:1032 ... 22:37:56 SRX3K /kernel: %KERN-2: tsec1: watchdog timeout outstanding desc 27, ifd_snd 1959 ... 16:56:48 SRX3K /kernel: tsec1: watchdog timeout outstanding desc 94, ifd_snd 1444 <-- Indication of Tx stuck ... 16:56:48 SRX3K l2ald[12088]: L2ALD_DUPLICATE_CONNECTIONS: Duplicate connection for peer DPC-0 (Ident: 63). Disconnecting... ... 16:56:48 SRX3K mib2d[12111]: SNMP_TRAP_LINK_DOWN: ifIndex 17, ifAdminStatus up(1), ifOperStatus down(2), ifName em0
From the internal switch port statistics (port 3), you can also see ‘RX FCS Errors’. This count should be zero (0) all the time under normal conditions.
root@SRX3K> show chassis ethernet-switch statistics -------------------------------------------------------------------------- Displaying port statistics for switch 0 Error reading status for port 0 connected to device: Nothing Statistics for port 3 connected to device RE-GigE: RX FCS Errors 12007 RX Align Errors 0 RX Fragments 0 RX Symbol errors 12007
Rx lockup condition
When Rx is stuck on node0’s RE (let’s say node0 is active for RG0 and RG1+), there will be no Rx messages which need to be processed, so the following occurs:
- node0 assumes that node0’s CPP is not alive, so node0 tries RG1 failover, but the RG0 (Routing Engine) failover will not happen because node0 can still transmit HA control messages. So it thinks that “I am still alive”.
- After RG1 failover, CPP on node1 will try to connect RE on node0.
- Since RX on node0 is stuck, the CPP on node1 cannot connect to the RE on node0
- Finally CPP will suicide.
In this condition, you will not see the same messages or counters seen in the ‘Tx lockup’ condition. Therefore, you will not observe the link down on the em0 interface. Instead both nodes generate the message “committing suicide, Shutting down due to loss of communication with master RE” message in nvram (check be checked in the file of “request support information”) after attempting to restart the FPC cards.
> show log messages 23:32:44.241 srx_node0 /kernel: rdp retransmit error: No route to host (65) src 0x00000000:1067 dest 0x01100112:1009 23:32:44.744 srx_node0 chassisd[1234]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: CPP Power cycle 23:35:44.809 srx_node0 chassisd[1234]: fru_nmi_timer: Restart FPC 0 due to NMI timeout 23:35:44.863 srx_node0 chassisd[1234]: fru_nmi_timer: Restart FPC 2 due to NMI timeout === CPP0, node0.cpp0> show nvram [... 23:37:57.941 LOG: Emergency] <160>CMLC: Master RE did not recover, forwarding stopped [... 23:37:57.949 LOG: Emergency] <160>CMLC: committing suicide , Shutting down due to loss of communication with master RE [... 23:37:58.540 LOG: Emergency] <160>PFEMAN: Shutting down , Master routing engine did not recover; forwarding stopped === CPP0, node1.cpp0> show nvram [... 23:32:37.207 LOG: Emergency] <160>RDP: Remote side closed connection: rdp.(34603010:41984).(serverRouter:chassis) [... 23:37:36.909 LOG: Emergency] <160>CMLC: Master RE did not recover, forwarding stopped [... 23:37:36.928 LOG: Emergency] <160>CMLC: committing suicide , Shutting down due to loss of communicationwith master RE
The Tx and Rx lockup issues are caused by eTSEC Ettras on the Routing Engine Processor (REP).
The following fix/enhancement is included in Junos 10.4S12, 10.4R13, 11.4R6-S2, 11.4R7, 12.1R5-S1, 12.1R6 and 12.1X44-D15.
Fix: Enhance the Tx and Rx lockup detection logic and once the error conditions are detected, gracefully reset the Tx or Rx lockup status or reset the MAC of the em0 interface. Then reconfigure the MAC to bring the em0 interface back to a normal working condition.
Caution: Note that regardless of the tsec1 (em0) controller enhancements and fixes in the above Junos OS versions, running either “tcpdump” (from FreeBSD shell) or “monitor traffic interface” (from Junos CLI) for em0, em1 and fxp0 are not recommended because it could cause the busy status for the em0 interface and interrupt the important system messages processing between Routing Engine and other modules (CPP, IOC, SPC, NPC).