How to distinguish Tx and Rx lockup status on em0 interface

The em0 interface is supposed to be ‘UP’ all the time because it provides the communication channel between the Routing Engine (RE) and Packet Forwarding Engines (PFEs); however, in a certain condition the link status will be changed to ‘DOWN’.

When EM0 is marked as ‘DOWN’ it can cause the following issues.

Chassis Cluster failover
Chassis Cluster failover of Data RG’s followed by traffic loss
Chassis Cluster split brain
Severe service impact including restarting of FPCs (Chassis Cluster or standalone)

This article provides the way of distinguishing the Tx lockup status and Rx lockup status on the em0 interface that triggers EM0 DOWN state.

NOTE: This issue is primarily seen in Chassis Clusters but can be seen in standalone devices as well. Without the software fix, a manual system reboot may be required to recover.

The ‘Tx lockup’ issue is more common than the ‘Rx lockup’ issue. The ‘Rx lockup’ issue happens very rarely, but it causes more severe problems because both nodes stop working (suicide condition). For more details, please refer to the section of ‘Rx lockup condition’.

Tx lockup condition

From the output of ‘show log message’, you can see

'tsec1: watchdog timeout outstanding desc <number>, ifd_snd <number>.

... 22:37:56 SRX3K eventd: %SYSLOG-3: sendto: Network is down
... 22:37:56 SRX3K /kernel: %KERN-3: rdp keepalive expired, connection dropped - src 0x0f100001:1011 dest 0x0110011c:1032
... 22:37:56 SRX3K /kernel: %KERN-2: tsec1: watchdog timeout outstanding desc 27, ifd_snd 1959

... 16:56:48 SRX3K /kernel: tsec1: watchdog timeout outstanding desc 94, ifd_snd 1444 <-- Indication of Tx stuck
... 16:56:48 SRX3K l2ald[12088]: L2ALD_DUPLICATE_CONNECTIONS: Duplicate connection for peer DPC-0 (Ident: 63). Disconnecting...
... 16:56:48 SRX3K mib2d[12111]: SNMP_TRAP_LINK_DOWN: ifIndex 17, ifAdminStatus up(1), ifOperStatus down(2), ifName em0

From the internal switch port statistics (port 3), you can also see ‘RX FCS Errors’. This count should be zero (0) all the time under normal conditions.

root@SRX3K> show chassis ethernet-switch statistics
--------------------------------------------------------------------------
Displaying port statistics for switch 0

Error reading status for port 0 connected to device: Nothing
Statistics for port 3 connected to device RE-GigE:
RX FCS Errors 12007 
RX Align Errors 0
RX Fragments 0
RX Symbol errors 12007

Rx lockup condition

When Rx is stuck on node0’s RE (let’s say node0 is active for RG0 and RG1+), there will be no Rx messages which need to be processed, so the following occurs:

node0 assumes that node0’s CPP is not alive, so node0 tries RG1 failover, but the RG0 (Routing Engine) failover will not happen because node0 can still transmit HA control messages. So it thinks that “I am still alive”.
After RG1 failover, CPP on node1 will try to connect RE on node0.
Since RX on node0 is stuck, the CPP on node1 cannot connect to the RE on node0
Finally CPP will suicide.

In this condition, you will not see the same messages or counters seen in the ‘Tx lockup’ condition. Therefore, you will not observe the link down on the em0 interface. Instead both nodes generate the message “committing suicide, Shutting down due to loss of communication with master RE” message in nvram (check be checked in the file of “request support information”) after attempting to restart the FPC cards.

> show log messages

   23:32:44.241 srx_node0 /kernel: rdp retransmit error: No route to host (65) src 0x00000000:1067 dest 0x01100112:1009
   23:32:44.744 srx_node0 chassisd[1234]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 0 offline: CPP Power cycle
   23:35:44.809 srx_node0 chassisd[1234]: fru_nmi_timer: Restart FPC 0 due to NMI timeout
   23:35:44.863 srx_node0 chassisd[1234]: fru_nmi_timer: Restart FPC 2 due to NMI timeout


=== CPP0, node0.cpp0> show nvram

   [... 23:37:57.941 LOG: Emergency] <160>CMLC: Master RE did not recover, forwarding stopped
   [... 23:37:57.949 LOG: Emergency] <160>CMLC: committing suicide , Shutting down due to loss of communication with master RE
   [... 23:37:58.540 LOG: Emergency] <160>PFEMAN: Shutting down , Master routing engine did not recover; forwarding stopped


=== CPP0, node1.cpp0> show nvram
   [... 23:32:37.207 LOG: Emergency] <160>RDP: Remote side closed connection: rdp.(34603010:41984).(serverRouter:chassis)
   [... 23:37:36.909 LOG: Emergency] <160>CMLC: Master RE did not recover, forwarding stopped
   [... 23:37:36.928 LOG: Emergency] <160>CMLC: committing suicide , Shutting down due to loss of communicationwith master RE

The Tx and Rx lockup issues are caused by eTSEC Ettras on the Routing Engine Processor (REP).

The following fix/enhancement is included in Junos 10.4S12, 10.4R13, 11.4R6-S2, 11.4R7, 12.1R5-S1, 12.1R6 and 12.1X44-D15.

Fix: Enhance the Tx and Rx lockup detection logic and once the error conditions are detected, gracefully reset the Tx or Rx lockup status or reset the MAC of the em0 interface. Then reconfigure the MAC to bring the em0 interface back to a normal working condition.

Caution: Note that regardless of the tsec1 (em0) controller enhancements and fixes in the above Junos OS versions, running either “tcpdump” (from FreeBSD shell) or “monitor traffic interface” (from Junos CLI) for em0, em1 and fxp0 are not recommended because it could cause the busy status for the em0 interface and interrupt the important system messages processing between Routing Engine and other modules (CPP, IOC, SPC, NPC).

Related