How to distinguish Tx and Rx lockup status on em0 interface

The em0 interface is supposed to be ‘UP’ all the time because it provides the communication channel between the Routing Engine (RE) and Packet Forwarding Engines (PFEs); however, in a certain condition the link status will be changed to ‘DOWN’.

When EM0 is marked as ‘DOWN’ it can cause the following issues.

  • Chassis Cluster failover
  • Chassis Cluster failover of Data RG’s followed by traffic loss
  • Chassis Cluster split brain
  • Severe service impact including restarting of FPCs (Chassis Cluster or standalone)

This article provides the way of distinguishing the Tx lockup status and Rx lockup status on the em0 interface that triggers EM0 DOWN state.

NOTE: This issue is primarily seen in Chassis Clusters but can be seen in standalone devices as well. Without the software fix, a manual system reboot may be required to recover.

The ‘Tx lockup’ issue is more common than the ‘Rx lockup’ issue. The ‘Rx lockup’ issue happens very rarely, but it causes more severe problems because both nodes stop working (suicide condition). For more details, please refer to the section of ‘Rx lockup condition’.

Tx lockup condition

From the output of ‘show log message’, you can see

From the internal switch port statistics (port 3), you can also see ‘RX FCS Errors’. This count should be zero (0) all the time under normal conditions.

Rx lockup condition

When Rx is stuck on node0’s RE (let’s say node0 is active for RG0 and RG1+), there will be no Rx messages which need to be processed, so the following occurs:

  1. node0 assumes that node0’s CPP is not alive, so node0 tries RG1 failover, but the RG0 (Routing Engine) failover will not happen because node0 can still transmit HA control messages. So it thinks that “I am still alive”.
  2. After RG1 failover, CPP on node1 will try to connect RE on node0.
  3. Since RX on node0 is stuck, the CPP on node1 cannot connect to the RE on node0
  4. Finally CPP will suicide.

In this condition, you will not see the same messages or counters seen in the ‘Tx lockup’ condition. Therefore, you will not observe the link down on the em0 interface. Instead both nodes generate the message “committing suicide, Shutting down due to loss of communication with master RE” message in nvram (check be checked in the file of “request support information”) after attempting to restart the FPC cards.

The Tx and Rx lockup issues are caused by eTSEC Ettras on the Routing Engine Processor (REP).

The following fix/enhancement is included in Junos 10.4S12, 10.4R13, 11.4R6-S2, 11.4R7, 12.1R5-S1, 12.1R6 and 12.1X44-D15.

Fix: Enhance the Tx and Rx lockup detection logic and once the error conditions are detected, gracefully reset the Tx or Rx lockup status or reset the MAC of the em0 interface. Then reconfigure the MAC to bring the em0 interface back to a normal working condition.

Caution: Note that regardless of the tsec1 (em0) controller enhancements and fixes in the above Junos OS versions, running either “tcpdump” (from FreeBSD shell) or “monitor traffic interface” (from Junos CLI) for em0, em1 and fxp0 are not recommended because it could cause the busy status for the em0 interface and interrupt the important system messages processing between Routing Engine and other modules (CPP, IOC, SPC, NPC).

About the author

Prasanna

Leave a Comment