Loss of communication with Backup RE

This article explains why the following event message might appear in the syslog:

Loss of communication with Backup RE

When communication between the master Routing Engine and the backup Routing Engine is lost, a message similar to the following appears in the syslog:

craftd[3253]: Minor alarm set, Loss of communication with Backup RE
alarmd[3252]: Alarm set: RE color=YELLOW, class=CHASSIS, reason=Loss of communication with Backup RE

Note that other messages are often associated with this message, including one or more of the following:

chassisd[3251]: No response from the other routing engine for the last 2 seconds.
chassisd[3251]: CHASSISD_SNMP_TRAP10: SNMP trap generated: Fru Offline (jnxFruContentsIndex 9, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName Routing Engine 1, jnxFruType 6, jnxFruSlot 2, jnxFruOfflineReason 2, jnxFruLastPowerOff 0, jnxFruLastPowerOn 0)

The event message Loss of communication with Backup RE appears in the syslog when the master Routing Engine detects that keepalive messages sent from the backup Routing Engine are lost.

The loss of communication can usually be traced to one of the following causes:

Backup Routing Engine experienced an issue that prevented it from sending out keepalive messages, but the backup Routing Engine was otherwise operating as expected.
Backup Routing Engine rebooted due to either a software or a hardware issue (such as the loss of a hard drive).
Backup Routing Engine failed.
Master Routing Engine and backup Routing Engine were experiencing some sort of unidentifiable communication problem, but each Routing Engine was operating properly in all other respects.

Perform these steps to determine the cause and resolve the problem (if any):

1. Collect the show command output on the master and backup Routing Engine.

a. On the master Routing Engine, capture the output of the commands below.
Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

show log messages
show log chassisd
show version invoke-on all-routing-engines  
show system core-dumps
show chassis alarms
show chassis hardware detail
show chassis routing-engine
show chassis environment
show system storage
show configuration chassis
show configuration routing-options

b. On the backup Routing Engine, capture the output of the commands below.
Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

Note: The log data from the backup Routing Engine might be available only by accessing the board directly via the serial console port. This can occur if the communication link between the Routing Engines is experiencing a problem, preventing access to the backup Routing Engine from the master. If this is the case, you must have physical access to the serial console port on the backup Routing Engine to retrieve the output of the commands.

show log messages
show log chassisd
show system core-dumps
show chassis routing-engine
show system storage

2. Analyze the show command output:

a. In the ‘show log messages’ output on both the master and backup Routing Engines, review the events that occurred at or just before the appearance of the “Loss of communication with Backup RE” message. Do these events help you identify the cause?

No – Continue to Step 2b.

b. In the ‘show log messages’ output, did the event message appear shortly after a ‘commit’ command was executed?

Yes – A commit is resource intensive. There might be an issue with the hard drive or length of the configuration file. If the issue is reproducible, open a case with your technical support representative to investigate further.
No – Continue to Step 2c.

c. In the ‘show version’ output, are the Routing Engines running different versions of the Junos OS?

Yes – The event message is expected and can be ignored. To resolve this, upgrade one of the Routing Engines so that it is running the same version of Junos OS as the other Routing Engine.
No – Continue to Step 2d.

d. In the ‘show system core-dumps’ output, was a core-dump file generated by either Routing Engine with a timestamp corresponding to the moment that the event message appeared?

Yes – Open a case with your technical support representative to investigate the issue further. If you are not running Junos Space Service Now, please attach the information collected above to the case.
No – Continue to Step 2e.

e. In the ‘show log messages’ output, did the backup Routing Engine reboot because the compact flash or the hard drive was no longer in the bootlist?

Yes – Add the affected drive back into the bootlist as follows:

Get into the shell prompt as root.

Issue this command:

sysctl -w machdep.bootdevs=pcmcia-flash,compact-flash,disk,lan

Reboot the Routing Engine, but only during a maintenance window, as rebooting will impact transit traffic.

No – Continue to Step 2f.

f. In the ‘show log messages’ output from both Routing Engines, did both Routing Engines report the loss of keepalive messages from the other one?

Yes – There might be a problem with the link between the Routing Engines. To resolve this, perform the procedure below, but only during a maintenance window, as it will impact transit traffic:

Reseat both Routing Engines in their respective chassis slots.
Try a reseat or replace of the associated Control Board components that handle the communications link. The components for various devices are as follows:

M10i – HCM (High-Availability Chassis Manager)
M40e, M160 – MCS (Miscellaneous Control Subsystem)
M120, M320 – CB (Control Board)
MX240, MX480, MX960, SRX5600, SRX5800 – SCB (Switch Control Board)
T-series – CB (Control Board)
No – Continue to Step 3.

3. If these efforts do not resolve the problem, contact your technical support representative to investigate the issue further.

Related