What is the reaction of the Routing Engine (RE) system when the hard drive fails?
Reaction of the Routing Engine (RE) system when the hard drive fails.
When a hard drive fails the following usually occurs:
- A problem is detected (reading or writing) at the driver level.
- The hard-drive gets removed from the bootlist (unless it is the only device in the bootlist, – like an M7i without compact flash – to avoid being in a situation where the router cannot reboot).
- An emergency /var structure is created out of MFS (Memory File System) – a virtual hard disk out of the RAM – so that the system may continue to function.
- The on-disk-failure actions specified in the configuration are undertaken
Configuration Options for ‘on-disk-failure’ knob:
For a single RE, the following CLI will cause the RE to reboot in case of failure:
set chassis routing-engine on-disk-failure disk-failure-action reboot *
Note: You can also mention ‘halt’ as the disk-failure-action which will halt the RE from running.
For redundant REs, configuring the below will result in a RE switchover if the Master RE was the one to experience the HDD failure (and a reboot of the Backup RE if it was the one to experience the failure); this is the current best practice.
There is also the option to trigger a failover when the Master RE detects a HDD failure but it is considered less robust than the below:
set chassis redundancy failover on-disk-failure *
Note: In this case, graceful-switchover or failover on-loss-of-keepalives must also be configured
Note : If the on-disk-failure action is not configured, the system might hang half functional and stay unaccessible. In that case the only way to recover it is to extract and re-insert the RE.
For more information about on-disk-failure action, refer to the following link: