SRX branch series platforms use flash media for internal storage. UFS file systems on flash media are prone to corruptions if the system is not shutdown gracefully. This document will describe the successive improvements for making the system more resilient, and how to enable them.
Quick Summary of Best Practices:
- Enable dual root partitioning
- Enable partition alignment tweaks
- Enable autorecovery feature
- Avoid ungraceful shutdowns as far as possible
Resiliency Improvements
1. Dual Root Partitioning (available 10.0R1 onwards) : Prior to 10.0R1, the system would become completely non-bootable after an file system corruption (see <Figure 1>). Dual root partitioning ensures that the system can boot even after a corruption has occurred. This improves the situation considerably, but damage to var and config areas will cause the system to become amnesiac, or malfunction in other ways.
<Figure 1 – SRX Branch Single root and dual root partition scheme>
2. Partition Alignment Tweaks (available 10.2R4, 10.3R2, 10.4R2, 11.1R1 onwards) : Based on analysis of several cases of configuration loss after power loss reported by customers, it was understood that information related to secondary partitions require physical isolation on the disk to prevent corruptions. The PR 561700 adds the required physical isolation. Feedback from the affected customers indicate that this fix is VERY EFFECTIVE in preventing configuration loss due to file system corruptions and it is HIGHLY RECOMMENDED to enable this fix.
NOTE: These fixes do not get applied by simply upgrading to the Junos version which has the fix. It requires the disk to be re-partitioned so that the physical isolation can take place. Hence to enable this fix on the system, the first installation must be done using a method which repartitions the media:
- Loader based installation methods: USB or TFTP
- Installation from CLI using the “partition” option
e.g., request system software add junos-srxsme-<Junos version>-domestic.tgz no-copy no-validate partition reboot
Note: With the partition option, it will cause the media to be formatted and repartitioned before the software is installed. Only the current configuration and license keys are preserved. Any important data should be backed up before starting the process. Also the format and install process is scheduled to run on the next reboot. Therefore, it is recommended that this option be used together with the reboot option.
SRX devices that ship with JUNOS Release 10.0 onwards are formatted with dual-root partitions from the factory.
3. Integrity Check and Autorecovery of Configuration, Licenses and Disk Information (available 11.2R2, 11.4R1 onwards) : Autorecovery feature goes a step further and ensures that we can verify the integrity of secondary partition tables (bsdlabels), Junos configuration and licenses as well as recover them automatically after a corruption. Upgrading to a Junos version with this feature is HIGHLY RECOMMENDED, and will provide the greatest protection against system malfunction after an ungraceful shutdown.
4. Graceful and Ungraceful Shutdown
A Graceful Shutdown is said to occur when the Operating System itself powers the box down. In the process it makes sure that all internal buffers are flushed and all storage media have consistent data.
Only the following methods assure a graceful shutdown:
- request system <reboot|power-off|halt>
- Press and immediately release power button
In all these cases the system takes some time (about half a minute) to reboot or power off. Main power must be available for this entire duration.
An Ungraceful Shutdown is said to occur if the system stops functioning without allowing the Operating System to flush necessary buffers. All methods of restarting or powering off the system other than those mentioned above are ungraceful. For example,
- Pulling out the power plug
- Pressing and holding the power button till the LEDs turn off
- Panics or ‘db prompt’ mode
- Power cuts or outage or any other low power anomalies
There have been some reports of configuration loss after a graceful shutdown. In our analysis we either conclusively found the shutdown to be ungraceful, or there is lack of clarity since the necessary data is not in place or the system encountered physical media problems. In any case of file system corruption, we recommend to take the following actions:
- Most importantly, note the exact method used to restart/shutdown the system. If anything other than CLI command or quick press of power button has been used then the method is ungraceful.
- If the problem was caused by an ungraceful shutdown, please refer the above enhanacements and upgrade to 11.2R2 or 11.4R1 or above
- If the method was indeed graceful, and still configuration or licenses have disappeared, do not attempt to recover the system by committing new configuration or installation etc. File loss on graceful restart is very rare, and has not been reproduced internally. The few cases where relevant data was available indicated flash media hardware problems. It will be good to preserve the state of the system and try to provide access to JTAC. Capture as much of the bootup console logs as possible (preferably since before the fatal restart) if such is available.