The PFE: SDRAM ECC Error message is reported in the system message file, whenever a SDRAM ECC error is detected by a system component (for example, FPC). This article provides a method to to troubleshoot this issue.
The PFE: SDRAM ECC error message is logged, whenever a SDRAM ECC error is detected by a PFE. This article describes the symptoms, possible causes and confirmations step to diagnose this issue.
The following excerpt is an example of the message that is displayed in the system message file or an FPC/FEB syslog:
fpc0 GFPC: MPC107 detected 1 new correctable SDRAM ECC error (total 22) cfeb CFEB: MPC107 detected 1 new correctable SDRAM ECC error (total 19) cfeb CFEB: MPC107 detected 1 new correctable SDRAM ECC error (total 20) fpc5 GFPC: MPC107 detected 1 new correctable SDRAM ECC error (total 47) fpc5 GFPC: MPC107 detected 1 new correctable SDRAM ECC error (total 48) fpc5 GFPC: MPC107 detected 1 new correctable SDRAM ECC error (total 49) fpc10 Last Detected DDR SDRAM Error: at addr 0x7c92fde0, data 0x0044000100002000 fpc10 ADPC: detected 1 DDR SDRAM Single-bit ECC errors in the last 30 seconds fpc10 ADPC: addr=7c92fde0 data=0044000100002000 ecc=cbcb attr=30112001 fpc6 Last Detected DDR SDRAM Error: %PFE-0: at addr 0x46f2a160, data 0x0004000500400000 fpc6 STFPC: %PFE-4: detected 1 DDR SDRAM Single-bit ECC errors in the last 30 seconds fpc0 STFPC: %PFE-4: detected 1 DDR SDRAM Single-bit ECC errors in the last 30 seconds fpc0 STFPC: %PFE-4: addr=00000000 data=0000000000000000 ecc=00 attr=00000000 fpc0 STFPC: %PFE-4: ECC scrubbing at 0x0 failed
The PFE components include error checking and correction (ECC) to protect their SDRAM. This ECC correction provides error detection of single-bit and double-bit errors and correction of single-bit errors for the SDRAM as follows:
- If ECC detects a single-bit error, it logs the error, automatically corrects the error, and operation continues.
- ECC errors have two main causes; hardware failure (bad memory, failed board components) or transient anomaly (transient noise, external radiation, and so on).
The following log data will help to confirm the cause of this system message:
request support information | no-more show log messages | no-more show log chassisd | no-more
Be sure to check the output of log messages and verify that it includes the time of the initial ECC errors message. If the messages log file does not contain this message due to the entries being after the event, then check the log messages archive file. For example, the previous messages file can be accessed with this command:
show log messages.0.gz | no-more
The output of log messages will show if the message has occurred only a few times or is being reported a large number of times. If the single bit error has only occurred once or a few times, it is most likely a one time event.
- If the single-bit error has been reported only a few times, it is most likely due to a transient anomaly. Monitor the logs for a week to see if the message is repeated.
- If the messages are frequently repeated or the ECC error count is increasing, reseat the PFE that is the source of the messages.
- If the messages are frequently generated after the reseat, open a case with your technical support representative.