PFE:.* LCHIP.*new errors in LSIF*

PFE:.* LCHIP.new errors in LSIF

 PFE:.* LCHIP.*new errors in LSIF*

message indicates that the FPC is reporting LCHIP interface errors.

The LCHIP.*new errors in LSIF message is logged each time an LCHIP ASIC detects an error on the L-to-S interface between the LCHIP and Physical Interface Card (PIC) on the egress Flexible PIC Concentrator (FPC). This does not mean that the LCHIP itself has a problem, nor does it mean that the error was generated there.

When a new errors in LSIF event occurs, a message similar to the following is reported:

fpc2 LCHIP(2): 1 new errors in LSIF
fpc3 LCHIP(0): %PFE-3: 8 new errors in LSIF

Other messages can also be associated with the LCHIP message. It is common to see messages such as one or more of the following:

fpc2 LCHIP(2): %PFE-3: 2 new crcerror errors in NLIF 
fpc1 LCHIP(0): 1 new crcdrop errors in NLIF
fpc2 CMALARM: %PFE-3: Error (code: 18, type:Minor) encountered, cmalarm_passive_alarm_signal
alarmd[1224]: %DAEMON-4: Alarm set: FPC color=YELLOW, class=CHASSIS, reason=FPC 2 Minor Errors - Error code: 18
craftd[1225]: %DAEMON-4: Minor alarm set, FPC 2 Minor Errors - Error code: 18
fpc3 NH(nh_ifl_reroute): No NHs on ifl(68) list
fpc5 SLout(0): 1 (total 334) packets forwarded with errors at ip
fpc5 SLout(0): 1 (total 322) packets with errors at ip/op interaface
fpc5 SLout(0): 1 (total 334) packet with errors sent to desrd at ip
fpc5 SLCHIP(0): 1 new errors (packet error) in HDRF, lout_hdrf_poll_stats
fpc5 CMALARM: Error (code: 13, type:Major) encountered, cmalarm_passive_alarm_signal
fpc5 .pm3393.5.2. RXXG: %PFE-3: Packet exceeds the maximum frame size 4496
fpc5 .pm3393.5.2. RXOAM: %PFE-3: Remote Fault received
fpc5 HCHIP(5/2): %PFE-3: HSTX overflow error for stream 0
fpc5 .pm3393.5.2. RXXG: %PFE-3: A line interface error is detected
fpc5 .pm3393.5.2. TXXG: %PFE-3: FIFO has errors
fpc2 NFAB(2/1): PKTR FTTL error counter incremented
fpc2 CMALARM: Error (code: 540, type:Minor) encountered, cmalarm_passive_alarm_signal
fpc2 NFAB(2/1): PKTR ICELL signature error counter incremented
fpc2 CMALARM: Error (code: 538, type:Minor) encountered, cmalarm_passive_alarm_signal
fpc0 NWAN(0/1): %PFE-3: PKTR ICELL signature error counter incremented
lcc1-fpc0 NFAB(0/0): RODR offset overflow count incremented (1)
lcc1-fpc0 CMALARM: Error (code: 542, type:Minor) encountered, cmalarm_passive_alarm_signal
fpc6 LCHIP(2): Error in Lout L3 uCode, mem bank 0, corrupted instruction addr 0x513942d0, actual 0x4c00007f, read from lchip 0x0c00007f
fpc1 DCHIP(1/2): H/S link Tx48 overflow

This notification can occur for different reasons. It could be due to an interface flap or because of an issue in a hardware state in an FPC, PIC, or Switch Interface Board (SIB) component. The responsible component might not be the one this log message is referring to either, as this message is only generated on traffic egress, not on ingress.

Perform these steps to determine the cause and resolve the problem (if any). Continue through each step until the problem is resolved.

1. Collect the show command output.

Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

show log messages
show chassis hardware detail
show pfe statistics traffic 
start shell pfe network [fpc#]
show syslog messages
show nvram
show lchip ifd 
show ifl brief

Repeat the following log commands after a delay of a minute or so:

show lchip [x] error 
show lchip [x] lout stat 
show lchip [x] lout sw lsif 
show lchip [x] lout sw desrd 
show lchip [x] lout sw hdrf 
show lchip [x] lout sw nlif 
show lchip [x] lout hw lsif 
show lchip [x] lout hw nlif 
show lchip [x] lout hw hdrf 
show lchip [x] lout hw nlif 
show lchip [x] lout registers nlif bdispmon
show lchip [x] lout registers nlif dbufpart
show lchip [x] lout registers nlif nlif

(Repeat the above log outputs after a minute or so.)

show nchip [x] all 
show mq [x] wan stat 
show mq [x] wan stream active stat

Note: Be sure to replace the [fpc#] entry above with the reference to the FPC component that is reporting the log message. Similarly, replace the [x] with the LCHIP referenced in the log message. The lchip outputs on the FPC should be repeated after about a minute. The reason for this is to detect if any of the counters being reported are increasing in value over time.

2. Analyze the show command output. Continue with the following steps.

3. Check the show log messages output to see if there was a link flap, commit, or online of a system component (i.e., FPC, SIB).

Yes – The LCHIP message is expected and is not reporting any performance issue in the device. The messages should go away shortly after this event and can be ignored.
No – Continue to Step 4.

4. Check the show log messages output to see if ISSU was being done.

Yes – Some PICs, such as the 4xCOC12 PIC, must be taken offline before an ISSU is performed. Otherwise, this message might be generated as a result.
No – Continue to Step 5.

5. Check to see if some link has suddenly been disabled and traffic cannot pass through it anymore.

Yes – These messages are generated while the link is disabled. The errors will stop incrementing when the link is enabled again.
No – Continue to Step 6.

6. Check to see if there are any associated log messages with pm3393. This is the MAC chip implemented on the XENPAK 10GE PIC.

a. If “RXXG: packet exceed error” is observed, then a received packet (line side) is bigger than the MTU number programmed into the pm3393 MAC. Check for any MTU mismatch between the link ends. If there are any mismatches, then configure the proper MTU value.

b. If “RXXG: A line interface error is detected” is observed, then improper packets are sent to the interface, such as those resulting from lowering the IFG to an invalid size.

c. If “TXXG: FIFO has errors” are seen, be aware that in the IEEE802.3ae standard there is a concept called pause frames. Pause frames are used to slow down traffic coming from the line-side device. For example, if the ingress FIFO is filling up the PM3393 MAC line side, transmit can send pause frames to the other device, connected on the line side. The device connected on the line side can then stop sending Ethernet packets to allow the ingress FIFO to drain to the link layer device. This will prevent any FIFO overflows. If pause frames are disabled, a FIFO overflow can be experienced, generating the FIFO errors and other errors. Therefore you will need to verify that flow control is enabled using the show interfaces extensive <interface-name> command.

d. If RXOAM errors are observed, these are transient and experienced when an interface transitions UP or DOWN. These are expected and can be ignored.

e. In the IEEE 802.3ae standard, the required BER of 10GBASE-L is 10e-12. If the error rate is in within this range, the error messages can be ignored. If it exceeds this value, then inspect the cabling. If this does not resolve the issue, skip to Step 9.

7. Check for the PKTR ICELL error. If this error appears, you cannot identify the source of the errors from the messages logged because the CRC is system-wide and is only checked on the egress FPC. In some cases, additional crcdrop errors in NLIF messages are seen.

a. To identify the faulty hardware, first test the SIBs to ensure that they are not contributing to the NLIF errors. (Check the FPCs using the instructions below by offlining and onlining them one by one). The SIBs can be cycled manually without packet loss (assuming a spare SIB is installed). A T320 might experience some packet loss if 10G interfaces are installed and running at high rates since the backup SIB only has a single 10G link to the FPCs compared with two for the active SIBs. Take an active SIB offline with the following command, and the backup will take its place:

user@router> request chassis sib slot <slotno> offline

b. Repeat this for each active SIB checked for errors after each change. Use the following command to check the status of the SIBs between each change:

user@router> show chassis sibs
Slot State Uptime
0 Spare
1 Online 1 day, 23 hours, 9 minutes, 48 seconds
2 Online 1 day, 23 hours, 9 minutes, 48 seconds

c. If the log messages do not stop, then take the FPCs offline to identify the faulty item (see Step e for the command). If all but one FPC is reporting errors and that FPC has an active interface, then skip to Step 9.

d. If more than one FPC does not show errors, then take this subset of FPCs offline (see Step e for the command) one at a time until the errors stop, then skip to Step 9.

e. To take the FPCs offline and bring them back online, use the following command:

user@router> request chassis fpc slot 0 ?
Possible completions:
offline Take FPC offline
online Bring FPC online

f. Use the following command to check for NLIF errors:

user@router> show pfe statistics error fpc 0

8. Check if the LCHIP messages are seen with crcerror errors in NLIF messages but not any other messages listed above. If they are, then check to see if there have been any changes with the FPC or the interface (i.e., PIC). A change in either software configuration or hardware would explain an appearance of the LCHIP messages. If reverting config or reseating the hardware helps recovery, then continue monitoring the router. If CRC errors are persistent, then replace the hardware.

PFE:.* LCHIP.*new errors in LSIF*

Related

PFE:.* LCHIP.new errors in LSIF