How to debug output drops on Services PIC carrying CGNAT flows in AT&T SDG Consumer Neo Zones

This article explains what to look for when output drops are seen incrementing on the sp- interface on the MS-DPC cards carrying CGNAT flows in the SDG Neo Zones.

Output drops were seen incrementing on the sp- interface serving CGNAT flows in the AT&T SDG Neo Zones.

The sp- output interface counters (sp-3/1/0 in this case) represent the traffic that is sent by the PFE ASIC towards the services PIC for services processing. Therefore output drops mean that the PFE ASIC had some packets that it could not send towards the services PIC.

The services PIC outputs were checked for sp-3/1 (provided by customer) and no specific errors could be seen in there. There are a lot of counters related to how many packets, bytes, flows, and so on that the services PIC processed. These are mostly relevant to compose a baseline of the services PIC utilization and behavior, and then compare with the values when the problem is happening.

The logs on the router were checked for the sp- interface with the following results:

So the services PIC is sending back flow control messages to the PFE (basically saying “don’t send me more packets because I am too busy”). These happen very infrequently and most likely when the output drop counter increments on the sp-3/1/0 interface.
The question then becomes: why is the services PIC very busy at those times? Since the only service set configured on this services PIC is NAT, then mostly likely a burst in traffic (either in terms of packets per second or flows per second) triggers the service PIC CPU utilization or memory utilization to go up for a brief period of time (high enough to cause the flow control messages):

Check if the maximum number of flows exceeded the configured limits. Use the following commands to find out if it did.

This is a legitimate drop as the CGNAT flows per second value exceeded the maximum configured limits.

Check if the CPU usage on the sp- interfaces is high. If the CPU usage on the sp- interfaces is high, refer to PR960619.

Dirty flow counters incrementing is a major reason for the CPU usage to go high.

This can be verified with the following command:

The following are commands that could be used on the services PIC console:

The services PIC could receive unidentified traffic (amplification attacks) from the Internet. This could be via any of the ports placed inside the GI-PUB routing instance. One such example is shown below, where the S-PIC receives continuous TCP-SYNs and ICMP requests with duplicate sequence numbers. The command below could be used to view them.

Collect data from the last run of the sdg-oid event script to understand when the output drops were reported.
Check Vitalsuite data to understand if the CPU usage on the services PICs was high during the output drops.
Verify if any of the causes mentioned in the Cause section are true. Use the commands as mentioned.

If non of the above-mentioned causes apply, use the following CLI commands to collect logs and reach out to sdg-eng:

CLI:

S-PIC:

 

About the author

Prasanna

Leave a Comment