This article explains what to look for when output drops are seen incrementing on the sp- interface on the MS-DPC cards carrying CGNAT flows in the SDG Neo Zones.
Output drops were seen incrementing on the sp- interface serving CGNAT flows in the AT&T SDG Neo Zones.
The sp- output interface counters (sp-3/1/0 in this case) represent the traffic that is sent by the PFE ASIC towards the services PIC for services processing. Therefore output drops mean that the PFE ASIC had some packets that it could not send towards the services PIC.
The services PIC outputs were checked for sp-3/1 (provided by customer) and no specific errors could be seen in there. There are a lot of counters related to how many packets, bytes, flows, and so on that the services PIC processed. These are mostly relevant to compose a baseline of the services PIC utilization and behavior, and then compare with the values when the problem is happening.
The logs on the router were checked for the sp- interface with the following results:
Oct 17 08:01:48 test-RE0 fpc3 Transient flow-control asserted by MAC onsp-3/1 for 1 seconds Oct 17 08:01:49 test-RE0 fpc3 Transient flow-control asserted by MAC on sp-3/1 for 2 seconds Oct 17 08:01:51 test-RE0 fpc3 MAC flow-control cleared on sp-3/1
So the services PIC is sending back flow control messages to the PFE (basically saying “don’t send me more packets because I am too busy”). These happen very infrequently and most likely when the output drop counter increments on the sp-3/1/0 interface.
The question then becomes: why is the services PIC very busy at those times? Since the only service set configured on this services PIC is NAT, then mostly likely a burst in traffic (either in terms of packets per second or flows per second) triggers the service PIC CPU utilization or memory utilization to go up for a brief period of time (high enough to cause the flow control messages):
{MASTER} test-RE0> show services service-sets cpu-usage interface sp-3/1/0 CPU Interface Service Set(or system category) Utilization sp-3/1/0 CGN4_SP-3-1-0 8.38 % {MASTER} test-RE0> show services service-sets memory-usage interface sp-3/1/0 Interface Service Set Bytes Used sp-3/1/0 CGN4_SP-3-1-0 513002902 {MASTER} test-RE0> show services stateful-firewall flows count service-set CGN4_SP-3-1-0 InterfaceService set Flow count sp-3/1/0 CGN4_SP-3-1-0 1879794
Check if the maximum number of flows exceeded the configured limits. Use the following commands to find out if it did.
This is a legitimate drop as the CGNAT flows per second value exceeded the maximum configured limits.
test-RE0> show configuration interfaces | match cgn-max-flows | display inheritance | display set set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 set groups NPU-LIMIT-CPS interfaces <sp-*> services-options cgn-max-flows-per-second 40000 test-RE0> show services stateful-firewall flow-analysis|grep "Peak Created Flows per Second" Peak Created Flows per Second :40029 Peak Created Flows per Second :40027 Peak Created Flows per Second :40033 Peak Created Flows per Second :40024 Peak Created Flows per Second :40039 Peak Created Flows per Second :40022 Peak Created Flows per Second :40036 Peak Created Flows per Second :23755
Check if the CPU usage on the sp- interfaces is high. If the CPU usage on the sp- interfaces is high, refer to PR960619.
{MASTER} test-RE0> show services service-sets summary Service sets CPU Interface configured Bytes used Policy bytes used utilization sp-2/0/0 2 907103310 (31.49 %) 65776 ( 0.08 %) 55.06 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-2/1/0 2 916766554 (31.82 %) 65776 ( 0.08 %) 55.55 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-3/0/0 2 916906150 (31.83 %) 65776 ( 0.08 %) 55.26 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-3/1/0 2 915253198 (31.77 %) 65776 ( 0.08 %) 55.88 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-7/0/0 2 919270250 (31.91 %) 65776 ( 0.08 %) 55.77 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-7/1/0 2 919079294 (31.91 %) 65776 ( 0.08 %) 55.61 %<<<<<<<<<<this is to high due to 960619 and makes it 20% easier for a traffic spike to cause an output drop sp-9/0/0 2 918648814 (31.89 %) 65776 ( 0.08 %) 36.54 % <<<<<<<<<<this is normal sp-9/1/0 4 25355995 ( 0.87 %) 30232 ( 0.03 %) 2.37 % <<<<<<<<<<<<<this is normal
Dirty flow counters incrementing is a major reason for the CPU usage to go high.
This can be verified with the following command:
sp90(vty)# show services ha workers all 12-25-26,CPU Bkt Range Dirty DelayDel DelayDelPeak Iter State Synched 12-25-26, 4 0- 74897 4294966943 19 0 46538 A 1 12-25-26, 5 74898- 149795 4294966943 17 0 88605 A 1 12-25-26, 6 149796- 224693 4294966949 25 0 174181 A 1 12-25-26, 7 224694- 299591 4294966926 26 0 271147 A 1
The following are commands that could be used on the services PIC console:
show services service-sets summary show services service-sets cpu-usage show services cpu-usage comp show services cpu-usage show services ha workers all
The services PIC could receive unidentified traffic (amplification attacks) from the Internet. This could be via any of the ports placed inside the GI-PUB routing instance. One such example is shown below, where the S-PIC receives continuous TCP-SYNs and ICMP requests with duplicate sequence numbers. The command below could be used to view them.
{MASTER} test-RE0> show services stateful-firewall statistics extensive interface sp-9/0/0 Interface: sp-9/0/0 Service set: CGN4_SP-9-0-0 New flows: Rule Accepts: 0, Rule Discards: 0, Rule Rejects: 0 Existing flow types packet counters: Accepts: 2561806252, Drop: 8470280, Rejects: 0 Hairpinning counters: Slow Path Hairpinned Packets: 0, Fast Path Hairpinned Packets: 0 Drops: IP option: 0, TCP SYN defense: 328366 <<<<<<<<< NAT ports exhausted: 7388 <<<<<<<<< Errors: IP: 357119, TCP: 292444 <<<<<<<<<< UDP: 1836895, ICMP: 46272 <<<<<<<<<< <SNIP> ICMP errors: IP data length less than minimum ICMP header length (8 bytes): 0 ICMP error length inconsistencies: 46 Duplicate ping sequence number: 46192 <<<<<<<<<< Mismatched ping sequence number: 34 No matching flow: 79959 <SNIP> UDP errors: IP data length less than minimum UDP header length (8 bytes): 0 Source or destination port number is zero: 0 << Check if the src or Dst port is 0 UDP port scan (ICMP error seen for UDP flow): 0 <SNIP> IP errors: IP packet length inconsistencies: 0 Minimum IP header length check failures: 0 Reassembled packet exceeds maximum IP length: 0 Illegal source address: 0 Illegal destination address: 0 TTL zero errors: 0, Illegal IP protocol number (0 or 255): 0 Land attack: 0 Non-IPv4 packets: 0, Bad checksum: 0 Illegal IP fragment length: 4230161802 << Check this counter for non-initial fragments IP fragment overlap: 2 IP fragment reassembly timeout: 1935869131 << Check this counter for non-initial fragments IP fragment limit exceeded: 0 Unknown: 0
Collect data from the last run of the sdg-oid event script to understand when the output drops were reported.
Check Vitalsuite data to understand if the CPU usage on the services PICs was high during the output drops.
Verify if any of the causes mentioned in the Cause section are true. Use the commands as mentioned.
If non of the above-mentioned causes apply, use the following CLI commands to collect logs and reach out to sdg-eng:
CLI:
show interfaces sp-* extensive | no-more show services stateful-firewall flow-analysis | no-more show services nat pool detail | no-more show services service-sets summary | no-more show services service-sets statistics packet-drops | no-more show services service-sets statistics syslog | no-more show services stateful-firewall flows count | no-more show services stateful-firewall statistics extensive | no-more show services stateful-firewall statistics | no-more show services nat mappings summary | no-more show log messages | match "flow-control" | no-more show configuration interfaces | match cgn-max-flows | display inheritance | display set show services stateful-firewall flow-analysis|grep "Peak Created Flows per Second” show services stateful-firewall subscriber-analysis |match "Total Subscribers Active” show log messages| grep "exceeded threshold" |no-more show services service-sets cpu-usage
S-PIC:
start shell telnet -Ji fpc<x>.pic<y> set services service-set active-service-set 3 (To filter stats for CGNAT flows) show sg mac show services data statistics show services cpu-usage comp show services cpu-usage pot show services cpu-usage show services cpumap show services mem show arena show services ip frag-statistics show services tcp-recon-stats show services fwnat counters show services fwnat counters nat show services fwnat counters alg show services ha show services ha workers all show services ha stats show services ha seqnum show services fwnat natpool