moderate traffic on one vlan = 100% CPU usage on RB1000

cololine · Thu Feb 18, 2010 3:49 am

Hello All -

Running ROS 3.24 on RB1000U, doing VLAN routing of IP traffic, most other features disabled, have a few simple queues, that's about it.

Today I saw something that is a first in nearly a year (since I put this configuration into production): CPU usage briefly spiked to nearly 100%, and latency became severe, something like 50% packet loss. You can see the very short spikes on the graph below - it happened twice:

daily.gif

Unfortunately, by the time I determined that this was the cause of the latency vs some other network issue and was able to log in to Winbox, the incident had passed and CPU usage was already settling down to pre-spike levels. The second time it happened, it was away from the office, so I never had the chance.

There's about 5 dozen client VLANs being routed by 1000U via a connected L2 switch. I figured this spike was due to some near-Gbps burst of traffic, but in fact there was only a moderate spike on just one client graph which corresponds exactly with each incident:

r1_0_prgmr_com-day.png

Now, this router has been stress-tested with a client a few months ago, who was doing traffic well in excess of 500Mbps, and it handled it fine, no latency. The config of the router has changed very little - a few client VLANs have been dropped, and few new ones have been added. The overall VLAN and IP address count is down, and the traffic is down.

So here's the question: why did each of these 25Mbps - 30Mbps spikes on this one client VLAN bring the RB1000 to it's knees? It's routinely routing about 50Mbps - 80Mbps in each direction and CPU loading is steady at around 25%. Unfortunately I don't have any data on the traffic from the incidents. I've emailed the client and asked him to check his logs; if he produces anything, I'll post it here. But maybe someone out there has been down this road before, or has an idea?

On a related note, can someone suggest a good way to capture an log traffic data from the RB1000 in a way that will offer good forensics if this happens again? It seems from reading this forum that trafr is based on a very old linux kernel and is essentially broken on anything of recent vintage. Also, the sniffer itself seems kind of broken, at least from Winbox - I tried running to with the interface set just to one vlan, but even after applying that change, when I run it it captures data from every interface anyway!

Thanks in advance!

Ed

rickhodger · Thu Feb 18, 2010 2:43 pm

Hello All -

r1_0_prgmr_com-day.png
Now, this router has been stress-tested with a client a few months ago, who was doing traffic well in excess of 500Mbps, and it handled it fine, no latency. The config of the router has changed very little - a few client VLANs have been dropped, and few new ones have been added. The overall VLAN and IP address count is down, and the traffic is down.

So here's the question: why did each of these 25Mbps - 30Mbps spikes on this one client VLAN bring the RB1000 to it's knees? It's routinely routing about 50Mbps - 80Mbps in each direction and CPU loading is steady at around 25%. Unfortunately I don't have any data on the traffic from the incidents. I've emailed the client and asked him to check his logs; if he produces anything, I'll post it here. But maybe someone out there has been down this road before, or has an idea?

There's two possible answers:

1. It might have been a very small burst of bandwidth, but it may have been made up of very small packets. Unless you have a graph of the packet rates as well, you will not be able to determine this. I have seen commerical, big name firewalls brought to their knees with just 10Mb of traffic, but when examined turned out to be in excess of 30,000 packets per second.

2. Your RRD/MRTG graph has averaged out the peak of the bandwidth burst, and it was actually much higher than you can see. RRD normally uses a 5 minute averaging, so very quick spikes in traffic will not show up well. For example:

1 minute: 1Mb
2 minute: 1.5Mb
3 minute: 2Mb
4 minute: 1.5Mb
5 minute: 1Mb

On an RRD graph, that will show an average for that block of 1.4Mb. Now take:

1 minute: 1Mb
2 minute: 1.5Mb
3 minute: 20Mb
4 minute: 1.5Mb
5 minute: 1Mb

An RRD graph will record an average of 5Mb for that 5 minute window, even though it spiked to 20Mb. There's also a chance that it did record a higher spike, but can't display it due to the small pixel dimensions of your graph.

You can see this in the two graphs below. The first is from our cacti system, which uses RRD. When you cut it down to a one hour window, you can clearly see the 5 minute slots and a peak of 9Mb.

graph_image.png

The graph for the same port from The Dude however tells a different story. As it polls every few seconds, it saw the true peak in bandwidth.. a spike of almost 60Mb.

chart.png

In summary: Start graphing your packet rates in addition to bandwidth rates, and if possible setup a copy of the Dude someplace and monitor your RB1000 with it for a while. This will tell if it's either packet rates, or bandwidth rates that are causing the issue.

rickhodger · Thu Feb 18, 2010 2:56 pm

2. Your RRD/MRTG graph has averaged out the peak of the bandwidth burst, and it was actually much higher than you can see. RRD normally uses a 5 minute averaging, so very quick spikes in traffic will not show up well. For example:

It occurs to me after writing this that you're probably using the built-in graphing, which in turn uses RRD, so you may have no idea what I'm talking about XD

cololine · Thu Feb 18, 2010 5:00 pm

I had suspected a high packet rate of very small packets. I understand what you are saying about the sampling frequency and the possibility to miss small extreme bursts in the graph. However, the client who was implicated in this is on a 100Mbps uplink, so that would be the limit of his burst, and as pointed out before the router has proven itself capable of handling 100's of Mbps... with more typical packet sizes, that is.

I am using ROS's built-in graph for the CPU usage, and mrtg for the client VLANs. I don't think either one has the ability to graph packet size -is this something that The Dude can do? Also, I've noted that The Dude can run on an RB1000. Has anyone done this? Does it run alongside ROS, and how does one access the various Dude displays and screens?

Thanks,
Ed

rickhodger · Fri Feb 19, 2010 12:59 pm

I had suspected a high packet rate of very small packets. I understand what you are saying about the sampling frequency and the possibility to miss small extreme bursts in the graph. However, the client who was implicated in this is on a 100Mbps uplink, so that would be the limit of his burst, and as pointed out before the router has proven itself capable of handling 100's of Mbps... with more typical packet sizes, that is.

I am using ROS's built-in graph for the CPU usage, and mrtg for the client VLANs. I don't think either one has the ability to graph packet size -is this something that The Dude can do? Also, I've noted that The Dude can run on an RB1000. Has anyone done this? Does it run alongside ROS, and how does one access the various Dude displays and screens?
Thanks,
Ed

You can run it on the RB1000, but I found it made an unacceptable amount of writes to the RB's internal flash memory and really needs to be run on an external CF flash card - but The Dude cannot graph packet rates. MRTG can graph packet size, but I am unsure of the exact configuration. According to the MRTG docs, the following is an example of a packet-rate graph config:

#
# Interface Packets (SNMP)
#

YLegend[localhost.packets]: Packets/m
Options[localhost.packets]: growright
Target[localhost.packets]: .1.3.6.1.2.1.2.2.1.11.11&.1.3.6.1.2.1.2.2.1.17.11:public@localhost * 60
SetEnv[localhost.packets]: MRTG_INT_IP="157.55.12.215" MRTG_INT_DESCR="Broadcom-NetXtreme-57xx-Gigabit-Controller"
MaxBytes[localhost.packets]: 12500000
Title[localhost.packets]: Packet Analysis for corpnet -- wsman.msft.net
PageTop[localhost.packets]: <H1>Packet Analysis for -- wsman.msft.net</H1>
 <TABLE>
   <TR><TD>System:</TD>     <TD>wsman.msft.net in Redmond</TD></TR>
   <TR><TD>Description:</TD><TD>Microsoft Virtual Machine Bus Network Adapter</TD></TR>
   <TR><TD>ifType:</TD>     <TD>ethernetCsmacd (6)</TD></TR>
   <TR><TD>ifName:</TD>     <TD></TD></TR>
   <TR><TD>Max Speed:</TD>  <TD>12.5 MBytes/s</TD></TR>
   <TR><TD>Ip:</TD>         <TD>131.107.83.32 (wsman.msft.net)</TD></TR>
 </TABLE>

cololine · Fri Feb 19, 2010 5:38 pm

Thanks for looking into that. I adapted that tempate a bit for my device, then pasted it into the mrtg config file for the router and gave it a run. I get this error:

SNMP Error:
Received SNMP response with error code
  error status: noSuchName
  index 1 (OID: 1.3.6.1.2.1.2.2.1.11.11)
SNMPv1_Session (remote host: "r1.0" [XXX.XXX.XXX.XXX].161)
                  community: "public"
                 request ID: 1868700253
                PDU bufsize: 8000 bytes
                    timeout: 2s
                    retries: 5
                    backoff: 1)
 at /usr/local/mrtg-2/bin/../lib/mrtg2/SNMP_util.pm line 490
SNMPGET Problem for .1.3.6.1.2.1.2.2.1.11.11 .1.3.6.1.2.1.2.2.1.17.11 sysUptime sysName on public@r1.0::::::v4only
 at /usr/local/mrtg-2/bin/mrtg line 2150

So I guess the above OIDs are not quite right for the RB1000. I ran snmpwalk against the router, then grep'd for strings that contain 'pkts' and came up with thirteen counters each for unicast packets in and out - which of these would be the right one to query to get the *total* pps values for the RB1000 as a whole? I've set it up to use the .1 counter for each direction, simply because those had the largest value stored in their registers, and it's graphing. But it would be great to get confirmation that these are the right ones.

IF-MIB::ifInUcastPkts.1 = Counter32: 606099085
IF-MIB::ifInUcastPkts.2 = Counter32: 561426671
IF-MIB::ifInUcastPkts.4 = Counter32: 63674008
IF-MIB::ifInUcastPkts.5 = Counter32: 16228311
IF-MIB::ifInUcastPkts.6 = Counter32: 137079344
IF-MIB::ifInUcastPkts.7 = Counter32: 2708223
IF-MIB::ifInUcastPkts.8 = Counter32: 207841135
IF-MIB::ifInUcastPkts.10 = Counter32: 129699186
IF-MIB::ifInUcastPkts.11 = Counter32: 0
IF-MIB::ifInUcastPkts.12 = Counter32: 0
IF-MIB::ifInUcastPkts.13 = Counter32: 275559
IF-MIB::ifOutUcastPkts.1 = Counter32: 630414800
IF-MIB::ifOutUcastPkts.2 = Counter32: 521516275
IF-MIB::ifOutUcastPkts.4 = Counter32: 83624981
IF-MIB::ifOutUcastPkts.5 = Counter32: 17136344
IF-MIB::ifOutUcastPkts.6 = Counter32: 142978944
IF-MIB::ifOutUcastPkts.7 = Counter32: 2971107
IF-MIB::ifOutUcastPkts.8 = Counter32: 216032813
IF-MIB::ifOutUcastPkts.10 = Counter32: 58262481
IF-MIB::ifOutUcastPkts.11 = Counter32: 0
IF-MIB::ifOutUcastPkts.12 = Counter32: 0
IF-MIB::ifOutUcastPkts.13 = Counter32: 248263

Thanks!
Ed

rickhodger · Mon Feb 22, 2010 4:32 pm

So I guess the above OIDs are not quite right for the RB1000. I ran snmpwalk against the router, then grep'd for strings that contain 'pkts' and came up with thirteen counters each for unicast packets in and out - which of these would be the right one to query to get the *total* pps values for the RB1000 as a whole? I've set it up to use the .1 counter for each direction, simply because those had the largest value stored in their registers, and it's graphing. But it would be great to get confirmation that these are the right ones.

IF-MIB::ifInUcastPkts.1 = Counter32: 606099085
IF-MIB::ifInUcastPkts.2 = Counter32: 561426671
IF-MIB::ifInUcastPkts.4 = Counter32: 63674008
IF-MIB::ifInUcastPkts.5 = Counter32: 16228311

Each of those numbers (or indexes) will correspond to a different interface. Check your snmpwalk output for "ifDescr" which should give you an indicator of which index matches up with which interface. Eg from a Cisco 24 port switch.

IF-MIB::ifDescr.1 = STRING: FastEthernet0/1
IF-MIB::ifDescr.2 = STRING: FastEthernet0/2
IF-MIB::ifDescr.3 = STRING: FastEthernet0/3
IF-MIB::ifDescr.4 = STRING: FastEthernet0/4
IF-MIB::ifDescr.5 = STRING: FastEthernet0/5
IF-MIB::ifDescr.6 = STRING: FastEthernet0/6
IF-MIB::ifDescr.7 = STRING: FastEthernet0/7
IF-MIB::ifDescr.8 = STRING: FastEthernet0/8
IF-MIB::ifDescr.9 = STRING: FastEthernet0/9
IF-MIB::ifDescr.10 = STRING: FastEthernet0/10
IF-MIB::ifDescr.11 = STRING: FastEthernet0/11
IF-MIB::ifDescr.12 = STRING: FastEthernet0/12
IF-MIB::ifDescr.13 = STRING: FastEthernet0/13
IF-MIB::ifDescr.14 = STRING: FastEthernet0/14
IF-MIB::ifDescr.15 = STRING: FastEthernet0/15
IF-MIB::ifDescr.16 = STRING: FastEthernet0/16
IF-MIB::ifDescr.17 = STRING: FastEthernet0/17
IF-MIB::ifDescr.18 = STRING: FastEthernet0/18
IF-MIB::ifDescr.19 = STRING: FastEthernet0/19
IF-MIB::ifDescr.20 = STRING: FastEthernet0/20
IF-MIB::ifDescr.21 = STRING: FastEthernet0/21
IF-MIB::ifDescr.22 = STRING: FastEthernet0/22
IF-MIB::ifDescr.23 = STRING: FastEthernet0/23
IF-MIB::ifDescr.24 = STRING: FastEthernet0/24
IF-MIB::ifDescr.25 = STRING: GigabitEthernet0/1
IF-MIB::ifDescr.26 = STRING: GigabitEthernet0/2

Your 1-4 probably correspond to the RB's built in gigabit ports. The rest are probably your VLAN's, bridges or whatever other virtual interfaces you have created.

moderate traffic on one vlan = 100% CPU usage on RB1000

moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Re: moderate traffic on one vlan = 100% CPU usage on RB1000

Who is online