Hello All -
Running ROS 3.24 on RB1000U, doing VLAN routing of IP traffic, most other features disabled, have a few simple queues, that's about it.
Today I saw something that is a first in nearly a year (since I put this configuration into production): CPU usage briefly spiked to nearly 100%, and latency became severe, something like 50% packet loss. You can see the very short spikes on the graph below - it happened twice:
Unfortunately, by the time I determined that this was the cause of the latency vs some other network issue and was able to log in to Winbox, the incident had passed and CPU usage was already settling down to pre-spike levels. The second time it happened, it was away from the office, so I never had the chance.
There's about 5 dozen client VLANs being routed by 1000U via a connected L2 switch. I figured this spike was due to some near-Gbps burst of traffic, but in fact there was only a moderate spike on just one client graph which corresponds exactly with each incident:
Now, this router has been stress-tested with a client a few months ago, who was doing traffic well in excess of 500Mbps, and it handled it fine, no latency. The config of the router has changed very little - a few client VLANs have been dropped, and few new ones have been added. The overall VLAN and IP address count is down, and the traffic is down.
So here's the question: why did each of these 25Mbps - 30Mbps spikes on this one client VLAN bring the RB1000 to it's knees? It's routinely routing about 50Mbps - 80Mbps in each direction and CPU loading is steady at around 25%. Unfortunately I don't have any data on the traffic from the incidents. I've emailed the client and asked him to check his logs; if he produces anything, I'll post it here. But maybe someone out there has been down this road before, or has an idea?
On a related note, can someone suggest a good way to capture an log traffic data from the RB1000 in a way that will offer good forensics if this happens again? It seems from reading this forum that trafr is based on a very old linux kernel and is essentially broken on anything of recent vintage. Also, the sniffer itself seems kind of broken, at least from Winbox - I tried running to with the interface set just to one vlan, but even after applying that change, when I run it it captures data from every interface anyway!
Thanks in advance!
Ed