Traffic Flow use 100% CPU of CCR1072

Alferez · Fri Sep 22, 2017 12:28 pm

Hi, we have a CCR1072 in our CORE and it has become unstable.
The configuration of this device is:
SFP1 - NIX Connection
SFP2 - Cogent Connection
SFP5 - Bondign
SFP6 - Bonding
SFP7 - Bonding
SFP8 - Bonding

In NIX we have 72 peering with different entities, none of them full.
In Cogent we have a full bgp.
In Bonding we have 3 ibgp with our routes and those of other connected providers.

A week ago it became unstable, there was no attack and the CPUs were all 100% and use 100% of IRQ:

/system resource> monitor
cpu-used: 100%
cpu-used-per-cpu:
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%
free-memory: 11673536KiB

/system resource cpu> print
# CPU LOAD IRQ DISK
0 cpu0 100% 100% 0%
1 cpu1 100% 100% 0%
2 cpu2 100% 100% 0%
3 cpu3 100% 100% 0%
4 cpu4 100% 100% 0%
5 cpu5 100% 100% 0%
6 cpu6 100% 100% 0%
7 cpu7 100% 100% 0%
8 cpu8 100% 100% 0%
9 cpu9 100% 100% 0%
10 cpu10 100% 100% 0%
11 cpu11 100% 100% 0%
12 cpu12 100% 100% 0%
13 cpu13 100% 100% 0%
14 cpu14 100% 100% 0%
15 cpu15 100% 41% 0%
16 cpu16 100% 100% 0%
17 cpu17 100% 100% 0%
18 cpu18 100% 100% 0%
...

It was not an attack, since the traffic did not rise, it fell to 0 because the device itself was not able to do anything. Even trying to write in console only showed 1 character every 5 seconds.

Even attempting to refresh the bgp rules was impossible:
/routing bgp advertisements> /routing bgp peer refresh-all
action timed out - try again, if error continues contact MikroTik support and send a supout file (13)

The way to get back to normal was to cancel all BGP sessions and cut traffic. Even so it was with high CPU load some minutes.

Finally when everything returned to normal I was able to generate the support file and send it to support so that they tried to review it, but everything had already happened, so it would be difficult for them to see something.

The support team tells me that it seems to be related to the Traffic Flow, which saturated and blocked the computer.

We used the Traffic Flow to detect DDOS attacks, so for us that have a traffic of 16/20 Gbps and a passthougt of 25/30 Gbps is fundamental.

I understand that it is a difficult problem but we have to solve it because in my case there are more people. Talking with other mates agree on the same, the Traffic Flow does not work well when you start to have high traffic. And we're talking about peers who have small ISPs that move 1Gbps of traffic. We have traffic of 16/20 Gbps and a passthougt of 25 Gbps.
The solution that several have raised is that Mikrotik does not work. They have decided to put a switch before the Mikrotik so that it is this switch that sends the information to netflow and avoid using the mikrotik. Since that time they have not got back to problems so I understand that there must be an error in the way that Mikrotik manages that section.

As I understand it, the Traffic Flow should be a FIFO (First In First Out) stack, so if mikrotik could not clear the cache it should be eliminating the older ones. If you have not been able to process it, you should delete it to store new information. I think the problem comes when it has all the cache used and does not know where to store the next, so it is waiting (and using CPU).

Would installing M2 Disk solve this problem? Could we use all the M2 to store Traffic Flow and never saturate?

Best Regards and sorry for my bad english and large post.

jkaeton · Mon Jan 21, 2019 6:57 pm

I am responsible for Cloud based flow collection as part of my duties with the organization I work for. We collect flows from hundreds of devices and ISPs as part of the services we provide. So, I have configured just about every flow capable router and switch for flow export. The key feature for mitigating CPU usage when flow export is enabled is the ability to configure a sampling rate of less than 1:1. This is a capability that MiKroTiks do not have. For most collector applications one flow export packet per traffic packet is much more definition than is necessary. Simply applying a 1 out of 10 sampling rate can make an amazing difference on CPU utilization, and on most cflow/netflow based products we advise 1:100. Sampling is supported by the largest routers from the largest manufacturers as well as the smallest switches that support flow export. Does anyone on this forum know why MikroTik does not support random sampling?

mobilexpi · Sun Apr 04, 2021 2:42 am

Hi Alferez,
We are facing the same issue. Were you able to fix it?

Thanks,

Hi, we have a CCR1072 in our CORE and it has become unstable.
...

jvanhambelgium · Sun Apr 04, 2021 9:14 am

Does anyone on this forum know why MikroTik does not support random sampling?

Well RouterOS does not even have Netflow v9/IPFIX properly implemented the last time I checked.
I've opened a ticket on this long time ago, was never addressed.
Currently using v5 and that seems to work.

alcat · Tue Apr 19, 2022 6:36 pm

Here is the reply we got from Mikrotik which is unacceptable. When our traffic hits 6Gbps the CPU goes from 5% to 100%. Haven't tried setting to v5, will try that tonight:
**************************************************
Hello,

What exactly we can fix if the device is overloaded with advanced configuration? Either you have to obtain a more powerful device, preferably CHR or you have to split such configuration over multiple devices to avoid CPU overload.

Best regards,
Artūrs C.

Traffic Flow use 100% CPU of CCR1072

Traffic Flow use 100% CPU of CCR1072

Re: Traffic Flow use 100% CPU of CCR1072

Re: Traffic Flow use 100% CPU of CCR1072

Re: Traffic Flow use 100% CPU of CCR1072

Re: Traffic Flow use 100% CPU of CCR1072

Who is online