Hi, we have a CCR1072 in our CORE and it has become unstable.
The configuration of this device is:
SFP1 - NIX Connection
SFP2 - Cogent Connection
SFP5 - Bondign
SFP6 - Bonding
SFP7 - Bonding
SFP8 - Bonding
In NIX we have 72 peering with different entities, none of them full.
In Cogent we have a full bgp.
In Bonding we have 3 ibgp with our routes and those of other connected providers.
A week ago it became unstable, there was no attack and the CPUs were all 100% and use 100% of IRQ:
/system resource> monitor
cpu-used: 100%
cpu-used-per-cpu:
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,
100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%,100%
free-memory: 11673536KiB
/system resource cpu> print
# CPU LOAD IRQ DISK
0 cpu0 100% 100% 0%
1 cpu1 100% 100% 0%
2 cpu2 100% 100% 0%
3 cpu3 100% 100% 0%
4 cpu4 100% 100% 0%
5 cpu5 100% 100% 0%
6 cpu6 100% 100% 0%
7 cpu7 100% 100% 0%
8 cpu8 100% 100% 0%
9 cpu9 100% 100% 0%
10 cpu10 100% 100% 0%
11 cpu11 100% 100% 0%
12 cpu12 100% 100% 0%
13 cpu13 100% 100% 0%
14 cpu14 100% 100% 0%
15 cpu15 100% 41% 0%
16 cpu16 100% 100% 0%
17 cpu17 100% 100% 0%
18 cpu18 100% 100% 0%
...
It was not an attack, since the traffic did not rise, it fell to 0 because the device itself was not able to do anything. Even trying to write in console only showed 1 character every 5 seconds.
Even attempting to refresh the bgp rules was impossible:
/routing bgp advertisements> /routing bgp peer refresh-all
action timed out - try again, if error continues contact MikroTik support and send a supout file (13)
The way to get back to normal was to cancel all BGP sessions and cut traffic. Even so it was with high CPU load some minutes.
Finally when everything returned to normal I was able to generate the support file and send it to support so that they tried to review it, but everything had already happened, so it would be difficult for them to see something.
The support team tells me that it seems to be related to the Traffic Flow, which saturated and blocked the computer.
We used the Traffic Flow to detect DDOS attacks, so for us that have a traffic of 16/20 Gbps and a passthougt of 25/30 Gbps is fundamental.
I understand that it is a difficult problem but we have to solve it because in my case there are more people. Talking with other mates agree on the same, the Traffic Flow does not work well when you start to have high traffic. And we're talking about peers who have small ISPs that move 1Gbps of traffic. We have traffic of 16/20 Gbps and a passthougt of 25 Gbps.
The solution that several have raised is that Mikrotik does not work. They have decided to put a switch before the Mikrotik so that it is this switch that sends the information to netflow and avoid using the mikrotik. Since that time they have not got back to problems so I understand that there must be an error in the way that Mikrotik manages that section.
As I understand it, the Traffic Flow should be a FIFO (First In First Out) stack, so if mikrotik could not clear the cache it should be eliminating the older ones. If you have not been able to process it, you should delete it to store new information. I think the problem comes when it has all the cache used and does not know where to store the next, so it is waiting (and using CPU).
Would installing M2 Disk solve this problem? Could we use all the M2 to store Traffic Flow and never saturate?
Best Regards and sorry for my bad english and large post.