Hello guys,
we have a truly important and urgent problem to solve, here you can find the attachments to download:
https://www.mediafire.com/folder/wjktkk ... Datacenter
the device affected (FIRT02) has 4 BGP peers:
1. SNAP: upstream peer from where we receive the full internet routing table
2. FIRT01: our full internet routing table router, it does the same of the currently affected device
3. MID01: our internal router, it receives announces of public subnets of our customers to announce to the world
4. MID02: same as MID01
The 5th of July, at about 16:00 / 17:00 GMT+2, the affected router started to remove all the routes from the aggregates configured, consequently completely removing the announce to the UPSTREAM.
Mikrotik version is 6.48.5 (long-term).
As a workaround we inserted static announces which you will find in the configuration.
Verifying the logs we found out what you can see in attachment (zip file), cyclically (every about 30 seconds) the networks in the aggregates get completely removed and then added back.
Verifying the state of the peers there are no “withdrawn” and updates received by the 2 MID peers (that should be responsible of the removal of the routes reported in attachment). Practically the counter is stable since the first power on.
Taking a pcap on port 179 of our 4 routers we see a huge quantity of DUP ACK and retransmissions.
If we go deeper in this we see, for example, a keepalive message that is sent and then immediately re-sent in a matter of a few ten thousandths of a second
The router on the other side responds at the first one and then at the 2 retransmissions as well.
Routers are linked through an LACP and there are no errors on ports.
The configuration is like this since January 2022.
The same exact problem happened again at about May 2024 on the twin router (FIRT01).
You can find the supout files of the 4 routers, please note that the generation of the FIRT01 and FIRT02 took a lot more time compared to the other 2 routers.
We also noticed that in RouterOS v6 the bgp process is managed by only 1 of the 72 cores...
Does anybody have any idea or have experienced it already?