Thu Dec 12, 2024 7:33 pm
More specifically:
I am running L3HW offload on several CRS300's, using them as site or edge (customer-facing) routers. They work great, unless they have diverse routes with equal cost. In that case, they will eventually get confused and routes will get "stuck" going out the wrong port, despite changes in the routing table. This requires disabling/enabling L3HW offload to force the ASIC to reload the current FIB. Ensuring that no two paths are equidistant keeps this from happening. These internal-only routers only see customer prefixes and internal transit prefixes, so the entire FIB fits in the ASIC just fine.
On my 2116's (same CPU, RAM as 2216's, just a different switch chip), I am receiving full routes from multiple peers. I have three world-facing routers and two internal routers on the public side of my CGNAT stack, all fully meshed. The border routers all receive full routes from their peers, but then I filter to one or two AS's away. /ip/route/print shows a couple million routes, but most are filtered and don't make it into the FIB.
Early on, when I had just two border 2116's and three upstream peers, I ran a number of experiments with L3HW offload enabled and disabled. I found that for L3HW to have any effect, I needed to limit the number of prefixes I inserted, so I settled on two AS's away. That worked great for a while and the CPU would drop to < 5%, but inevitably routes would get stuck and I'd have to do the same disable/enable trick. After adding three more peers, the hardware tables got full pretty quickly, and the CPU remained just as busy with it on as it is with it off, so I have had L3HW offload disabled for some time now on my 2116's. We're at 10-20% with 3-4Gbps at peak. I still limit the inserted prefixes to be 2 AS's away and allow the preferred default route to take the bulk of the outbound traffic.
With five fully-meshed routers, all accepted (i.e. not filtered out) routes are shared with the other four. The internal ones therefore get a much smaller subset of routes. L3HW can insert all those routes just fine, but the CPU load is already pretty low on these routers (only pushing 3-4Gbps), and the risk of them "sticking" isn't worth the imperceptible improvements.
With 7.16 I'm seeing BGP routes get stuck on the busiest of the five routers. L3HW offload isn't even a factor here because it's disabled. I don't recall having had this problem on previous releases, at least to this extent (I've had to restart that router four times now), so I backed this one off to 7.15. (I remember it being pretty solid on either 7.14 and/or 7.15.)