Page 1 of 1

Help diagnosing daily network outage at approximately the same time

Posted: Wed Nov 20, 2024 8:58 pm
by Dunderhams
I'm looking after a network that went all-in on MikroTik (why wouldn't you).

The core switches are CRS518-16XS-2XQs and my client access switches are CRS354-48P-4S+2Q+. Here's a diagram of how everything is linked together:
topology.png
I have a dude server and I'm using it to collect syslogs. I collect info, warning, error and stp logs. At the same time everyday 4:18 in the afternoon, the 802.3ad connection between CORE-SW02 and CORE-SW04 seems to start learning and forwarding. Then CORE-SW01 starts discarding the qsfp28 connection to CORE-SW02.

Basically, I get a whole host of TCHANGE start and TCHANGE over events until 4:25 in the afternoon when it settles down.

I'm not much of a layer 2 guy, but I thought it might be related to my RSTP setup. As a result, I finally added priorities to the core switches, as I've added to my diagram. This doesn't seem to have had any impact, although the network does seem to behave better in other ways now.

I've attached my syslog so you can see what I mean.

I'm looking for ideas on what's happening. I'm also looking for ideas on what more logs I should be grabbing to get an idea of what's happening. I'd also appreciate some guidance on this current topology. I'm not a big fan of it because it's a square and I'd prefer a triangle. However, these are two sites separated by quite some distance. There are 6 fibre cables that go between the sites, so we setup 3 fibres on each core switch and 802.3ad them.

All routers are running 7.15.3.

Re: Help diagnosing daily network outage at approximately the same time

Posted: Thu Nov 21, 2024 9:09 am
by mkx
It could be some rogue device somewhere on the edge of your network which initiates STP topology changes. And there are plenty of devices which can do it, e.g. any server running VMs can do it (they tend to run bridges for connecting VMs to network) or servers running any containers, etc.

I'd start by going through the logs, focusing primarily on access switches (CAS-xx) and see if any of STP logs relate to ports which are not used as uplink/backhaul ...e.g. CAS-07 port ether23. If you find some, set them to edge=yes and observe if it fixes the problem. If yes, go after administrator/owner of device connected to that port and teach him/her some lessons ;-)
I wouldn't go and disable STP on all edge ports because this might reduce ability to detect loops.

Another thing: most vendors have default priority of bridges set to 0x4000. The root bridge selection process selects bridge with lowest priority as root bridge (in case there are multiple devices with same priority, "native VLAN ID" becomes next selection criterion, MAC addresses are the last value to decide the winner). Beware that if core switches have VLANs enabled, native VLAN is most often VLAN ID 1 ... and bridges without VLAN awareness will have that VLAN ID set to 0 and will thus win in case priorities are all equal (been there, bit me hard ... with Cisco switches none the less).
So when setting priority on core switches, go for values lower than 0x4000 ... e.g. 0x1000 for preferred root device(s), 0x2000 and 0x3000 for other core switches and leave 0x4000 "to the crowds" (edge switches, etc.).

Re: Help diagnosing daily network outage at approximately the same time

Posted: Mon Nov 25, 2024 4:50 am
by Dunderhams
Thanks mkx!
I was thinking the same thing. I started adding BPDU Guard on all ports. I hadn't noticed the edge port setting. Do you know if there's much of a difference? If so, what would you recommend and why?

I set my root bridge to 4000 before I saw your response and go up to 5000, 6000 and 7000 for my other core switches. I wanted to have some wiggle-room in case we add other switches in future. Does that make sense, or do you recommend I go lower?

Re: Help diagnosing daily network outage at approximately the same time

Posted: Mon Nov 25, 2024 9:00 am
by mkx
Here's an article, somehow explaining different STP options: https://help.mikrotik.com/docs/spaces/R ... er-portSTP

According to my understanding, BPDU-guard is almost exactly opposite from setting port as edge: BPDU-guard disables port if it detect ingress BPDU packet ... while edge port ignores/discards those packets. I don't have an universal answer to which one to use, I guess it depends on actual use. Setting port to edge hinders ability to detect loops ... OTOH BPDU-guard (disabling such port) can help to fix any actual issue because admin of device, connected to such port, would probably complain.

Re bridge priority: as I already wrote, priority value of 0x4000 seems to be default on most vendors. So setting your preferred root switch to value lower (e.g. 0x3000) will make sure it wins root bridge selection if some (rogue) bridge starts root bridge selection. The idea being that STP topology change can disturb traffic and you want to avoid that.