CSS326-24G-2S+RM hangs until power cycle

0ram · Wed Dec 19, 2018 1:31 pm

Every couple of weeks, I have to power cycle the switch in order for it to work again. The SwOS is the latest 2.8.
The green lights are still on when the switch hangs but no connection is possible until power is cycled.
Is this a known bug in firmware?

jonwang · Thu Dec 20, 2018 5:48 am

I've had this happen too on a couple of these but it's always been in a production/hurry situation where I haven't been able to troubleshoot without rebooting. Super pain and sounds similar. Lights are on but no ssh, no GUI, doesn't pass traffic.

3freet · Wed Dec 26, 2018 2:17 pm

I've had this happen too on a couple of these but it's always been in a production/hurry situation where I haven't been able to troubleshoot without rebooting. Super pain and sounds similar. Lights are on but no ssh, no GUI, doesn't pass traffic.

I'm experiencing the exact same thing as well. Wanna add something though, it actually does pass/forward ICMP packets (but not other protocols) during these hangs, which I guess makes watchdog functionality entirely useless

I emailed support@mikrotik.com about this issue.

I hope they get it fixed

Paternot · Wed Dec 26, 2018 4:29 pm

I have two of these in production, with one 10Gib fiber, one VLAN trunk port and about 4 VLANs each. Rock solid, with an uptime of 56 and 62 days. True, the traffic is light - but even when stress testing I didn't get problems.

Maybe some specific configuration, triggering a bug?

Or could it be the power? Some transient, some spike? Noise on the line, maybe? They run quite cool, so I don't think it would be temperature related.

3freet · Wed Dec 26, 2018 10:28 pm

I have two of these in production, with one 10Gib fiber, one VLAN trunk port and about 4 VLANs each. Rock solid, with an uptime of 56 and 62 days. True, the traffic is light - but even when stress testing I didn't get problems.

Maybe some specific configuration, triggering a bug?

Or could it be the power? Some transient, some spike? Noise on the line, maybe? They run quite cool, so I don't think it would be temperature related.

What firmware version you using?

Are there any special configurations?

RSTP? ...

Paternot · Thu Dec 27, 2018 1:58 am

I have two of these in production, with one 10Gib fiber, one VLAN trunk port and about 4 VLANs each. Rock solid, with an uptime of 56 and 62 days. True, the traffic is light - but even when stress testing I didn't get problems.

Maybe some specific configuration, triggering a bug?

Or could it be the power? Some transient, some spike? Noise on the line, maybe? They run quite cool, so I don't think it would be temperature related.

What firmware version you using?

Are there any special configurations?

RSTP? ...

I'm using firmware 2.8
RSTP, IGMP snooping, no LAG, no forwarding, VLANs and no ACLs.

Each of them has a trunk of 10Gb (module S+85DLC03D), and about 15 computers. The trunk connects them to a CRS328, where the servers are connected.

3freet · Thu Dec 27, 2018 7:47 am

Then it's either wrong configurations or faulty units. I thought there was something wrong with the firmware. I was on 2.8, now downgraded to 2.7 just in case. I will test it and see but probably will have to wait 6-10 days.

mozerd · Fri Dec 28, 2018 6:10 pm

Then it's either wrong configurations or faulty units. I thought there was something wrong with the firmware. I was on 2.8, now downgraded to 2.7 just in case. I will test it and see but probably will have to wait 6-10 days.

[edit] oops my mistake ... the unit I purchased is the CRS326 and not the CCS326.
I just got one of these and under SWOS it states upgrade 2.9 available but when I upgrade and it reboots automatically the upgrade does NOT take. Tried this a few times with the exact same behavior. Not good experience.

Paternot · Fri Dec 28, 2018 6:31 pm

Then it's either wrong configurations or faulty units. I thought there was something wrong with the firmware. I was on 2.8, now downgraded to 2.7 just in case. I will test it and see but probably will have to wait 6-10 days.
I just got one of these and under SWOS it states upgrade 2.9 available but when I upgrade and it reboots automatically the upgrade does NOT take. Tried this a few times with the exact same behavior. Not good experience.

Yes, I see this too. I don't know where it comes from, but there is only 2.8

Don't even try this upgrade.

0ram · Mon Dec 31, 2018 12:23 am

Then it's either wrong configurations or faulty units. I thought there was something wrong with the firmware. I was on 2.8, now downgraded to 2.7 just in case. I will test it and see but probably will have to wait 6-10 days.

It could be a defective device. I have two CSS326. Only one of them has this hangs until now. The well functioning switch has uptime ~38 days now. Could be also because it has much less traffic than the "defective" one.

Kevo · Mon Dec 31, 2018 6:03 pm

We still run 2.4 on our CSS326 and it has been running fine so far. We only use it as a basic switch and don't have any extra features or functionality turned on. I'm pretty sure we did have to reboot it once after some storms, but we are thinking it was do to one or more of our wireless backhaul radios going wacky which they occasionally do during thunderstorms. We had to reboot those as well.

We've been reluctant to update past 2.4 because of the reported issues.

k6ccc · Thu Jan 03, 2019 7:38 am

Only time I have had one of CSS326 switches hang is when I had a problem DSL modem. The modem would crash, and after power cycling it, the switch would shortly lock up. Replaced the DSL modem after we determined that it was having a problem. Never had a problem with the switch since. Currently it is showing an uptime of 136 days. It is running version 2.8

As I understand it, 2.9 was released and immediately recalled. That left the phantom 2.9 availability.

creed23 · Sun Jan 06, 2019 4:07 pm

We see te same problem.

3freet · Sat Jan 12, 2019 9:54 pm

Then it's either wrong configurations or faulty units. I thought there was something wrong with the firmware. I was on 2.8, now downgraded to 2.7 just in case. I will test it and see but probably will have to wait 6-10 days.

UPDATE:

After downgrading to 2.7, the switch hanged on the fifth day.
I contacted MikroTik support and they provided me with firmware 2.9 (still beta in the testing phase I think) and now it's been 10 days functioning well with no issues.

I am not sure whether this related or not, but I grounded the switch just before upgrading to 2.9 since I had issues with static electricity building up on the switch case. I don't think this was the problem because as I said before, the switch keeps forwarding ICMP packets which suggests that it is not a hardware issue. I think 2.9 fixed it.

(just in case, ground your switch if it is not grounded and see if it fixes anything)

gregecslo · Sun Jan 13, 2019 5:32 pm

Did you receive 2.9 RC7 ?

3freet · Mon Jan 14, 2019 5:49 pm

Did you receive 2.9 RC7 ?

I'm not sure of the exact version, but SwOS gui says 2.9 (built at Tue Dec 18 2018)

Alestrix · Tue Jan 29, 2019 11:27 pm

I have the very same problem with my CSS326 for a a very long time now and it happened with basically every firmware since v2.3, currently running v2.8: After some time - sometimes weeks, sometimes a few days, recently it was less than a day (see below) - it just locks up and would not let any traffic go through and I believe it also blocked ICMP but take this with a grain of salt. The lights on the switch's ports keep on happily blinking, no sign of an issue from the looks of it.
Until just recently I haven't really found a pattern, but yesterday it locked up when I did a firmware update on one of the UniFi Access Points (UAP-AC-LR) that are connected to the CSS. Power cycle of CSS fixed the issue until I did the firmware update on the other UniFi access point (UAP-AC-Lite) which again caused the CSS to hang.

The setup: UniFi Controller VM running on ESXi which is connected to CSS via 10G (direct attach cable). Each UniFi AP is connected to its respective PoE injector which in turn connects to the CSS. Management of UniFis is done via dedicated management network (which is untagged on the 1G UAP ports and tagged vlan 1 on the 10G port), the SSIDs' traffic also runs over dedicated VLANs. I also had the UAPs connected to other physical ports until a few months ago but the issues were the same.

Do any of you who see these issues also have UniFi access points connected to your CSSes?

I will try v2.9 but I won't let my hopes get too high since this is going on for what seems like an eternity.

Nom · Wed Jan 30, 2019 5:07 am

Do any of you who see these issues also have UniFi access points connected to your CSSes?

Yep, I also see the same thing when I update the UniFi firmware through my CSS326-24G-2S+ but not every time. I just power-cycle the Miktrotik switch when this happens, and it's fine until the next UniFi firmware is released...

mitjau · Tue Aug 13, 2019 10:31 am

Hi!

A also have the exact same issue, two CSS326-24G-2S+ connected with 10G (Mikrotik 3m SFP), hangs second time in a month. In a production facility.
Same storry at arrounf 6.00 in the morning.... ;(

Current Installed Version 2.9 (built at Mon Jan 07 2019 12:05:03 GMT+0100 (Central European Standard Time))

I wish there was at least logging option.. or scheduled time reset.
I had to run to the factory to power cycle swithes.
Will try the lates upgrade, but have little hopes, since is the same version:

Latest Available Version 2.9 (built at Mon Jan 14 2019 07:49:36 GMT+0100 (Central European Standard Time))

Anyone found the solution?

mitjau · Mon Aug 19, 2019 4:10 pm

No luck, today it happend again at arround 6 am.

Since I don't remember this trouble before attaching second switch on 10G link, I've removed it from SFPs and linked the two switches with patch cable.
I know it's not the best solution, but just to be sure is it SFP link or something other...

kb4fxc · Fri Sep 27, 2019 2:12 am

Hi Everyone,

I just purchased a CSS326-24G-2S+RM to evaluate. I went ahead and upgraded to the 2.10 firmware. I'm going to test carefully with many GigE connections and both SFP+ ports active a 10GigE. One thing I've noticed right away: under the "System" tab, "Health" section, the temperature shows about 60C. I took the top cover off and confirmed that the switch chip (with smaller heatsink) is HOT! What temperature are you all seeing on your switches?

Paternot · Fri Sep 27, 2019 1:47 pm

Hi Everyone,

I just purchased a CSS326-24G-2S+RM to evaluate. I went ahead and upgraded to the 2.10 firmware. I'm going to test carefully with many GigE connections and both SFP+ ports active a 10GigE. One thing I've noticed right away: under the "System" tab, "Health" section, the temperature shows about 60C. I took the top cover off and confirmed that the switch chip (with smaller heatsink) is HOT! What temperature are you all seeing on your switches?

That's about right. I have two of them - both running about 61 C.

dmitris · Sun Oct 06, 2019 3:54 pm

We are also using in our production 10 x CRS326-24G-2S+ and two of them experienced the same issue. The devices itself are not at heavy load, on uplink max 5Mbit/s and cpu load 0-5%....BTW i can connect to hanged devices over console port, logs are ok but traffic is not passing in/out....Last hang 06.10 - 5AM
CPU Temperature also 61*
OS:RouterOS 6.44.5 (LTS)
FW: 6.44.5 (LTS)

dmitris · Fri Oct 25, 2019 10:48 am

I'm just curious, maybe anybody from Mikrotik support team can comment on this issue, do you aware about this behavior?

P.S
At this moment i'm afraid to use Mikortik switches for commercial services in our environment....

telcouk · Wed Nov 20, 2019 1:36 pm

Surely this product should have been withdrawn from sale?
Almost a year without a reason or resolution is ridiculous...

We installed one of these switches a week ago, today it has crashed twice.
Like yourselves it would still pass ICMP traffic, but nothing else.

CPU is 60C+
Running firmware 2.10

After seeing this thread, we're currently de-racking and installing the old switch.
We will be returning this switch for a refund.

troybowman · Wed Dec 25, 2019 3:07 am

This just happened to my CSS326-24G-2S+ running 2.10. It started balking after 17 days of uptime. Pings were fine, but any serious traffic would hang after a packet or two. Now that I think about it, it seems like maybe the ping packets were small, and larger packets would hang, maybe?

I don't know where to begin troubleshooting these CSS devices. I was disappointed to discover how much of a black-box they are. Is there a system log? Error log? Kernel panic log? Any troubleshooting tools at all for end-users? Heck, just sending system events to a remote Syslog server would probably help. The data provided by SNMP is extremely minimal. No CPU, memory, or any other metrics besides ports and mac address tables. All I have are RRD graphs for things like uptime, where there are gaps that SNMP could not get a reliable response from the switch.

My switch idles (with low traffic) at 74C. Could this be a heat issue? Mine is fanless. Have any of your added fans? Are there fan headers? Would I void a warranty by breaking it open to add fans?

gotsprings · Tue Dec 31, 2019 5:08 am

My current work around.

https://dlidirect.com/products/new-pro-switch

karlisi · Tue Jan 07, 2020 9:45 am

This just happened to my CSS326-24G-2S+ running 2.10. It started balking after 17 days of uptime. Pings were fine, but any serious traffic would hang after a packet or two.

Wow, it seems I'm not alone. My problem though is a little bit specific. There is no problem with wired clients, but if I connect access point (RB951) to it, symptoms are like yours. Ping works, web browsing in general works, but big files stuck, sometimes after some minutes, sometimes in the beginning. In the same time wired clients attached to AP ethernet ports working fine. Quite mysterious. Switch configured as plain dumb switch, only password, static IP and identity assigned. Same AP connected to another switch or router directly is OK.

akoss · Wed Jan 08, 2020 10:20 am

hungs once for 10-15 days. ICMP packets to the mikrotik pass. the web interface does not work and the devices behind it are also unavailable!( I wonder if they're going to fix it somehow?...

Wed Jan 08, 2020 4:18 pm

Thanks to everyone for sharing the details!

We are working on the issue, but because of the rare appearance, it gets extremely difficult to reproduce the problem. It might be related to Flow Control or switch congestion controls. For now, try to disable the Flow Control for all interfaces under the "Link" menu in SwOS. Also, try to verify that other devices connected to the switch are not using any Flow Control settings. Keep an eye for any counters on the "Errors" menu. Let us know whether the switch still fails after this.

karlisi · Thu Jan 09, 2020 10:46 am

For now, try to disable the Flow Control for all interfaces under the "Link" menu in SwOS. Also, try to verify that other devices connected to the switch are not using any Flow Control settings. Keep an eye for any counters on the "Errors" menu. Let us know whether the switch still fails after this.

Disabled flow control for all ports, nothing changes. No errors.

angriukas · Thu Jan 09, 2020 5:25 pm

Sharing info about my story:
Had issues (port flapping) on CRS326 with powered from APC UPS.
Standard CRS PSU for 24V 1200mA replaced to PSU from CCR with DIY elements

Post #169 viewtopic.php?f=2&t=141633
With new PSU port flapping dissapier, SFP+ ports started to work.
CSS326 has same PSU like CRS326, PSU could be noisy when powered from UPS (like in my case), or:
could be not enough power on big loads, and that could lead to hang-up of device (weak PSU).
To eliminate case of noisy PSU: try to use PSU from different MT device with same characteristics, or try to power device not from UPS if you have this case.
To eliminate case of weak or broken PSU - maybe try to use more stronger PSU, for example 24V 2000mA.
Suggestion for Mikrotik: maybe it's worth to install fan to CSS/CRS to lower CPU temperature almost twice?
I see there are lot questions from users regarding high CSS/CRS CPU temperature. Fan is not so expensive.

telcouk · Thu Jan 09, 2020 5:39 pm

When we had ours, we turned off flow control on the switch. We also have a policy of doing this on all devices too, however it could have been missed on some devices (unlikely, but not impossible).

So to conclude, turning off flow control did not stop the issue for us.

KlimX · Fri Jan 10, 2020 8:16 am

We are working on the issue, but because of the rare appearance, it gets extremely difficult to reproduce the problem. It might be related to Flow Control or switch congestion controls. For now, try to disable the Flow Control for all interfaces under the "Link" menu in SwOS. Also, try to verify that other devices connected to the switch are not using any Flow Control settings. Keep an eye for any counters on the "Errors" menu. Let us know whether the switch still fails after this.

Disabled flow control for all ports, nothing changes.

But after turning off IGMP Snooping uptime more than 21 days now.

karlisi · Fri Jan 10, 2020 9:40 am

IGMP Snooping is already off.

gotsprings · Mon Jan 13, 2020 10:33 pm

My CRS328 is still passing packets. Powering devices. But it is gone from the DHCP server. Doesn't show up in a network scan. But winbox finds it at 192.168.88.1. Webpage is not reachable at 192.168.88.1

Anything I can do to get diagnostic info to Mikrotik?

Tue Jan 14, 2020 8:29 am

@angriukas - this most likely is not related to switch hangs
@gotsprings - you can report the issue to our support portal or emailing to support@mikrotik.com

When the switch hangs, it tends to forward only small frames (from 64 bytes to around 200 bytes). To verify that you experienced the same issue, try to ping some devices through the switch with different size packets (e.g 100B and 1000B). Small packets should flow normally, while the larger ones are dropped or corrupted (some Rx-errors might appear on connected devices).

Maggiore81 · Thu Jan 16, 2020 2:09 pm

My CRS328 is still passing packets. Powering devices. But it is gone from the DHCP server. Doesn't show up in a network scan. But winbox finds it at 192.168.88.1. Webpage is not reachable at 192.168.88.1

Anything I can do to get diagnostic info to Mikrotik?

I am with the very same problem.
Bought 3 switches, loaded SwOS on them because of the easy interface. They just are used as plain switches.
flow control ON, everything as default, SNMP OFF, IGMP OFF.
it just hanged after 3 hours of operation.
dhcp gone, on 192.168.88.1 pingable but not accessibile.
Power cycle restored the operation.

Maggiore81 · Thu Jan 16, 2020 5:02 pm

I add some info.
With the very same configuration I notice on SwOS 2.10 a lot of SFP errors, deferred, collisions etc.
I just reboot with 6.44.6 Routeros, 0 error on the SFP interface, 0 errors, 0 collisions.
We have WDM 1G SM sfp (blue/yellow). Everything configured auto-mode. In RouterOS zero issues.
SwOS a lot of errors.
what is the real situation then?
I bought some 328 to use Sw0S because of the simpler interface, and because we dont need any routing.
Just L2, no RSTP, no IGMP, no VLANS, just pure L2 switching.
very disappointed.

FCM · Mon Jan 20, 2020 1:11 pm

hello there
Same issues here with 2 of our CSS326 : blinking, pingable, but nothing else work.
And also these morning, same error but on a CRS328 !!
With the ICMP working, our alert system didn't work and everytime we learn from users they are without phone and network....

FCM · Thu Jan 23, 2020 11:56 am

Hello,
Just to say that another CRS328-24P-4S+ get the illness too...

akoss · Fri Jan 24, 2020 11:16 am

Hello,
Just to say that another CRS328-24P-4S+ get the illness too...

in your case, you can switch to RouterOS and forget about the problem as a bad dream)

Maggiore81 · Sun Jan 26, 2020 11:56 am

Hello,
Just to say that another CRS328-24P-4S+ get the illness too...
in your case, you can switch to RouterOS and forget about the problem as a bad dream)

I installed the RouterOS 6.44.6 and since then 0 issues.
I would like to test the latest stable because of lot of fix in 6.45 but I prefer being long-term...

dmitris · Sun Jan 26, 2020 1:11 pm

LTS 6.44.5 - 6.44.6 didn't help us and Mikrotik support suggested us to upgrade to latest available software 6.45.7. After upgrade it's working now more than 1 month without problem....

mhenriques · Fri Jan 31, 2020 1:17 pm

Can't Mikrotik make available a simple workaround of a scheduled reboot command on SwOS while researching further the problem?
We have 2 CSS326 running latest firmware and we're facing the same problem.

Regards
Mauricio

shadybk · Thu Feb 06, 2020 7:40 am

My team just purchased three CSS326-24G-2S+RM switches. We're currently running firmware version 2.10. Our monitoring system today reported that one of the CSS326-24G-2S+RM was no longer responsive. It passed traffic, but failed to respond to ping, and the web page was inaccessible. The web page, ping and snmp was not restored until power cycling the device. Judging from the comments in this forum this issue has been going on for sometime now. Is there any traction / progress with a firmware solution for the fault? I will need to make a recommendation to keep or return these devices.

Thanks,
shadybk

cwachs · Wed Feb 19, 2020 6:09 pm

We have 4 of these deployed in an apartment building and ended up buying a AC power adapter that pings the Internet and reboots the switch when the pings die. We got tired of driving over there in all hours of the day and night to reboot switches that hang.

gcsuri · Wed Feb 19, 2020 7:20 pm

Hi,

I have some CSS326 on latest swOS. Today one of them is hanged up. I could ping it, snmp is worked, but no traffic, no web interface.
Port 80:
$ telnet xxx.9.2.22 80
Trying xxx.9.2.22...
Connected to xxx.9.2.22.
Escape character is '^]'.
help
HTTP/1.0 400 ERROR
Content-Type: text/html
Connection closed by foreign host.
...

$ wget xxx.9.2.22
--2020-02-19 17:01:22-- http://xxx.9.2.22/
Connecting to xxx.9.2.22:80... connected.
HTTP request sent, awaiting response... 303 Use Instead
Location: /index.html [following]
--2020-02-19 17:01:22-- http://xxx.9.2.22/index.html
Connecting to xxx.9.2.22:80... connected.
HTTP request sent, awaiting response... ^C

please help! It's in production environment I don't like another hangup.

best regards, Gabor

akoss · Tue Feb 25, 2020 3:41 pm

In the 2.11 update fixed this problem? I don't see any mention of this problem in the change log((

tomkmonda · Thu Mar 05, 2020 8:45 pm

Release 2.11 not fix this issue. Newly connected cable is not go link-up on hanged switch. CSS326 swOS 2.10 or 2.11 is ununsable in production for my

tmcnulty1982 · Mon Mar 16, 2020 3:34 am

I also have a CSS326-24G-2S+ and ran into this issue tonight. I was running 2.10 when I encountered the issue but have upgraded to 2.11 now.

The switch was passing DHCP traffic sometimes (my computer would get an IP address, but not reliably). ICMP traffic seemed flawless. Nothing TCP-based that I tried worked (HTTPS, SSH, etc.).

tomkmonda · Sun Mar 22, 2020 11:11 am

Switches in this issued state get only packets with size max 200b larger packets are FCS Errors. This must be software error in SwOS. Any solution ?

cwachs · Wed Mar 25, 2020 9:17 pm

We had our second hang this week on two separate CSS326-24G-2S+ switches. While it was in the "down" state, I was able to confirm I could ping the switch up to a MTU size of 230 bytes. At 231 bytes, the packet was dropped by the offending CSS switch. Both of our hangs this week were under 2.10. We have taken them both to 2.11 but it sounds like that is not a fix.

Is anyone having this issue on a true CRS-326 (cloud router switch) running SwOS as well? Are we certain this is a SwOS issue or a CSS (cloud core switch) hardware problem?

Any word back from Mikrotik on this?

rblazek · Thu Mar 26, 2020 3:58 pm

I can also confirm the issue is present on 2.11.
Happens every 24 hours or so, the web interface goes down and it stops passing traffic (which is very light for this device).
No VLANs, default config with a DHCP IP, only 6 gbit ports active & no SFP. It's in an actively cooled rack with 56C on the CPU.

I'm disappointed and hope this gets fixed soon. Judging by the amount of views on this thread a lot of people seem to have this problem.

EDIT: Read somewhere that a downgrade to 2.4 might help with the hanging (??). Testing it right now, will update. Though the 2.4 has again a different problem with Rx Overruns...

chrismal · Sat Mar 28, 2020 12:58 pm

Having the exact same problem.
I must reboot the switch every couple of days.
CSS326
SWOS v2.11

AlienPie · Sat Apr 04, 2020 2:30 pm

I’m having a similar problem with my CSS326-24G-2S+RM on 2.11. For me, however, the issue seems to be trigged by large data transfers, and I’m losing complete access on just some VLANs.

VLAN 600 is my DMZ, VLAN 100 my servers/storage and VLAN 300 my WiFi.

VLAN 100 is using jumbo frames. When data transfers (980mbps) from 600 -> 100 the issue occurs. This is 100% reproducible.

My WiFi vlan goes dead, but vlan 600 can still send data to the gateway and to vlan 100. Haven’t tested with flow control off. My environment is quite warm (ambient 30C switch ~65C) but then specs show the switch being tested to 75C ambient.

RSTP is on, no LAGs (had LACP but disabled as thought it might be the issue), no IGMP snooping etc.

No ICMP or DHCP traffic observed to be passing but need to check firewall logs as it may be a unidirectional failure.

cwachs · Sat Apr 04, 2020 5:45 pm

We've had two more switches fail with this same exact issue (all css326). I have filed a support request with Mikrotik two weeks ago but still no reply from them. I am now RMA'ing these switches and replacing them with CRS326 versions - but still running SwOS on them. So far, no issues on the CRS hardware - knock on wood...

rblazek · Mon Apr 06, 2020 11:01 am

What has worked for me is downgrading the switch to version 2.2. Running stable for a week and no issues so far...
https://download2.mikrotik.com/swos2/cs ... 26-2.2.bin

cwachs · Mon Apr 06, 2020 4:42 pm

Does anyone that is having this issue have Flow Control turned off on all ports?

k6ccc · Mon Apr 06, 2020 11:17 pm

Well, this one finally caught me last night. Below is a simplified drawing of the routers and switches here at home. I did not include any of the end user devices. I enabled the IPv6 package on router #1 and commanded the reboot. As far as I can tell, shortly after the reboot, I could not get into anything. I could ping devices, but could not pass traffic on any VLAN. This includes traffic that only went through the family room switch. Figured it locked up, so I power cycled the family room switch. Still nothing. Went out to the garage and from the server (which has LAN interfaces on three different networks), could not access anything. So I power cycled the switch in the garage. Still nothing. Somewhere in the process i power cycled both the DSL and cable modems - not that I really expected that to help - it didn't..

So figuring that somehow I had totally crapped the router when I enabled IPV6, so I power cycled Router #1. Still nothing. Power cycled the garage switch again for good measure. Next step was to use my cellphone data to download NetInstall, and grab the 45 minute old backups for both routers (just in case). Took my laptop out to the garage and unplugged all the LAN cables from Router #1 in prep for a NetInstall. Just for grins first I plugged the laptop into one of the native LAN ports (not a VLAN trunk port) and set it for DHCP. Somewhat to my surprise, I got a correct address, and could access the router via Winbox. It was perfectly happy. So instead of a NetInstall, I just disabled IPv6 and rebooted. I think I power cycled both switches again, and after plugged everything back in, all is working (without IPv6 of course).

My assumption (whatever that is worth), is that after Router #1 rebooted and now had IPv6 capability, it attempted to communicate via IPv6 to my cable ISP (which can provide IPv6), the switches went balistic. Once I disabled IPv6 and rebooted again, they were happy. Pure speculation on my part.

Both switches are running SwitchOS 2.11, flow control was enabled and IGMP snooping is off. The VLAN trunk between the switches is Gigabit copper. Both routers are running RouterOS 6.46.4 The routers are used exclusively for routing. All switch function is in the switches.

nputnam · Fri Apr 24, 2020 2:16 am

My setup if very similar to yours. I noticed after upgrading my CSS326 to 2.11 it started to stop forwarding ipv4 packets every 24 hours or so. I reverted back to the 2.10 version and the problem seems to have gone away. I'm using ipv6 through my isp also.

tomkmonda · Fri Apr 24, 2020 12:38 pm

Does anyone that is having this issue have Flow Control turned off on all ports?

I was try this, no effect, switch hang too. On all hanged switch in my network was port with 10Mbps connection or 100Mbps with half duplex, may by this go to trouble.

walterav1984 · Fri May 01, 2020 3:44 pm

This "hang" as in today became a real issue for us, it was not passing any traffic anymore except ping commands also the webinterface was not available. In the past sometimes the webinterface didn't come available, but the rest of the switch still kept working and passing traffic fine but this time it really went down.

If it happens again I will try @cwachs suggestion of increasing ping packetsize.

The CSS326-24G-2S+RM (uptime ~65days) switch was running SWOS 2.11 and had the following modest activity 1 (remote) desktop session 100Mbit port and 4 servers (1Gbit ports) of which 2 continous NDI video streams 720P50 ~100Mbit data each and 2 http downloads from a local server 1Gb(can saturate 1Gbit).

cwachs · Fri May 01, 2020 3:56 pm

We are looking at how to move our switches over to ROS because of this. It keeps happening and as a service provider, I can't keep having buildings go offline because a switch stops working.

The only feature of SwOS we use that I have not found out how to duplicate in ROS yet is "Port Lock" and "Lock on first". We need those two features in our environment and I can't figure out how to duplicate that in ROS.

chrismal · Sun May 03, 2020 9:50 pm

Since I turned off IGMP snooping I have 23 days uptime, do not know if that was the problem, will see how it keeps going.

gcsuri · Wed May 06, 2020 4:31 pm

Hi,

one of our CSS326 switch today stopped again... it received and forwarded packets which smaller than MTU 251bytes. Monitoring system didn't alarmed us because of the ping reply...
It drives me crazy.
I can't believe there is no solution to this BUG!

you may ...
1., remove this bug
2., develop a watchdog function with 1000 bytes icmp packets
3., don't recommend it to production environment
4., stop selling this product

Anybody knows how can I reboot the switch with a simple http request?
Edit: (you can reboot the switch by $ wget --post-data="" --http-user=admin --http-password="" http://192.168.88.1/reboot )

best regards, Gabor

gotsprings · Wed May 06, 2020 5:45 pm

I followed another thread about this for a while but ultimately gave up. The Ubnt Edgeswitch I swapped in has experienced zero lockups and has not been rebooted in several months.

gcsuri · Thu May 07, 2020 7:20 am

Hi,

it's a simple script to watch the switches:

#!/bin/sh

HOSTS="10.29.2.12 10.29.2.13 10.29.2.14 10.29.2.21 10.29.2.22 10.29.2.23"

function log {
  if [ -e "/tmp/wdebug" ];then
    echo -e $1 >> /tmp/wdebug.txt
  fi
}

log "watch service started..."
while true
do
for h1 in $(echo $HOSTS); do 
  log "ping $h1 ..."
  ping -c1 -s 1300 $h1 >/dev/null
  if [ $? -eq 0 ];then
    log "$h1 is alive..."
  else
    ping -c1 -s 50 $h1 >/dev/null
    if [ $? -eq 0 ];then
      log "$h1 has BUG needs reboot..."
      wget --post-data="" --http-user=admin --http-password="secret" http://$h1/reboot
    else
      log "switch $h1 unavailable..."
    fi
  fi
done
sleep 10
done

exit 0

mutluit · Fri May 08, 2020 7:30 pm

Don't know about SwOS, but in RouterOS 7.0beta5 the said reboot method via http request is not implemented (yet?):

$ wget --post-data="" --http-user=admin --http-password="secret" http://192.168.88.1/reboot
--2020-05-08 18:26:14-- http://192.168.88.1/reboot
Connecting to 192.168.88.1:80... connected.
HTTP request sent, awaiting response... 501 Not Implemented
2020-05-08 18:26:14 ERROR 501: Not Implemented.

sotirone · Sat May 09, 2020 11:31 pm

Just happened to me for the first time. WTF?

Only icmp traffic passing, everything else was dead.

Tracked it down to the switch because OpenVPN clients had no problem connecting to the router.

lmacka · Sun May 10, 2020 1:07 am

Plus 1 to this problem. It started happening to us as well with a CRS125-24G-1S firmware 6.46.5 rOS 6.46.5.
Most network traffic ceases to pass through the switch, yet I can still log in and power-cycle the unit to fix the problem.

I'd prefer to stick with Mikrotik if possible but not at the cost of reliability. Following.

cwachs · Sun May 10, 2020 1:58 am

@Imacka, did this happen to you on a switch running Router OS, not Switch OS?? That would be I think the first report of this happening on a ROS switch. I thought the problem was just with switches running SwOS.

lmacka · Sun May 10, 2020 2:53 am

Yes, running RouterOS. Initial post amended to include versions of firmware and rOS. The device is in a T3 datacenter with power and cooling protection. It's operating at a constant 24.6V and 29C. CPU fluctuates between 15% and 35% but never any higher.

Some notes on the first failure:
- The switch was receiving it's management IP from a DHCP server (also rOS) on the network. The first sign of trouble was that this stopped working and the switch lost its IP. (I know, awful practice, fixed now!)
- Could not log in to the router using normal nor MAC telnet, yet it responded to pings
- Switch failed to pass any traffic through it, but still responded to pings and returned "Login:" via MAC telnet, so was partially responsive
- Could log in to the router using a console cable, but could not see any faults in the log. Statically assigned an IP, still wouldn't allow winbox access. Tried a few other things and while it accepted the commands, nothing changed it's failed state
- Power cycle via console command fixed the issue

A week later, the same thing happened again in that no traffic could pass through the device. However this time I was able to log in via winbox and reboot the switch. This fixed the problem immediately.

I now have the switch under strict monitoring with debug logging to file. If it happens again I will post my findings here.

SW01-2020may10.rsc

dmitris · Sun May 10, 2020 12:40 pm

We had similar problem with CRS326 switches, Mikrotik support suggested us to upgrade ROS to the latest stable (not long term) and after this upgrade switches still working fine 2-3 months wihtout a problem.

Yes, running RouterOS. Initial post amended to include versions of firmware and rOS. The device is in a T3 datacenter with power and cooling protection. It's operating at a constant 24.6V and 29C. CPU fluctuates between 15% and 35% but never any higher.

Some notes on the first failure:
- The switch was receiving it's management IP from a DHCP server (also rOS) on the network. The first sign of trouble was that this stopped working and the switch lost its IP. (I know, awful practice, fixed now!)
- Could not log in to the router using normal nor MAC telnet, yet it responded to pings
- Switch failed to pass any traffic through it, but still responded to pings and returned "Login:" via MAC telnet, so was partially responsive
- Could log in to the router using a console cable, but could not see any faults in the log. Statically assigned an IP, still wouldn't allow winbox access. Tried a few other things and while it accepted the commands, nothing changed it's failed state
- Power cycle via console command fixed the issue

A week later, the same thing happened again in that no traffic could pass through the device. However this time I was able to log in via winbox and reboot the switch. This fixed the problem immediately.

I now have the switch under strict monitoring with debug logging to file. If it happens again I will post my findings here.

SW01-2020may10.rsc

netpro25 · Tue May 12, 2020 4:27 pm

Did the 2.2 downgrade resolve this issue? I have two of these switches (CSS326-24G-2S+) in a data center and they are having this exact issue. The funny thing is they were running for about 1 year without an issue and just started having issues recently.

What has worked for me is downgrading the switch to version 2.2. Running stable for a week and no issues so far...
https://download2.mikrotik.com/swos2/cs ... 26-2.2.bin

netpro25 · Fri May 15, 2020 4:46 pm

Okay so I determined that the latest firmware 2.11 was causing this issue to occur daily. I've since downgraded to 2.2 per a previous post and I'm no longer seeing this issue, however it's only been a few days. Hopefully Mikrotik can find a solution sooner than later.

epkulse · Fri May 15, 2020 7:39 pm

I have had issues with the switch before, but still I attempted to put it in service again since I suspected the previous problems were related to high temp. But now, even when in a cool environment, it starts to drop sessions randomly... Running 2.11, will try to downgrade to 2.7...
But, is this a crappy product or crappy SW?

netpro25 · Fri May 15, 2020 10:58 pm

I have had issues with the switch before, but still I attempted to put it in service again since I suspected the previous problems were related to high temp. But now, even when in a cool environment, it starts to drop sessions randomly... Running 2.11, will try to downgrade to 2.7...
But, is this a crappy product or crappy SW?

2.11 was problematic for me. Switches would lock up daily, but as mentioned earlier in this thread they'd still pass some traffic which meant that pings would go through and if you have a ping monitor it won't always detect an issue. Downgraded to 2.2 and no issues yet. As I mentioned previously I ran these switches for over a year without any issues and the issues start when I did a firmware update.

epkulse · Sat May 16, 2020 10:47 am

OK, I am now running 2.7 - will monitor behavior. The problem I noted first, when I guessed issue with temperature, was that ports did not even recognize a connected cable. Regardless reboot or reverting to 2.10 helped. So I removed the switch and replaced with a Zyxel switch. Recently, I decided to test the CSS326 again, and after a week connections started to drop intermittently. Will stay on 2.7 now and see if that helps. Was these problems injected with 2.11, or even earlier versions as you decided to revert to 2.2?

epkulse · Sun May 24, 2020 12:58 am

My CSS326 has now been running flawlessly for 2 weeks on 2.7 FW. Which SWOS version can be considered stable on CSS326? Is it safe to upgrade to 2.10, or shall I remain on 2.7?

dave93cab · Mon May 25, 2020 11:44 pm

Had this exact issue today on a new (few days old) unit I bought on amazon.. FW 2.11

epkulse · Mon May 25, 2020 11:49 pm

Yup. I reverted to 2.7, which was the oldest version I had available. netpro25 reverted to 2.2, for some reason. Basically, I don´t know which SWOS-version is stable on CSS326. Maybe even 2.10 would be OK? For now, 2.7 is stable for me...

netpro25 · Tue May 26, 2020 12:00 am

So I reverted to 2.2 since someone else mentioned it was stable for 2 weeks. My switches are in a mission critical setting for my data center servers. Regardless, I know 2.10 had the issue but 2.11 was much worse and I couldn’t recall the version I had prior to 2.10 that was stable.

mhenriques · Tue May 26, 2020 12:16 am

Yup. I reverted to 2.7, which was the oldest version I had available. netpro25 reverted to 2.2, for some reason. Basically, I don´t know which SWOS-version is stable on CSS326. Maybe even 2.10 would be OK? For now, 2.7 is stable for me...

There are reports of this problem with V2.10, mine included. V2.11 was our hope for a solution, but...

Mauricio

epkulse · Tue May 26, 2020 12:18 am

So, would it be a better and more stable solution if buying a new Switch to go for a CRS326 instead?

dave93cab · Tue May 26, 2020 12:20 am

So, would it be a better and more stable solution if buying a new Switch to go for a CRS326 instead?

I'm wanting to know this too as I can return the CSS and pay the extra for CRS features I don't need if it means it is reliable..

cwachs · Tue May 26, 2020 4:39 am

CRS is having the same exact issue running SwOS. I've had issues with 2.9, 2.10 and 2.11

epkulse · Tue May 26, 2020 8:18 am

I meant: would a CRS326 running RouterOS be a more stable solution?

cwachs · Tue May 26, 2020 2:35 pm

Ahh, sorry. We do have two CRS3xx switches running ROS and they have not locked up. They are our core switches so it will be disastrous if they do...

gotsprings · Wed May 27, 2020 12:49 pm

I meant: would a CRS326 running RouterOS be a more stable solution?

Changing vendors would be "more stable".

Mikrotik can route... But man... Switching and wireless is absolutely not in the same league.

tmcnulty1982 · Thu Jun 11, 2020 2:24 am

I also have a CSS326-24G-2S+ and ran into this issue tonight. I was running 2.10 when I encountered the issue but have upgraded to 2.11 now.

The switch was passing DHCP traffic sometimes (my computer would get an IP address, but not reliably). ICMP traffic seemed flawless. Nothing TCP-based that I tried worked (HTTPS, SSH, etc.).

Okay, exact same thing happened on 2.11 today. It's a real pain because, since ICMP traffic passes okay, I keep assuming it's a firewall issue when in fact the switch "just" needs a reboot. Is there a known safe firmware version that is newer than 2.2 and/or any possibility for Mikrotik to fix this issue?

Kalderos · Sat Jun 13, 2020 11:00 pm

Reporting the same problem CRS326-24G-2S+, SwitchOS 2.11, random hangups, was fine until the upgrade from swOS 2.8.

nachtw1nd · Mon Jun 15, 2020 2:41 am

Just had the same problem! Well for me it was really bad my home automation is connected to the switch and without my wifi access points been able to communicate I can't open my main door .. so the solution is to downgrade ? to what version 2.2 ? 2.7 ? 2.10 ?

netpro25 · Tue Jun 16, 2020 6:11 pm

I downgraded to 2.2 and it's been rock solid ever since. Some others have stated other versions work too but I really have no need to try them as 2.2 is working flawlessly.

rblazek · Wed Jun 17, 2020 5:29 pm

Did the 2.2 downgrade resolve this issue? I have two of these switches (CSS326-24G-2S+) in a data center and they are having this exact issue. The funny thing is they were running for about 1 year without an issue and just started having issues recently.

What has worked for me is downgrading the switch to version 2.2. Running stable for a week and no issues so far...
https://download2.mikrotik.com/swos2/cs ... 26-2.2.bin

I've kept downgrading our CSS326 from 2.11 to 2.10 to 2.9 and so on, the issues stopped appearing on 2.2. Has been running for over two months without a problem.

Not sure if I'd call that a fix, Mikrotik!

tmcnulty1982 · Wed Jun 17, 2020 7:21 pm

This has started happening almost daily now despite no material change in switch usage. I'm a home user with 1-2 people on the network at any given point, so not exactly a stressful application.

I'll see if 2.2 works, though there are some bug fixes in more recent versions I'll be sad to lose! Agreed this does not feel like a solution.

chrismal · Sat Jul 11, 2020 7:11 am

For some strange reason, I can easily reproduce the problem by adopting and upgrading Unifi APs. Yesterday I had to adopt 6 to my controller and the switch hanged on almost all of them.

Nom · Mon Jul 13, 2020 8:44 am

For some strange reason, I can easily reproduce the problem by adopting and upgrading Unifi APs. Yesterday I had to adopt 6 to my controller and the switch hanged on almost all of them.

I can also knock my CSS326-24G-2S+RM offline by playing around with my Unifi gear !
It was common every time there was a firmware update, but I haven't noticed it happening recently...

chrismal · Mon Jul 13, 2020 10:18 am

@Nom
Thanks for sharing.
Its not just updating the Unifi gear that knocks the switch but Adopting, I can reproduce this nearly every time I adopt a unified device. Its a shame if this problem does not get fixed as I bought this switch to try out, really hoped to start using Mikrotik switches.

Paternot · Mon Jul 13, 2020 5:27 pm

@Nom
Thanks for sharing.
Its not just updating the Unifi gear that knocks the switch but Adopting, I can reproduce this nearly every time I adopt a unified device. Its a shame if this problem does not get fixed as I bought this switch to try out, really hoped to start using Mikrotik switches.

This is quite interesting. I've never used Ubiquiti, so I have no idea what "adopt a device" is. Some kind of discovery, to management purposes? That would explain why my two CSS326 didn't get this problem.

If it is really reproducible, open a support with Mikrotik - it may help them solving this.

k6ccc · Mon Jul 13, 2020 7:55 pm

I'm sure that this has not been fixed because it is such a odd situation that causes it to happen. That makes it hard to replicate. As far as I know, I have only had it happen when I enabled IPv6 in the Mikrotik router that is attached to the garage switch. My arrangement is that my internet services (two) come into my house and are connected to ports on my family room switch. There is a VLAN trunk between the family room switch and the garage switch. The routers are connected to the garage switch. Both switches are CSS326-24G-2S+RM with 2.11 firmware. A few months ago I enabled IPv6 in one of the routers and almost immediately, both switches locked up. This was documented in detail in my post on 6 April 2020.

Paternot · Mon Jul 13, 2020 8:42 pm

Both switches are CSS326-24G-2S+RM with 2.11 firmware. A few months ago I enabled IPv6 in one of the routers and almost immediately, both switches locked up. This was documented in detail in my post on 6 April 2020.

I have two of them (SWoS 2.11) connected to a CRS328 (RouterOS), through 10GiB fiber.
I use VLANs, trunk and IPv6. No lockups, whatsoever. At least none that I caught - if it happened, the watchdog restarted the switch and no one noticed. They are used at my work intranet, so we could miss an odd reboot. But so far, never caught one.

chrismal · Tue Jul 14, 2020 3:42 pm

@Paternot
Adopting Unifi APs means you bind them to your controller and they automatically receive the wifi SSID and settings.

@k6ccc
Thank you for sharing, I disabled IPV6 on my routerboard as you suggested and tried re adopting my existing AP and did not get a hang, I have another batch to adopt soon, will see how it goes and post back.

chrismal · Tue Jul 14, 2020 3:50 pm

Looks like we got the fix

) , V2.12

*) fixed problem where switch locked up after a while and only small packets could be sent

k6ccc · Tue Jul 14, 2020 9:31 pm

My initial tests are looking good for 2.12. I repeated my April test by enabling IPv6 in an attached router, and both switches are working just fine.

netpro25 · Wed Jul 15, 2020 1:44 am

Our switches are mission critical so I won't be upgrading anytime soon. But it's awesome it's been fixed.

netpro25 · Mon Aug 03, 2020 5:52 am

Anyone running the new firmware notice any issues? How's it working?

k6ccc · Mon Aug 03, 2020 6:37 am

Anyone running the new firmware notice any issues? How's it working?

My two CSS326-24G-2S+RM are working fine on 2.12 - so far.

karlisi · Mon Aug 03, 2020 8:16 am

On first test problem was not resolved, but we will test it more thoroughly this week.

tomkmonda · Mon Aug 03, 2020 11:13 pm

SwOS 2.12 is working good cca 4 weeks.

epkulse · Tue Aug 04, 2020 4:12 pm

15 days, no issues.

netpro25 · Tue Aug 04, 2020 5:09 pm

Thanks for the updates!

dave93cab · Wed Aug 19, 2020 4:38 pm

Everybody still good?

k6ccc · Wed Aug 19, 2020 5:43 pm

Coming up on 36 days and still good on both of mine..

RcRaCk2k · Sun Aug 23, 2020 10:11 pm

I ran into that issue twice.

CRS326-24G-2S+
Current Firmware: ROS v6.44.5

The switch locks up and is not able to communicate via IP (SSH, Telnet, MAC-Telnet, WinBox, arping).

Only powercycle will bring the switch back into life.

We use the SFP1-Port for Upstream with an Fiber-Switch.
On the fiber-switch we can see RX-Packets, also the Neighbour-Discovery can be seen from the upstream-switch.
Also we see DHCP-Discover Messages, and also ARP-Messages.
But all traffic that is sent to the CRS326-24G-2S+ will be dropped.

Will install the newest Firmware tonight in hope to get this issue fixed.

k6ccc · Sun Aug 23, 2020 10:47 pm

As far as I know, you are the first person to report this lockup under ROS. This entire thread has been about SwOS.

dmitris · Sun Aug 23, 2020 10:56 pm

No he is not first person...use search and you will find my posts also in this thread about the same problem. Our problem was solved with newer ROS.

As far as I know, you are the first person to report this lockup under ROS. This entire thread has been about SwOS.

P.S

Don't use long-term software, update to the latest stable available.

Nom · Mon Aug 24, 2020 10:05 am

I ran into that issue twice.
CRS326-24G-2S+
Current Firmware: ROS v6.44.5

No he is not first person...use search and you will find my posts also in this thread about the same problem. Our problem was solved with newer ROS.

Please start a new thread guys - this one is about CSS326-24G-2S+RM which is a SWOS-only device.

agehall · Mon Apr 04, 2022 11:08 am

Are people still doing OK on 2.12? My device just went dead and I realized that for some reason I'm still on 2.10 so I'm trying to decide if I should downgrade or upgrade.

tomkmonda · Mon Apr 04, 2022 3:16 pm

For my with 2.13 everythig good long time.

agehall · Tue Apr 05, 2022 8:17 am

Good to hear - I'll drop in 2.13 then and hope for the best. The wife was NOT happy yesterday when her connection died in the middle of a call so I really want to avoid this happening again...

FillippoGonzalez · Thu Jun 09, 2022 12:39 am

Had this problem several times with 2.11 I think. But actually again on 2.13. The switch is not pingable. But it is seen in IP Neighbors. Also there is some little traffic going through it. If the switch could by somehow auto powercycled by disfunction would be wonderful. 24 students without internet is not good reputation and brings a lot of disatifaction. MIKROTIK you can make it better! Waiting for BUGFREE 2.14! We believe in you guys! : )

k6ccc · Sat Jun 11, 2022 1:21 am

Had this problem several times with 2.11 I think. But actually again on 2.13. The switch is not pingable. But it is seen in IP Neighbors. Also there is some little traffic going through it. If the switch could by somehow auto powercycled by disfunction would be wonderful. 24 students without internet is not good reputation and brings a lot of disatifaction. MIKROTIK you can make it better! Waiting for BUGFREE 2.14! We believe in you guys! : )

The bug that existed for several versions would allow pings and other small packets, but anything over a couple hundred bytes (as I recall) would not work. That has been solved with either 2.12 or 2.13 (don't remember which off hand). One of my CSS326 switches that exhibited the bug has an uptime now of 405 days...

jlficken · Thu Nov 03, 2022 4:32 am

I’m having this happen on a CRS326-24S+2Q+RM on 2.13 that I just got today and set up as my core switch.

I’ve disabled Flow Control to no avail. I finally had to give up after several hours of problems and go back to my ES-16-XG. I have another one of these switches and it doesn’t have the same issues on the same firmware.

What else can I try?

Gblenn · Sun Dec 10, 2023 11:56 am

I have been having a similar issue on my CSS326 where I am using it as a glorified media converter on my WAN side. The sole purpose is to be able to quickly switch between firewalls when testing different solutions. I've am using pfsense but have been playing around with Sophos XG and OPNsense as well. So it's just for convenience, as it allows me to enable and disable ports between Fiber IN and the WAN of the selected firewall. So it's not a permanent solution.. and I have isolated the ports I use as well as put the port I use for management of the Mikrotik switch on an isolated VLAN.

Anyway, I have been loosing the connection randomly, and have been trying all kinds of changes in settings to make it work. The exact same setup using a TP-Link switch worked like a charm, so I knew it had to be something with the Mikrotik switch.

I started out on FW 2.12 and upgraded to 2.13 but it just did not work. After discovering this thread I started downgrading and came to 2.7 which worked a bit better, at least for 12 hours or so before it broke. BTW, reboot never helped at all, I had to move the fiber back into the firewall...

One thing I noticed was that when moving the fiber connection to the Mikrotik, it took quite a while before pfsense got an IP on WAN. Whereas after it failed, moving it back it picked up the IP directly... Running pcap on WAN from within pfsense shows a lot of traffic and DHCP seems to fail... as far as I can tell being a wireshark noob.

This weekend I decided to do one final test and downgrade to FW 2.2, and this time when I plugged the SFP modules into the switch, the WAN IP popped up in an instant on pfsense. So that was a big change and very promising... now it's nearly 24h later and still up and running...

Reading the release notes there is one thing in FW2.7 that stands out is the one about the SFP port.
What's new in v2.7:
*) Make some of Mikrotik DAC work in 10G mode;

And I'm using the SFP+ cages as I'm plugging the fiber directly into the switch. Also I have found some other threads about problems with these ports. viewtopic.php?t=197031

Anyone else noticing problems with the SFP ports. I don't understand what makes me see all this traffic going through, but my firewall lost it's IP and isn't able to reclaim it... ???

jfreak53 · Tue Dec 03, 2024 3:39 am

I can confirm this is still happening! I'm running SwOS 2.16, two different TOR switches, CRS354. I am currently running 6 different 354's as TOR's, each with massive traffic and VLANs. Out of the 6 only two of them have this issue, and it only happens with certain packets.

I've tracked it down to the packets connected with crypto node rigs some of my customers are running. If I disable those ports of those customers, the switches stay online just fine, but once enabled they last randomly between 1 to 5 days then go out and I have to power cycle to get back.

I tried downloading 2.2 and 2.12, but when I try that link for the 354 series it tells me not authorized. So I have tried upgrading to 2.17 on one of them to see if it fixes it.

IGMP is off and never was on. I've checked and the config matches the other 4, so its not config.

Whats crazy is, these two switches have the least amount of traffic, under 1Gbps of traffic, the other 3 TOR switches with no issues are each running around 2Gbps and no issues for over two years. So its not the model, not the firmware, its something to do with the packets!! Come on Mikrotik, its been 4 years same ghost issue! If you want to compete with the big dogs you can't have this old and long of issues with your switches.

Anyone know where to get the firmware to 2.2 or 2.12 for the 354 series by chance?

mkx · Tue Dec 03, 2024 8:56 am

@jfreak53: if you can pinpoint the problem to certain packet contents, then you'd make MT (and humanity) a favour if you could sniff off those frames and send MT the capture file. IMO this is the only way allowing MT to actually fix it. Unless they see those packets and analyze which combination of bits actually upsets switch chip they can not fix the bug.

jfreak53 · Tue Dec 03, 2024 5:57 pm

@jfreak53: if you can pinpoint the problem to certain packet contents, then you'd make MT (and humanity) a favour if you could sniff off those frames and send MT the capture file. IMO this is the only way allowing MT to actually fix it. Unless they see those packets and analyze which combination of bits actually upsets switch chip they can not fix the bug.

How do I do that on SwOS? I know the ports that those units are on, so it would be easy, they also have their own vlan, but how do I do that?

Side note, upgrading to 2.17 didn't help, did it again last night.

mkx · Tue Dec 03, 2024 11:11 pm

I don't have any CSS, but I'd expect it to have mirror functionality. However, since those frames break CSS, they might not come out of CSS via mirror port. Which means you'd have to use another managed switch between CSS and one of connected servers and configure mirror port on that switch.

jfreak53 · Wed Dec 04, 2024 5:03 pm

If I have access to one of the servers doing it, creating the traffic, could I run wireshark on that server directly capturing the NIC port?

mkx · Wed Dec 04, 2024 10:14 pm

I guess you could.

jfreak53 · Thu Dec 05, 2024 3:49 pm

I was just curious if it "had" to be from the switch port or if I could go directly to server. I'll do that here this week and see what I get. Thanks!

mkx · Thu Dec 05, 2024 10:06 pm

If UTP cable doesn't affect bits passing too much (i.e. if it doesn't drop or invent bits or whole frames), then both link partners should see identical frames. So it shouldn't matter which side of UTP cable captures traffic.

jfreak53 · Fri Dec 13, 2024 7:44 pm

Well that idea is gone now. All of the crypto servers have been offline for 48 hours and still doing it, now more frequently.

Anyone have links to firmware 2.2 or 2.12 that we can put on these and try?

Who is online