Page 1 of 1
Uptime rollover bug/SNMP
Posted: Mon Nov 16, 2020 10:31 pm
by sathackr
Hello,
About 497 days ago we deployed our first Mikrotik CRS326 switches running RouterOS 6.44.3 into production.
Today they are one-by-one becoming unreachable via SNMP, and when viewing system uptime in the Web UI, it's becoming clear that the uptime counter is being measured in 32bits and has rolled over.
We suspect this is causing SNMP to fail.
Has there been any update in versions >6.44.3 to address this issue? We have over 400 of these switches deployed and do not want to have to track rebooting them every 497 days.
Re: Uptime rollover bug/SNMP
Posted: Mon Nov 16, 2020 11:20 pm
by joegoldman
497 days is a long time to go without security upgrades etc.
Perhaps set up a yearly maintenance and upgrade cycle.
Or at the least - have SNMP monitoring start warning at day 450, and become critical at day 480.
Who knows - maybe uptime is 64bit int in newer version of RouterOS - a lot of new versions since your current one.
Re: Uptime rollover bug/SNMP
Posted: Mon Nov 16, 2020 11:50 pm
by mkx
Linux kernel had 64-bit uptime counter (regardless the HW platform "bitness") since version 2.6 which was released in mid-December 2003.
ROSv7 is built around much newer linux kernel, so the issue will be gone. Not with ROSv6 though, MT is not going to upgrade kernel inside (it's not a trivial task, they stuck to same kernel for too long).
While I tend to agree that some minimum maintenance is right thing to do I don't see that as pressing for a switch where (almost) everything happens inside ASIC / switch chip.
Re: Uptime rollover bug/SNMP
Posted: Wed Nov 18, 2020 11:54 pm
by sathackr
Yep -- also we are always hesitant to upgrade firmware unless there is a specific issue to address. The risk of firmware upgrade and even just a reboot is not zero. We know that 6.44.3 & 6.44.5 work very well on hundreds of switches and thousands of customers. We're not in a hurry to change it every month when there is a new firmware upgrade and/or potential new firmware regression.
More than a couple of times I've had a MT device fail after a firmware upgrade or simple reboot (corrupt routerboot, corrupt flash, and self-recovery fails and causes and outage and requires subsequent truck roll)
We protect the devices with a robust firewall rule set, and while not perfectly secure, it serves our purposes.
The rollover bug itself isn't necessarily a problem, but SNMP dies somehow in connection with it and makes the devices unmonitorable.
Re: Uptime rollover bug/SNMP
Posted: Fri Jul 22, 2022 9:11 pm
by troylb
This whole issue with the uptime and 32bit counters can be resolved without using 64 bit counters. The issue is that timetick are used which rolls over the counters at 497 days. You can resolve this if you setup a counter that is 32bit but seconds instead of timeticks. Most hardware manufactures have this already using the OID: SNMP-FRAMEWORK-MIB::snmpEngineTime.0 which can be used as an alternative to the timeticks counter. This can't be used on a linux/unix machine because that daemon can restart and the time would change, but this does not restart in router/switch gear, printers and other hardware.
Currently FSCOM, Cisco, Fortinet, peplink, Axis, UBNT Edge switches and Zyxel switches are all known to support this option. I don't believe that it would be difficult to implement this on the mikrotik as it would just be another 32bit INT value.
This would give you uptimes that would far exceed the life of the hardware.
Is there anyone at Mikrotik that mike be able to comment about this? The issue of security updates for hardware that is on private network space is completely moot point as they are not reachable from the outside and these can run stable for years without issue.
Anyway, that is my two cents on this.
Best,
-Troy
Re: Uptime rollover bug/SNMP
Posted: Fri Jul 22, 2022 11:23 pm
by mkx
The issue was resolved in 32-bit linux kernel long time ago (and was never a real issue in 64-bit linux kernel). Since v7 uses fairly recent linux kernel, uptime rollover won't be an issue for much longer. It'll only affect devices whose administrators don't care to upgrade running software, but I guess most of those admins don't care about uptime too much either.
BTW I suspect that workaround for SNMP actually relies on kernel having correct uptine info and only maps value into sustainable range. Which makes it pretty impractical to implement in ROS v6 (even if devs cared about such a minor nuissnce) but much easier to implement in v7.
Re: Uptime rollover bug/SNMP
Posted: Fri Apr 28, 2023 6:39 pm
by troylb
Hello,
We created a solution that works for us as a workaround to the counters being 32bit. This involved using a script and while I would not recommend making this available to devices where SNMP has not been restricted, the script is below. This will produce an uptime that is accurate to 5 seconds. This generates a number that is in seconds, so a basic snmp script to convert it to years, months, weeks, days, hours, minutes and seconds can be easily written and report and/or alert on uptimes. The example is from one of our RB3011 units, but this works on all units that are running at least 6.34 routerOS and switchOS. We have this running on CRS units, CCR units and RB units.
Hope this is useful. It certainly has been for us.
The SNMP OID is 1.3.6.1.4.1.14988.1.1.18.1.1.2.2
# model = RouterBOARD 3011UiAS
# serial number = 8EEB08F6B003
/system script
add dont-require-permissions=no name=seconds owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source=\
":global a (\$a+5); :put \$a;"
add dont-require-permissions=no name=uptime_data owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source=\
":global a; :put \$a;"
/system scheduler
add comment="Runs the seconds script that updates the variable \"a\" by 5 seconds" interval=5s name=Uptime_run on-event=seconds policy=\
ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon start-time=startup