Leap second bug present on TILE devices?

Clbh · Wed Jul 01, 2015 3:50 am

Did anyone else see any TILE-based RouterOS devices go unresponsive at leap second insertion today? (00:00 UTC)

Three of my CCRs (one running 6.27 and two running 6.29.1) became unresponsive at 00:00 UTC on the dot. LCD screens were also unresponsive, and I equally couldn't get any output via serial console. The only fix was a hard power cycle.

coylh · Wed Jul 01, 2015 3:53 am

Yes, they all crashed: http://forum.mikrotik.com/viewtopic.php?f=3&t=95455

drb · Wed Jul 01, 2015 6:47 am

All of our CCRs were impacted. Those running 6.28 required a hard reset. We had some on 6.17 that seemed to just reboot.

SwissWISP · Wed Jul 01, 2015 9:49 am

+1 here on two CCR 1016

Wed Jul 01, 2015 9:55 am

I can confirm that some CCR units experienced a crash due to introduction of leap second

Only those CCR units were affected, that use the client inside NTP npk package. It currently seems the issue was in linux kernel, the bug was fixed, but RouterOS did not have this kernel fix yet.

If the CCR uses the default SNTP client (ie. NTP.npk is not installed) then nothing happened.

paoloaga · Wed Jul 01, 2015 9:56 am

I can confirm that some CCR units experienced a crash due to introduction of leap second

Only those CCR units were affected, that use the client inside NTP npk package. It currently seems the issue was in linux kernel, the bug was fixed, but RouterOS did not have this kernel fix yet.

If the CCR uses the default SNTP client (ie. NTP.npk is not installed) then nothing happened.

False. 95% of my routers don't have the NTP package installed and they all crashed badly.

bmv · Wed Jul 01, 2015 9:58 am

I can confirm we suffered from this problem at the stroke of 00:00 GMT.

http://forum.mikrotik.com/viewtopic.php ... 59#p488659

2 of our 3 CCRs failed and required a hard power cycle to function again.

All CCRs are on 6.27 and have the NTP package installed, including the 1 that stayed online.

Not good that we find out about this problem before the leap second event.

MartijnVdS · Wed Jul 01, 2015 10:08 am

My my CCR1009 (v6.29.1), using "SNTP Client" (NTP package not installed), and no BGP didn't hang or reboot.

madman2233 · Wed Jul 01, 2015 10:14 am

All of my border routers (ALL CCR) that were synced with an ntp.org pool crashed. All of my edge routers were synced to the border routers so i didn't have to power cycle those. Still caused a significant outage. Unfortunately, this is the last straw and i will be replacing all these devices, border or not, with more reliable cisco equipment.

If mikrotik worked more closely with the open source linux community, I'm sure this wouldn't have happened.

madman2233 · Wed Jul 01, 2015 10:33 am

We have found how to fix the issue in the kernel, fix is coming soon.

Little too late, don't you think? When is the next leap second? I won't have any mikrotik devices on my networks when it happens.

Wed Jul 01, 2015 10:49 am

Little too late, don't you think?

For this one, yes, but next leap second will be added in around 2 years.
Could you please tell me if you had NTP package on all the servers, or you used SNTP?

antondollmaier · Wed Jul 01, 2015 11:03 am

If I may respond as well ...

Could you please tell me if you had NTP package on all the servers, or you used SNTP?

NTP, no SNTP. 6.29, build time May/27/2015 11:19:36.

paoloaga · Wed Jul 01, 2015 11:07 am

next leap second will be added in around 2 years.

Unless a bug in the hardware driver of some NTP server triggers an unexpected leap second (like it happened to me on 1st April, http://forum.mikrotik.com/viewtopic.php?f=3&t=95455 , or unless a malicious user wants to bring down an entire ISP network by hacking one public NTP server.

viacomkft · Wed Jul 01, 2015 11:31 am

Dear Normis!

Besides the probably driver/NTPd bug in the kernel I don't understand why the routers hang and why were not restarted by the hardware watchdog? As I remember Tilera processors have hardware watchdog and seems it doesn't function properly!

Should we trust in the watchdog in these cases? The main problem is that operators had to restart the routers on-site.

We have found how to fix the issue in the kernel, fix is coming soon.

wenasong · Wed Jul 01, 2015 11:34 am

Happened to all our CCR today.

Multinational meltdown on our BGP networks, right now network engineer run around to physically reboot it.

mstead · Wed Jul 01, 2015 11:52 am

Is it just me or is Normis incapable of say "sorry - we screwed up"?? I have read through all his replies and I don't see the apology anywhere - but frankly I am not in the least bit surprised....

paoloaga · Wed Jul 01, 2015 12:00 pm

Let me explain a bit more what happened tonight in my situation:

Being aware of the problems that the leap second caused accidentally to my routers on 1st April, I disabled NTP ( /system ntp client set enabled=no ) on every router, except those I could reach easily. The result is that the routers with NTP disabled didn't crash.

The ones where NTP hasn't been disabled all crashed. Including those who hadn't NTP package installed. This means that if the NTP package causes the problem, there are chances that it causes something else to fail, for example it could be a BGP routing update that triggers the bug/crash and someway propagates it to other routers (this is just a guess).

The point of this post is to warn and emphasize that I found MANY routers in an irresponsive state and only a couple of them had the NTP package installed.

pukkita · Wed Jul 01, 2015 12:09 pm

CCR1009, CCR1016 and CCR1036, 6.19 and 6.27, sntp client. None crashed.

alexcherry · Wed Jul 01, 2015 12:12 pm

Hi, we have Mikrotiks everywhere + around 20 CCRs. NTP is configured on all devices in our network.
Conclusion is below :
1. RBs with NTP client or SNTP client were not affected. versions from 6.13 to 6.28
2. Affected were only CCRs with version after 6.20 and NTP client running on it.
For example CCR with 6.18 was not affected, even it has NTP running.

Wed Jul 01, 2015 12:14 pm

Hi, we have Mikrotiks everywhere + around 20 CCRs. NTP is configured on all devices in our network.
Conclusion is below :
1. RBs with NTP client or SNTP client were not affected. versions from 6.13 to 6.28
2. Affected were only CCRs with version after 6.20 and NTP client running on it.
For example CCR with 6.18 was not affected, even it has NTP running.

That is very interesting, maybe those units used a different NTP server? Because NTP package and Kernel were not changed in 6.18 or even since any v6 version

eehan · Wed Jul 01, 2015 12:23 pm

Hi,

I can report the same issue here. I have two Routerboad CCR1036-12G-4S in operation. Only one was affected.

Both routerboards synchronise with external Australian official NTP pool servers.

The unit that froze up was running v6.28, wheras the unit that was not affected was running v6.15

The unit that froze up was a lab router. I did not realise there was an issue until I was in the lab 2 hours later and I could not login to anything. The CCR required a power-cycle to restore operation.
It wasn't until I checked the logs a few hours later that I saw the last entry before it froze was 09:59:57 (the leap second occured at 10am local time) and so here we are...

maznu · Wed Jul 01, 2015 12:29 pm

Two CCR1009-8G-1S-1S+

Both v6.27 with v3.22 firmware.
Both with "ntp" package installed and enabled.
Both with "NTP Server" enabled.
Both with "NTP Client" enabled and syncing to (different) NTP servers.

One crashed at 23:59:60.
One running ok.

rudihnio · Wed Jul 01, 2015 12:31 pm

Hi,

We have about 65 CCR1036 in service, with NTP-package enabled and NTP used to 2 MT1100AHx2 as NTP-Server.
About 90%-95% crashed at exactly 2h00 CET.

But we experienced 2 versions of the problem:
Release 6.24, all those rebooted and came back after watchdog-timer
Release 6.27 & 6.28 stuck, until Power-cycle on site.

Perhaps this info helps.

Steve

Clbh · Wed Jul 01, 2015 12:40 pm

2x CCR1036-8G-2S+ running 6.29.1 w/ NTP package installed (client enabled, server disabled)
1x CCR1016-12G-1S+ running 6.27 w/ NTP package installed (client enabled, server disabled)

All those configurations crashed for me and were not rebooted by the watchdog (required hard power cycle; thank god for out of band access & switched PDU outlets).

czolo · Wed Jul 01, 2015 1:05 pm

Hi Normis
In our network we are using NTP package an all routers. All our CCRs crashed today. I couldn't connect and only hard power reboot helps. I was trying to touch LCD, but nothing happend. It ssems that the dvices was in hang state.

When can we expect the fix?

Wed Jul 01, 2015 1:09 pm

Status update!

We have found that this issue is related to a Linux Kernel issue that was patched in Linux Kernel v3.4 (RouterOS v6 uses Linux Kernel v3.3.5).

The problem happens only if the following criteria is met:
1) 64bit RouterOS (only tile)
2) any RouterOS v6.x
3) installed and synchronized NTP client from NTP package (NOT the default SNTP client)
4) synchronization to server that have proper Leap Second implementation, not just time adjustment on next synchronization

We have currently not been able to reproduce the issue on the default SNTP client (non NTP package) or confirm that problem doesn't happen on older RouterOS versions - we are still working on this issue, so we might confirm those later.

maznu · Wed Jul 01, 2015 1:23 pm

[quote="normis"]4) synchronization to server that have proper Leap Second implementation, not just time adjustment on next synchronization
[/quote]

This is interesting.

Of the two identical CCRs we have, the one that crashed was synchronised to our stratum 1 DCF time server. Our DCF time server was running at stratum 2 at the time, because of signal quality problems. During the night, our "DCF" server was actually synchronised via NTP to our stratum 1 GPS+PPS time server instead.

The CCR that kept running was synchronised to our stratum 1 GPS+PPS time server.

Both our DCF and GPS+PPS time servers run the same version of ntpd, on Debian Linux.

marrold · Wed Jul 01, 2015 1:33 pm

Yes, they all crashed: http://forum.mikrotik.com/viewtopic.php?f=3&t=95455

that post is from April?

Wed Jul 01, 2015 1:41 pm

same leap second problem, only misconfiguration in those particular NTP servers.
early warning that was missed by everyone

But you must admit - the choice of date was just too good

marting · Wed Jul 01, 2015 1:44 pm

Yes, they all crashed: http://forum.mikrotik.com/viewtopic.php?f=3&t=95455
that post is from April?

When you scroll down, you will find more recent posts from today. The reason is the same. On 1st of April there was a leap second insertion on some Italian nameservers and today it was worldwide: http://forum.mikrotik.com/viewtopic.php ... 99#p488599
My CCR1036-12G-4S also crashed completely (NTP package enabled).

eehan · Wed Jul 01, 2015 2:15 pm

The problem happens only if the following criteria is met:
1) 64bit RouterOS (only tile)
2) any RouterOS v6.x
3) installed and synchronized NTP client from NTP package (NOT the default SNTP client)
4) synchronization to server that have proper Leap Second implementation, not just time adjustment on next synchronization

Can you please explain what you mean by "server that have proper Leap Second implementation" ? That is, how does the "proper" implementation differ from "time adjustment on next synchronization" ?

rudihnio · Wed Jul 01, 2015 2:36 pm

Status update!

4) synchronization to server that have proper Leap Second implementation, not just time adjustment on next synchronization

Does the MT1100AHx2 on release 5.26 have this proper Leap Second implementation.

What changed between 6.24 where we had 50 CCR's rebooted by watchdog timer and newer release, as those needed power cycle on site?

Thanks,
Steve

Wed Jul 01, 2015 3:07 pm

Can you please explain what you mean by "server that have proper Leap Second implementation" ? That is, how does the "proper" implementation differ from "time adjustment on next synchronization" ?

This:

The NTP packet includes a leap second flag, which informs the user that a leap second is imminent. This, among other things, allows the user to distinguish between a bad measurement that should be ignored and a genuine leap second that should be followed.

Wed Jul 01, 2015 3:07 pm

Does the MT1100AHx2 on release 5.26 have this proper Leap Second implementation.

only CCR was affected by this. RB1100 and all other devices worked fine

rudihnio · Wed Jul 01, 2015 3:10 pm

Does the MT1100AHx2 on release 5.26 have this proper Leap Second implementation.
only CCR was affected by this. RB1100 and all other devices worked fine

Normis, what I meant, has the NTP-Server based on release 5.26 on MT1100AHx2 this proper SERVER implementation?

japaeye4u · Wed Jul 01, 2015 3:26 pm

We have about 150 CCRs (1009/1016/1036) and 24 CCRs with version above 6:24 crashed.
All ( about 150) using NTP client package and enable.
I am now looking if there CCR with the same versions that not crashed. Soon return with new information.

jakubj · Wed Jul 01, 2015 3:31 pm

We have two CCR1016-12G running Router OS v6.27 firmware v3.22 one crashed and other did not. So not sure why one but not the other... odd.

petrisimo · Wed Jul 01, 2015 3:59 pm

Hello Mikrotik,

which ROS and firmware it is safe to use on CCR1036-8G-2S+EM ?

+++

> system health print
fan-mode: auto
use-fan: main
active-fan: main
use-fan2: main
active-fan2: main
cpu-overtemp-check: yes
cpu-overtemp-threshold: 100C
cpu-overtemp-startup-delay: 1m
voltage: 23.6V
current: 1906mA
temperature: 40C
cpu-temperature: 54C
power-consumption: 44.9W
fan1-speed: 10155RPM
fan2-speed: 9953RPM

> system resource print
uptime: 30m7s
version: 6.28
build-time: Apr/15/2015 15:18:31
free-memory: 3745.2MiB
total-memory: 3966.7MiB
cpu: tilegx
cpu-count: 36
cpu-frequency: 1200MHz
cpu-load: 0%
free-hdd-space: 703.6MiB
total-hdd-space: 1024.0MiB
architecture-name: tile
board-name: CCR1036-8G-2S+
platform: MikroTik

> system routerboard print
routerboard: yes
model: CCR1036-8G-2S+
serial-number: 52A002DC3D1F
current-firmware: 3.18
upgrade-firmware: 3.22

> system package print
Flags: X - disabled
# NAME VERSION SCHEDULED
0 ntp 6.28
1 routeros-tile 6.28
2 system 6.28
3 X wireless-fp 6.28
4 X ipv6 6.28
5 X wireless 6.28
6 X hotspot 6.28
7 dhcp 6.28
8 mpls 6.28
9 routing 6.28
10 ppp 6.28
11 security 6.28
12 advanced-tools 6.28
13 X openflow 6.28
14 multicast 6.28

denke · Wed Jul 01, 2015 4:49 pm

Hello Normis!

I have another question for you:

We have 2 CCR1036 which were connected to the internet, with NTPD enabled and syced at the time of the incident. Both were affected, both had WatchDog enabled.

As far as I know Tile cpus has hw watchdog integrated, and it has support in the linux kernel, so the watchdog feature should have been hardware based.

Why didn't the watchdog reset the routers when they were both frozen solid?

royalpublishing · Wed Jul 01, 2015 4:54 pm

All 6 of my CCR's crashed and burned this morning because of the leap second issue so I came into an office of nothing working and all of my branch offices being down and all the routers required a cold boot. Fun times.

Beeski · Wed Jul 01, 2015 5:11 pm

We have over 20 CCR's in production.
The only one that locked up was acting as an NTP Server.
All other CCR's are NTP clients that sync with our Cisco Core/Edge router.

StubArea51 · Wed Jul 01, 2015 5:37 pm

It would have been helpful to have a patch from MikroTik, but for most of the customer networks we manage, we began leap second planning a while ago and removed any equipment from an NTP server that was suspect until the leap second passed and then re-enabled it. That proved to be a very simple, yet effective mitigation technique to script even on some of the larger networks we work on (50,000+ network devices)

paoloaga · Wed Jul 01, 2015 6:07 pm

It would have been helpful to have a patch from MikroTik, but for most of the customer networks we manage, we began leap second planning a while ago and removed any equipment from an NTP server that was suspect until the leap second passed and then re-enabled it. That proved to be a very simple, yet effective mitigation technique to script even on some of the larger networks we work on (50,000+ network devices)

This won't protect you from unexpected (wrong) leap seconds. A few public NTP servers here in Italy have been affected during March and applied the leap second on 1st of April. A deeper investigation related the cause to a bug into an hardware clock driver...

This won't protect also from a malicious hacker who could break into a public NTP server and crash the whole network.

rkj · Wed Jul 01, 2015 6:12 pm

Leap Second was a one time only event. It has passed. You can use any release now.

We will make a fix today that will make sure you don't see this issue again in 2-3 years, when next leap second happens

One CCR crashed just 10 minutes ago, so it might not be a one time event.

Also, some users reported the same issue with SXT using RouterOS 6.27 starting 0000 UTC, and some reported while not using NTP.

So, using NTP on CCR is sure to be the largest contributing factor, but it's not 100% limited to that scope.

lele · Thu Jul 02, 2015 12:38 am

So, using NTP on CCR is sure to be the largest contributing factor, but it's not 100% limited to that scope.

While unrelated bugs can't be ruled out, this specific issue is tied to the processing of a leap second event from the NTP subsystem to the linux kernel.

So it can not happen if

you are not using some kind of NTP client
there are no leap seconds (real or spurious) propagated through NTP

Also, the specific occurrence of april 1 can not repeat now, because that ntpd bug (there was an ntpd bug behind the spurious leap-second propagation) required a real leap-second to happen within 3 months.

scampbell · Thu Jul 02, 2015 2:38 am

Little too late, don't you think?
For this one, yes, but next leap second will be added in around 2 years.
Could you please tell me if you had NTP package on all the servers, or you used SNTP?

I can confirm CCR's with SNTP were OK and CCR's with NTP crashed and became unresponse.

StubArea51 · Thu Jul 02, 2015 3:42 am

It would have been helpful to have a patch from MikroTik, but for most of the customer networks we manage, we began leap second planning a while ago and removed any equipment from an NTP server that was suspect until the leap second passed and then re-enabled it. That proved to be a very simple, yet effective mitigation technique to script even on some of the larger networks we work on (50,000+ network devices)
This won't protect you from unexpected (wrong) leap seconds. A few public NTP servers here in Italy have been affected during March and applied the leap second on 1st of April. A deeper investigation related the cause to a bug into an hardware clock driver...

This won't protect also from a malicious hacker who could break into a public NTP server and crash the whole network.

Certainly getting the code patched is the ideal, but planning for a known network issue that will happen at a specific date and time and defending against daily attacks are two different animals.

darkorigins · Thu Jul 02, 2015 9:34 am

Having also suffered from all our CCRs locking solid (no network / serial or LCD) what I would now like to know is;

What happened to the watchdog? This was enabled on all devices but failed to save all but a couple of them.

Mark

wildbill442 · Thu Jul 02, 2015 7:48 pm

I can confirm that my CCRs running NTP-server package all Crashed @ 5:00PM PST Jun/29/15 (00:00 GMT).

Only fix was physical reboot of routers.

CCR's using NTP Client for time set were not effected.

All CCRs were CCR1036-12G-4S

Fri Jul 03, 2015 8:40 am

Official thread: http://forum.mikrotik.com/viewtopic.php?f=21&t=98224

Who is online