Bug? Custom MetaROUTER kernels no longer work on PPC w/ ROS6

NathanA · Fri Oct 04, 2013 9:56 am

As the subject says, it would appear that something changed in RouterOS 6 for PowerPC that broke the ability to run custom Linux kernels built with MikroTik's MetaROUTER patches (v1.2).

RouterOS 6 for MIPS still works fine: I can continue to use my custom images on MIPSBE boards after upgrading from RouterOS 5 to RouterOS 6 without a problem. But on PPC, a custom image will not boot. Nothing ever shows up on the console. And it is not just that the console is broken: this image automatically runs a DHCP client on eth0, but after waiting over 10 minutes, no network traffic has been seen coming from the VIF attached to the guest. So I know that it is not even running.

When I downgrade back to 5.26, the exact same images start working again.

I have no idea what on earth might have changed that would break just custom PowerPC kernels and not the kernel that comes with RouterOS. I examined the PPC MetaROUTER guest code in MikroTik's patches to Linux kernel 3.3.5 (the version that RouterOS 6 is built on top of), and was surprised to see that there are no significant differences between that code and the v1.2 MetaROUTER patches for 2.6.31, and certainly none that would explain why these kernels won't boot under RouterOS 6's MetaROUTER host. (Specifically, I looked at arch/powerpc/platforms/85xx/metarouter.c and arch/powerpc/include/asm/vm.h)

Furthermore, I have successfully compiled Linux 2.6.35 PPC MetaROUTER guest kernels from MikroTik's own 2.6.35 patches, and these kernels boot just fine under RouterOS 5, but these same kernels also won't boot up under RouterOS 6. I compared the MetaROUTER guest code between these 2.6.35 patches and the 3.3.5 patches, and there are NO differences at all between them! The code is exactly the same.

Is there anybody else out there who is familiar with Linux on PowerPC and who has sharper eyes than I who might be able to tell me either what I'm doing wrong or how the issue can be worked around? The fact that nothing even shows up in the console makes it incredibly hard to know where to start...

I ran my tests on an RB1100 (800MHz, single-core). I can supply test MetaROUTER guest images upon request.

-- Nathan

xrajox · Mon Oct 21, 2013 10:15 am

Hello,
I have the same problem on my RB800 ROS6.5 . I tried four metarouter builds and none of them was working.
On RB951G-2HnD ROS6.4 there is no problem with metarouter...

Rajo

janisk · Mon Oct 21, 2013 11:46 am

please look in the forum where users report till what build number you can use openwrt where patch from wiki applies cleanly.

NathanA · Tue Oct 22, 2013 1:15 pm

I have the same problem on my RB800 ROS6.5 . I tried four metarouter builds and none of them was working. On RB951G-2HnD ROS6.4 there is no problem with metarouter...

Right. RB951G is MIPSBE, and non-RouterOS MetaROUTER guests still work in RouterOS 6.x. RB800 is PowerPC, and non-RouterOS MetaROUTER guests do NOT work in RouterOS 6.x.

please look in the forum where users report till what build number you can use openwrt where patch from wiki applies cleanly.

janisk, you are missing the point. This has nothing to do with the OpenWRT patch and build process. I am not having problems building OpenWRT with official MetaROUTER patches. I am checking out a version of OpenWRT from their SVN that your official patch applies to cleanly. The problem is that the same OpenWRT TGZ that boots just fine on PPC RouterOS 5.x does not even boot on PPC RouterOS 6.x. Something changed in the PowerPC MetaROUTER host code, and MikroTik has not told us what that change is (the necessary guest changes are not included in the Linux 3.3.5 patches that MikroTik published).

MikroTik either needs to reverse the incompatible change in RouterOS 6.6 so that older OpenWRT images continue to work, or publish the guest changes so that we can rebuild our PowerPC OpenWRT images to work under RouterOS 6.x (and hopefully the same images can work on both 5.x and 6.x, like they already do on MIPSBE).

-- Nathan

xrajox · Fri Nov 08, 2013 10:52 am

I tried two PPC metarouter images on RouterOS 6.6, but both images still freezing at boot time (without any message)
(Both images working under 5.x)

2janisk: please could you try to run some metarouter on PPC routerboard ?

NathanA · Wed Jan 08, 2014 3:22 am

Update: Sergejs responded to my ticket (2013100466000193) on December 9: "Thank you very much for pointing that out. It will be difficult for us to look for the solution, however we will see what we can do." So now we wait.

(I'm not sure why it will be difficult...MikroTik's own kernel runs just fine in RouterOS 6 MetaROUTER, so it seems like they obviously know how to make it work under some circumstances, and there must be something different about their kernel vs. ours.)

-- Nathan

BeepDog · Tue Jan 21, 2014 3:44 am

Crap. I just bought an RB800, and was going to use their MetaRouter tech to be able to run a more powerful version of OpenVPN, as well as AICCU. Hopefully they find the solution soon

BeepDog · Wed Jan 22, 2014 1:36 am

@NathanA, how do I get added to that ticket as an interested party?

janisk · Thu Jun 12, 2014 4:02 pm

For OpenWRT use RouterOS 6.15 where this problem is resolved.

BeepDog · Thu Jun 12, 2014 6:36 pm

Today is best day! I'll give it a shot promptly!

NathanA · Thu Nov 13, 2014 1:58 pm

MetaROUTER on PowerPC is broken again on 6.21, 6.21.1, and 6.22, and this time for all guests, including RouterOS. If you use MetaROUTER on PowerPC, do not upgrade to these versions. I opened a ticket (2014111366000271).

-- Nathan

BeepDog · Thu Nov 13, 2014 8:51 pm

MetaROUTER on PowerPC is broken again on 6.21, 6.21.1, and 6.22, and this time for all guests, including RouterOS. If you use MetaROUTER on PowerPC, do not upgrade to these versions. I opened a ticket (2014111366000271).

-- Nathan

Thanks for posting here to let us know, I appreciate it.

NathanA · Wed Nov 26, 2014 6:52 am

FYI, RouterOS 6.23rc7 fixes this bug. Thanks, MikroTik devs, for getting this one turned around so quickly.

-- Nathan

janisk · Wed Nov 26, 2014 12:45 pm

not-booting down, console-reconnection-hang to go.

NathanA · Wed Nov 26, 2014 2:54 pm

not-booting down, console-reconnection-hang to go.

Actually, RB1xxx reboots are probably more serious than the console bug, which is merely inconvenient because it means I have to make sure I stay away from the console and only talk to MetaROUTERs over the network.

-- Nathan

janisk · Wed Nov 26, 2014 4:08 pm

it was hard to test console bug without MetaROUTER guests booting.

Regarding RB1100 crashes - most probably we will wait for PPC SMP virtualization to arrive.

NathanA · Wed Nov 26, 2014 4:49 pm

Regarding RB1100 crashes - most probably we will wait for PPC SMP virtualization to arrive.

Well, since there is no way of knowing when that will arrive, I will still continue to run a few tests. If I make the bug easy to reproduce, perhaps you will discover that the fix is easy to code.

One thought: maybe it is related to PCI bus issues? On RB1xxx, some ethernet interfaces are PCI-based. On RB850Gx2, SoC ethernet is exposed via a custom bus/interface/API.

-- Nathan

---

EDIT:

it was hard to test console bug without MetaROUTER guests booting.

I'm not sure I understand your meaning here. Are you saying that you had trouble testing the console bug because your RouterBoard kept crashing and rebooting? If so, then actually, that is really funny...

I suppose you could just use a hacked RB850Gx2 to test for other bugs without constant crashing.

janisk · Thu Nov 27, 2014 10:45 am

RB800 is fine for MetaROUTER tests.

And not-booting at all i would not call a-constant-crashing.

NathanA · Thu Nov 27, 2014 2:14 pm

And not-booting at all i would not call a-constant-crashing.

Oh, you are talking about the fix that came in 6.23rc7. Got it. I wasn't sure what you were referring to.

-- Nathan

BeepDog · Fri Nov 28, 2014 6:53 pm

RB800 is fine for MetaROUTER tests.

This means the RB800's are good to go for MetaROUTER stuff? (well for the RC7 at least)

Thanks,
David

NathanA · Sat Nov 29, 2014 10:15 am

I just sent the following in an e-mail to MikroTik Support (ticket #2014112966000099):

I have stumbled upon a really simple way to very reliably reproduce a crash and reboot scenario on RB1000/1100/1100AH products when using MetaROUTER. This particular scenario only appears to occur on RouterOS 6 (I tested as far back as 6.7); if I try the same thing on 5.26, it doesn't reboot. I know that there were random occurrences of crashes and reboots with MetaROUTER on PowerPC boards in RouterOS 5 that were seemingly never resolved, and it is possible that this particular crash is unrelated and that the underlying reason for this crash is not the only cause of MetaROUTER-related crashes on these boards. Still, I am hopeful that most or all of the reasons for crashes are related somehow, and that by investigating this particular crash and finding a fix for it, you might end up actually fixing the majority of other MetaROUTER-related crashes.

It turns out that to trigger a crash on RB1000/1100/1100AH with MetaROUTER on RouterOS 6 is so simple, I'm surprised I haven't run across it earlier. It's also possible that you are already aware of this method. Basically, you just have to be running a MetaROUTER within the first 5 minutes of uptime after booting RouterOS. That's it. As long as you start a MetaROUTER sometime between 0 and 5 minutes after bootup, the router will crash at almost exactly the 5 minute mark, regardless of how long the MetaROUTER has been running: it will crash at around 5m of uptime if you start the MetaROUTER at boot, or if you start the MetaROUTER at 4m45s of uptime. The CPU doesn't even have to be busy, and the MetaROUTER doesn't even need to have any interfaces added to it. I have tried this on an RB1000, an RB1100, and 2 RB1100AH boards. They all behave *exactly* the same way.

It appears that something is happening just before the 5 minute uptime threshold is crossed, even when a MetaROUTER is not running. If you boot up an RB1xxx, and then connect to it and start running "/system resource print interval=1" on the console or a Winbox terminal, you will see that at "4m55s" of uptime, the value of "Uptime" suddenly skips ahead by 10 seconds to "5m5s", and then it will stay at "5m5s" for 10 seconds, as if it is waiting for the real, internal uptime clock to catch up. (Winbox gets confused by this, and it starts counting down *backwards* to 5m0s, and then back up again.) At "5m6s" of uptime, after the "Uptime" clock matches reality again, if you are running a MetaROUTER, the RB1xxx will crash and reboot itself. If you are not running a MetaROUTER, it will not crash and reboot.

This strange "uptime skips ahead 10 seconds when it reaches 4m55s" bug does not occur in RouterOS 5 for PowerPC. It *does* occur on virtually every version of PPC RouterOS 6 (including 6.23rcX), and it even happens on other Freescale-based RouterBoards that use the MPC85xx kernel, like the multicore RB1100AHx2 or the RB850Gx2. Of course, you can't run MetaROUTER on these, so they don't reboot themselves, but the uptime clock does the same bizarre skip-10-seconds trick in RouterOS 6 on all of those routers. (Another interesting fact: if I "hack" an RB850Gx2 to run the uniprocessor kernel, and boot a MetaROUTER up on it, even though the uptime clock does the funny skip-ahead-10-seconds thing, the router does not crash and reboot after the uptime clock catches back up to reality! Only RB1xxx boards running MetaROUTERs crash at that point!)

Another interesting thing that may or may not be related: if I run a MetaROUTER on an PPC board, even ones that don't crash and reboot, I see the following show up on the kernel ring buffer every few minutes:

INFO: task fs-server-1:361 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fs-server-1     D 00000000     0   361      2 0x00000000
Call Trace:
[dd22ddb0] [80164714] blk_finish_plug+0x1c/0x58 (unreliable)
[dd22de70] [80006350] __switch_to+0x7c/0x94
[dd22de80] [8028ce64] __schedule+0x18c/0x374
[dd22dec0] [e1863960] wait_for_data+0x84/0x12b8 [fs-back@0xe1863000]
[dd22df00] [e1863b54] wait_for_data+0x278/0x12b8 [fs-back@0xe1863000]
[dd22dfb0] [80047650] kthread+0x84/0x88
[dd22dff0] [8000b9cc] kernel_thread+0x4c/0x68

The crashes do not coincide with these messages, so they may be completely innocent and normal, but because it includes a stack trace, I thought it would be worth mentioning.

So, to sum up, here is how you reproduce a MetaROUTER crash on PowerPC RB1xxx boards:

Netinstall 6.23rc7 onto an RB1000, RB1100, or RB1100AH, and boot it up.
Connect with Winbox, bring up System -> Resources, and bring up a Terminal running "/system resource print interval=1"
Watch the Uptime counter both in the Terminal and on the Resources window. When it gets to 4m54s, it will skip to 5m5s. The Terminal Uptime number will stay frozen for 10 seconds, while the Winbox/Resources window uptime will count backwards to 5m0s, and then back forwards.
Create a single MetaROUTER with "/metarouter add"
Reboot the RB1xxx to reset the system uptime back to 0.
Connect again with Winbox and bring up System -> Resources.
Wait for 4m54s. At 4m55s, just like before, the uptime will skip ahead 10 seconds to 5m5s, count backwards to 5m0s, and then count back forwards.
Because a MetaROUTER is now running, at 5m6s, the router will reboot itself.

Thanks for looking into this.

-- Nathan

NathanA · Sat Nov 29, 2014 10:47 am

After doing some brief research, this sounds potentially like a bug related to jiffies handling somewhere in the RouterOS kernel, perhaps even in MetaROUTER code that is included in the kernel. In the lead up to Linux 2.6, the kernel developers changed from initializing 'jiffies' to 0 at bootup to instead intentionally initializing 'jiffies' at bootup to a large value, such that it wraps around to 0 after exactly 5 minutes. This is to force bugs in kernel drivers related to their use of jiffies (for timing/scheduling?) to be discovered more quickly.

Here is a relevant post to the Linux kernel mailing list on the subject: http://lkml.iu.edu/hypermail/linux/kern ... /0421.html

-- Nathan

Bug? Custom MetaROUTER kernels no longer work on PPC w/ ROS6

Bug? Custom MetaROUTER kernels no longer work on PPC w/ ROS6

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Re: Bug? Custom MetaROUTER kernels no longer work on PPC w/

Who is online