Bad 2.8.x AND 2.9.x OSPF bug? (was: IP Pool bug)

NathanA · Fri Jul 08, 2005 1:20 am

Hey, everybody,

For quite a while now, we've been having problems with random crashes of MikroTik routers which are doing PPPoE termination for us. We've tried all sorts of troubleshooting to try and narrow down the cause of the problem, but up until recently we were not able to discern any clear patterns.

However, it now appears that it has something to do with the IP Pool feature of RouterOS. This also seems to be reproducible across all versions of 2.8.x (we've upgraded and downgraded several routers with no change).

We find that, with the following setup, we can recreate the crash nearly 100% of the time. Could someone else out there (preferably a MikroTik employee) try these steps to see if you get the same result?

By the way, we're using 1U rackmount Supermicro machines (2.8GHz P4/Celeron, 1GB RAM, RouterOS loaded on IDE flash module).

1. Configure a RouterOS machine (we'll call this router #1) on one end to have an IP pool with a range of, say, 192.168.1.0-192.168.1.255.

2. Create a PPPoE Profile on router #1 and set the Remote Address to the pool you created in step 1.

3. Create and start a PPPoE Server instance on one of the ethernet interfaces on router #1 using the PPPoE profile you created in step 2 as the Default Profile for this server.

4. Create a single PPP Secret on router #1 with any username and password of your choice. Set its profile to the one you created in step 2 and the Service type to 'pppoe'.

5. On a second RouterOS machine (router #2) which you should wire up to router #1, create several hundred PPPoE client interfaces (just make one and copy it several times; easiest with a script), all with the username and password you came up with for step 4, but leave them all disabled for now. You may set the Service name to match router #1's PPPoE service name if you need/desire to.

/interface pppoe-client
add interface=ether1 user=username password=password add-default-route=no disabled=yes name=0
print
:for x from 1 to 511 do=[add disabled=yes copy-from=0 name=($x)]

6. Now -- again, best done with a script -- enable each PPPoE client on router #2 one-by-one with a second or two of delay in between each. Watch to make sure the tunnels are coming up on router #1, and watch the system resources on router #1 carefully during this.

:for x from 0 to 511 do=[:delay 1;enable $x]

Results: The tunnels will successfully establish for a while and IPs from the pool will be handed out to the incoming tunnels in sequential fashion, as we should expect. However, once we get somewhere around 120 tunnels (at least on our test machines), router #1's system resource usage numbers will spike dramatically: CPU load goes to 100% and nearly all of the entire 1GB of physical RAM gets used up. At this point, what happens next seems left up to chance: you might see things normalize and continue after a minute or so, you might see resource usage drop back to normal levels but router #1 will stop accepting new PPPoE tunnels, you might see CPU load remain at 100%, you might see existing PPPoE tunnels that were established successfully before the resource spike stop working completely even though they are still listed in the Interfaces list, or you might see router #1 drop off the network and never return (in the latter case, accessing router #1 via the console might show a kernel panic, or a login prompt that is non-responsive, or several other possibilities).

After you have confirmed the above behavior, try this:

1. Go to router #2 and disable all of the PPPoE interfaces.

/interface pppoe-client
disable [find]

2. Go back to router #1, reboot it, then go to the properties of the PPPoE Profile you created earlier and set its Remote Address to a static value (like 192.168.1.1 or something) rather than pointing it to an IP Pool.

3. Go back to router #2, and re-enable the PPPoE client interfaces again one-by-one, like last time.

Results: All 512 PPPoE tunnels come up successfully on router #1 (albeit they all have the same IP assigned to them) with no resource spikes and no crashes.

This would seem to indicate that something in the IP Pool code is causing a memory leak or is getting stuck in a loop or something. The kernel then probably starts killing off processes after things start spiraling out of control. It may end up being successful or not, which is probably why we see the diverse array of unpredictable symptoms that we do at the end.

If it's an IP Pool problem, I suppose it is possible that DHCP could also be affected by this, though I've never tried to reproduce this issue using DHCP in place of PPPoE.

-- Nathan

hitek146 · Fri Jul 08, 2005 2:28 am

I can see how this would be a large issue.... I'm sure that you don't want to use version 2.9, since it is not in full release yet, but I am curious if this behaviour occurs on version 2.9, since it is very different code from 2.8.x.... Have you tried 2.9, by any chance?

Hitek

PS: Sorry, I've no solution to your problem, although MT did just correct a large PPPoE issue with version 2.9 for many of us, and the fix will be fully available in 2.9rc7...

changeip · Fri Jul 08, 2005 4:41 am

Maybe this is offtopic, but why would you need that many tunnels between the same 2 machines? Is that example above just to prove the bug, or is that how you are using it?

NathanA · Fri Jul 08, 2005 5:00 am

Maybe this is offtopic, but why would you need that many tunnels between the same 2 machines? Is that example above just to prove the bug, or is that how you are using it?

Sorry if my post was not clear. Yes, this is only how I'm trying to show people how to reproduce the bug. The bug seems to manifest itself on the PPPoE server once it has reached a certain threshold of tunnels that have had dynamic IPs assigned to them, regardless of where these tunnels originate. Setting up two machines -- a server and a client -- and then artificially simulating load on the PPPoE server using this tactic is just a way to accelerate the decline of the PPPoE server for demonstration purposes.

-- Nathan

hitek146 · Fri Jul 08, 2005 6:14 am

It would also be an issue if the Access Concentrator remained running, but the clients lost network connectivity and logged off. If the network connection were to be restored, and all the clients tried to log back on again immediately, I could see this being a problem.....

Also, NathanA, you didn't say if you had tried 2.9 or not.....

Hitek

NathanA · Fri Jul 08, 2005 11:06 pm

Also, NathanA, you didn't say if you had tried 2.9 or not.....

hitek146,

No, I had not tried 2.9, but I did so this afternoon. I was really, really hoping it would work correctly, but it did the same darn thing. Around tunnel #118, all available memory was used up, the CPU spiked to 100, and I lost contact with the box. I went to the console and found I was still able to login there. Checked system resources again, and it looks like it actually managed to kill the process (CPU and memory usage were back to normal). It looks like it managed to recover for a bit as I then counted 170 tunnels. I tried to ping the other endpoint of one of the tunnels, but got no response (so they showed as 'up' but weren't working). While I was investigating on the console, the entire box hardlocked (no errors printed on-screen).

So, yes, the problem still exists, even in the latest RC.

-- Nathan

hitek146 · Sat Jul 09, 2005 1:29 am

I am using a PPPoE patch that MT sent me for 2.9rc6 which corrected my problem, and I wonder if it would also correct the problem that you are having. They told me the patch will be part of the next release, which should be available at any time. Having exhausted other options(besides hopefully getting help here in the forum), you should give it one more try when the next candidate is released....

Hitek

NathanA · Sat Jul 09, 2005 1:32 pm

I am using a PPPoE patch that MT sent me for 2.9rc6 which corrected my problem, and I wonder if it would also correct the problem that you are having.

What problem was the patch supposed to fix, out of curiosity?

Thanks,

-- Nathan

john2 · Sun Jul 10, 2005 12:17 pm

The forum is for community support. If you find what you think is a bug, you should email support@mikrotik.com with the supout file and description.

People continually post here and think that something will be done about possible bugs posted here, but that is not the case. You must write to support if Mikrotik is to help resolve such issues. I guess we will put some better notices to explain this.

John

NathanA · Mon Jul 11, 2005 9:30 pm

The forum is for community support. If you find what you think is a bug, you should email support@mikrotik.com with the supout file and description.

People continually post here and think that something will be done about possible bugs posted here, but that is not the case. You must write to support if Mikrotik is to help resolve such issues. I guess we will put some better notices to explain this.

John,

Thanks for your reply. I posted this here to try and give this problem more exposure, to get the attention of the appropriate people, AND to get feedback from fellow users in the RouterOS community to see if anybody else has ever experienced the same thing or can reproduce my problem (so this post wasn't merely targetted at MikroTik themselves). I've had problems with the responsiveness and helpfulness of support in the past, though I understand the importance of going through proper channels, and so I will do as you say and give MikroTik support a second shot if you think it will help get this problem resolved. I'm not going to be able to supply a supout file at this time since the majority of the time, I cannot get one generated before the machine becomes unresponsive. I was hoping that for you (MikroTik) to have step-by-step instructions on how I can faithfully reproduce the bug with a near 100% success rate would be helpful.

-- Nathan

tully · Tue Jul 12, 2005 8:08 pm

The support department is doing their best. You should continue to work with them. If you want the issue solved quickly, then you should contact support first -- I hate to see people wasting their time and getting frustrated when they could simply email support and have the best chance of getting the problem solved.

You can provide a supout file showing your configuration before you have issues. Or disable the PPPoE/IP pool that is making the issue and then make a supout file -- and explain this in your email.

No support will be given without a supout file.

John

NathanA · Thu Jul 28, 2005 10:17 pm

Well, it appears that this most likely is not an IP Pool-related bug after all. Support informed me it was not related, I remained doubtful, but I have since discovered that I was wrong about one thing, that being that the server I was conducting these tests on still was set up to communicate on our OSPF network (I thought I had remembered turning it off when we took the box out of service, but I apparently didn't).

I have since turned off OSPF and run the tests again. The box failed to crash. So MikroTik support was right...it is related to "routing program" (by which I'm guessing he meant Quagga) and not to the IP Pool function.

However, we still have a problem. It has been over a week now since I last heard from support@mikrotik.com, and the last fix they sent me didn't make one bit of difference (still crashes). I've prodded them twice already, trying to at least get some response, but I've heard nothing since a week ago Wednesday. Even a "we got your message, we don't have a fix yet but we're working on it, and we'll let you know when that happens" would suffice...just something to let me know that you're getting my e-mails and are working on the problem.

And Tully wonders why I post our bug reports to this forum!

We have a bunch of angry customers beating down our door because of the instability of your routing software. We're taking measures to spread our PPPoE terminations amongst more MikroTiks to reduce the load on all of them, and it does seem to be helping, but we need a true fix, not a patchwork workaround.

As a side note, I was also disturbed to hear (via one of the support people) that MikroTik would not consider releasing a fix for this issue in 2.8.29, and that we can only expect to see this to be fixed in 2.9. This, as I tried to express to the support person we were working with, is something that we feel is unacceptable. 2.9 may be close to release and may be just on the horizon, but forcing a major upgrade down someone's throat before they're ready just to fix a bug is asking for trouble. And besides, we need to get this fix rolled out ASAP, before 2.9 goes gold, and I would *REALLY FEEL UNCOMFORTABLE* running beta or release candidates in a PRODUCTION ENVIRONMENT. I don't see how we could call ourselves professionals if we found this suggestion completely acceptable. Our customers deserve better from us. We would *really* appreciate, being a customer of *yours* who has bought many, many RouterOS licenses and continues to do so, having a fix in 2.8 if at all possible.

Thanks again for your time and for listening,

-- Nathan