For quite a while now, we've been having problems with random crashes of MikroTik routers which are doing PPPoE termination for us. We've tried all sorts of troubleshooting to try and narrow down the cause of the problem, but up until recently we were not able to discern any clear patterns.
However, it now appears that it has something to do with the IP Pool feature of RouterOS. This also seems to be reproducible across all versions of 2.8.x (we've upgraded and downgraded several routers with no change).
We find that, with the following setup, we can recreate the crash nearly 100% of the time. Could someone else out there (preferably a MikroTik employee) try these steps to see if you get the same result?
By the way, we're using 1U rackmount Supermicro machines (2.8GHz P4/Celeron, 1GB RAM, RouterOS loaded on IDE flash module).
1. Configure a RouterOS machine (we'll call this router #1) on one end to have an IP pool with a range of, say, 192.168.1.0-192.168.1.255.
2. Create a PPPoE Profile on router #1 and set the Remote Address to the pool you created in step 1.
3. Create and start a PPPoE Server instance on one of the ethernet interfaces on router #1 using the PPPoE profile you created in step 2 as the Default Profile for this server.
4. Create a single PPP Secret on router #1 with any username and password of your choice. Set its profile to the one you created in step 2 and the Service type to 'pppoe'.
5. On a second RouterOS machine (router #2) which you should wire up to router #1, create several hundred PPPoE client interfaces (just make one and copy it several times; easiest with a script), all with the username and password you came up with for step 4, but leave them all disabled for now. You may set the Service name to match router #1's PPPoE service name if you need/desire to.
Code: Select all
/interface pppoe-client
add interface=ether1 user=username password=password add-default-route=no disabled=yes name=0
print
:for x from 1 to 511 do=[add disabled=yes copy-from=0 name=($x)]
Code: Select all
:for x from 0 to 511 do=[:delay 1;enable $x]
After you have confirmed the above behavior, try this:
1. Go to router #2 and disable all of the PPPoE interfaces.
Code: Select all
/interface pppoe-client
disable [find]
3. Go back to router #2, and re-enable the PPPoE client interfaces again one-by-one, like last time.
Results: All 512 PPPoE tunnels come up successfully on router #1 (albeit they all have the same IP assigned to them) with no resource spikes and no crashes.
This would seem to indicate that something in the IP Pool code is causing a memory leak or is getting stuck in a loop or something. The kernel then probably starts killing off processes after things start spiraling out of control. It may end up being successful or not, which is probably why we see the diverse array of unpredictable symptoms that we do at the end.
If it's an IP Pool problem, I suppose it is possible that DHCP could also be affected by this, though I've never tried to reproduce this issue using DHCP in place of PPPoE.
-- Nathan