RB1000 closing tens of pppoe connections at once

jkohan · Wed Dec 02, 2009 12:25 am

I have a problem with an RB1000 managing some 300+ pppoe connections.
At some times, it closes suddenly tens of connections. In the logs (and de Freeradius logs) the disconnection cause is loged as "User request" (contrary to the usual one of "Peer is not responding" when some modem or line is wrong) The connections come from several DSLAM at different VLANs and physical interfaces, and some from a wireless distribution made with another mikrotik.
Each VLAN has it separate PPPOE server and we observed that when the drops occur, they occur at the same PPPOE server, although all of them closes sessions in this manner from time to time.

We are not completly sure, but seems this is happening when there is a burst of failed authentications ( maybe that is a coincidence, as we couldnt repeat the problem forcing a client to misauthenticate).

We tried different versions of ROS: 3.22, 3.30 and 4.3, and 2 differents RB1000s and the problem is consistent across all of them.

Did anobody see a problem like this ? Any suggestions ?

Thanks

Javier

Tue Dec 08, 2009 3:50 pm

Javier, it is very weird, that only 10 connections are closed.
Perhaps you can enable pppoe,debug logs and get the reason, why all the 10 clients were disconnected simultaneously.
Otherwise it is very hard to guess, what could be the reason.
Just curious, do you have RADIUS server for these 300 clients?

jkohan · Tue Dec 08, 2009 6:17 pm

Javier, it is very weird, that only 10 connections are closed.
Perhaps you can enable pppoe,debug logs and get the reason, why all the 10 clients were disconnected simultaneously.
Otherwise it is very hard to guess, what could be the reason.
Just curious, do you have RADIUS server for these 300 clients?

Well, maybe because English is not my native language I was not clear. When I say "tens", I refer to 23 one time, 45 next, 30 next and so. (In Spanish the word is "decenas", and allways thought the English word "tens" meant the same).

With some hundreds of users, and the fact that this disconnections happen sparsely and randomly is somewhat difficult to enable full pppoe logs without affecting performance, and then the size of the log file in my server (I log to a server with syslog) gets huge. Anyway, both in RB´s and the server logs, the disconnection cause is "User Request". The same in my Radius logs.
And yes, I use 3 RADIUS servers that get user data from replicated LDAP bases.

Some data that I think could be relevant to this issue.
1) Sample Access-Reply AV Pairs.

Tue Dec  8 00:10:31 2009
        Packet-Type = Access-Accept
        Mikrotik-Rate-Limit = "128k/256k 130k/258k 129k/257k 3/3 8 32k/32k"
        Framed-Routing = None
        Framed-Protocol = PPP
        Service-Type = Framed-User

2) PPPOE Server Configuration

add authentication=pap default-profile=PPPoE disabled=no interface=\
    PPPoEVLAN20 keepalive-timeout=10 max-mru=1480 max-mtu=1480 max-sessions=0 \
    mrru=disabled one-session-per-host=yes service-name=XXXX
add authentication=pap default-profile=PPPoE disabled=no interface=VLAN10 \
    keepalive-timeout=10 max-mru=1480 max-mtu=1480 max-sessions=0 mrru=\
    disabled one-session-per-host=yes service-name=XXXX

set default change-tcp-mss=yes comment="" name=default only-one=default use-compression=default \
    use-encryption=default use-vj-compression=default
add change-tcp-mss=yes comment="" dns-server=10.120.0.2,10.120.0.3,10.120.128.2 local-address=\
    200.xxx.xxx.xxx name=PPPoE only-one=default remote-address=POOL-PPPoE use-compression=default \
    use-encryption=default use-vj-compression=default wins-server=127.0.0.1

If someone likes to see more information, please ask.

Thanks

Javier

pbel88 · Tue Dec 29, 2009 6:34 pm

Peer is not responding with tens of PPPoE sessions and sometimes all sessions that disconnect intermittently is still an issue for me. I saw one live once, had about 100 users connected and suddenly everyone got disconnected with mention in log "Terminating... disconnected" the pppoe service wen't crazy... collapse, RB was still alive. Then it took about 30 seconds before everyone logged back in. I can have this issue 3 times a day or once a week, it depends on something that I don't know and it's very frustrating. Clients complaints, have about 250 users on an RB1000. Thought it was a bridge problem first, so I separated the PPPoE service trough ether interface but still same issue. Have looked around the forum but found nothing. Does someone have found a fix or do I have to switch to another solution than Mikrotik.

Regards

jkohan · Tue Dec 29, 2009 7:00 pm

So, I´m not the only one exprimenting this problem, and it is a real problem when one has 300+ users angry.
Can someone @mikrotik engage in this issue ?

Wed Dec 30, 2009 2:36 pm

The best way to get it, please contact support (support@mikrotik.com) with detailed problem description.

promind · Mon Jan 11, 2010 1:31 pm

Same problem here... I have a dozen of RB1000's which work as PPPoE servers...CPU goes for about 20 secs on 100% and all users got "terminated".
Please find any solution because I'll have to stop working with you MikroTik guys...

pbel88 · Fri Jan 15, 2010 7:06 pm

Found this (http://forum.mikrotik.com/viewtopic.php ... =pppoe+bug). I've removed everything related to PCQ in my queues and all seems to be OK since 24H

. Hope this will continue. But PCQ + PPPoE service mixed together will need to be reviewed by MK.

pbel88 · Tue Jan 19, 2010 3:39 pm

It still crash but less often

rborz · Mon Apr 26, 2010 8:23 pm

The same problem occurs here, too. With about 100 users on RB1000 PPPoE server. All sessions disconnect simultaneously.
The RB1000 has two uplinks to two different providers using PPPoE for their connection, too. Also the uplink PPPoE sessions disconnect from time to time.

Two weeks there was no problem, no disconnects - and today all PPPoE sessions (incoming & outgoing) disappeared within 3 hours. This is horrible...

The 100 users are on one physical interface and shared across two VLANs. Each of the two uplink PPPoE sessions goes over a separate physical interface.

We tried every firmware from 3.22 up to 4.6. The same problem... we also completely exchanged the RB1000 without any success.

Tue May 11, 2010 1:37 pm

rborz, it would be great you can contact MikroTik support (support@mikrotik.com) with detailed problem description and support output file generated, when problem with PPPoE users is present.

jkohan · Tue May 11, 2010 2:29 pm

As SergeJS advised, I opened a ticket @mikrotik. After seeing my supout file, their suggestion was:

1) replace all your dynamic change-mss rules with one global change-mss rule.

2) check that you use latest winbox loader (cleare cache after upgrading)

3) Think about switching from dynamic simple queues to Dynamic address-list
and queue tree with PCQ

#1 was relatively easy do do.
#2 I did what was asked to do.

Up to here, the problem persists, I even tried to split users among 2 RBs, some 100+ on a 600 and some 180+ stayed in the RB1000. Less frequently, but both had massive disconnections.

#3 I don´t understand what I´m asked to do. I have hundreds of users and rely on RADIUS to pass PPPoE concentrators customer´s bandwidth parameters. Can I pass address list instead of "Mikrotik-Rate-Limit" attributes ? How ?
There is a "Mikrotik-Mark-Id". Is it for doing that ? In that case, how do I use it ?

Thanks.

Javier

chaym · Sat May 15, 2010 10:39 pm

I am not using a RB1000, but x86 based RouterOS on a PowerRouter 2242. Same issue here. Terminating 600+ PPPoE customers and the queues will fail, disconnecting all users until the unit is reboot. We have tried several versions of RouterOS including 4.9 and the problem persists.

The previous suggestions from Mikrotik staff either do not work, or are not usable in our environment. (We need simple queues to assign specific bandwidth profiles a customer is paying for which is passed from our RADIUS server)

This can happen once a week or a few times a day, it is very random. It does seem to happen more often if we have multi cpu support enabled. We cannot contact Mikrotik support since we purchased our license through a 3rd party. This is becoming very frustrating for us, but even more so for our customers.

rodolfo · Sun May 16, 2010 8:34 pm

I have the same problem using x86, partially resolved disabling simple queues.

You can contact mikrotik support also if you purchased licenss from 3rd parts.

Muqatil · Sun May 16, 2010 10:13 pm

I am using RB1000, x86, RB433AH as PPPoE servers using RADIUS Centralized accounting and i don't encounter similar problems. I use simple queues so your issues might not be related to simple queues.
I had a similar problem a while ago, but i addressed it to a flapping wireless link. Fixed the link, fixed the issue. Did you try to look for packet losses on the path to the clients?

edit. My PPPoEs ask for interim updates, might this help?

jkohan · Mon May 17, 2010 2:28 am

edit. My PPPoEs ask for interim updates, might this help?

What is that ? Where do you set it up ?

Thanks

Javier

Muqatil · Mon May 17, 2010 9:31 am

From Documentation:

interim-update - defines time interval between communications with the router. If this time will exceed, RADIUS server will assume that this connection is down. This value is suggested to be not less than 3 minutes

rborz · Mon May 17, 2010 12:46 pm

sergejs, I already contacted support multiple times... the last hint was to upgrade 5.0 beta. But I'm afraid of doing that, as the routerboard is on a production network serving about 120 PPPoE clients at the moment.

Thinking a lot, my last thoughts yesterday were, if anybody having this issue maybe using external radius server? If this is the case, I think most of the users will use FreeRADIUS (as we do). FreeRADIUS default configuration states this:

#  max_requests: The maximum number of requests which the server keeps
#  track of.  This should be 256 multiplied by the number of clients.
#  e.g. With 4 clients, this number should be 1024.
#
#  If this number is too low, then when the server becomes busy,
#  it will not respond to any new requests, until the 'cleanup_delay'
#  time has passed, and it has removed the old requests.
#
#  If this number is set too high, then the server will use a bit more
#  memory for no real benefit.
#
#  If you aren't sure what it should be set to, it's better to set it
#  too high than too low.  Setting it to 1000 per client is probably
#  the highest it should be.
#
#  Useful range of values: 256 to infinity
#
max_requests = 1024

Maybe together with interim updates, this value might be to low... maybe this has something to do with the issue. But in my case with about 120 PPPoE clients and approximately 4 sip accounts per client this may lead to 600 simultaneous requests (in the worst case). Concerning this, maybe this hasn't to do anything with the connections drops... just my two cents...

EDIT: Sometimes we have bruteforce attacks with about 500 requests per second against our sip gateways... and each register/login-attempt also leads to a radius request. Now above becomes more reasonable... So the main question is:
Does PPPoE server on MikroTik drop connections if there are timeouts on interim-updates...?

EDIT: Ok, a few minutes ago - all my PPPoE sessions were gone again... so this time I checked all the logs - no brute force or something like that leading in a DoS on the radius server. So this must be another issue...

jkohan · Mon May 17, 2010 7:23 pm

So the main question is:
Does PPPoE server on MikroTik drop connections if there are timeouts on interim-updates...?

I have to mention: We do NOT use interim-updates and suffer the same problem.

cuz2000m · Fri May 21, 2010 1:54 am

I have the same problem with 2 RB1000 units. However, none of my PCQ options are in use as I have disabled all the Queues. I have 600+ angry customers and would really like a fix. Has anyone found anything that actually works, or have any information from Mikrotik about what the problem could be? The only thing that I have noticed thus far is that in the PPPoE Servers tab, some of the interfaces display "unknown" when the sessions are dropped.

cuz2000m · Fri May 21, 2010 12:06 pm

Update to what I mentioned earlier. I believe this problem to be associated to the 4.x firmware. I had realized a similar problem in a previous RB1000 (It has firmware 4.6) which has now been re-deployed in another area of our network. I had pulled all the configs and uploaded them to a new RB1000 with 3.x firmware and had no PPPoE connection drops during that time. However i had a fairly high CPU usage which prompted me to upgrade to the latest firmware 4.9. And a couple of days afterwards I have started to experience the same problem. I have reverted to firmware 3.30 and will test it for a few days. If will post the results of my little test.

Corey.

rborz · Tue May 25, 2010 9:01 am

We have this problem since the day we bought our first RouterBOARD from MikroTik (RB1000) 1,5 years ago. Well, if I remember correctly, the board had RouterOS v3.22 installed. Since that day we're always looking forward for the next firmware update hoping they fixed it.

cuz2000m: The support told me to upgrade to v5.x beta as they may changed something with the PPPoE stuff. But as my RB1000 is about 450 kilometers away I can't do the upgrade hoping nothing bad will happen. Maybe you can check it out if v5.x will fix it as you're next to the boards? We've to buy another solution if MikroTik doesn't get this fixed soon!

820 · Tue May 25, 2010 10:26 pm

We have 2 PowerRouter 732's and an RB1000 that keep having "spontaneous rebooting". The PowerRouters keep crashing and we have had to - disable 1 of the CPU's, disable "multi threading" and try v5.x beta and v4.9 for each unit - without success and we continue to crash every 1-2 days, which has been going on for several weeks now. The RB1000 is stable for us though.

Mikrotik support have been given all the support logs and we are urgently waiting for a quick reply. Mikrotik have a great product (we are very happy) but Iv'e been reading many threads regarding this and it is a big issue that needs fixing.

cuz2000m · Mon May 31, 2010 4:24 pm

Hi All,
As promised, I just wanted to update you on the progress so far and that is to say that I have not yet experienced the same problem since I downgraded to 3.30.

I'm sorry rborz but I can't upgrade to the 5.x beta at this time. It would be really irresponsible for me to do this and needless to say, I have already lost a few customers because of these outages.

Maybe someone else who had the problem with Firmware version 4.x can upgrade or downgrade to version 3.30 like I did and maybe confirm a possible problem with the 4.x line or firmwares.

Corey.

user47 · Wed Jun 09, 2010 4:12 pm

hi all
I am having very similar problems where is run a RB1000 with 500+ pppoe sesions and all the sessions drops at once. When this happens all sesions will immediately start to reconnect without problems.

Little bit background on my setup. I make use of a radius server for the AAA functions that uses the RB1000 as the pppoe server. when the client connects i simple queue is created to provide bandwith limit to the session on a per session basis.

cuz2000m i have also downgraded to 3.30 with no change in the problem. Have gone through version 3.3, 4.5, 4.6 recently and some other along the way. All give the same problem. Can not find any pattern in when this happens.

Have also replaced the RB1000 with a new one and problem persists (and all other common hardware between RB1000 and clients). have 4 other Routerboards also using the same radius server with no problems, however they are running a max of 150 clients. Can this be load related?

well for any other ideas im willing to try almost anything at this point have lost a few dozen clients already....

cuz2000m · Thu Jun 10, 2010 3:17 pm

Hi All,
Another update. Was busy for the past few days. Well, the problem did subside with v3.30, but only for a week. After which, it started all over again. User47, I have been wondering the same thing. I have a very similar setup to you in that there is an external radius server which does our authentication but there is no queues setup on that mikrotik. I also have a few customers (around 50) on another mikrotik running 4.6 without these drops. So I was also wondering if it was a load problem. The RB1000 according to specs is suppose to handle around 5000 PPPoE sessions. CPU load is between 30-50 % and there is not much memory load at all.
I have updated to version 5.0 beta 2 which in this case seems to correct itself somewhat but I can't allow the disconnections to continue in this form. It makes us look very Unprofessional and Unreliable.
Does anyone have any ideas?

omidh · Thu Jun 10, 2010 11:41 pm

hi
i have the same problem.
when clients trying to conncet pppoe they gets error 691 and in mikrotik log says "peer is not responding" and afer two or more retry, clients can conncet.
my OS version is 4.9

Nando_lavras · Tue Jul 13, 2010 12:49 pm

Same problem here.... after a time with all clients connected all connections are terminated... after 2~3 seconds all clients reconnect normally..... i have send the supout.rif to mikrotik and they not encounter no signal of software crash... but the clients continues to disconnect.

tlcscousin · Wed Aug 18, 2010 5:56 am

We are seeing the same issues except on two RB450G routers and a lot less clients. 70 clients in Mikrotik 69 will drop session and within seconds re-enter session.We have tried all the suggestions given here minus going to the beta firmware
We have
wifi link-->switch-->RB450G-->Engenius 2611AP-->customers
when the sessions drop all customers are in AP.They all authenticate to a central Radius server.
We have 15 Mikrotiks in use none have quite the client load of these two and never do the drop all customers thing, the two that drop are both using 4.10 firmware. We added a 5.8 link to one of the Microtiks (all are 2.4)so one has 2.4 and 5.8 customers and all drop the same.Which to me sets common point of issue with microtik. We have replaced pretty well everything on the tower AP 3 different units even went with ubiquity radio but that was the original radio and had a few issues before we converted to PPPOE and Mikrotik (errors on RX and TX which we attributed to to many customers). We are going to try downgrading to 4.5 but reading this thread not holding much hope of it being the cure.

lavv17 · Thu Aug 19, 2010 9:53 am

This may be related.
http://forum.mikrotik.com/viewtopic.php?p=211442

tlcscousin · Sat Aug 21, 2010 11:40 pm

It almost appears ours is bandwidth usage that is the root cause of disconnects. We set everyone of the people down to 384/128 and they so far have stayed connected 4 hours longer than they used to(will see if any hit the 1 day mark).
OK definately not working dropped everybody at about 20 hours.

cuz2000m · Tue Aug 24, 2010 3:09 pm

Hi,
Has anyone been able to check their logs to see if they recognize any similarities before the "crash" of the PPPoE servers. In mine, I notice that my VLAN interfaces all switch to the UP state. I don't see anything going to the DOWN state prior to this though. So I have no idea why the state changed to UP. Is anyone else logging to a syslog server that can confirm this?

Corey.

roneyeduardo · Fri Sep 03, 2010 1:09 am

Hi all.

I'd like to ask everyone who reported this problem if there was no solution up to date?

asterisco · Tue Sep 07, 2010 4:18 pm

Hi,
Has anyone been able to check their logs to see if they recognize any similarities before the "crash" of the PPPoE servers. In mine, I notice that my VLAN interfaces all switch to the UP state. I don't see anything going to the DOWN state prior to this though. So I have no idea why the state changed to UP. Is anyone else logging to a syslog server that can confirm this?
Corey.

Hi,

I'm having same issue with RB1100 as PPPoE concentrator and RocketM5 directly connected to a port of RB1100 in order to terminate wireless connections that connect to such AP.

Just *BEFORE* pppoe bulk disconnections I see in the RocketM5 logs:

[1404166.824000] AG7240: unit 0: phy 4 not up carrier 1
[1404166.825000] br0: port 1(eth0_real) entering disabled state
[1404168.635000] AG7240: enet unit:0 phy:4 is up...RGMii 100Mbps full duplex
[1404168.636000] AG7240: done cfg2 0x7135 ifctl 0x10000 miictrl
[1404168.636000] br0: port 1(eth0_real) entering learning state
[1404169.636000] br0: topology change detected, propagating
[1404169.636000] br0: port 1(eth0_real) entering forwarding state

This happens from time to time; not very often... Now the question is: which device is going down/up? mikrotik? ubiquity?

I have asked in the Ubiquity forums too: http://www.ubnt.com/forum/showthread.php?t=23032

The simplest workarround I can figure is to put a switch between the pppoe and the AP in order both ends always see and ethernet link up independently of the other end.

Regards,
Antonio

rborz · Tue Oct 05, 2010 1:19 pm

RB1000 discontinued?!
http://www.routerboard.com/pricelist.php?showProduct=57

Tue Oct 05, 2010 1:36 pm

RB1000 discontinued?!
http://www.routerboard.com/pricelist.php?showProduct=57

since a long time already. RB1100

formico · Sat Oct 09, 2010 12:07 am

I have noticed that when the cpu reaches 100% of usage and stays there for over a couple of minutes, all the connections go down. I hope this can be helpfull for someone. Now I am tryng to install router OS on a HP DL 380 2.8 ghz quad core double cpu server, with 4 GB Ram and sas HD and equipped with ESX since Router OS doesn't support sas hd's. It seems to work fine but I am not sure that the Hardware performance is the problem.
I'll keep you all up with the results of the trial.

fatty · Wed Oct 13, 2010 10:07 pm

Same problem. Replacing rb1000 with x86 machine , solved the problem.

Thu Oct 14, 2010 9:39 am

Did you all submit tickets to support about this issue with RB1000 as suggested above? Please tell me the ticket numbers and I will check the status.

DSP · Mon Oct 18, 2010 11:30 am

Same problem. I notice the problem about 20 sessions and it persist until now, 215 sessions. Problem does not include pptp session connected thru WAN port. What is the "ticket numbers" ?

Tue Oct 19, 2010 9:32 am

Same problem. I notice the problem about 20 sessions and it persist until now, 215 sessions. Problem does not include pptp session connected thru WAN port. What is the "ticket numbers" ?

email support. when they answer, in the subject of the email you will see a ticket number, like 2010101966000161

hajde · Thu Oct 27, 2011 7:30 pm

Same problem hire but with x86, I try to change hardware (3x HP server: 1. ML110 G5 2. ML110 G6 and 3. DL380-G7) and problem still persist?

Contact support with suppout file long time ago, but he seed everything is ok?

Edit: On all ROS version I try, same problem.

riyadiari · Fri Mar 23, 2012 5:22 pm

is there any update on this problems (3years already)

?
any "new" solution ?

i have this same problem with RB750, many x86, RB1100AH, with ROS v3.2, v3.3 ,4.1 and 5.1.
always disconnecting 40-100 PPPoE connection at a time, 3 - 5 times daily.
my "temporary" solution with the most stable PPPoE connection was using ROS v<2.9 on x86 as PPPoE server only, until now the problem never happened again . but really 2.9

ferdinandbabst · Mon Mar 04, 2013 11:36 am

Is anyone still having this issue?

We had a similar fault and I found that the issue was actually caused by Spanning tree on the switches. We have a RB1000 connected to a set of switches running spanning tree. The issue was that the customer edge ports on switches were not configured as edge ports, so when a single user disconnects pppoe, the switches do a complete spanning tree re-election and then other connect users sessions will drop because of the re-election process between the switches. Not at all a Mikrotik fault.

Redrik · Thu Jan 29, 2015 3:35 pm

I have a very similar issue. My setup is:
Mikrotik Cloud core router runs a pppoe server, which has only around 70 clients so far. I've set up another Mikrotik (RB1100) as a radius server to control user access and bandwidth and to get statistics. The interface with pppoe server is connected to an unmanaged switch with 2 dslams. Both Mikrotiks run on 6.24 version.

Multiple Pppoe sessions drop simultaneously and immediately re-initiate all the time. The quantity of dropped sessions and their uptime vary with no obvious patten at all. The customers complain that they can't access the internet until they reboot their modems. But not all of them: some still can use Internet even after drop and re-create the session. And again, there is no patten here either. Log shows that termination is initiated by the client.

I increased keep-alive timeout from 10 to 120 seconds. Drops started to happen more seldom but still were there. Setting keep-alive timeout to 0 didn't make much difference.

I managed to get hold of a couple of customers before they rebooted the modems. Dynamic pppoe interface created by the server shows up and running but the customer can't get on the Internet. When I delete that interface manually, it is re-created within 4-5 seconds and the customer confirms Internet is ok after that without rebooting the modem.

Please!!! Can anyone help me with this issue?

chrisw · Thu Aug 20, 2015 8:45 pm

It's rather discouraging seeing how old this thread is when I'm having the same problem here, five years later on a CCR1036-12G-4S running RouterOS v6.31.

2,000 PPPoE clients. Sometimes if a node drops during maintenance, we expect lose about 200 sessions. Those sessions do drop, and then the REST OF THE ENTIRE NETWORK drops along with them. Once PPPoE is finished dropping ALL its connections, then it starts reauthenticating people. CPU does not appear to be taxed at all when the sessions drop, and it drops sessions very slowly (2-3 per second.) When this happens, it's simply faster to yank the power cord and have it boot back up, reacquire a full route table via BGP, and reauthenticate all 2,000 PPPoE users. Otherwise, it'll take it at least 10 minutes just to drop all the sessions.

hashbang · Mon Aug 24, 2015 9:41 pm

right, after so many years the problem is still there but what is the cause. I'm experiencing the same thing. One my my networks runniing pppoe likethis MT.....switch.....nanobeam.........nanobeam....switch....subscribers. The number of subscribers is low around 20. They all experience disconnection problem every now and then. I'd tested the link its giving 100mbps aggregated throughput. Still groping in dark

. My hw is rb 2011 ver 6.18

hugleo · Wed Nov 18, 2015 2:46 am

chrisw,

"2,000 PPPoE clients. Sometimes if a node drops during maintenance, we expect lose about 200 sessions. Those sessions do drop, and then the REST OF THE ENTIRE NETWORK drops along with them. Once PPPoE is finished dropping ALL its connections, then it starts reauthenticating people."

We have the exactly same problem, we are using CCR1036-12G-4S.

Can the mikrotik team do something about it?

hugleo · Wed Jan 20, 2016 8:29 pm

And continues happing here...

No solution until now.
Mikrotik support can you say if we have something new about it?

genie · Fri Feb 05, 2016 4:49 am

PPPoE clients get disconnected though they are logged in.This happens when you use PPP--->Active connection tab to check status of PPPoE users.This is a old known bug which Mikrotik is yet to resolve.Don't use this tab,instead user PPP--->Interface tab to get active connection details,from there select a user click on it you can get details of uptime and IP address as well or as a workaround add additional columns in Interface tab to display uptime and IP address.

Well this is one of the causes for the unexplained disconnections.Hope this helps.

Genie.

hugleo · Fri Feb 05, 2016 11:48 am

It can be another bug.
The bug I talking is that if same if I use multiple pppoe server in differents interfaces and if I just disconnect the cable CPU load grows while disconnecting and reconnecting the clients. Due of the fact mikrotik does not paralelize CPU in this case all others pppoe clients start disconnecting by timeout because mikrotik does not send pppoe echo message. The whole process last 8 minutes in that state and stabilize again after that.
Mikrotik says that will solve this problemas is router os v7. For now I will try to change all pppoe clients echo tolerance to something like 3 minutes to see If can minimize the damages.

hugleo · Tue Mar 22, 2016 4:02 am

Maybe will be solved in 6.35?

6.35rc Changelog:

*) ppp - fixed crash when ppp interface gets disconnected and user gets authenticated at the same time (most probable with slow RADIUS server);

*) ppp - fixed memory leak high number of pppoe clients to the same server;

*) ppp - do not crash when received multiple CBCP packets;

*) ppp - close connection if peer wants to re-authenticate;

flameproof · Fri Aug 14, 2020 9:06 am

I hate reviving old threads from years past, but this one IMHO is worth keeping alive. We have the same issue with 1300 PPPoE sessions on a CCR1702. We are able to reliably reproduce this:

1. Drop a number of customers by:
a) Rebooting a downstream switch
OR
b) Rebooting a PtP AirFiber serving a downstream switch
OR
c) Pull one of the ports on the bridge serving PPPoE on the CCR

2. We will see traffic drop according to the segment lost.
3. When the disconnect completes, traffic resumes.
4. About 2 minutes after traffic resumes, ALL traffic stops at the CCR, and PPPoE sessions start dropping - sometimes it's all of them, sometimes only a portion.

clipboard-image-6.png

CCR remains accessible during these events, but no amount of CPU profiling has pointed to anything specific. Mikrotik support ended up shrugging and said "our hardware won't support your current configuration" without further details. The interface is a 10Gbps fiber, so this is not a "you're choking your 1G link".

I think this problem is embedded deeply in the core of the operating system, and thus has not been fixed during years of development, upgrades and fixes.

At this point, we are looking at alternative vendors, at a loss of thousands of dollars to Mikrotik (we are a credible ISP in Eastern Africa with some 15.000 customers... and plans for growth to 200.000 customers).

glueck05 · Fri Aug 14, 2020 11:01 am

Hello,
do not use CCR for more than 1000 PPPoP Customers per device. And under all circumstances disable connection tracking on CCR. We use X86 Devices which could handle >= 4000 PPPoE Customers but also with connection tracking disabled.
We use these devices:

1) X86_64: http://www.lannerinc.com/products/netwo ... s/nca-5510
2) 8 Port-Copper Port: http://www.lannerinc.com/products/x86-n ... cm-igm801a
3) 4 Port SFP+: https://www.landitec.com/products/x86-n ... 05a-detail

regards,
glueck

rodolfo · Fri Aug 14, 2020 1:18 pm

@glueck

The problem of traffic drops is caused by the cpu at 100%, occupied to remove connections of pppoe users dropped in connection table.
This can occur for some minutes in which the router could be unreachable.
Now we have one CCR1036 with 4000 pppoe users (distributed in 200 pppoe servers).
We have resolved as follows:
1. remove all firewall and ebgp functions from pppoe server
2. disable connection tracking from pppoe server
We no more use x86 as ppoe server.
Hih

I hate reviving old threads from years past, but this one IMHO is worth keeping alive. We have the same issue with 1300 PPPoE sessions on a CCR1702. We are able to reliably reproduce this:

1. Drop a number of customers by:
a) Rebooting a downstream switch
OR
b) Rebooting a PtP AirFiber serving a downstream switch
OR
c) Pull one of the ports on the bridge serving PPPoE on the CCR

2. We will see traffic drop according to the segment lost.
3. When the disconnect completes, traffic resumes.
4. About 2 minutes after traffic resumes, ALL traffic stops at the CCR, and PPPoE sessions start dropping - sometimes it's all of them, sometimes only a portion.

clipboard-image-6.png

CCR remains accessible during these events, but no amount of CPU profiling has pointed to anything specific. Mikrotik support ended up shrugging and said "our hardware won't support your current configuration" without further details. The interface is a 10Gbps fiber, so this is not a "you're choking your 1G link".

I think this problem is embedded deeply in the core of the operating system, and thus has not been fixed during years of development, upgrades and fixes.

At this point, we are looking at alternative vendors, at a loss of thousands of dollars to Mikrotik (we are a credible ISP in Eastern Africa with some 15.000 customers... and plans for growth to 200.000 customers).

flameproof · Fri Aug 14, 2020 1:48 pm

@glueck @rodolfo

Thanks for your input and suggestions - we are definitely contemplating the x86 metal + dedicated PPPoE stack as an option.

On the connection tracking disabled - how would you handle dynamic rate limiting without it? We use a simple queue for each CPE session, assigned based on RADIUS response (and the service level set on the customer DB). We also (in some cases) use mangle rules to direct traffic where we have more than one upstream link (e.g. two parallel 1Gbps fibers).

rodolfo · Fri Aug 14, 2020 4:53 pm

For dynamic rate limiting, we use a simple queue for each user session, assigned based on RADIUS response.
For mangle, be shure to mangle in raw queues, also if we prefere to demand mangle/route/bgp/firewall to a border routerboard different from pppoe server (also because is useful tu have at least two redundanto pppoe server)

@glueck @rodolfo

Thanks for your input and suggestions - we are definitely contemplating the x86 metal + dedicated PPPoE stack as an option.

On the connection tracking disabled - how would you handle dynamic rate limiting without it? We use a simple queue for each CPE session, assigned based on RADIUS response (and the service level set on the customer DB). We also (in some cases) use mangle rules to direct traffic where we have more than one upstream link (e.g. two parallel 1Gbps fibers).

Who is online