Hi everyone,
Nowadays we have a topology where my clients connect to our CCR, authenticate to Radius and thus have access to certain services where the consumption of data are really low.
Before using it at production environment we made tests with more than 5 thousand connections. No problem founded.
Today at production environment we got around 200 connections per CCR, and most of my clients have a high latency internet link, and with frequent abrupt disconnections.
Said all this, we are experiencing a problem where at some point the connections get stucked at PPP> ActiveConnections, and it is impossible to delete them.
We delete the connection, but when you change tabs (or close/open winbox) the conneciton is back there.
All connections (150)
We delete some connections 4.
So we just change tabs, and the connection is back.
At first this would be a "visual only" problem, however, due to the service we provide, we need to ensure that only one connection per user is active at a time.
So if I get this connection stuck, the “real client” trying to connect can´t succeed.
The only way we have found so far to "solve" the problem is rebooting the CCR.
Below we would like to share some notes:
1) OVPN Server keepalive;
If a connection is no longer responding, the session should be terminated. That would prevent this problem.
2) Probably because of the link quality of our customers, there is a lack of messages between CCR and Radius, where Radius should receive a STOP message and never receive it.
The acct_status_type field starts a session with a “START” message;
During session, information is exchanged between CCR and Radius (INTERIM_UPDATE);
When the connections is closed, the CCR should send to Radius the “STOP” message.
In all problematic cases we have, the "STOP" value was not passed to RADIUS.
3) Maybe if the OVPN's keepalive was running, situation # 2 would not occur.
We do not understand how a connection could stuck in the CCR. Let's say there was an abrupt disconnection, i think this is just the role of keepalive, to see if those connections actually remain
"usable".
4) Based on reports in several forums, this rare situation seems to happen only at Tile architecture.
We have opened a ticket at Support since August, 14 at this point we just talk about the problem, but can´t "show" it.
At September, 27 we got the problem again stucking 150 connections.
We believe we could help the support “freezing” this unit, so we do it. Since 09/27 we have stopped one of our CCR and provided access to mikrotik support.
So we would like to know if anybody experience something like that. Or if you guys have any suggestions for us.
Thanks you!