I wasn't able to reproduce it in lab too..(which made me think I could have an issue with some firewall rules on the HQ?), but I have a production environment where I am able to reproduce it every time I want.I was never able to collect enough evidence to open a support case with Mikrotik because it could never reproduce it on my set of lab machines, and I'm not going to send supout.rif from production machines with PSK authentication on IPsec even though Mikrotik states the PSKs are not saved into the supout.
So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.Meanwhile I opened a case with Mikrotik and I sent this thread... let's see what happen
I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?Would it be too painful for you to change the tunnels from GRE to IPIP?
I have a CCR1036 in the HQ and a RB3011 in the BO...So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.
Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
When it is the same issue as what I see, it is not a GRE or other tunnel issue but an IPsec issue.What is seems to me, is that the GRE invalid session break the IPSec somehow..
But still, I should say it's maybe a firewall issue on the HQ?
/system script
add dont-require-permissions=no name=ipsecerrorhandler owner=admin policy=\
ftp,read,write,policy,test source="# scan log buffer for ipsec error messa\
ges\r\
\n# when error is \"phase1 negotiation failed due to time up\", temporaril\
y block the sender\r\
\n# this is done to work around problems with some NAT routers\r\
\n\r\
\n:global lastTime;\r\
\n\r\
\n:local currentBuf [ :toarray [ /log find topics=ipsec,error and message~\
\"phase1 negotiation failed due to\" ] ] ;\r\
\n:local currentLineCount [ :len \$currentBuf ] ;\r\
\n\r\
\n:if (\$currentLineCount > 0) do={\r\
\n :local currentTime [/log get [ :pick \$currentBuf (\$currentLineCoun\
t -1) ] time ] ;\r\
\n if (\$currentTime != \$lastTime) do={\r\
\n :set lastTime \$currentTime ;\r\
\n :local currentMessage [/log get [ :pick \$currentBuf (\$currentL\
ineCount -1) ] message ] ;\r\
\n :if (\$currentMessage~\"due to time up\") do={\r\
\n :local ipaddress [:pick \$currentMessage ([:find \$currentMe\
ssage \"<=>\" ]+3) 99 ] ;\r\
\n :set ipaddress [:pick \$ipaddress 0 [:find \$ipaddress \"[\"\
\_] ];\r\
\n :local activepeers [ /ip ipsec active-peers find where remot\
e-address=\$ipaddress and state=established ] ;\r\
\n :if ( [ :len \$activepeers ] = 0 ) do={\r\
\n :log info \"Temporarily blocking \$ipaddress due to erro\
rs\" ;\r\
\n /ip firewall address-list add list=blocked address=\$ipa\
ddress timeout=\"00:05:00\" ;\r\
\n }\r\
\n }\r\
\n }\r\
\n}"
I really appreciate the help but in this moment the HQ configuration is a disaster, we are fixing it but I think we need some weeks before having a precise configuration. If the issue will happen after the configuration will be corrected, I will send you the configurationSo please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.
@sindy: after other tests, the behavior is exactly the one described in this topic viewtopic.php?t=122014Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.
Hello Sindy,As @pe1chl wrote, there may be firewall/NAT issues associated with the PPPoE flap. If there is NAT somewhere between the peers, both IKE (or IKEv2) and the transport packets use the same UDP stream, and either Mikrotik's own NATs or those on the ISP's devices may behave in an unexpected way when the address and port at one end changes whilst the other end keeps remembering the old ones.
With multiple WANs, there's one more complication - I had cases where the SAs chose a wrong peer locally after restart or path glitch, so the remote peer was ignoring the transport packets as they came in via a wrong SA.
To suggest some analysis steps, the anonymized configuration is necessary. Maybe start from posting the one from the BO side running the IPIP tunnel (which is not a "disaster" like the HQ one), and stating whether the WANs get the same IP each time the PPPoE connects and whether that IP is a public one or not.
I'll wait for some news, I am quite desperate at the momentIf I make any progress I'll share here
And if you try to disable and enable the phase2 policy on the mikrotik, what happens?installed SA's clear following this ipsec is broken until reboot, or if you hold the PPPoE down for like 3~5 min
Did you check in the /ip firewall connection if the ipsec-esp(50) has a flag for dst-nat? If yes, you should avoid itThe only sure fire ways that work for me are holding PPPoE down for 3-5 min once the issue presents or just rebooting the device.
Both devices are connected to internet the same way, with a real and static external IP.Are these devices all connected to internet the same way? I.e. is it always a plain PPPoE connection to an ISP that offers a real and static external IP, not using cg-nat?
Omg, I am so dumb, sorry I posted the wrong file. I re-posted it anonymizedYou forgot to obfuscate HQ.rsc, maybe you want to withdraw it and post it again once anonymized?
In the test case we have both IPs from the same provider, but for the real tunnels where I noticed the issue, other providers are involved, so other prefixes.The fact that both public IPs involved in this test setup come from the same prefix/range may or may not be relevant - is it the same case for the "real" tunnels?
Ok, so here I am with the results of the test.Oops. I haven't encountered an ISP to use compression on PPPoE yet, so I've never noticed that Wireshark doesn't inflate the payload. The only information I could find is that this has been the case 8 years ago.
Try to disable compression on the /ppp profile row used by the PPPoE client (by setting use-compression to no) - if the ISP accepts that, Wireshark will show you the payload. If it doesn't, you'll have to test using bandwidth-test with manually specified test packet size and random data as payload instead of ping, so that the ESP packets could be easily identified by size, as it should be impossible to actually compress the random data. But it will be a try and fail process.
At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.Any idea?
Well... my personal preference is analysis first, workarounds next So far we only suspect that the issue has something to do with connection tracking although no NAT is involved, so if there is no difference between "before" and "after", it may be an internal issue of the firewall and therefore removal of the tracked ESP connection may help, but it may as well be a coincidence that the SA has rekeyed after 10 minutes and the actual issue may have been a too large gap in ESP sequence numbers. /ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue. I incline to the firewall explanation as few lost ESP packets should be silently ignored, but who knows.I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.
Here the screenshots. Apparently no differences.At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.
I was used to it too: I want to keep it as my last solution because sometimes scripts don't work as expected.Also it is possible to set an "on up" script in the PPP profile used for the PPPoE connection (probably best to copy it from default and assign an appropriate name, then set that in the PPPoE connection).
In this script you can do things like removing all tracking entries related to the connection.
I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.
After the cycle, the IPSec counters don't increase at all, they remain the same./ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue.
Hello @sindy,The only thing to come to my mind is to avoid tracking of ESP connections in hope that the behaviour is caused by a bug in connection tracking. In particular:
/ip firewall raw
add chain=prerouting protocol=ipsec-esp src-address-list=allowed-ipsec-peers action=notrack
add chain=output protocol=ipsec-esp action=notrack
Add untracked to the connection-state list in the first rule in input chain in filter (so that it says action=accept connection-state=established,related,untracked), and either add the addresses of the peers to the address list allowed-ipsec-peers, or omit matching on that src-address-list in the rule in raw completely.
If that helps, it may be possible to slightly improve security by populating that address-list dynamically.
Although the Mikrotik configuration is the same, is the ISP configuration the same as well? In particular, do both these Mikrotiks get a public IP from the PPPoE server? Or the xDSL one gets a public one whilst the wireless one gets a private/CGNAT one?I have one PPPoE interface on each RB. The only one difference between the two connections is about the link type.
The first is connected directly to a radio on the roof, via ethernet cable.
The second is connected to one Zyxel xDSL modem, configured with a bridge mode.
When I disable and re-enable the "wireless" PPPoE, the IPSEC doesn't stop working. Never.
When I disable and re-enable the "xDSL" PPPoE, the IPSEC always stops working.
Could this be related somehow to the issue? I don't think it would influence how the Mikrotik handles the PPPoE
I've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here. @loca995, what are the exact models of the "pppoe via wireless" and "pppoe via xDSL" devices you've mentioned above? Or simpler, are they different or same?It could be an endianness bug?
I have some funny news.It could be a bug in the ZyXEL, e.g. when it does connection tracking.
At the moment I am researching another strange bug that also involves a ZyXEL DSL modem.
When this changes anything, it could point to a problem with the ZyXEL modem (for both of us).
These gave me an idea. I had another xDSL Dlink modem. I put the factory conf on it, then I made the same bridge mode, like I did with the Zyxel.Highly theoretically their contents may be be malformed, but that would be another issue than the one you experience with your Zyxel
@loca995, what does the /ip ipsec installed-sa print show after the PPPoE glitch?
The ISP is the same and assigns both public IP addresses on the PPPoE interface. There's no CGNATAlthough the Mikrotik configuration is the same, is the ISP configuration the same as well? In particular, do both these Mikrotiks get a public IP from the PPPoE server? Or the xDSL one gets a public one whilst the wireless one gets a private/CGNAT one?
The wireless PPPoE is installed on the 2011 so mipsbeI've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here. @loca995, what are the exact models of the "pppoe via wireless" and "pppoe via xDSL" devices you've mentioned above? Or simpler, are they different or same?
Neither do I at the moment, but it seems that there is some endianness or alignment problem when adjusting the DSCP, and it changes when I change the packet structure.I've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here.It could be an endianness bug?
This finally solves the problems with the Zyxel modem! Very nice! I just moved the Vlan tag from the Zyxel to the Mikrotik and now the IPSec doesn't stop working anymore!I.e. when I have PPPoE over VLAN (the ISP requires the PPPoE over VLAN6 and I apply a VLAN interface to ether1 and stack the PPPoE on top of that) it behaves differently from when I stack the PPPoE directly on ether1 and then configure the modem to add the VLAN tag.
Do you mean the "GRE over IPSec stops working when PPPoE interface flaps" issue or an issue with DSCP setting/usage on ARM routers?I have SUP-52523 open for the same issue
Same here. 3xRB3011, 2 of them connected via PPPoE. IPSec regularly stop working. Only reboot. So i need to use sstp instead of l2tp/ipsec. 6.48.5.It's for IPSec breaks after PPPoE interface flaps.
It's on an RB1100AH4 (Arm) also.
I've simplified my configs back to just an ipsec tunnel no gre and still have issues after a PPPoE bounce.
Last night I've upgraded to 6.49 and tried with the same thing again.
I've captured logs and support outs and added to the ticket
The tunnel is between an RB1100 <->RBHEXgr3. If the Hex PPPoE bounces no issue. If the RB1100 PPPoE bounces then IPSec fails
Hello @pe1chlI am back after some tests in the production environment.
A little recap of the prod. environment:
HQ : CCR 1036 - Firrmware version 6.48.3 tile
BO : RB 3011 - Firmware version 6.48.3 arm
BO has 3 different providers. Each of them has a different PPPoE interface. The provider assigns the public IP on the PPPoE interface. No CGNAT.
- Wireless link, mikrotik connected via ethernet to the radio on the roof. VLAN tag isn't involved.
- VDSL-1, mikrotik connected via ethernet to Zyxel modem, which is configured with a bridge mode. VLAN tag is performed by the Mikrotik. (thanks to @pe1chl suggestion about vlans, now this link works correctly)
- VDSL-2 mikrotik connected via ethernet to Dlink modem, which is configured with a bridge mode. VLAN tag is performed by the Dlink, because seems impossible to move the vlan tag on the Mikrotik. I guess the Dlink WAN bridge mode is more strict than the Zyxel one and doesn't allow vlan tag to pass.
Each of these links have 2 different IPIP over IPSEC tunnels with the 2 internet links in the HQ.
The default route on BO side is one of the IPIP over IPSEC tunnels made with VDSL-2
When I disable / enable the VDSL-2 PPPoE interface, the default route changes.
Then the PPPoE interface comes back running, but the ESP session between VDSL-2 and the HQ, used to initialize the IPIP interface that was used as the default route, hangs.
The only way to bring back the ESP session is stopping every session that goes through this ESP policy (In my case I disable the IPIP interfaces on both sides to avoid IPIP keepalive packets to be routed)
When I am sure that there is no traffic in the ESP session, it starts working again. In my opinion connection tracking is involved.
So I decided to change the default route to the IPIP tunnel made with the VDSL-1 PPPoE interface.
As I expected, When I disable / enable the VDSL-2 PPPoE interface, now the default route doesn't change: then the PPPoE comesback running and the ESP session over the VDSL-2 doesn't stop working.
In the next days I will analyze better the ESP flows with the packet sniffer.
I am very curious to see if changing the VDSL-2 Dlink modem with another vendor (Zyxel , TPlink ) the behaviour will change.
Any suggestion to avoid this behaviour of the ESP session?
I had another BO with the same providers situation as the example described above, but, the Routerboard is a 2011 - mipsbe.I have confirmed that in my case the bug is occurring only on my 4011 (ARM), the 2011 (MIPSBE) works fine with exactly the same config.
@pe1chl, I am back after other tests. I am lucky (or unlucky, it depends from the point of view of the situation) that the network scenario where I am experiencing the problem is made of 40 branch offices, altogheter there are 3 / 4 different type of Mikrotik products.Yes, it seems like there still are some bugs in the ARM version of RouterOS.
Correct, the DPD is still enabled on both BO and HQ. Unfortunately the 80% of PPPoE cycle is made because of the "session-timeout" attribute sent by the Radius Server. So the cycle it's super fast, like 1s or less. On such situation, the DPD doesn't help because it doesn't lose any keepalive packet.I presume you have not fiddled with settings in IPsec like DPD, send initial contact etc. Keep them at their default values. Without DPD you certainly will have trouble at link resets.
But that does not mean there are no other scenarios where these issues can occur, and maybe they are caused by RouterOS bugs, maybe for all architectures maybe for a single one.
Yes, you assumed correctly. This is exactly my setup: a /32 route towards the IPSEC peer via PPPoE interface.As you mention the default route and the PPPoE - I assume you've got a specific route with the remote peer as dst-address via the PPPoE, and a default route via GRE as you've mentioned. So as long as the PPPoE is up, that specific route overrides the default one; once the PPPoE goes down, the specific route becomes inactive, so the IPsec transport packets (ESP) start getting routed via the GRE, encapsulated, encrypted, and routed to the GRE again.
This is exactly my setup. We already used the blackhole, in some cases helped, on other seems not. I am glad to know that the reasoning about the ESP traffic routed into the default GRE tunnel was correct.As you mention the default route and the PPPoE - I assume you've got a specific route with the remote peer as dst-address via the PPPoE, and a default route via GRE as you've mentioned. So as long as the PPPoE is up, that specific route overrides the default one; once the PPPoE goes down, the specific route becomes inactive, so the IPsec transport packets (ESP) start getting routed via the GRE, encapsulated, encrypted, and routed to the GRE again.
So just a suggestion - create another route, with distance=10, with dst-address set to the address of the remote peer, no gateway, and type=blackhole. This route will prevent the transport packets to be sent down the default route while the PPPoE tunnel is down. See whether it helps something.
Unfortunately I have not heard anything either on my arm-related bug, initially there was a reply within a day or so but after supplying more information it became silent.Sadly Mikrotik Support are being very silent on the ticket I have opened.
Well in my case the first reply unfortunately was not to the point (it questioned the use of a VLAN-filtering bridge on the LAN side while my problem occurs on a VLAN on the WAN side that is not part of that bridge), so I explained what exactly is required to reproduce the problem and that the bridge is not required for that. Then I heard nothing back anymore.I wouldn't hold my breath, My ticket was opened back in June 21 and to date I've only got one reply on it which was Oct 21.
Well I don't see it that way, I can understand that supports get loads of tickets from people who have misunderstood something, made some config mistake they overlooked, etc etc. I have opened such tickets myself in the past.It's sad to see that support seems to have dropped off. Like I said I don't expect ci$co levels of support (not that those levels are great these days for the price you pay).