GRE over IPSec stops working when PPPoE interface flaps.

loca995 · Wed Sep 01, 2021 8:37 pm

Hello everyone,
The situation is the following (If you need the network diagram I will provide you one).
HQ: two WANs
BO: two or more WANs.
In order to provide VPN connection with failover, for every WAN connection in the BO there are two GRE tunnels with the HQ. In this specific case, we have 3 WANs in the BO, so 6 GRE tunnels. Every tunnel has a static route with a different distance.
Everything works fine, until the PPPoE interface of the WAN in the BO flaps. Since this moment, the GRE tunnel is down. I am not able even to ping the remote address used to make the GRE tunnel, forcing the correct source address.
Gre Interfaces - BO side

BO-Gre-List.png

Tunnel configuration - BO side

BO-Gre-Int.png

Gre interfaces on HQ (interfaces names are the same on both sides)

HW-Gre-List.png

Same Tunnel configuration - HQ side

HQ-Gre-Int.png

I started the tests from the IPSec phase one: the state is established, but the RX value is 0. Phase 2 is established too. IPSec is configured with tunnel mode.

IPSEC-Status.png

IPSEC-PHASE.png

If I try to ping the remote address with a src address, like is defined in the Phase2 policy, the ping doesn’t work. This makes me think about an issue with IPSec, but the next steps make me even more confused.
Ping from BO -> HQ on the loopback interfaces

Debug1.png

In the screenshot above, in the connection tracking, a GRE connection is shown. The timeout on the GRE connection is 3 minutes.
Now if I keep the GRE interfaces disabled for three minutes on both sites, the timeout expires. Since this moment, I am able to ping the loopback interfaces on the other side.

Debug2.png

So, at this moment, I bring up the GRE Interfaces on both sides, and everything start working well again.

Debug3.png

The manual troubleshooting works everytime, but I would like to understand what causes this issue. I have a big number of GRE Tunnel and every morning I have to check if this problem is active and fix it.
I am very confused. In the first moment I thought it could be a problem of IPSec when the PPPoE interface flaps. Now I would say that in the flap moment something happens in the routing table, and in some way the GRE keepalive packets go through another route and remain active for some reason.

I would really appreciate to resolve this problem, I can provide any type of information or test if needed.
Thank you
Lorenzo

loca995 · Mon Sep 27, 2021 5:48 pm

Hello,
anyone has any suggestion?

I am able to create the problem everytime I want, but I need a solution to avoid GRE stops working
Thank you

sindy · Mon Sep 27, 2021 6:25 pm

Would it be too painful for you to change the tunnels from GRE to IPIP?

The thing is that at least since a fix of some GRE-related vulnerability somewhere in 6.45.x, the issue you describe exists, plus only on some CPU architectures to make it even more entertaining. I've migrated all my affected tunnels to IPIP (enable ipencap, not ipip in the firewall rules!) and it's been working fine ever since. As a bonus point, you'll win a few bytes of MTU - the GRE headers are larger than the IPIP ones, but Mikrotik cannot make use of them, so you cannot set up multiple GRE tunnels between two peers, differentiated by Tunnel-ID.

I was never able to collect enough evidence to open a support case with Mikrotik because it could never reproduce it on my set of lab machines, and I'm not going to send supout.rif from production machines with PSK authentication on IPsec even though Mikrotik states the PSKs are not saved into the supout.

loca995 · Tue Sep 28, 2021 12:06 pm

Hello Sindy!
Thanks for the answer.
Well, at least I know that If I migrate from Gre to IPIP I should resolve this problem! For sure it's not easy, because we are speaking of more than 200 tunnels... But I can try on one site and see...

But anyway it should be interesting resolve this issue with GRE.

I was never able to collect enough evidence to open a support case with Mikrotik because it could never reproduce it on my set of lab machines, and I'm not going to send supout.rif from production machines with PSK authentication on IPsec even though Mikrotik states the PSKs are not saved into the supout.

I wasn't able to reproduce it in lab too..(which made me think I could have an issue with some firewall rules on the HQ?), but I have a production environment where I am able to reproduce it every time I want.
It could be interesting if the support could connect with a remote session and see what is going on...
Meanwhile I opened a case with Mikrotik and I sent this thread... let's see what happen

sindy · Tue Sep 28, 2021 12:52 pm

Meanwhile I opened a case with Mikrotik and I sent this thread... let's see what happen

So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.

pe1chl · Tue Sep 28, 2021 2:01 pm

I am using such a config and I do not see any issues. Watch out for:
- firewall errors (as sindy mentioned, there is a bug in RouterOS for the past couple of versions. incoming GRE traffic is marked "invalid" instead of "established" or "new", when you drop invalid traffic before accepting GRE it fails
- NAT issues in other routers. when IPsec is sent via NAT and the session is interrupted, it can get into an unrecoverable state. This happens when your router is behind some other product like AVM Fritzbox. It often helps to "forward" the traffic, in this case UDP port 500 and 4500 to the router.

sindy · Tue Sep 28, 2021 2:46 pm

@pe1chl, unfortunately the PITA the OP has described exists in addition to the two you've mentioned. I've done all my homework to work these around (exemption of GRE from "drop invalid", measures to make sure that IPsec recovers from an interruption/restart of a mid-path router properly, filter rule allowing forwarding of GRE packets emerging from GRE tunnel (you favourite keepalive theme) - this one is only necessary with paranoid firewalls) and still there's a problem sometimes, where after an outage or after initial configuration of the GRE tunnel, without touching anything else, you have to just disable the GRE for 10 minutes at both ends and then re-enable it to make it work again. If you don't, it stays broken forever, so not related even to SA rekeying.

pe1chl · Tue Sep 28, 2021 3:02 pm

I recognize that only when there is another NAT router inbetween. The NAT router has translated the original session to a different address but the same port numbers (500 and 4500), but after the outage the NAT router thinks there is a new connection and has not yet deleted the old one, and decides to translate the port number because it sees a session to the same IP/port (500). So the NAT rule translates to some random port number and the two sides never meet again.
Disabling for 10 minutes deletes this entry from the NAT table and sessions are possible again.
When I do a "port forwarding" in the NAT router this usually does not happen because the router knows to translate only the address and not the port number.

On routers that are directly on internet without extra NAT router inbetween I never see this problem, not even when one of them uses PPPoE.
I *do* see another issue with PPPoE sometimes: when there is a reset in the backbone network and the PPPoE is not cleanly closed and reopened, it sometimes occurs that the PPPoE interface does not get an IPv4 address. It does get the IPv6 address. As I make 2 tunnels, one GRE and one GRE6, between each location, I then see that the GRE6 works normally and the GRE does not.
I have a script to detect this problem and disable/enable the PPPoE interface, which recovers both tunnels without requiring further action on the tunnel interfaces.

loca995 · Wed Sep 29, 2021 10:53 am

I have some news:

Would it be too painful for you to change the tunnels from GRE to IPIP?

I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?

So to contribute - if I remember right, I had this problem when CHR was at one end and RB1000AHx4 at the other one.

I have a CCR1036 in the HQ and a RB3011 in the BO...

I read your nice considerations about the thread and what I can say is that I always have a public IP over the PPPoE interfaces
I also have other environments with other Mikrotik models and there the issue never happens.
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
But still, I am not able to reproduce the issue in the lab environment.

About what @pe1chl said, this is quite interesting but I will focus on one thing:
IPsec is in tunnel mode, with phase2 established.
The phase two policy is:
- src : x.x.x.x
- dst: y.y.y.y
Where these two are the remote and local address used for initializate the GRE interface.
I leave a CLI session opened with a ping from x.x.x.x to y.y.y.y

CASE 1: GRE tunnel is running and enabled
1) I voluntary disable and enable the PPPoE interface
2) The ping stops working and it won't work anymore until I disable the two GRE interfaces (one in BO and one in HQ)

CASE 2: GRE tunnel is disabled
1) I voluntary disable and enable the PPPoE interface
2) The ping stops working but when the PPPoE comes up the ping works correctly.

What is seems to me, is that the GRE invalid session break the IPSec somehow..
But still, I should say it's maybe a firewall issue on the HQ?

sindy · Wed Sep 29, 2021 11:11 am

I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?

Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.

pe1chl · Wed Sep 29, 2021 11:29 am

What is seems to me, is that the GRE invalid session break the IPSec somehow..
But still, I should say it's maybe a firewall issue on the HQ?

When it is the same issue as what I see, it is not a GRE or other tunnel issue but an IPsec issue.
Any stateful firewall (including the NAT example I gave but also the firewall you have) can cause IPsec issues that go away when the session is silenced for a while.

I even wrote a script for this I run on a router that sometimes has those issues:

/system script
add dont-require-permissions=no name=ipsecerrorhandler owner=admin policy=\
    ftp,read,write,policy,test source="# scan log buffer for ipsec error messa\
    ges\r\
    \n# when error is \"phase1 negotiation failed due to time up\", temporaril\
    y block the sender\r\
    \n# this is done to work around problems with some NAT routers\r\
    \n\r\
    \n:global lastTime;\r\
    \n\r\
    \n:local currentBuf [ :toarray [ /log find topics=ipsec,error and message~\
    \"phase1 negotiation failed due to\" ] ] ;\r\
    \n:local currentLineCount [ :len \$currentBuf ] ;\r\
    \n\r\
    \n:if (\$currentLineCount > 0) do={\r\
    \n    :local currentTime [/log get [ :pick \$currentBuf (\$currentLineCoun\
    t -1) ] time ] ;\r\
    \n    if (\$currentTime != \$lastTime) do={\r\
    \n        :set lastTime \$currentTime ;\r\
    \n        :local currentMessage [/log get [ :pick \$currentBuf (\$currentL\
    ineCount -1) ] message ] ;\r\
    \n        :if (\$currentMessage~\"due to time up\") do={\r\
    \n            :local ipaddress [:pick \$currentMessage ([:find \$currentMe\
    ssage \"<=>\" ]+3) 99 ] ;\r\
    \n            :set ipaddress [:pick \$ipaddress 0 [:find \$ipaddress \"[\"\
    \_] ];\r\
    \n            :local activepeers [ /ip ipsec active-peers find where remot\
    e-address=\$ipaddress and state=established ] ;\r\
    \n            :if ( [ :len \$activepeers ] = 0 ) do={\r\
    \n                :log info \"Temporarily blocking \$ipaddress due to erro\
    rs\" ;\r\
    \n                /ip firewall address-list add list=blocked address=\$ipa\
    ddress timeout=\"00:05:00\" ;\r\
    \n            }\r\
    \n        }\r\
    \n    }\r\
    \n}"

The script is scheduled to run every 2 minutes, and when it finds a remote that is
having problems establishing the connection it puts its address in a list and in the
firewall you should block the packets from source address in that list. After 5
minutes (you can use 10 minutes when required) it expires and the connection is
established again.

loca995 · Wed Sep 29, 2021 2:06 pm

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.

I really appreciate the help but in this moment the HQ configuration is a disaster, we are fixing it but I think we need some weeks before having a precise configuration. If the issue will happen after the configuration will be corrected, I will send you the configuration
Anyway, I have other news:

@pe1chl
I made other tests with some colleagues and probably you got the point.
What we made is establishing a new IPIP-TUNNEL from the same BO to another Mikrotik (let's call it HQ-TEST), which is not the HQ.
So, we should exclude any problem due to the default route in the BO (so the IPIP tunnel itself towards HQ)
The issue still happens.

So I kept the new IPIP-TUNNEL with HQ-TEST disabled, I disabled and enabled the PPPoE interfaces, and the phase2 doesn't work.
For the BO and HQ-TEST mikrotiks the state of the policy is "established", but a ping from the source to the destination doesn't work.
My colleague tried to decrease the DPD timeout on phase1 (something very strict: like 1-1), and in this way, the phase2 always comes back up after the PPPoE flap.

So I am starting to think it's something more related to IPsec... but still no idea...
Which DPD configuration would you suggest?

EDIT
Seems that if there is no traffic between BO and HQ-TEST, when the PPPoE flaps the IPsec phase 2 works
If there is traffic between BO and HQ-TEST, like IPIP keep alive or just a ping, when the PPPoE flaps, the IPSEC phase 2 remains established but stop working.

loca995 · Sat Oct 02, 2021 10:08 pm

I tried on one site but I have the same issue! Could it mean that I have some issue on the HQ firewall?
...
The only difference is that in the other enviroments the GRE-IPIP tunnel is not the default gateway in the BO: is it possible that cause the issue?
Hopefully one of these is the reason, otherwise it would mean that the IPIP handling would be flawed too.

So please post the anonymized configuration of both devices (the BO one currently running IPIP and the HQ one), see my automatic signature below for a hint.

@sindy: after other tests, the behavior is exactly the one described in this topic viewtopic.php?t=122014
I saw that you already visited it, any suggestion about it?

sindy · Sun Oct 03, 2021 2:21 pm

As @pe1chl wrote, there may be firewall/NAT issues associated with the PPPoE flap. If there is NAT somewhere between the peers, both IKE (or IKEv2) and the transport packets use the same UDP stream, and either Mikrotik's own NATs or those on the ISP's devices may behave in an unexpected way when the address and port at one end changes whilst the other end keeps remembering the old ones.

With multiple WANs, there's one more complication - I had cases where the SAs chose a wrong peer locally after restart or path glitch, so the remote peer was ignoring the transport packets as they came in via a wrong SA.

To suggest some analysis steps, the anonymized configuration is necessary. Maybe start from posting the one from the BO side running the IPIP tunnel (which is not a "disaster" like the HQ one), and stating whether the WANs get the same IP each time the PPPoE connects and whether that IP is a public one or not.

loca995 · Mon Oct 04, 2021 6:43 pm

As @pe1chl wrote, there may be firewall/NAT issues associated with the PPPoE flap. If there is NAT somewhere between the peers, both IKE (or IKEv2) and the transport packets use the same UDP stream, and either Mikrotik's own NATs or those on the ISP's devices may behave in an unexpected way when the address and port at one end changes whilst the other end keeps remembering the old ones.

With multiple WANs, there's one more complication - I had cases where the SAs chose a wrong peer locally after restart or path glitch, so the remote peer was ignoring the transport packets as they came in via a wrong SA.

To suggest some analysis steps, the anonymized configuration is necessary. Maybe start from posting the one from the BO side running the IPIP tunnel (which is not a "disaster" like the HQ one), and stating whether the WANs get the same IP each time the PPPoE connects and whether that IP is a public one or not.

Hello Sindy,
in this moment I have two test WANs in the datacenter, so I took a RB2011 for tests, I put a simple configuration, and I simulated that it's the TEST-HQ.
So I connected to the production BO with a single IPSEC tunnel.
If I disable the PPPoE on TEST-HQ site, without any tunnel (IPIP / GRE or what else) configured, the flap doesn't make any problem on the IPsec Phase2 status.
If I configure a IPIP or even a GRE tunnel between the TEST-HQ and the BO, if I disable and enable the PPPoE on HQ site, the IPsec phase 2 stop working.
I attached the configuration of the TEST-HQ, which is pretty clear and easy, compared to the production HQ.
Could you please take a look and tell me if there is something wrong?

I am working on the BO configuration but I won't be ready until the next days.

loca995 · Tue Oct 12, 2021 3:43 pm

Hello,
any news?

Meanwhile I realized a test environment with one RB2011 and one RB3011 and a internet connectivity for each routerboard, provided by the same provider.
One WAN connection on the RB2011, made with a PPPoE interface.
One WAN connection on the RB3011, made with a PPPoE interface.

If I disable and re-enable the PPPoE interface on the RB2011, the IPSec works properly.
If I disable and re-enable the PPPoE interface on the RB3011, the IPSec stops and I need to reset it manually.

I attached the two configurations. It's a very simple configuration, could you please take a look?

The only difference between the two setup is:
on the RB2011 the internet connection is made with a wireless link, on the RB3011 the internet connection is made with a FTTC Modem in bridge mode. But for the routerboards there should not be any difference, it should be transparent...

markmcn · Thu Oct 14, 2021 1:11 am

Not helpful but you're not alone having IPSec issues with PPPoE flaps
I'm seeing IPSec break on a RB1100AH4 every time PPPoE flaps and the installed SA's clear following this ipsec is broken until reboot, or if you hold the PPPoE down for like 3~5 min, I've a case with Tik Support going on about it. Might be worth reaching out to support also,
If I make any progress I'll share here
Cheers
Mark

loca995 · Thu Oct 14, 2021 2:01 pm

If I make any progress I'll share here

I'll wait for some news, I am quite desperate at the moment

installed SA's clear following this ipsec is broken until reboot, or if you hold the PPPoE down for like 3~5 min

And if you try to disable and enable the phase2 policy on the mikrotik, what happens?

markmcn · Thu Oct 14, 2021 3:52 pm

I've mixed success with going through and disabling all elements of the ipsec config peer, policy etc and leaving them disabled for a while but this doesn't always work.
The only sure fire ways that work for me are holding PPPoE down for 3-5 min once the issue presents or just rebooting the device.
This is a real pain and I hope they can get it fixed soon, It's really strange for me it's only impacting certain devices I'm seeing it on a RB1100AH4 but i've a pair of these side by side and only one is having the issue even after a full factory reset and only basic config applied

pe1chl · Thu Oct 14, 2021 4:26 pm

Are these devices all connected to internet the same way? I.e. is it always a plain PPPoE connection to an ISP that offers a real and static external IP, not using cg-nat?
I ask that because the behavior may very well depend on NAT occurring external to the MikroTik routers, or depend on the IP changing after the PPPoE interface flap.
That would explain why certain people see it and others don't, or why you see it on some router and not on another.

markmcn · Thu Oct 14, 2021 6:56 pm

Hey pe1chl,
In my case both devices have public IP's landing directly on the device and there is no CG-NAT in between them 100% sure of this.
In my case with the 1100AH4 that' doing it actually both plug into the same VDSL modem in bridge mode and just have different PPPoE cred's to get different ip's but always the same ip's are assigned.
I had a similar case with Mikrotik maybe a year or more back and it got resolved but that was on a 2011 device. It's really like the the IPSec daemon just gets stuck in some sort of bad state either holding onto a stale connection or something, Given we can't see the innards of RouterOS it's hard to tell, I've upload support outputs and logs before and after the PPPoE bounce so I'm hoping tik support can find something.
Cheers
Mark

sindy · Thu Oct 14, 2021 9:42 pm

You forgot to obfuscate HQ.rsc, maybe you want to withdraw it and post it again once anonymized?

loca995 · Fri Oct 15, 2021 2:06 pm

The only sure fire ways that work for me are holding PPPoE down for 3-5 min once the issue presents or just rebooting the device.

Did you check in the /ip firewall connection if the ipsec-esp(50) has a flag for dst-nat? If yes, you should avoid it

Are these devices all connected to internet the same way? I.e. is it always a plain PPPoE connection to an ISP that offers a real and static external IP, not using cg-nat?

Both devices are connected to internet the same way, with a real and static external IP.

You forgot to obfuscate HQ.rsc, maybe you want to withdraw it and post it again once anonymized?

Omg, I am so dumb, sorry I posted the wrong file. I re-posted it anonymized

sindy · Fri Oct 15, 2021 11:16 pm

Looking at your minimalistic configurations, I can currently imagine only the following things:

something weird regarding handling packets that cannot be routed anywhere in RouterOS kernel - the only routes the packets between 10.199.199.0 and 10.199.199.1 can take are the default ones, and the only default routes have got the PPPoE client interfaces as gateways, so if a PPPoE goes down, the default route becomes inactive. To get intercepted by an IPsec policy, a packet must first get routed somewhere by the regular routing, hence while the PPPoE is down, the GRE or IPENCAP transport packets cannot be routed anywhere. I have no clear idea what exactly it should break, though - normally, the packets should get routed as soon as the PPPoE interface goes up.
the IPsec policy goes mad as it loses the sa-src-address while the PPPoE interface is down - but again, since the issue only strikes when the PPPoE glitches on one of the devices, it doesn't seem likely.
some strange behaviour of your ISP while getting packets for an IP address which is currently not reachable via the PPPoE tunnel. The fact that both public IPs involved in this test setup come from the same prefix/range may or may not be relevant - is it the same case for the "real" tunnels?

In any case, it must be related to the fact that a payload packet that would normally get intercepted by an IPsec policy and sent encrypted and ESP-encapsulated to the destination causes something to go wrong if it is processed while the PPPoE is down.

What I would suggest to do now as you've put together this pleasantly simple testbed would be to sniff on the physical interfaces carrying the PPPoE client ones at both routers while the issue exists, pinging inside the tunnel from both ends, to see whether PPPoE frames carrying the ESP packets carrying the ICMP ones are coming and leaving. The less other traffic you'll have there the better as you don't want the other traffic squeeze the interesting one out of the rolling buffer; if this is a problem, you'll have to connect PCs with Wireshark or tcpdump and use the "streaming" capability of the /tool sniffer, allowing to send copies of the sniffed frames to an IP address, encapsulated into TZSP.

By comparing the results, you should be able to find out quite quickly whether the issue is inside or outside the Mikrotiks.

loca995 · Sun Oct 17, 2021 7:24 pm

The fact that both public IPs involved in this test setup come from the same prefix/range may or may not be relevant - is it the same case for the "real" tunnels?

In the test case we have both IPs from the same provider, but for the real tunnels where I noticed the issue, other providers are involved, so other prefixes.

I understood all your explaination but honestly I am less skilled on the subject than you, so I am going to follow your suggestion.
Just one question, forgive my ignorance on the topic/tools:
If I start a tool sniffer session on the physical interface, I get a .pcap file containing the PPP frames. Using wireshark I am able to see something like this:

pppoe-pcap.PNG

What I should do to see properly where PPPoE frames contains ESP packets and understand then whether the packets are coming and leaving?
Edit-Solved
I am dumb part2, it was enough to change the profile associated to the PPPoE client from default-encryption to default.
I'll let you know asap.

sindy · Sun Oct 17, 2021 8:06 pm

Oops. I haven't encountered an ISP to use compression on PPPoE yet, so I've never noticed that Wireshark doesn't inflate the payload. The only information I could find is that this has been the case 8 years ago.

Try to disable compression on the /ppp profile row used by the PPPoE client (by setting use-compression to no) - if the ISP accepts that, Wireshark will show you the payload. If it doesn't, you'll have to test using bandwidth-test with manually specified test packet size and random data as payload instead of ping, so that the ESP packets could be easily identified by size, as it should be impossible to actually compress the random data. But it will be a try and fail process.

loca995 · Tue Oct 19, 2021 4:05 pm

Oops. I haven't encountered an ISP to use compression on PPPoE yet, so I've never noticed that Wireshark doesn't inflate the payload. The only information I could find is that this has been the case 8 years ago.

Try to disable compression on the /ppp profile row used by the PPPoE client (by setting use-compression to no) - if the ISP accepts that, Wireshark will show you the payload. If it doesn't, you'll have to test using bandwidth-test with manually specified test packet size and random data as payload instead of ping, so that the ESP packets could be easily identified by size, as it should be impossible to actually compress the random data. But it will be a try and fail process.

Ok, so here I am with the results of the test.
Here a small summary of the configuration.
HQ - RB2011 - public IP : hq.test.it
BO - RB3011 - public IP : bo.test.it

Here the SPI
13767fb : src.address: hq.test.it ; dst.address: bo.test.it
e1ded53: src.address: bo.test.it ; dst.address: hq.test.it

SPI-IPSEC.png

CASE 1: ping from HQ to BO. I try to disable and enable the PPPoE interface on BO to see if the ESP packets arrive on the BO mikrotik. The packet capture is enabled on BO.
Result: the ESP packets arrive but are not processed, there's no reply.

BO-PACKET-CAPTURE.png

CASE 2: ping from BO to HQ. I try to disable and enable the PPPoE interface on BO to see if the ESP packets leave the BO mikrotik. The packet capture is enabled on HQ.
Result: the ESP packets don't arrive at all.

HQ-Packet-Capture.png

It seems that the 13767fb SPI still works, meanwhile the e1ded53 doesn't work anymore.

I also saw another thing.
If I leave the IPSec tunnel without traffic for 10 minutes (which is the IPSEC(50) connection tracking timeout on Mikrotik), the tunnel starts working again.

Any idea?

sindy · Tue Oct 19, 2021 4:47 pm

Any idea?

At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.

pe1chl · Tue Oct 19, 2021 5:00 pm

Also it is possible to set an "on up" script in the PPP profile used for the PPPoE connection (probably best to copy it from default and assign an appropriate name, then set that in the PPPoE connection).
In this script you can do things like removing all tracking entries related to the connection.
I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.

sindy · Tue Oct 19, 2021 5:10 pm

I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.

Well... my personal preference is analysis first, workarounds next

So far we only suspect that the issue has something to do with connection tracking although no NAT is involved, so if there is no difference between "before" and "after", it may be an internal issue of the firewall and therefore removal of the tracked ESP connection may help, but it may as well be a coincidence that the SA has rekeyed after 10 minutes and the actual issue may have been a too large gap in ESP sequence numbers. /ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue. I incline to the firewall explanation as few lost ESP packets should be silently ignored, but who knows.

loca995 · Tue Oct 19, 2021 6:10 pm

At both ends, use /ip firewall connection print detail where protocol~"esp" before and after doing the pppoe cycle, look for differences between "before" and "after" state.

Here the screenshots. Apparently no differences.

HQ-IP-FIREWALL-PRINT.png

BO-IP-FIREWALL-PRINT.png

Also it is possible to set an "on up" script in the PPP profile used for the PPPoE connection (probably best to copy it from default and assign an appropriate name, then set that in the PPPoE connection).
In this script you can do things like removing all tracking entries related to the connection.
I have done that in the past to remove entries related to my SIP phone, which also caused trouble in some cases.

I was used to it too: I want to keep it as my last solution because sometimes scripts don't work as expected.

/ip ipsec statistics print should help here - if, after the glitch, some counter (in-state-sequence-errors?) keeps growing there, it's an IPsec behaviour (not necessarily a wrong one), otherwise it's more likely a firewall issue.

After the cycle, the IPSec counters don't increase at all, they remain the same.

loca995 · Mon Oct 25, 2021 9:48 am

Hello
@sindy, @pe1chl. Any other suggestion?

sindy · Mon Oct 25, 2021 11:18 pm

The only thing to come to my mind is to avoid tracking of ESP connections in hope that the behaviour is caused by a bug in connection tracking. In particular:

/ip firewall raw
add chain=prerouting protocol=ipsec-esp src-address-list=allowed-ipsec-peers action=notrack
add chain=output protocol=ipsec-esp action=notrack

Add untracked to the connection-state list in the first rule in input chain in filter (so that it says action=accept connection-state=established,related,untracked), and either add the addresses of the peers to the address list allowed-ipsec-peers, or omit matching on that src-address-list in the rule in raw completely.

If that helps, it may be possible to slightly improve security by populating that address-list dynamically.

loca995 · Tue Oct 26, 2021 5:40 pm

The only thing to come to my mind is to avoid tracking of ESP connections in hope that the behaviour is caused by a bug in connection tracking. In particular:

/ip firewall raw
add chain=prerouting protocol=ipsec-esp src-address-list=allowed-ipsec-peers action=notrack
add chain=output protocol=ipsec-esp action=notrack

Add untracked to the connection-state list in the first rule in input chain in filter (so that it says action=accept connection-state=established,related,untracked), and either add the addresses of the peers to the address list allowed-ipsec-peers, or omit matching on that src-address-list in the rule in raw completely.

If that helps, it may be possible to slightly improve security by populating that address-list dynamically.

Hello @sindy,
no unfortunately this doesn't help, the issue still remains and I don't know what the reason could be...

One question for you:
I have one PPPoE interface on each RB. The only one difference between the two connections is about the link type.
The first is connected directly to a radio on the roof, via ethernet cable.
The second is connected to one Zyxel xDSL modem, configured with a bridge mode.

When I disable and re-enable the "wireless" PPPoE, the IPSEC doesn't stop working. Never.
When I disable and re-enable the "xDSL" PPPoE, the IPSEC always stops working.

Could this be related somehow to the issue? I don't think it would influence how the Mikrotik handles the PPPoE

pe1chl · Tue Oct 26, 2021 5:46 pm

It could be a bug in the ZyXEL, e.g. when it does connection tracking.
At the moment I am researching another strange bug that also involves a ZyXEL DSL modem.
I am not sure if the problem occurs in the modem or in the provider network, but it appears to drop packets with certain DSCP values.
When you have set "DSCP: inherit" in the GRE config, try to set it to "0" or make a mangle rule in the postrouting chain that resets DSCP to 0 for all packets sent to the PPPoE interface.
When this changes anything, it could point to a problem with the ZyXEL modem (for both of us).

sindy · Tue Oct 26, 2021 6:54 pm

@pe1chl, I'm afraid evidence suggests the issue is inside the Mikrotik - in this post, we can clearly see that the ESP packets do arrive from the HQ to the BO Mikrotik (so the Zyxel did not block them), but there is no response. Which means that the ESP packets don't get decrypted (or they do but the responses to the pings extrated from them do not get encrypted - that would be possible but not really likely). Highly theoretically their contents may be be malformed, but that would be another issue than the one you experience with your Zyxel.

@loca995, what does the /ip ipsec installed-sa print show after the PPPoE glitch?

I have one PPPoE interface on each RB. The only one difference between the two connections is about the link type.
The first is connected directly to a radio on the roof, via ethernet cable.
The second is connected to one Zyxel xDSL modem, configured with a bridge mode.

When I disable and re-enable the "wireless" PPPoE, the IPSEC doesn't stop working. Never.
When I disable and re-enable the "xDSL" PPPoE, the IPSEC always stops working.

Could this be related somehow to the issue? I don't think it would influence how the Mikrotik handles the PPPoE

Although the Mikrotik configuration is the same, is the ISP configuration the same as well? In particular, do both these Mikrotiks get a public IP from the PPPoE server? Or the xDSL one gets a public one whilst the wireless one gets a private/CGNAT one?

pe1chl · Wed Oct 27, 2021 12:11 am

After more hairpulling and tracing I am now suspecting a bug in the ARM version of RouterOS.
I recently replaced my RB2011 with a RB4011 and it now occurs to me that I see behavior I never saw in MIPSBE, MMIPS and TILE routers.
It could be an endianness bug?

Tomorrow I plan to swap back my 2011 for my 4011, I have loaded it with exactly the same config export.

sindy · Wed Oct 27, 2021 12:14 am

It could be an endianness bug?

I've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here. @loca995, what are the exact models of the "pppoe via wireless" and "pppoe via xDSL" devices you've mentioned above? Or simpler, are they different or same?

loca995 · Wed Oct 27, 2021 10:13 am

It could be a bug in the ZyXEL, e.g. when it does connection tracking.
At the moment I am researching another strange bug that also involves a ZyXEL DSL modem.

I have some funny news.
As I said before, the Zyxel modem is a bridge in this case, so it just put on the same layer2 domain the Ethernet interfaces and the xDSL one.
I guess that it shouldn't do any connection tracking. Please let me know if I am wrong.

When this changes anything, it could point to a problem with the ZyXEL modem (for both of us).

Highly theoretically their contents may be be malformed, but that would be another issue than the one you experience with your Zyxel

These gave me an idea. I had another xDSL Dlink modem. I put the factory conf on it, then I made the same bridge mode, like I did with the Zyxel.
I tried a just a few times but it seems to work!
I am literally speechless: I spent the last three months looking for the reasons on the Mikrotik, I want to be 100% sure that it's not just a random case: I will do other tests in the next days.

@loca995, what does the /ip ipsec installed-sa print show after the PPPoE glitch?

sa-print.png

Although the Mikrotik configuration is the same, is the ISP configuration the same as well? In particular, do both these Mikrotiks get a public IP from the PPPoE server? Or the xDSL one gets a public one whilst the wireless one gets a private/CGNAT one?

The ISP is the same and assigns both public IP addresses on the PPPoE interface. There's no CGNAT

I've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here. @loca995, what are the exact models of the "pppoe via wireless" and "pppoe via xDSL" devices you've mentioned above? Or simpler, are they different or same?

The wireless PPPoE is installed on the 2011 so mipsbe
The xDSL PPPoE is installed on the 3011 so arm

loca995 · Wed Oct 27, 2021 10:16 am

Another news!
If on the Zyxel I apply the config while the issue is experiencing, the PPPoE doesn't restart, the IPSEC doesn't restart, but the ESP starts working without doing any procedure on the Mikrotik!
This definetely sounds like a bug with the bridge modality on the Zyxel.

Btw, I have other Dlink modems where I am experiencing the same issue. I should try to update the firmware on those devices I guess..

pe1chl · Wed Oct 27, 2021 11:32 am

It could be an endianness bug?
I've definitely seen endianness bugs (e.g. vlan-protocol in /interface bridge filter), but I cannot imagine how endianness bug could be related here.

Neither do I at the moment, but it seems that there is some endianness or alignment problem when adjusting the DSCP, and it changes when I change the packet structure.
I.e. when I have PPPoE over VLAN (the ISP requires the PPPoE over VLAN6 and I apply a VLAN interface to ether1 and stack the PPPoE on top of that) it behaves differently from when I stack the PPPoE directly on ether1 and then configure the modem to add the VLAN tag.

I have several sites with MIPSBE and TILE routers, either directly connected with ether1 facing internet or with PPPoE over VLAN as described above, and they all work predictably and correctly. E.g. when I create a GRE/IPsec tunnel with "DSCP: inherit" I see the outgoing traffic has the correct DSCP (same as inner packet).
However on the ARM router I do need to have this mangle rule:
add action=change-dscp chain=postrouting dscp=!0 new-dscp=0 passthrough=yes
or else the GRE/IPsec connection will not pass any traffic with nonzero DSCP in the payload. But, when I disable that rule, the ESP traffic is still sent with DSCP 0 (as seen in the trace), however it is somehow invalid and not received by the other end.
I still have not found what wrong packet field is touched in that case, but not DSCP it seems.
That is why I suspect an endianness bug affecting some internal data and not the final packet.

sindy · Wed Oct 27, 2021 3:54 pm

Well, my remark applies in the context of IPsec, not in the context of DSCP/priority manipulation where it is quite easy to imagine that the intended modification of the priority field can actually modify other fields due to an endianness bug. More than that - here in particular, to deliver the three bits of priority information, you need to insert the whole VLAN header, so you substitute the ethertype value 0x800 at offset 12 with ethertype 0x8100, and move the 0x800 into the "ethertype" field following the VLAN header at offset 16 (I know that 802.1Q describes it different, but the Wireshark way of presentation is much more logical). And since there is the endianness bug with bridge filter matching at the same offset 16, I will not be surprised if it turns out to actually be the very same bug.

But @loca995 doesn't manipulate DSCP/priority and the issue begins already in the modem -> Mikrotik direction.

pe1chl · Wed Oct 27, 2021 4:41 pm

It would be interesting to see if this bug can be reproduced with only big-endian routers in the setup (MIPSBE and TILE) or if there needs to be an ARM router in the mix.
I have now prepared the RB2011 with exactly the same RouterOS and config as the RB4011 currently has (discovering another bug in the process...) and soon I will swap it and see if it does exhibit the problem.

loca995 · Wed Oct 27, 2021 5:51 pm

I.e. when I have PPPoE over VLAN (the ISP requires the PPPoE over VLAN6 and I apply a VLAN interface to ether1 and stack the PPPoE on top of that) it behaves differently from when I stack the PPPoE directly on ether1 and then configure the modem to add the VLAN tag.

This finally solves the problems with the Zyxel modem! Very nice! I just moved the Vlan tag from the Zyxel to the Mikrotik and now the IPSec doesn't stop working anymore!
This is a very nice thing and I could apply this workaround!
Now I think that in my production environment there still is another issue, because I have the same problem with some ADSL links without VLANS and some VDSL links with VLANS on other vendors modems. I will do other tests and I'll let you know!

pe1chl · Wed Oct 27, 2021 7:49 pm

That is interesting to know, but I think it is not good and it points to a bug somewhere. And my prime suspicion of the bug is in RouterOS for ARM.
How are your tunnel settings? Do you use DSCP: inherit? Do you use traffic with DSCP not 0 (e.g. DSCP 24)? Does that work?
In my case this causes problems. Which I can fix by changing DSCP back to 0 on postrouting. I cannot explain that.

pe1chl · Thu Oct 28, 2021 11:54 am

I have confirmed that in my case the bug is occurring only on my 4011 (ARM), the 2011 (MIPSBE) works fine with exactly the same config.
Support ticket made, SUP-64360.

markmcn · Fri Oct 29, 2021 1:13 am

Hey pe1chl
I hope you have better luck with your case,
I have SUP-52523 open for the same issue and it's very slow going, I'm guessing the support team are very busy
Here's hoping having a few tickets open about the same bug will help,
Cheers
Mark

pe1chl · Fri Oct 29, 2021 11:12 am

I have SUP-52523 open for the same issue

Do you mean the "GRE over IPSec stops working when PPPoE interface flaps" issue or an issue with DSCP setting/usage on ARM routers?

markmcn · Fri Oct 29, 2021 1:40 pm

It's for IPSec breaks after PPPoE interface flaps.
It's on an RB1100AH4 (Arm) also.
I've simplified my configs back to just an ipsec tunnel no gre and still have issues after a PPPoE bounce.
Last night I've upgraded to 6.49 and tried with the same thing again.
I've captured logs and support outs and added to the ticket

The tunnel is between an RB1100 <->RBHEXgr3. If the Hex PPPoE bounces no issue. If the RB1100 PPPoE bounces then IPSec fails

pe1chl · Fri Oct 29, 2021 4:31 pm

I will be researching further into the problem I am experiencing because I realized that it might be related to connection tracking as well, and thus may have a common cause with that problem.
In my case I do not need a PPPoE interface flap, but I have noticed that the first nonconforming packet gets through and the next ones do not. so maybe it is bad connection tracking in both cases.
Try to keep the "/ip firewall connection" window open while testing and see if any tracking entries disappear during the flap and/or if you can delete them manually and make it work again.

Mikhail73 · Fri Oct 29, 2021 4:56 pm

It's for IPSec breaks after PPPoE interface flaps.
It's on an RB1100AH4 (Arm) also.
I've simplified my configs back to just an ipsec tunnel no gre and still have issues after a PPPoE bounce.
Last night I've upgraded to 6.49 and tried with the same thing again.
I've captured logs and support outs and added to the ticket

The tunnel is between an RB1100 <->RBHEXgr3. If the Hex PPPoE bounces no issue. If the RB1100 PPPoE bounces then IPSec fails

Same here. 3xRB3011, 2 of them connected via PPPoE. IPSec regularly stop working. Only reboot. So i need to use sstp instead of l2tp/ipsec. 6.48.5.

loca995 · Sat Oct 30, 2021 1:16 am

I am back after some tests in the production environment.
A little recap of the prod. environment:
HQ : CCR 1036 - Firrmware version 6.48.3 tile
BO : RB 3011 - Firmware version 6.48.3 arm

BO has 3 different providers. Each of them has a different PPPoE interface. The provider assigns the public IP on the PPPoE interface. No CGNAT.
- Wireless link, mikrotik connected via ethernet to the radio on the roof. VLAN tag isn't involved.
- VDSL-1, mikrotik connected via ethernet to Zyxel modem, which is configured with a bridge mode. VLAN tag is performed by the Mikrotik. (thanks to @pe1chl suggestion about vlans, now this link works correctly)
- VDSL-2 mikrotik connected via ethernet to Dlink modem, which is configured with a bridge mode. VLAN tag is performed by the Dlink, because seems impossible to move the vlan tag on the Mikrotik. I guess the Dlink WAN bridge mode is more strict than the Zyxel one and doesn't allow vlan tag to pass.

Each of these links have 2 different IPIP over IPSEC tunnels with the 2 internet links in the HQ.
The default route on BO side is one of the IPIP over IPSEC tunnels made with VDSL-2
When I disable / enable the VDSL-2 PPPoE interface, the default route changes.
Then the PPPoE interface comes back running, but the ESP session between VDSL-2 and the HQ, used to initialize the IPIP interface that was used as the default route, hangs.

The only way to bring back the ESP session is stopping every session that goes through this ESP policy (In my case I disable the IPIP interfaces on both sides to avoid IPIP keepalive packets to be routed)
When I am sure that there is no traffic in the ESP session, it starts working again. In my opinion connection tracking is involved.

So I decided to change the default route to the IPIP tunnel made with the VDSL-1 PPPoE interface.
As I expected, When I disable / enable the VDSL-2 PPPoE interface, now the default route doesn't change: then the PPPoE comesback running and the ESP session over the VDSL-2 doesn't stop working.

In the next days I will analyze better the ESP flows with the packet sniffer.
I am very curious to see if changing the VDSL-2 Dlink modem with another vendor (Zyxel , TPlink ) the behaviour will change.

Any suggestion to avoid this behaviour of the ESP session?

loca995 · Sat Oct 30, 2021 4:09 pm

I am back after some tests in the production environment.
A little recap of the prod. environment:
HQ : CCR 1036 - Firrmware version 6.48.3 tile
BO : RB 3011 - Firmware version 6.48.3 arm

BO has 3 different providers. Each of them has a different PPPoE interface. The provider assigns the public IP on the PPPoE interface. No CGNAT.
- Wireless link, mikrotik connected via ethernet to the radio on the roof. VLAN tag isn't involved.
- VDSL-1, mikrotik connected via ethernet to Zyxel modem, which is configured with a bridge mode. VLAN tag is performed by the Mikrotik. (thanks to @pe1chl suggestion about vlans, now this link works correctly)
- VDSL-2 mikrotik connected via ethernet to Dlink modem, which is configured with a bridge mode. VLAN tag is performed by the Dlink, because seems impossible to move the vlan tag on the Mikrotik. I guess the Dlink WAN bridge mode is more strict than the Zyxel one and doesn't allow vlan tag to pass.

Each of these links have 2 different IPIP over IPSEC tunnels with the 2 internet links in the HQ.
The default route on BO side is one of the IPIP over IPSEC tunnels made with VDSL-2
When I disable / enable the VDSL-2 PPPoE interface, the default route changes.
Then the PPPoE interface comes back running, but the ESP session between VDSL-2 and the HQ, used to initialize the IPIP interface that was used as the default route, hangs.

The only way to bring back the ESP session is stopping every session that goes through this ESP policy (In my case I disable the IPIP interfaces on both sides to avoid IPIP keepalive packets to be routed)
When I am sure that there is no traffic in the ESP session, it starts working again. In my opinion connection tracking is involved.

So I decided to change the default route to the IPIP tunnel made with the VDSL-1 PPPoE interface.
As I expected, When I disable / enable the VDSL-2 PPPoE interface, now the default route doesn't change: then the PPPoE comesback running and the ESP session over the VDSL-2 doesn't stop working.

In the next days I will analyze better the ESP flows with the packet sniffer.
I am very curious to see if changing the VDSL-2 Dlink modem with another vendor (Zyxel , TPlink ) the behaviour will change.

Any suggestion to avoid this behaviour of the ESP session?

Hello @pe1chl

I have confirmed that in my case the bug is occurring only on my 4011 (ARM), the 2011 (MIPSBE) works fine with exactly the same config.

I had another BO with the same providers situation as the example described above, but, the Routerboard is a 2011 - mipsbe.
And, I am not able to recreate the issue.
Meanwhile, I have other routerboards in the same environment (so main router of some Branch Offices):
- CRS326 (arm) v 6.48.3
- CRS326 (arm) v 6.48.2
- CRS326 (arm) v 6.47.2
- CRS326 (arm) v 6.47.2
And there the issue is experienced when the PPPoE cycle.

I think you are right thinking about a bug on ARM routerboard.
I will perform other tests in order to exclude a problem with the VLAN tagging on a CRS326 with an ADSL link.
I'll let you know.

pe1chl · Sat Oct 30, 2021 4:35 pm

Yes, it seems like there still are some bugs in the ARM version of RouterOS.
It disappoints me a little. I have seen the many reports of problems when the first ARM devices appeared, but it is so long ago now.
All the time I have used many MIPSBE, MMIPS, and TILE routers without any issue like I have now.
Recently I wanted to upgrade the performance of the 2011 at home and bought a 4011 as it is mostly a drop-in replacement (same number of ports etc), even though I found it not so interesting when it was introduced some time ago due to lack of USB, lack of storage extension (SD) capability, and reported trouble with the WiFi.
But I sort of assumed that the software troubles would all be fixed by now as MikroTik is moving to ARM in all new devices.

Apparently not. That is a bit disappointing for me.
And unfortunately, with v7.1rc5 my VPN links do not come up at all. None of them. I'll try soon to see if I can find the cause of that because maybe in v7 these other issues could have been 'accidentally' solved and it saves me the pain of getting them to fix it in v6.

markmcn · Sun Oct 31, 2021 2:01 am

I've just tested 7.1RC5 on a Rb1100AH4 and the issue of IPSec failing to restore after PPPoE bounces is still there

There wasn't anything in the change log but I had hoped it might have been fixed

loca995 · Mon Nov 01, 2021 1:17 pm

Yes, it seems like there still are some bugs in the ARM version of RouterOS.

@pe1chl, I am back after other tests. I am lucky (or unlucky, it depends from the point of view of the situation) that the network scenario where I am experiencing the problem is made of 40 branch offices, altogheter there are 3 / 4 different type of Mikrotik products.

I noticed this:
I am not able to see any problem on RB2011 6.48, but in another BO with a RB2011 (version 6.47.3, upgraded to 6.49 last night) the issue is present like for the arm devices.
I don't think that the version 6.48 is not afflicted, but there should be something else.

I have a question:
with the command IP ROUTE CHECK X.X.X.X SRC-IP=Y.Y.Y.Y, where X.X.X.X is the destination address and Y.Y.Y.Y the src address in the ESP policy , the next-hop address is kinda strange, because it's the default gateway address.
I honestly expected to be the remote address of the PPPoE interface.
Could this be related to the issue somehow?

markmcn · Mon Nov 01, 2021 2:09 pm

One Point to add to this in 6.49 I've disabled connection tracking and I can create the issue by disabling and enabling the PPPoE interface IPSec still fails on an ARM device

pe1chl · Mon Nov 01, 2021 2:32 pm

So it looks like there is some dependency on both RouterOS version and system architecture. That is a bit weird (and not so good).
I do not see why you would want to check a route based on IPsec policy when you are doing GRE/IPsec.
Check the route across the tunnel when you want to, but not the ESP policy. I.e. use the IP assigned to the other end of the GRE tunnel.
In my network, I do not use such things but instead I use BGP to distribute routes automatically and remove those that are not available.
BGP setup is easy, much easier IMHO than tweaking all those static routes and their checking. Especially with 40 nodes.
But probably with 40 nodes you also have some issues when wanting to migrate to that solution.

loca995 · Mon Nov 01, 2021 3:38 pm

I was discussing about implementing a dynamic routing protocol like BGP or OSPF with a colleague, but for the moment with the check gateway "ping" option we are able to guarantee the services to BO without problems.
Like you said, it's not easy to migrate to BGP over 40 BO, but it's bothering to know that every morning I have to check if some GRE interfaces are stuck and fix them manually.

We decided to move from OVPN SITE-to-SITE to GRE over IPSEC because honestly, the performance were embarassing. OVPN using TCP has embarassing latency problems and the TCP retransmission basically kills the available bandwidth.

Coming back to the issue, I just can't explain why I have two RB2011, with the same configuration, but different behaviours. Even the three providers are the same...
I really hoped that in my case could be an arm processor bug, but I am starting to think there is something else. If it's in the configuration, somehow I will find it.

I used the ip route check tool during troubleshooting because I am experiencing the issue only when the default route is one of the GRE over IPSEC tunnels. I wanted to see what was going on when the default route flaps.
And what I saw it's quite weird in my opinion, but still I admin my ignorance when it comes to IPSEC behaviours, so that's why I asked.

@pe1chl, in your case, is the PPPoE interface the default route?

My idea is that the traffic that should use the ESP policy, goes through the default route for some reasons when the PPPoE inteface is disabled, and since that moment, the ESP get stuck.
I even tried to blackhole the traffic to the HQ peer address when the PPPoE cycle, without success.

pe1chl · Mon Nov 01, 2021 4:24 pm

Well, in the many years of experience I have with IPsec (also on cisco and other manufacturers of equipment) I have noticed a lot of times that it can get stuck in this kind of situation.
Often it is not really permanently stuck but it gets going when the 1h or 8h long timers on the phase1 or phase2 connections expire.
I presume you have not fiddled with settings in IPsec like DPD, send initial contact etc. Keep them at their default values. Without DPD you certainly will have trouble at link resets.
But that does not mean there are no other scenarios where these issues can occur, and maybe they are caused by RouterOS bugs, maybe for all architectures maybe for a single one.
I should say for me it all works well, except the issue with NAT already discussed above.
There are other users on this forum with far more in-depth knowledge of the situation, but they have been unable to help you as well.

sindy · Mon Nov 01, 2021 10:51 pm

The way you describe the current issue, it again resembles me about those I had in the past, where I had to disable the GRE tunnel for 10 minutes, which is the connection tracking timeout for both ESP and GRE, so maybe I was bitching at GRE all the time whilst the actual reason was IPsec (although I'm pretty sure it behaved the same when the ESP was encapsulated into UDP, and the UDP timeout in firewall is 3 minutes, not 10). So maybe the replacement of GRE by IPIP wasn't the actual remedy.

Having said that - the IPsec standard requires that packets that match a traffic selector of an existing IPsec policy with action=encrypt are not sent any other way even if no SA is currently active for that policy, and that packets that inverse match that traffic selector are dropped if they arrived any other way than via the policy. And I hazily remember I could log such packets in mangle but they did not reach filter, which suggests that this dropping of packets coming in the wrong way is something done by the firewall software, not by the IPsec decryption software.

As you mention the default route and the PPPoE - I assume you've got a specific route with the remote peer as dst-address via the PPPoE, and a default route via GRE as you've mentioned. So as long as the PPPoE is up, that specific route overrides the default one; once the PPPoE goes down, the specific route becomes inactive, so the IPsec transport packets (ESP) start getting routed via the GRE, encapsulated, encrypted, and routed to the GRE again.

So just a suggestion - create another route, with distance=10, with dst-address set to the address of the remote peer, no gateway, and type=blackhole. This route will prevent the transport packets to be sent down the default route while the PPPoE tunnel is down. See whether it helps something.

loca995 · Tue Nov 02, 2021 10:19 am

First of all, let me say thank you very much for your support, especially to @sindy and @pe1chl that gave me a lot of suggestions and I was able to find some problems. It's cool to know there is a nice community there on the mikrotik forum.
Coming back to the issue:

I presume you have not fiddled with settings in IPsec like DPD, send initial contact etc. Keep them at their default values. Without DPD you certainly will have trouble at link resets.
But that does not mean there are no other scenarios where these issues can occur, and maybe they are caused by RouterOS bugs, maybe for all architectures maybe for a single one.

Correct, the DPD is still enabled on both BO and HQ. Unfortunately the 80% of PPPoE cycle is made because of the "session-timeout" attribute sent by the Radius Server. So the cycle it's super fast, like 1s or less. On such situation, the DPD doesn't help because it doesn't lose any keepalive packet.

As you mention the default route and the PPPoE - I assume you've got a specific route with the remote peer as dst-address via the PPPoE, and a default route via GRE as you've mentioned. So as long as the PPPoE is up, that specific route overrides the default one; once the PPPoE goes down, the specific route becomes inactive, so the IPsec transport packets (ESP) start getting routed via the GRE, encapsulated, encrypted, and routed to the GRE again.

Yes, you assumed correctly. This is exactly my setup: a /32 route towards the IPSEC peer via PPPoE interface.

As you mention the default route and the PPPoE - I assume you've got a specific route with the remote peer as dst-address via the PPPoE, and a default route via GRE as you've mentioned. So as long as the PPPoE is up, that specific route overrides the default one; once the PPPoE goes down, the specific route becomes inactive, so the IPsec transport packets (ESP) start getting routed via the GRE, encapsulated, encrypted, and routed to the GRE again.

So just a suggestion - create another route, with distance=10, with dst-address set to the address of the remote peer, no gateway, and type=blackhole. This route will prevent the transport packets to be sent down the default route while the PPPoE tunnel is down. See whether it helps something.

This is exactly my setup. We already used the blackhole, in some cases helped, on other seems not. I am glad to know that the reasoning about the ESP traffic routed into the default GRE tunnel was correct.

Anyway, I made some changes on other BO. I think I will find some similarities soon that could help me to focus on my issue.
I'll let you know after the tests.

markmcn · Tue Nov 09, 2021 6:10 pm

Sadly Mikrotik Support are being very silent on the ticket I have opened.
I have sent them updates and support output's etc but I have no confirmation they have made any progress or are actively working on the issue.
I understand this isn't a paid service support program like other vendors so I'm just requesting updates on the ticket every now and then.
If they make progress and update I'll share here.

pe1chl · Tue Nov 09, 2021 6:18 pm

Sadly Mikrotik Support are being very silent on the ticket I have opened.

Unfortunately I have not heard anything either on my arm-related bug, initially there was a reply within a day or so but after supplying more information it became silent.

markmcn · Wed Nov 10, 2021 3:20 am

I wouldn't hold my breath, My ticket was opened back in June 21 and to date I've only got one reply on it which was Oct 21.
It's not like it was filed with the statement of it doesn't work. I have provided tech support outputs and as much detail as possible along with updates around if new versions happen to fix it by some luck.
It's sad to see that support seems to have dropped off. Like I said I don't expect ci$co levels of support (not that those levels are great these days for the price you pay).
The only way I the Oct reply was when I opened another ticket requesting someone to please please look at the first.
It seems that it's just dropped to the bottom of the pile of doom to be forgotten.
I mean how can MT be expected to be considered for enterprise of any scale when they have this issue and ignore the users. Hell I've even gone so far as to point out the existence of this thread just to show others are impacted by this bug. I've one client who I had to rip out MT's and put in pfSense boxes just to keep them happy I doubt I'll see MT hardware in there again. I've offered to provide them live access to a test setup which has the issue so they can perform live and realtime capture of diagnostic information to try and help get this fixed. I'm not expecting them to login and fix it but it might be easier for them to collect it from a lab setup which is existing though.
I'm going to give the benefit of the doubt and assume they are overworked but I know there was a case not too long ago for a different issue (Yes it was something stupid I had not spotted that was changed automatically, I'm not a fan of when the OS changes MTU setting of bridges automatically) but the engineer working the ticket was requesting to that I make changes that didn't make sense it felt like the approach was to press enough buttons or flip enough levers until something worked. Eventually I managed to convince them to please look at the techsupport output I had attached and they highlighted my over sight yes sadly this was a layer 8 case i'm not proud to say but have to admit. If the engineer had taken the 5 or 10 min to review the suppout that I'd supplied at the start chances are they would have spotted my issue right away and been onto the next ticket job done but instead they were they were trying firewall rules that didn't make sense.
Anyway I hope things improve with their support, I really try not to bug them with tickets that are my fault and in this case performed a full factory reset on the 1100Ah4 just to rule out me doing anything dumb.</rant> Now I just have to sit and wait here's hoping some day they may fix it but I get the sneaking idea, with they way they are ignoring this and the many many requests for VTI they really aren't interested in users who want IPSec and so maybe PFSense or Vyos is the way to go

pe1chl · Wed Nov 10, 2021 11:22 am

I wouldn't hold my breath, My ticket was opened back in June 21 and to date I've only got one reply on it which was Oct 21.

Well in my case the first reply unfortunately was not to the point (it questioned the use of a VLAN-filtering bridge on the LAN side while my problem occurs on a VLAN on the WAN side that is not part of that bridge), so I explained what exactly is required to reproduce the problem and that the bridge is not required for that. Then I heard nothing back anymore.

It's sad to see that support seems to have dropped off. Like I said I don't expect ci$co levels of support (not that those levels are great these days for the price you pay).

Well I don't see it that way, I can understand that supports get loads of tickets from people who have misunderstood something, made some config mistake they overlooked, etc etc. I have opened such tickets myself in the past.
You get a router for $59 and that includes the manufacturing and distribution, and one cannot expect personal consultancy services for that kind of money.
At other manufacturers you have to pay for a maintenance contract to even get someone to read your message, so the MikroTik support is not that bad.

markmcn · Wed Nov 10, 2021 5:02 pm

Hi Pe1chl,
Thanks for your thoughts honestly, I get it I don't expect personal consultancy and as I said I've opened tickets where I have missed things which did take support time which was my own fault.
I suppose I just feel support are possibly ignoring what has potential to be a larger issue. If they were actively working on it and contributed to this conversation to say they ack the issue which maybe impacting a larger audience and they were working on it fair enough. Honestly am I being unrealistic. It's the radio silence that I'm finding the hardest with them. If I was in a position to provide some clients with details that would be helpful.
Cheers
Mark

markmcn · Fri Jan 28, 2022 12:13 am

Ok so after reaching out to MT support I have to eat my words.
Emīls really went above and beyond on this case. I cannot speak ill of their support again.
So long story short it turns out the VDSL bridge was the issue. I just swapped it out tonight and i've bounced PPPoE over 10 times and every time the IPSec has recovered as expected.
It's strange that the device which should be only layer 2 was effectively eating ipsec traffic after a session reset but then again these devices are cheap and QA on the firmware has never been the strong point. For reference the VDSL device which is now in the bin was a Zyxel vmg8324-b10a It showed no sighs of issues expect the impact to IPSec.
So again I tip my hat to the support team and eat my words around the level of effort from them. They were typed in frustration and I have been proven wrong.

pe1chl · Fri Jan 28, 2022 11:45 am

That is interesting, I have seen some weird issues with ZyXEL as well, I suspect there is some firewall in them you cannot completely disable.
But what have you bought instead? Draytek?
Did you change other things besides the modem? (e.g. did you move the VLAN tagging from the router to the modem)

markmcn · Fri Jan 28, 2022 3:11 pm

Hey Pe1chl
So I didn't change anything else in the config. Vlan tagging was always being done by the Mikrotik.
I happened to have a Kasda KW5262 with the stock firmware no isp branding or anything just laying about doing nothing so I configured that as a bridge and boom everything started working
I've bounced PPPoE multiple times since making the change and every time IPSec recovered as expected. In a pinch one day the only thing I could buy locally to replace a failed ADSL modem was a TP-Link TD-W9970 but it does both (A/V)DSL and I haven't seen any issues with that either in bridge. But there you have it something I had just thought was doing layer2 was actually mangling IPSec in a very messed up way.
Cheers
Mark

pe1chl · Fri Jan 28, 2022 3:27 pm

Well I *do* think that there are bugs like that, but only in the ARM version.
I have a reproducible case (see above) of a config that works in a RB2011 and fails in a RB4011.
That is also related to PPPoE over tagged VLAN on a ZyXEL modem. I suspected the ZyXEL but no longer once I saw it work OK on the RB2011.
And when I tried with a different modem that does not support tagged traffic, I removed the VLAN tag temporarily and added it in the modem, and it worked OK as well.
That is why I asked.

pe1chl · Tue Mar 01, 2022 12:44 pm

@markmcn has your issue been fixed already? My SUP ticket has been closed with "resolved", it should be fixed in 7.1.2 and 7.1.3 but I still have the issue in 7.2rc4.
(which has a later release date than 7.1.3, I would hope that any fixes in 7.1 at least make it into 7.2)

Maybe it was fixed for your scenario but not for mine? I need to do more tracing I'm afraid...

loca995 · Sun Mar 20, 2022 4:19 pm

Hello Forum,
I come back on this thread because I still didn’t give up about finding a reason/bug over my problem and after months I “Maybe” found an explaination.
I say “Maybe” because it would be explained in Cisco environments, so I hope it could be the same for RouterOS.
Coming back to @sindy and @pe1chl words, I focused my attention over Cisco documentation. There is an article where Cisco explain exactly how GRE Keepalive packets work and there is a section that explains issues with IPSEC.

https://www.cisco.com/c/en/us/support/d ... re-00.html
https://www.cisco.com/c/en/us/support/d ... .html#anc7

Basically, a keepalive packet is a GRE packet encapsulated in the GRE tunnel itself, where the source and the destination addresses are inverted.

Gre-Headers.png

This is necessary because GRE is a stateless protocol, every GRE interface stands on its own, so it has to manage keepalive packets, regardless of the other router GRE interfaces states.

Summarizing my network topology:
- BO has the default route via the GRE/IPSec tunnels with Keepalive enabled
- The HQ has the default route via the HQ-WAN

My-Topology.png

Now, I suppose to look into GRE keepalive lifecycle when the BO’s PPPoE interfaces are both up and running.

Normal-Gre-1.png

Normal-Gre-2.png

Normal-Gre-3.png

Let’s suppose that the BO-WAN1 PPPoE interface goes down in the middle of this procedure

Broken-Gre-1.png

Broken-Gre-2.png

At this point, BO router will perform a route lookup in order to route the packet. It won’t find the IPSec ESP policy active anymore.
The route lookup will look into GRE headers and will choose the new default route via the failover GRE/IPSEC tunnel.

Broken-Gre-3.png

The HQ receive an unwanted keepalive packet. The question now is: could this break the IPSec ESP policy when the BO-Wan1 PPPoE interface comes back up?
Is the HQ-CCR able to discard the packet automatically?
Am I missing some steps?
Same issue it’s noticed with IP-IP interfaces over IPSec.
In both protocols, keeping both interfaces disabled (GRE or IP-IP as well) for three minutes (GRE and IPIP timeout on connection tracking), fix the problem. Basically there must not be traffic hitting the ESP policies on both sides.

The whole explaination could be real if routerOS handles GRE keepalive packets as Cisco IOS does.
So I opened a ticket with Mikrotik support, but the first answer is not encouraging: RouterOS handles keepalive packets in its own way.
I asked for more details and I am awaiting for answer.

I am starting tests with L2TP over IPSec, which is a stateful protocol, and hopefully it won’t have the same problem.

pe1chl · Sun Mar 20, 2022 7:10 pm

You should disable GRE keepalive.... it serves no function when using GRE/IPsec.