I am facing an annoying problem with my IPsec tunnels. Hope you guys can help me find the solution.
Cenario:
I have a Router (RouterBoard CCR1036-8G-2S+) working in a High Avaiability configuration using VRRP. My network ID is: xxx.xxx.xxx.8/29. My Master Router got the xxx.xxx.xxx.14/29 on his Lan Interface and the xxx.xxx.xxx.13/32 on his VRRP Interface (i am using netmask /32 because the wiki says so, but i don't really understand why. I guess is because something like broadcast, whatever. If someone can explain to me later I'll be very thankfull). My Backup Router got the xxx.xxx.xxx.12/29 on his Lan Interface and again the xxx.xxx.xxx.13/32 on his VRRP Interface, wich remains disable till something goes really wrong.
So, I got on my Master Router multiple IPsec tunnels with different players as: SonicWall, pfSense and Mikrotik. In all of the transform sets I use the local address xxx.xxx.xxx.13 to force the tunnel to use my VRRP IP for HA reasons (my backup has all the same configurations, so if one goes down, the tunnels goes back online with the other router). So, all tunnels presents the same problem at a given time...
The problem:
For no apparent reason, the traffic between some of the networks (e.g.: MyRouter -> SonicWall; MyRouter -> pfSense; MyRouter -> Other Mikrotik), suddently stops. OUT OF NOWHERE. This happens not offten and with no patterns at all, just happens. Sometimes with one of the tunnels, sometimes with the another one. I use the tunel without worries for a loooong time (like 1 month) and out of nowhere the traffic just stops. Somewhile later everything goes back to normal without me doing nothing, just waiting or doing something else.
Today I was looking in depth this problem and I found something really interesting and odd at the same time: I noticed that when the traffic stops, if i generate some log's I can see that the traffic is being translated to another IP address, and when the traffic goes back on, there is no translation.
Exemple: When the traffic stoped, I created 4 rules on the filter tab:
Code: Select all
/ip firewall filter
add action=accept chain=input dst-address=xxx.xxx.xxx.xxx src-address=MyPeerIP
add action=accept chain=output dst-address=MyPeerIPsrc-address=xxx.xxx.xxx.xxx
This two rules above I just created to see if the traffic is getting to my router and going out of it. (According to the packet flow, the packet will enter my firewall after the routing decision on input chain because of the destination IP the peer puts on, and will go out at the output chain because the IPsec policies reencapsulates de packets with the the outgoing IP so that the other peer accepts it). So, 'till here no problems, I can see the packets comming through and getting out.
So I created two more rules to see if the paylod itself is getting through the firewall after de decryption occurs on the router. In this rules I used the RFC1918 IP's because the packets already passed the IPsec policies and was decrypted going through forward chain. 192.168.110.0/24 is one of the CCR's network ID's and the 192.168.100.0/24 is one of the remote network ID's:
Code: Select all
/ip firewall filter
add action=accept chain=forward dst-address=192.168.110.10 src-address=192.168.100.23
add action=accept chain=forward dst-address=192.168.100.23 src-address=192.168.110.10
So, here is the problem: When I get the problem (the traffic stops), I see no traffic on this two last rules, meaning that something is going wrong on the decrytion task of my transform set. I realized it because a tested the same rules on the other tunnels that at this time were working and I can see normal traffic on all of the rules (public Ip's comming in and going out wrapped on esp header and rfc1918 Ip's comming in and going out wrapped on tcp headers... everything as expected). But, on the tunnels that the problem occurs, I see no traffic on the last two rules.
So, I logged the two first rules (the ones with publics ip's) and I saw something REALLY strange:
In the input rule (the one who shows me the traffic comming from the peer), I realized the destination IP (the one that should be .13, my VRRP IP address), was being translated to the LAN IP of my router (.14). The log shows:
<srcaddress> -> <xxx.xxx.xxx.14>, NAT <srcaddress> -> (<xxx.xxx.xxx.13> -> <xxx.xxx.xxx.14>). In short: the destination address is not the .13 (the IP wich the tunnel was established), it's being nated to the LAN IP of the router, and I have no idea why. I have already created all the accept rules you can imagine to bypass this but didn't work.
So, thinking about the traffic flow, I looked up what was going one before the forward, e.g.: prerouting (mangle; dstnat...). I've created a rule accepting the scr address of my peer on dst nat trying to bypass this but without success. So, looking at the mangle, I've also created a rule to log the traffic and this is what I'm getting:
So, the tunnel is established over .13 IP and the traffic in mangle is showing the destination as .13, but for some reason this is being translated to .14. While this happens, the traffic that should be destined to .13 is being changed to .14 and because of it, the IPsec transform set does not work because when the traffic moves to input chain the destination IP is already .14 and not .13, bypassing my IPsec rules, in that point I can't see traffic on the two final rules (rfc1918 IP's) and I get no traffic at all.
The problem is also that out of nowhere, after a while, like 20 minutes or more, everything comes back to normal, withou me doing nothing. After the traffic comes back to normal again, I remove all the rules I've created and everything keeps working.
Does anybody else faced something like that? Or, can anyone help me?
I have no problems with Mikrotik IPsec on others cenarios, only in this one with VRRP.
OBS: I use v6.43.16 on my CCR.