Page 1 of 1

WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Wed Jul 03, 2024 11:07 am
by 611
WireGuard over intersite links for management purposes, no recent configuration changes, no problems until, I believe, 7.15.
Various MT devices, CHR, all running 7.15.1.

Several sites: site1, site2, site3. All-to-all configuration - each site have a single WireGuard instance and two peers configured for other sites.
10.0.<site>.0/24 are internal networks, 10.1.<site>.0/24 are management networks.

site1:
/interface wireguard add listen-port=13231 mtu=1420 name=wireguard1
/ip address add address=10.1.1.1/16 interface=name=wireguard1 network=10.1.0.0
/interface wireguard peers add allowed-address=10.1.2.0/24 endpoint-address=10.0.2.1 interface=wireguard1 
/interface wireguard peers add allowed-address=10.1.3.0/24 endpoint-address=10.0.3.1 interface=wireguard1
site2:
/interface wireguard add listen-port=13231 mtu=1420 name=wireguard1
/ip address add address=10.1.2.1/16 interface=name=wireguard1 network=10.1.0.0
/interface wireguard peers add allowed-address=10.1.1.0/24 endpoint-address=10.0.1.1 interface=wireguard1 
/interface wireguard peers add allowed-address=10.1.3.0/24 endpoint-address=10.0.3.1 interface=wireguard1
site3:
/interface wireguard add listen-port=13231 mtu=1420 name=wireguard1
/ip address add address=10.1.3.1/16 interface=name=wireguard1 network=10.1.0.0
/interface wireguard peers add allowed-address=10.1.2.0/24 endpoint-address=10.0.2.1 interface=wireguard1 
/interface wireguard peers add allowed-address=10.1.1.0/24 endpoint-address=10.0.1.1 interface=wireguard1
After site3 goes down for some hours (in two observed cases it was 23 and 13 hours), then comes back up, no packets from site3 could reach site1 and site2 over WireGuard. The underlying site-to-site links are OK.
"Last handshake" property for site3 peer on site1 and site2 show the time since the last handshake before site3 gone down.

Cycling (disable/enable) WireGuard peer entries on site3 isn't helping.

Cycling WireGuard peer entry for site3 on site1, or a ping (any packet, I believe) from site1 to site3 brings WireGuard link between site1 and site3 back to live. Same for site2 and site3.

All sites are normally online (but outages happen), so marking any peers as "responder" doesn't look reasonable.

If anyone of you have the same or similar problems with WireGuard on 7.15?

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Wed Jul 03, 2024 3:11 pm
by jaclaz
Maybe related, maybe not:
viewtopic.php?p=1083628

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Thu Jul 04, 2024 4:53 pm
by 611
Maybe related, maybe not:
viewtopic.php?p=1083628
I believe it's unrelated, but a reply in the tread gave me good idea:
Another slight possibility
Router-B is not happy, because router-A still on same port, but wg timestamps from router-A went backwards,
Perhaps router-A wireguard needs restart after Router-A has got correct time.
Indeed, site3 is in a rural area, and the outages were power outages due to summer thunderstorms. The MT device on the site has no RTC thus wakes up with clock set in the past, which precludes WG handshake per 5.1 of WireGuard whitepaper.

After the last outage the router at site3 woke up with it's clock set approx. 16 hours behind, and was able to connect to site1 (without any manual intervention) after approx. 4:20. At this point the time since the last handshake with site3 in site1's WG was approx. 6 hours. Though I've observed 23 and 13 hours since last handshake earlier, and restoring the link required a ping from site1 (so it would initiate a handshake) or cycling the peer at site1.

TBH, I was not expecting this dependency, and, as I distribute time for all my network devices from a master device on the management network, this dependency becomes critical - the site can't connect to the management network unless it has its time fixed, and the site requires connection to management network to fix the time.

It looks like I will need to change the way the time is synced on my network - OFC there are many workaround possible, but I'm a bit tired of working around all the small things Mikrotik.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Thu Jul 04, 2024 5:56 pm
by jaclaz
Very interesting explanation.
TBH, I was not expecting this dependency, and, as I distribute time for all my network devices from a master device on the management network, this dependency becomes critical - the site can't connect to the management network unless it has its time fixed, and the site requires connection to management network to fix the time.
A good example of Catch22. :-D

I am not sure why Mikrotik does not provide a (optional, paid for) way to add a RTC (Real Time Clock) to their devices, capable to keep the date/time for a few days, I know that most if not all professional setups have an UPS so there is less need for such a device (but UPS can also fail from time to time), and for remote/not easily accessible sites the professional GPS based NTP server devices are in the hundreds of dollars, though I have seen el-cheapo ones for less than 100 $, no idea how good they are.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 12:20 am
by 611
I am not sure why Mikrotik does not provide a (optional, paid for) way to add a RTC (Real Time Clock) to their devices, capable to keep the date/time for a few days, I know that most if not all professional setups have an UPS so there is less need for such a device (but UPS can also fail from time to time), and for remote/not easily accessible sites the professional GPS based NTP server devices are in the hundreds of dollars, though I have seen el-cheapo ones for less than 100 $, no idea how good they are.
I'd guess the rationale is "It's a network device, it'll get the time from the network when it boots up". True, but there are caveats.
And developing an add-on module 0.1% users would want is a non-starter.

UPS are not just fail (from my experience, PSU/UPS and other power-circuit related faults are the most frequent failures), they normally won't keep the load alive for 24h, unless you have a very peculiar setup (and I've just got a 23h, then a 13h power outage).
GPS-based NTPs are solid, and honestly they don't have to cost a lot - any GPS receiver modules (including cheapest ones) have a time precision far exceeding most of IT needs, and implementing a NTP server on a microcontroller is not a rocket science. But introducing a separate NTP server on a site breaks internal NTP hierarchy, an this bothers me more than a need for a separate device.

In my particular case the site have a LTE link, thus the modem knows the time with microsecond precision (it gets its time form the base station that has a GPS receiver), and :local modemtimeoutput [/interface lte at-chat lte1 input="AT+CCLK?" as-value] will get the time from the modem. Though as I have a LTE MS separate from the site's main router, I will need to /system ssh-exec address=10.1.3.1 user=*** command="/system clock set date=$modemdate time=$modemtime" to set the time on the router. And from this point on the router will be able to connect to the management network and get it's NTP time. Yet another workaround script :(.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it  [SOLVED]

Posted: Fri Jul 05, 2024 5:14 pm
by 611
The problem seems to be caused by a combination of two peculiarities:
* time incorrectly set on site3's router after a power outage, each power outage - a "feature" of most Mikrotik devices;
* replay protection mechanism of WireGuard protocol - if the peer was previously connected, then the peer's time is set back for any reason, Wireguard won't accept handshakes from that peer until its time will become later than the last successful handshake.

Most probably the problem is not related to 7.15 - it's just a coincidence 7.15 was released in the late May, just before the summer with its thunderstorms and thus power outages.

As I've built my NTP hierarchy with a single master device syncing to the network source, and all the other devices syncing with the master, putting in some network NTP is not an option. And to add insult to injury, my master device is on the management network, one that inaccessible until Wireguard connection is established.

So, yet another workaround - a script that runs on startup of LTE device, waits for LTE interface to go up, grabs the current time from the modem (the time is supplied by the LTE network), and applies the time to both LTE device and the router, once:
# Wait for the LTE interface to go up
:while ([/interface lte get 0 running] = false) do={ :delay 5 }

# Check the single-run flag, proceed if not set
:if ([:typeof $"scr-lte-time-as-synced"] != "str") do={

  # Get current time from the LTE modem 
  :local modemclockoutput ([/interface lte at-chat 0  input="AT+CCLK?" as-value]->"output")

  # Extract date and time, starting at the first double quote symbol (tested with R11e-LTE6, other modems may use another date/time format)
  :local dtstart [:find $modemclockoutput "\""]
  :local modemdate ("20" . [:pick $modemclockoutput ($dtstart+1) ($dtstart+3)] . "-" . \
      [:pick $modemclockoutput ($dtstart+4) ($dtstart+6)] . "-" . [:pick $modemclockoutput ($dtstart+7) ($dtstart+9)])
  :local modemtime [:pick $modemclockoutput ($dtstart+10) ($dtstart+18)]

  # Set local time
  /system clock set date=$modemdate time=$modemtime

  # Set time on the router (prereqs: SSH PKI access to the router)
  /system ssh-exec address=[/ip dhcp-client get 0 dhcp-server] user=*** command="/system clock set date=$modemdate time=$modemtime"

  # Set the single-run flag
  :global "scr-lte-time-as-synced" ($modemdate . " " . $modemtime)
}
This way the router will get the correct time (minus second or so) within moments from establishing a LTE connection, and will be able to connect to the Wireguard hosts it was previously connected to.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 5:23 pm
by jaclaz
Nice script and - besides - I love it when scripts are well commented like this one, besides solving a specific problem they are useful to understand Ros syntax. :)

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 5:42 pm
by anav
The question is, why is such a script needed in this day and age.

a. why does the MT not have the same time as LTE
--> one can use NTP with MT
--> can one use NTP with LTE? If so why not use same NTP server as MT?

The LTE is getting its time from somewhere, as the provider!

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 5:59 pm
by 611
Thanks!

The script syntax documentation is another thing MT should definitely work on: if you look up :pick command (used in the script) in https://help.mikrotik.com/docs/display/ROS/Scripting, you'll find the syntax is :pick <var> <start> [<count>], but if you'll try to use the command, you'd realize (by trial and error, or a lookup into another scripts) that the syntax is actually :pick <var> <start> [<end+1>] - the third parameter is NOT the character count, it's an index of the last character to be extracted PLUS ONE. WTF?!

But, TBH, the problems like this are common for grass-root domain-specific scripting languages - some language constructs are odd (I've seen a language where character indices in string functions are starting from either 0 or 1, depending on the function - a thing just like this <end+1>) and ill-documented (<count>? keep in mind that's not true, or go debug your script to find it out once again), and documentation as whole frequently lacks the information you need (as-value parameter for at-chat is not documented).

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 6:12 pm
by jaclaz
Well, programmers are re-known for their counting starting from 0, the issue is that they (the programmers) should not be allowed to write manuals, help or syntax explanations, there are technical writers to do that, and what I find missing the most in Mikrotik (scarce) documentation are (not entirely unlike Linux man pages) the practical examples.

As a side note, in the words of Stan Kelly-Bootle:
https://en.wikipedia.org/wiki/Stan_Kelly-Bootle

"Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration."

And of course:
https://xkcd.com/163/

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Fri Jul 05, 2024 6:16 pm
by 611
The question is, why is such a script needed in this day and age.
That's the state of art :) :
* (consumer) MT devices have no RTC;
* the latest and greatest VPN protocol have a (pretty well hidden) requirement towards monotony of the peer's clock.

OK, this particular problem is in part caused by my aforementioned paranoia - strictly hierarchical NTP structure, plus NTP on management network, plus management network spans sites over WireGuard.

Regarding syncing the device time with LTE modem by default, I believe it could cause more problems than solve in certain edge cases.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Sat Jul 06, 2024 2:05 am
by Paternot
I don´t get it. You have internet connection. Use a public NTP server. True, it may not sync to the microsecond - but do you need this kind of precision? Isn't syncing to the millisecond enough?

Just boot, get internet, sync time from NTP and connect wireguard. Easy.

Yes, yes. I know. Private NTP server and all.
How about this? Boot, get internet, sync with public NTP server, connect to the VPN, detect VPN connected, change from public NTP to the private one.

I still think it's overkill, but...

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Sat Jul 06, 2024 10:32 am
by 611
I don't think a private server over a VPN will ever give more precision than a nearby public one. It's all about the hierarchy, not raw precision.

Bootstrapping the clock from a public NTP is fine, but LTE looks easier to me - no config changes at the router, no netwatch to detect if WG is running, just a startup script on the LTE device.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Sat Jul 06, 2024 5:06 pm
by Paternot
I was thinking NTP versus modem, to time keeping.

Either way, probably way more precise than using NTP through a VPN.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Sat Jul 06, 2024 10:39 pm
by anav
Well if one was looking for precision...........

https://help.mikrotik.com/docs/display/ ... e+Protocol

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Wed Oct 30, 2024 2:36 am
by firsak
Similar issue with Chateau LTE6 ax that’s been set up with wireguard. It is a portable router that is deployed for a user and frequently gets powered off and on, leading to wireguard connectivity issues, presumably because of time discrepancy.

NTP client is set up, but it takes approximately 10 minutes after rebooting for the correct time to sync through NTP. That's too long. The router itself boots for a couple of minutes. Than there is 10 minutes of waiting till time is synced. Almost a quarter of an hour!

Is there a way to force an immediate NTP time sync upon boot, after WAN interface is up so that the time is correct before wireguard attempts to connect?

LTE time is unreliable in my case. Returns a completely bs date/time with command /interface lte at-chat LTE-WAN input="AT+CCLK?" as-value.

Re: WireGuard link on 7.15 gets stuck after peer was down, a ping or cycling the peer will unstuck it

Posted: Wed Oct 30, 2024 11:50 am
by jaclaz
You could experiment with the Mikrotik Cloud service:
https://help.mikrotik.com/docs/spaces/R ... Updatetime
no idea how fast it is (and whether its uptime is reliable enough).

From the little experience I have with NTP, the time it takes to sync is "random", sometimes it takes seconds, sometimes several minutes.

You can also try with a script at boot time unsetting/setting the NTP server:
viewtopic.php?t=151533
but cannot say if it will be any faster. :?