Bonding questions

VanceG · Fri Sep 25, 2020 8:36 pm

All:

Looking at implementing bonding to get a redundant connection on a 26 mile wireless link. Here's a quick diagram of the proposed setup:

bonding_diagram.jpg

The reason for this is that occasionally we get very heavy rain/snow that could impact the 5G link and if it goes down completely, would like to stay "on the air" with the 2G link, which is likely to punch through when 5G can't.

I have read the documentation on bonding and it seems relatively simple to set up. I have a few questions that I haven't seen the answers for:

The 2G link will be much slower than the 5G link under normal conditions. Ideally, I would like the system to not use the 2G link at all unless the 5G link degrades to the point where it's unusable. From the bonding docs, it seems like the two links will be always active and passing traffic. How would this impact network speeds under normal operation where the 2G link is actually not needed? What will happen if there's a big difference in throughput between the two links?

Is it possible to configure bonding to check conditions on the two links such that if link degradation is detected (say, ping times get really high and ping drops are detected) it switches over to the secondary link? And when conditions improve it switches back? I see in the docs that stuff like this is supported by various Failover options, but since I do not have two IP's at the source end, it seems like Failover won't work for me or my understanding of it is in error.

Thanks in advance for any suggestions here.

VanceG · Fri Sep 25, 2020 8:37 pm

Wow. All the formatting of the diagram is completely mangled. Sorry about that...

Will post a screenshot.

VanceG · Fri Sep 25, 2020 8:41 pm

Diagram attached. Is there a way to paste text to the forum and have the formatting retained?

bonding_diagram.jpg

sindy · Fri Sep 25, 2020 8:59 pm

Regarding the diagram, use [code2=ascii-art] and [/code2] tags around the formatted text you want to stay as you wanted. I could not find a better way.

As for the bonding, setting mode to active-backup, link-monitoring to arp, and primary to the interface to which the 5 GHz radio is connected should do what you want - prefer the 5 GHz path as long as it transparent and only fail over to the 2.4 GHz one if the 5 GHz one becomes unusable.

VanceG · Fri Sep 25, 2020 9:27 pm

sindy, thanks. You have helped me before with succinct and clear answers to my questions.

I *almost* have enough parts and pieces lying about to set up a test bed to try this (a RB 750, two old Bullet 2 HP's and two PowerBeam 5's). I need another RB960 as backup in any case and this gives me an excuse to go ahead and buy it.

That will probably be the best way to try this out and learn.

VanceG · Fri Sep 25, 2020 10:08 pm

sindy:

Based on your suggestion that "active-backup" is the mode I am looking for, found this:

http://community.mimosa.co/t/b5c-hot-st ... ink/3234/3

post #3

What is your opinion?

sindy · Fri Sep 25, 2020 10:45 pm

What is your opinion?

On what in particular? I assume, as you talk about bonding, that you want your radio links to be L2 transparent, so no routing and associated failover mechanisms can be used. On bonding, that post says the same that I do, doesn't it?

VanceG · Sat Sep 26, 2020 12:51 am

Okay. Specifics are good.

I noticed that the destination router's config shows "link-monitoring=arp" and "mii-interval=100ms", then uses "down-delay" and "up-delay" options (which would be extremely handy).

The Mtik wiki says this about up/down delay: "This property only has an effect when link-monitoring is set to mii." But link-monitoring is set to arp.

I guess the implication is that you can use mii-interval even if mii link-monitoring is not used.

sindy · Sat Sep 26, 2020 12:02 pm

I'm afraid it may be an example of wishful thinking (if something isn't clearly stated, I expect that the actual behavior is the one I need). I would rather expect a mistake in the blog post than in the Mikrotik manual (mii-interval instead of arp-interval).

Other than that, mii-monitoring can only monitor the physical state of the interface; what you need is checking transparency of the whole path connected to the interface all the way to the remote endpoint, and for that only arp-monitoring is suitable.

VanceG · Mon Oct 19, 2020 6:18 pm

sindy:

Hope you're still monitoring this thread.

I have gotten all the pieces necessary to setup a trial bonding config. and am working on it in my spare time.

The Wiki has this to say near the beginning of the Bonding topic:

"Make sure that you do not have IP addresses on interfaces which will be enslaved for bonding interface!"

You might remember you helped me about a year ago set up an interface entry so I could access my WAN link devices. The required an addition to the ETH1 interface. The current config. of /ip address is as follows:

[admin@MikroTik] /ip address> print detail
Flags: X - disabled, I - invalid, D - dynamic
0 ;;; default configuration
address=192.168.0.1/24 network=192.168.0.0 interface=bridge1
actual-interface=bridge1

1 ;;; allows access to long link devices on 192.168.1.x
address=192.168.1.1/24 network=192.168.1.0 interface=ether1-gateway
actual-interface=ether1-gateway

Obviously, ether1 is connected to my WAN link.

My question is:

How do I retain the ability to reach the WAN while obeying the "no IP address on a bonded interface" rule?

And how (if possible) would I be able to reach the WAN from either of the bonded interfaces? Of course only one or the other will be active per our discussion above.

Had another thought just now - I have a hotspot running on the target device. Will that require any reconfig due to the bonding?

Thanks for your time.

sindy · Tue Oct 20, 2020 11:11 am

It's not just the IP address, the interfaces bonded together should have no individual configuration on themselves at all. This is especially critical when using LACP, as the LACP PDUs are VLAN-agnostic since they are intended for link layer control.

As you do not use LACP, it is possible, and breaking all the recommendations, to add vlans to the member interfaces of the bond in order to reach the management of the radios. For how long this will work is a question, some future ROS upgrade may break this possibility. Of course the VLAN used for management must not collide with any of the "payload" ones used across the bonded paths.

A clean solution would be to use an EoIP tunnel through each link and bond the EoIP tunnels together rather than the physical interfaces. If you can set the MTU on the radios and the Ethernet interfaces facing them so that a 1500 byte payload frame with all the EoIP headers would still fit into this larger MTU, the bandwidth waste on the overhead will be smaller than if the EoIP has to cut every frame into two because the MTU of the physical link is the same like the one of the payload frames.

But the best solution in my opinion is to forget about bonding and use Mikrotik's mesh setup. A bridge with a standard spanning tree protocol would work almost the same way, if not for the fact that STP doesn't detect a unidirectional failure of the link, which is not an unlikely event with optical and microwave links. LACP does, and the mesh protocol does as well.

A mesh finds the best path between nodes but doesn't block any of them. Imagine a physical ring of interconnected nodes A,B,C,D. If all links are OK, STP has to disable one of them to prevent an L2 loop, so if it e.g. the one between A and B gets disabled, frames from A to B must take the long path. Mesh doesn't disable any link, so as long as all of them are OK, A will talk directly to B and C will talk directly to D. You can use path-cost to prioritize the paths so that the traffic would prefer the 5 GHz link as long as it is alive.

VanceG · Tue Oct 20, 2020 6:49 pm

sindy, thanks.

I had encountered some information during my initial research on bonding using EoIP. Due to my lack of experience I hadn't considered a mesh configuration but your description of the benefits one can achieve by using one will send me back into "research mode". I imagine there's a fair amount of stuff online to review.

I expect I will be back on the forum at some point for more advice.

Thanks again.

VanceG · Sun Oct 25, 2020 3:12 am

I reset a RB 750 to factory defaults to start the learning process. Going to try bonding first just as a learning escapade.

It looks like there's more to making this work than just entering the commands shown in the example in the Wiki. Maybe a LOT more.

After turning off DHCP and NAT (that will be handled on the "destination" end) and figuring out how to remove the two ETH interfaces I wanted to use for the bond from the bridge on the board on each end, I set up both ends per the wiki entry with some interface number changes.

I could see that the bonded link was passing administrative traffic, but I had no internet connectivity at the far end. Checking back up the chain, checking the "source" end revealed no Internet there as well. DHCP client on the 750 shows the proper stuff from Verizon.

So, some questions:

What baseline should I be starting from? Factory defaults Router? Factory defaults Bridge? Or no defaults at all (which I have not tried due to many nightmare stories of folks who had to use a serial cable to recover from that)?

I *think* the bonded link will work IF I can get Internet to work on the source end without DHCP running. Just not knowledgeable enough to understand what's wrong.

If anybody knows of a step-by-step for configuring a bonded link that does not assume you already have the underlying stuff set up properly, that would be great. All the examples I have found say nothing about the initial setup/state of the device before they configured the interface ports for the bond.

sindy · Sun Oct 25, 2020 11:24 am

So, some questions:

What baseline should I be starting from? Factory defaults Router? Factory defaults Bridge? Or no defaults at all (which I have not tried due to many nightmare stories of folks who had to use a serial cable to recover from that)?

"Factory defaults -> home CPE" is a good start. "No defaults at all" should be manageable using Winbox connecting to MAC address or mac-telnet from another Mikrotik, and also recoverable by resetting to factory defaults using the reset button during power-on, but that's off-topic here.

Also, you talk about a single device, but to test the bonding you need two of them.

So my approach would be to set up one device as a router with WAN and LAN (factory default and then "home CPE" in QuickSet), and set the other one as a bridge (factory default and then "bridge" in QuickSet). Never use QuickSet again once you've done any additional changes; if you want to start from scratch, do the factory defaults first and then use QuickSet.

If the QuickSet "bridge" has a static management address (I don't know, I've never used it), you have to resolve the IP address conflict by changing the bridge-only's IP address from 192.168.88.1/24 to 192.168.88.2/24.

At this stage, I would connect one of its Ethernet ports to one of the router's LAN Ethernet ports, and check that a DHCP client device connected to the "bridge" gets an IP address from the "router" and can connect to internet (provided that router's WAN is connected to internet of course).

From this point on, you can start your labbing with the bonding setup. At both ends, cancel the membership of two unused Ethernet interfaces in the bridge, make them members of a newly created bond, and make the bond interface a member port of the bridge. And then move the single cable interconnecting the two boxes from the currently used Ethernet interfaces to the member ones of the bonds. If everything is set correctly, you can still reach internet from the bridge device. At this moment, you can add the second cable, remove the first one etc.

VanceG · Sun Oct 25, 2020 6:40 pm

sindy-

Let me add a higher level of detail here so I don't waste your time rehashing things I have already tried.

My "bench test" setup is exactly as shown in the original post. It's a cabling mess, as you can well imagine. The two wireless links are verified working and are in WDS mode.

Because the 750 and the 960 are "ethernet" only, there are no choices in QuickSet for various modes like there are when you have WLAN. Only "router" and "bridge". (BTW, thanks for the info on using QS only once - that might have been part of the problem).

When I choose Bridge, and assign a static IP, I can't get back in. Due to ignorance of what just happened, surely.

So, starting over with Router...

Connect to 750 having made no changes at all to verify that it is working and has Internet access.

Leave the default IP's as is (88.1). Turn off DHCP and NAT from QS. Release the two ETH ports (in my case, I used 4 and 5) using Webfig so they can be acquired by the bonding setups. I used the following info from another of your posts:

"So to remove a given interface from the bridge completely, disable/remove the corresponding row in /interface bridge port; to prevent the switch chip from forwarding frames to/from this interface, rather than sending them to the CPU, set hw on the corresponding row in /interface bridge port to no."

Then used the command lines from the Wiki>>Quick Setup Gude (suitably edited to reflect the different ETH interfaces and using a 192.168.1.x subnet):

https://wiki.mikrotik.com/wiki/Manual:I ... etup_Guide

This all went fine (no errors in SSH session in LXTerminal), on both the 960 and 750.

Quick check of the newly created bond1 interface shows it passing a small amount of housekeeping traffic, so I (perhaps incorrectly) assume that it is working. Connect to the destination 960, it hands out an IP, but there is no connection to the Internet.

Start checking upstream and find that the "source" 750 has no Internet either when connected to either ETH2 or ETH3.

Play with this a while, then reset to factory defaults so I can examine just where I lose connectivity.

All is well after reset. Repeating my steps above, I lose Internet when I turn off the DHCP server. DHCP client still sees the ISP and can release/renew OK. Pings from connected machine don't work (obviously).

The connected machine still has an IP from when DHCP was turned on, and it is configured correctly per Connection Information. Assigning an IP manually with the same parameters but using a different 88.x IP and using that does not make a difference.

So, that is where I am at currently. Obviously (again), until the source 750 has connectivity, I am not going to have same at the destination 960.

Doing some reading online, I see a goodly number of posts that seem to point in the direction of a route needing to be configured. Researching in depth on that topic took me down a very deep rabbit hole which I have yet to understand (even if it is actually relevant).

After my last post, I did rediscover the information that even if a device has no ip after a "no factory config" reset, you can connect via MAC and Winbox. I haven't done that for awhile and forgot all about it. I still haven't tried a "no factory default" from scratch setup. Seems a little draconian but I will do it if necessary. That approach will likely require more coaching.

So, there you go. Long-winded but hopefully more informative.

sindy · Sun Oct 25, 2020 7:32 pm

Leave the default IP's as is (88.1). Turn off DHCP and NAT from QS.
...
All is well after reset. Repeating my steps above, I lose Internet when I turn off the DHCP server. DHCP client still sees the ISP and can release/renew OK. Pings from connected machine don't work (obviously).

The connected machine still has an IP from when DHCP was turned on, and it is configured correctly per Connection Information. Assigning an IP manually with the same parameters but using a different 88.x IP and using that does not make a difference.

The part about NAT is important. The 750 gets its own WAN IP address and default route from the "backhaul IP", whatever it may be, but unless that "backhaul IP" thing knows that packets for 192.168.1.0/24 and/or 192.168.88.0/24 have to be sent to the 750's WAN IP, the action=masquerade rule on the 750 is necessary, otherwise the packets from its LANs are routed via the WAN but with their original source address, so in better case the "backhaul IP" drops them already on their way to the server somewhere in the internet, in worse case they reach the server but the server sends it response is to 192.168.1.x (or 192.168.88.x) which is a completely unrelated device in the server's network.

When you disable a DHCP server, it doesn't send any "lease withdrawn" message to the clients, so the addresses and routes obtained via DHCP stay valid until the lease expires. If you assign only the IP address to the ex-DHCP clients manually, but not the default route, they don't get any route and can only reach other devices in the same subnet. And on Windows, if you switch from "DHCP" to "manually assigned IP" mode, you remove the lease even if it is still valid.

The NAT handling is assigned to each individual connection when it starts (if its first packet matches the NAT rule); if you disable the NAT rule, ongoing connections keep being NATed, but new ones are not.

So re-enable the NAT rule on the 750, give the 750's IP address in the respective subnet as the defalt gateway to the client device(s) in 192.168.1.0/24 and/or 192.168.88.0/24, set those devices to use 8.8.8.8 as DNS, and try again. If it doesn't help the devices "have internet", post the export of both the 750 and the 960.

VanceG · Sun Oct 25, 2020 9:54 pm

Okay, there is some small glimmer of understanding here...

The only NAT rule on the 750 is the defconf one, which is very similar to this:

0 ;;; defconf: masquerade
chain=srcnat action=masquerade out-interface=ether1

Note that I grabbed this one from my hEx here at home, the 750 is at my shop where I have the bench test set up. I believe it was the same with the addition of an entry in the "Out Bridge Port List", which I believe was set to all LAN. I will need to check that next time I am down there.

This:

"give the 750's IP address in the respective subnet as the defalt gateway to the client device(s) in 192.168.1.0/24 and/or 192.168.88.0/24, set those devices to use 8.8.8.8 as DNS, and try again."

So, this needs to be done on end-user devices such as laptop, tablet, etc.? Or am I misunderstanding? In other words, do MANUALLY what a DHCP server would do AUTOMATICALLY?

Are we doing this only to test the ability of a device to get Internet access via the 750 when everything is properly configured?

I guess that if I can get access without DHCP running, that's one item checked off the list. Then I can set up the bonding interface again.

sindy · Sun Oct 25, 2020 10:07 pm

I believe it was the same with the addition of an entry in the "Out Bridge Port List", which I believe was set to all LAN. I will need to check that next time I am down there.

If the action=masquerade rule is set to match on the WAN port and on a list of LAN bridge ports at the same time, it cannot work at all. But matching on bridge port is only possible if use-ip-firewall is set to yes under /interface bridge settings, so the rule may be auto-disabled.

Too much speculation required - post the configuration exports once you get there.

"give the 750's IP address in the respective subnet as the defalt gateway to the client device(s) in 192.168.1.0/24 and/or 192.168.88.0/24, set those devices to use 8.8.8.8 as DNS, and try again."
So, this needs to be done on end-user devices such as laptop, tablet, etc.? Or am I misunderstanding? In other words, do MANUALLY what a DHCP server would do AUTOMATICALLY?

Exactly. The DHCP is used to set up IP address and network mask at least, but there are many additional parameters usually used - default gateway, dns server list, NTP server list, configuration server fqdn, configuration server name..

Are we doing this only to test the ability of a device to get Internet access via the 750 when everything is properly configured?

I'm not sure why we are doing this - it was your complaint that the devices behind the bond cannot reach internet after the changes you've done. My initial understanding was that the bonding setup was the only thing which was not clear to you, but now it seems to me that there is much more. That's why I've initially concentrated on the migration from a single cable connection to the bonded one.

VanceG · Sun Oct 25, 2020 11:09 pm

OK, my bad.

Since the whole chain did not initially work, what I have done is break this down into three parts:

1. get Internet connectivity working on the Source device (RB 750UP) without DHCP running.

2. once that's done, set up the bonded link again and see if it "just works".

3. if the bonded link does not work, focus on that.

I didn't see any point in pursuing #2 and #3 until #1 works. Unless there's a good reason to tackle the whole thing at once.

VanceG · Sun Oct 25, 2020 11:17 pm

Thinking some more about this, perhaps I should clarify further about the setup.

Referencing my original text-mode diagram -

The source end RB 750 needs to provide:

Connectivity to the ISP. I have a public static IP via DHCP from my provider.
Host the source end of the bonded link.
Provide whatever transport/translation steps are necessary to get data from the ISP on to the bonded link.

The destination RB 960 will:

Host the destination end of the bonded link.
Using the bonded link as source, run traditional router functions (DHCP, hotspot, etc.). In other words, act just like a regular router with the exception of the WAN coming in through the bonded link.

Hopefully this description isn't redundant.

sindy · Mon Oct 26, 2020 12:00 pm

Redundant parts would be better than missing ones, so far so good.

If we split the above into points which are fixed and which can be addressed in multiple ways, we get the following:

fixed:
- the ISP uplink must be physically connected to the 750
- all the client devices are physically connected to the 960 (but maybe this one is not so simple)?
arbitrary:
- routing and NAT must be done on at least one of the boxes
- a DHCP server for the client devices must run on one of the boxes
- interconnection between the WAN subnet and the LAN subnet may be done by means of a third subnet spanning only the physical path between the 750 and the 960, or one of the two (WAN or LAN) may be bridged across the physical path

Depending on the bandwidth of the microwave links, the bandwidth provided by the ISP, and the number of devices in the WAN subnet and in the LAN subnet, you may prefer a simpler setup with just two subnets and bridging of one of them across the microwave hops, or a more complex setup with three subnets, which will however prevent existence of broadcast traffic like ARP on the microwave hops, saving some bandwidth there.

VanceG · Mon Oct 26, 2020 5:23 pm

Okay. I will redraw my original network diagram to show what I am currently doing including IP information and configuration details and post it. Should get to that this evening.

VanceG · Tue Oct 27, 2020 10:09 pm

Promised info attached.

The RB_750_default.txt file is just what it implies - a export-hide-sensitive of the factory-reset RB750 with a few additional items at the beginning.

config.jpg is a network diagram of where I am with things currently.

If more information is needed, let me know.

RB_750_default.txt

config.jpg

sindy · Wed Oct 28, 2020 12:10 am

OK. So this way, if we leave the hotspot functionality out of the picture for now, if everything between the 750 and the 960 is configured as a transparent bridge, and if the 960 is configured as a transparent bridge too, whatever is connected to the 960 will get its IP configuration from the DHCP server at the 750 and get internet access. If you change 192.168.88.x to 192.168.1.x everywhere in the 750's configuration, you'll get management access to all the other devices in 192.168.1.0/24.

However, what could not be seen from your initial drawing is that there is a single path between the 750 and the unmanaged PoE switch powering both the long-distance radios. This, and the very presence of an unmanaged switch in the scheme, makes it impossible to use bonding, STP, or mesh. The good news is that you actually don't need any of these because there is no actual need for L2 transparency between the 750 and the 960.

So for the redundancy purposes, I would use two VLANs (say, VLAN 2 and VLAN 5), each hosting one interconnection subnet like 192.168.2.0/28 and 192.168.5.0/28, at the 750 and at the 960; on the 5 GHz link, only VLAN 5 would be let through, whereas on the 2.4 GHz link, only VLAN 2 would be let through.

So the overall picture would be a local subnet, say, 192.168.1.0/24, spanning from the 750 all the way to the local end of the radios, in the default VLAN (tagless); another local subnet, say, 192.168.3.0/24, spanning the 960 and its local radios in the default VLAN, and two interconnection subnets, 192.168.2.0/28 and 192.168.5.0/28, each in its own VLAN passing through its own long-distance radio link. The primary routes between 192.168.1.0/24 and 192.168.3.0/24 would go via gateways in 192.168.5.0/28, the secondary ones via 192.168.2.0/28. You can use OSPF with BFD or automatic failover based on recursive next-hop search to monitor the transparency of the two interconnection subnets; if you use the redundant next-hop search to control the failover, you can even use the 5 GHz link for the 750->960 direction while both are OK, and the 2 GHz link for the reverse direction, effectively obtaining a full duplex connection, i.e. lower round-trip latency, during favourable weather conditions. The hotspot configuration and everything related to the hotspot clients would then be a task on the 960 itself, which would merely use VLAN 2 and VLAN 5 as redundant WANs; management access to the radios at the 960 end would be also redundant.

If the long distance radios are already installed, it is still possible to implement the above in steps from one end without losing management access to any of the elements.

VanceG · Wed Oct 28, 2020 12:35 am

Argh.

I probably should have made clear that once I get the bench setup working, a properly configured 750UP will take the place of the unmanaged switch at the actual site. The 750 supports POE, which would make the whole reason the switch is there (powering the long link devices) unnecessary.

Sorry about that.

I have already verified that you can enable POE on the individual interfaces making up a bonded link. Although the 750 needs that weird "ether1-poe-in-long-cable" setting to avoid a continuous loop of power up/detect/falsely declare short/shut down. Learned about that the hard way several years ago. Thankfully the 960PGS does not suffer from this.

sindy · Wed Oct 28, 2020 12:48 am

a properly configured 750UP will take the place of the unmanaged switch at the actual site.

Even that way, the way with two interconnection subnets and redundant routing seems better to me than any L2 solution (bonding or mesh), as you can try several strategies for loading the two links and stick with the optimal one, and none will be worse than bonding.

I haven't thought about that initially as you've explicitly asked about bonding, but seeing the whole picture now, I really think this is a better deal.

VanceG · Wed Oct 28, 2020 1:29 am

So, you are saying that "two interconnection subnets and redundant routing" can be implemented with the network topology as-is, without needing to change any hardware at the "source" site? Or am I misunderstanding the meaning of "it is still possible to implement the above in steps from one end"? Perhaps that means that AFTER a 750 is installed at the source end it can all be configured from a single point.

It would be cool if I didn't need to go to the source end. Setting up a site access visit there requires getting in contact with a fairly non-responsive and grumpy County IT employee to get into the building and driving through some really nasty deep sand. I wasn't going to get around to that until next year.

sindy · Wed Oct 28, 2020 9:39 am

So, you are saying that "two interconnection subnets and redundant routing" can be implemented with the network topology as-is, without needing to change any hardware at the "source" site? Or am I misunderstanding the meaning of "it is still possible to implement the above in steps from one end"? Perhaps that means that AFTER a 750 is installed at the source end it can all be configured from a single point.

It seems my mind reading skills are severaly impaired these days. As I haven't seen the 750 anywhere on the drawing, and as you stated before the 750 will have the public IP on itself, I've thought the first box named ISP was the 750 ☹

It would be cool if I didn't need to go to the source end. Setting up a site access visit there requires getting in contact with a fairly non-responsive and grumpy County IT employee to get into the building and driving through some really nasty deep sand. I wasn't going to get around to that until next year.

Basically:

if any of the four radios at the ISP-facing end can work as a decent router (with an ability to prioritize routes and monitor their transparency), you can set up a per-packet L3 failover between the two long distance links without visiting the ISP-facing end. If that device itself supports VLANs, and the radios between that router and the 960 support at least transparent forwarding of VLAN-tagged frames, it is a plus but it is not mandatory. Per-packet failover means that if the 5 GHz radio goes down, the clients won't lose existing connections.
if none of the radios at the ISP-facing end can act as a router, but the ISP box has a private IP on the interface facing your radios (so it does NAT to the public IP), you can have a per-connection L3 failover without travelling. Per-connection failover means that if the 5 GHz radio goes down, the connections currently using it will have to re-establish through the 2 GHz one. This may be noticeable or unnoticeable to the user depending on the application.

From your other topics, and from the fact that the drawing indicates use of nv2 on the 5 GHz link, I believe that that link is made of NetBox 5 radios, so case a) should apply.

Dwarfer · Wed Oct 28, 2020 11:51 am

Hi,

Apologies if I have missed the details in this post, however if you are trying to run a LACP on different speeds/latency then your are opening yourself up to allot of hurt and weird results, Yes you could do Active / Passive maybe.

Do you have to-do this via L2 ? can you not use L3 and user either BGP or OSPF with metrics/weigh on the 5Ghz and 2Ghz links. This would only give you active/passive but would be bomb prof.

Please ignore if I have missed the point of your post.

sindy · Wed Oct 28, 2020 4:12 pm

See below the idea. The two devices with bold frame act as routers, the rest is L2 transparent (bridging). In fact, the 911 (NetBox) will also stay L2 transparent. A possibility to use a VLAN per subnet is a bonus but it is not mandatory. It is essential that the L2 ring is not closed; to prevent that, it is sufficient that WAN 2 and WAN 5 on the 960 were not bridged together.

The intention is to minimize the configuration changes required at the remote (ISP-facing) end of the network - they are necessary but all of them are intended to be "add new, test it, start using it, remove old".

I don't know what is the actual IP address at the LAN side of the ISP box, nor whether you can configure routing on it, but that's not essential, double NAT doesn't make things worse than a single one, except port forwarding.

Both routers have two routes towards the remote end, one via 192.168.2.0/28 and another one via 192.168.5.0/28. Both check the transparency of the primary route by pinging an IP address reachable via that route:

the 960 checks the LAN IP of the ISP box
the 911 checks some address outside the two interconnection subnets on the 960 (the 192.168.3.1 is an example, it may be the LAN one).

If the primary route is not transparent, the "scriptless failover using recursive next-hop search" will deactivate it in 10 seconds at latest.

You can also use policy routing so that e.g. certain connections would use the 2.4 GHz link even while the 5 GHz one would be working fine.

For a seamless failover between the long-distance radios, it is essential that the 960 would not masquerade or src-nat the traffic to the IP addresses from the interconnection subnets (192.168.2.14, 192.168.5.14).

You can even test the transparency of both links on the 960 and use LTE as the backup of the last resort. But here, of course, only per-connection failover is possible, not a per-packet one.

wireless failover.png

VanceG · Wed Oct 28, 2020 7:46 pm

Okay, I wasn't clear on something yet again.

When I said "where I am with things currently" in regards to the network diagram, that refers to the situation at the actual site NOT the "bench test" collection. I can see now how it might have been misinterpeted.

I will need to study the diagram carefully. A lot of what is going on there is above my current level of expertise but not necessarily above my capacity to learn.

sindy · Wed Oct 28, 2020 8:04 pm

When I said "where I am with things currently" in regards to the network diagram, that refers to the situation at the actual site NOT the "bench test" collection.

I've assumed that the yesterday's diagram was describing the actual current deployment setup. The suggestion on the drawing reflects that assumption - it can be implemented on the equipment in production, not affecting its operation until ready to start using the new approach, and keeping the management via 192.168.1.x like it is now so always keeping the possibility to revert the steps without need to rely on safe mode.

If you want to rehearse that on the test bench first, I'm all for that, just the 750 will substitute the ISP gear + the local radio hop (void) + the dumb switch all together. Of course, if none of the radios on the test bench is a Mikrotik one, that's a complication.

Can you reveal your time zone?

VanceG · Thu Oct 29, 2020 2:47 am

sindy:

Easy stuff first:

My TZ is UTC -7 : America/Phoenix (no DST). The network deployment is about 350 miles north in Southern Utah, same TZ but observes DST.

Harder stuff:

I am assigned my single static/public IP by a RB2011 somewhere in my service provider's network (I can see it when I do an IP>>Neighbors on the 960 and choose WAN as the target. If I VNC into the Raspberry Pi I have up there running a water system tank level indicator, I can actually WebFig into the login page from a web browser.)

My physical connection to the first device in the link chain is via an ISP-owned T-Marc 280 sitting in an equipment shack at the base of a T-Mobile cell tower.

My IP is locked to the MAC address of the RB960 at the remote network end. I have had some discussions with my provider on this - it can be changed, but I can have only one instance of MAC address at any given time.

This works, as the WDS transparency means the MAC of the 960 can be "seen" through the other devices by the 2011 and it can assign the IP to it.

The first two short link devices are UBNT PBE-M2 400's. I'm not particularly concerned about failover there as they are only 150 feet apart. Maybe some day...

The wireless link devices on the 26 mile piece that I would like redundancy on are two NetBox 5's and two old Bullet2 HP's. I will probably change the B2HP's to something a little less vintage at some point. I used them because I already had them and they are capable of -29dBm xmit power. $$$ is always a consideration - we're just a small group of users who got tired of the general crappiness of consumer satellite Internet at our off-grid places and put this together.

The distribution gear at the network end is a motley collection of UBNT stuff. It serves another motley collection of IP cameras, solar power system monitoring equipment (network sites and folks' cabins), one SPA112 providing a VoIP phone, the aforementioned water system telemetry, and AirGrids and PBEs for end users. Everything has a static IP with the exception of the SPA112, for which I have a reserved address assigned by the 960.

All that is running on a flat 192.168.0.0/24 config. as served by the 960. I know that isn't ideal, but we started over ten years ago not knowing much. We're still far from maxed out on IP addresses but now that we have a "real" backhaul, that will likely change.

So. I don't know what this will do to your proposed plan. I do like the idea of leveraging what I already have in place.

What I was hoping for was a way to provide redundancy on the long-distance piece as well as keep WDS transparency on the entire link so the 960 could keep plugging along pretty much as-is. That seemed to be the simplest method to this admitted novice, and one that I could administer with my current level of knowledge.

sindy · Thu Oct 29, 2020 3:42 pm

The network deployment is about 350 miles north

Now I'm a little bit uncertain - is it 350 miles north of your current location or of Phoenix? I mean, it is easy to migrate the configuration safely if you are physically present at the 960 end, but it requires a bit more planning and a scheduled rollback script for the final switchover step (which I haven't anticipated initially - see below) if done from the internet end.

My IP is locked to the MAC address of the RB960 at the remote network end. I have had some discussions with my provider on this - it can be changed, but I can have only one instance of MAC address at any given time.

That's not a big deal as you can set any MAC address on Mikrotik devices' Ethernet interfaces. The only complication is when you do that from the internet side, as you need to change it at both the 960 and at the NetBox almost simultaneously. I didn't expect this as your second drawing was showing the public IP on the ISP box, but doesn't change much about the concept. The red subnet becomes a public one (216.169.x.x), so the NAT handling moves from the 960 to the NetBox, whilst the firewall may stay at the 960.

I do like the idea of leveraging what I already have in place.
...
What I was hoping for was a way to provide redundancy on the long-distance piece as well as keep WDS transparency on the entire link so the 960 could keep plugging along pretty much as-is. That seemed to be the simplest method to this admitted novice, and one that I could administer with my current level of knowledge.

As said above, bonding can be used, but due to its design, a lot of extra stuff would have to be configured to maintain management access to the long-distance radios, and the overall performance would be inferior to the one of my proposed solution with routing. Leaving aside the fact that you don't need to physically add the 750 to the network if you take the routing-based way. The price to pay is that you have to learn a bit.

VanceG · Thu Oct 29, 2020 6:10 pm

sindy:

I was trying to make sure I covered all possibilities when responding to the TZ question. Trying to avoid going down a specific path defined by a lack of information only to find that it makes a difference.

I am in Phoenix. The network is in Utah. So, anything that I do remotely has to not take it down. I have been very, very careful about that - making liberal use of Safe Mode and UBNT's equivalent.

I am fine with not implementing this unless I am physically at the site so I can recover from any blunders. In fact, I would insist on that. Actually, I will absolutely need to be physically present as I will need to call my contact at the ISP to inform them of the new MAC that my IP will bind to as it appears that could now be the NetBox and not the RB960. (Maybe. Still examining your diagram and figuring things out. If that is incorrect, let me know.)

What I would still like to do is make this work on my test setup, whatever scheme we decide to use. The hardware will be the same except that here at the bench the 5G link radios will be UBNT PBE-M5-400's instead of the NetBox 5's at the actual site because that is what I have to work with.

UBNT gear does have a Router mode, although I have never used it, preferring to leave the routing to Mikrotik.

So, given the above, can I still create a bench setup with the hardware I have? I suppose I could buy a couple of cheap hAP's so we have Tik devices to work with. I always have a use for those anyway.

Thanks for sticking with this.

sindy · Thu Oct 29, 2020 7:42 pm

I will need to call my contact at the ISP to inform them of the new MAC that my IP will bind to as it appears that could now be the NetBox and not the RB960. (Maybe. Still examining your diagram and figuring things out. If that is incorrect, let me know.)

Yes, for the routing-based failover to work, the public IP has to be at the NetBox. And as said above, you don't need to talk to the ISP, it is enough to configure the 960's MAC address on the interface of the NetBox.

UBNT gear does have a Router mode, although I have never used it, preferring to leave the routing to Mikrotik.
So, given the above, can I still create a bench setup with the hardware I have? I suppose I could buy a couple of cheap hAP's so we have Tik devices to work with. I always have a use for those anyway.

It doesn't matter what equipment in particular is used for the radio links. You can use the testbench 960 instead of the production 960, and the testbench 750 instead of the production NetBox. The radio links on the testbench will be just L2 transparent like the real ones. The NetBox will have one ethernet port with the dumb switch on it and one wireless port bridged together, whereas the 750 will have three wired ports bridged together and to two of them an external radio will be connected, that's all the difference. At L3 (IP addresses and routing), the settings will be the same.

Currently, the 192.168.1.0/24 subnet you use to manage the transport devices shares the same L2 segment with the public subnet provided by the ISP. It would be nice to move the management of the radios etc. to a VLAN, and disable that VLAN on the UBNT port facing towards the ISP gear to prevent leaking the mangement subnet to the ISP network, but it can only be done if the Bullet2s support VLANs due to the dumb switch in the scheme (it will transparently forward tagged frames but it won't do tagging and untagging).

So the starting point is to set up the 750 to a factory config with DHCP client on ether1 and ether2-ether5 bridged together, and the 960 also to factory config with DHCP client on ether1 and ether2-ether5 bridged together. Then, change 192.168.88 everywhere in the 960's configuration to 192.168.0. "everywhere" means:

the IP address on the bridge interface,
the ip pool,
the ip dhcp-server network

It is simpler to do that when connected using MAC address, but as a kind of training, is also possible to do that while connected to the 960's IP address if you first add all the above items on top of the existing ones, and then change the pool used by the dhcp server from the .88. one to the .0. one. Then you physically disconnect the management PC, connect it back, and connect to 192.168.0.1 instead of the previously used 192.168.88.1, and remove all the 192.168.88.x items.
Once the .88. is replaced by the .0., you can connect ether1 of the 960 to ether2 of the 750 (first directly, without a radio link). A device connected to ether3 of the 960 should "get internet" if the 750 is connected to the network providing an address via DHCP using its ether1.

Once this works, we can move further.

VanceG · Fri Oct 30, 2020 5:51 am

Wow. OK.

Let me parse this out and integrate it with your diagram. And then study some config examples.

I have never set up any VLAN's, but I do understand the basic principle. That will be a learning experience.

sindy · Fri Oct 30, 2020 9:34 am

My idea is to first replicate the current production setup on the bench, mark and celebrate that point (and save the configurations), and then to start implementing the suggestion.

VLANs are the last step which may even be skipped completely since the 192.168.1.0/24 in the same L2 segment like the 216.169.x.y/mm has caused no trouble until now. It is just not nice to have it like that. I can't help much with UBNT stuff, AirOS seems to support "management VLAN" and a possibility to disable it on one of the interfaces (ethernet or wireless), but I don't have an UBNT link on my own bench to investigate the real capabilities, and I don't know whether the Bullet2s support it at all.

VanceG · Fri Oct 30, 2020 5:53 pm

The first thing I was going to do is investigate the capabilities of the old Bullets.

VanceG · Fri Oct 30, 2020 11:51 pm

Sadly, the old non-M Bullets don't support running VLANs. So I will need to buy stuff in order to complete the bench test setup. That will take some time.

Last night while doing some edits to code on a PHP page that fetches MODBUS data from solar charge controllers using a cron job, I had a thought:

Is it possible to have a script run periodically on the destination end 960 that would monitor the signal level or the data rate of the destination end NetBox (via Wireless>>Registration) and switch to the secondary link if it falls below a certain number?

Or conduct periodic pings of the source end NetBox such that if a large amount of drops occur switch to the secondary link?

And for either, switching back to the primary when ping drops/signal levels return to normal?

Of course there will need to be some other considerations, such as building in a delay (some minutes, probably) between link switches such that you don't get a large amount of switching back and forth in a short period of time.

I am aware that there is more to this than the oversimplified description above.

Just wondering if it's even possible.

sindy · Sat Oct 31, 2020 2:36 pm

Sadly, the old non-M Bullets don't support running VLANs. So I will need to buy stuff in order to complete the bench test setup.

Again, partitioning the network using VLANs is not mandatory - you currently run the 192.168.1.0/24 and the public subnet in the same LAN and it causes no trouble as the ISP doesn't have their own 192.168.1.0/24 in the same LAN. So the replacement of the Bullet2s can be easily postponed to never, or just the dumb switch may be replaced later on by the 750 you already have, to provide the tagging/untagging capability to the Bullet2 path externally.

Is it possible to have a script run periodically on the destination end 960 that would monitor the signal level or the data rate of the destination end NetBox (via Wireless>>Registration) and switch to the secondary link if it falls below a certain number?
...
Just wondering if it's even possible.

Assuming that the weather conditions deteriorate and improve gradually, yes, it is a very good idea to control the failover based on receive level on the link. However, I'd prefer hysteresis to time-based switchover. Say, a fall of the Rx level below -70 dBm means "stop using this link", but a raise above -60 dBm is required to start using it again. You'll need to collect some data to find out what is the fluctuation of the Rx level during normal weather conditions.

The routers can be set to monitor the availability of a route's gateway without any scripting. The monitoring consists in pinging the gateway every 10 seconds, and deactivating the route while the gateway does not respond. The 10-second interval is not configurable. So up to 10 seconds of the actual traffic may be lost completely before the link failure is detected.

I would use this monitoring mechanism at the routers, and control the passthrough of these test pings through the radio link using the script monitoring the Rx level. This way, the router would deactivate the route already while the link would still be transparent, so there would be no dropped actual traffic at all. For this, a selective action=drop rule in /ip firewall filter would be dropping the test pings alone; instead of making the script activate and deactivate this rule (which means a configuration change so it is written to the flash), the rule would match on an address-list, and the script would be adding and removing a dynamic item to that address-list.

This way, the 960 wouldn't need to communicate with its adjacent NetBox to learn the Rx level on it, making the implementation simpler, with less space for a mistake.

VanceG · Sun Nov 01, 2020 9:14 pm

Yes. Hysteresis is a good mechanism to employ here. It would have probably occurred to me eventually as it is used in programming disconnect/reconnect setpoints for Low Voltage Disconnect in solar power systems.

One of the first things I did after getting the link up and running was set a scheduled script to record signal level every 15 min, collect those in a file, and have the file emailed to me every 24 hr.. So I have several months of historical data. Suffice to say that the signal varies from -49 to -56 in summer, when higher temps and humidity have greater effect on the link, and -47 to -53 in winter (yesterday). Given that limited range of values, it shouldn't be difficult to assign a "switch over" setpoint as well as "switch back".

Based on your description here:

"The routers can be set to monitor the availability of a route's gateway without any scripting. The monitoring consists in pinging the gateway every 10 seconds, and deactivating the route while the gateway does not respond. The 10-second interval is not configurable. So up to 10 seconds of the actual traffic may be lost completely before the link failure is detected."

A quick search term found this:

https://itimagination.com/mikrotik-wan- ... -reliable/

This sentence there has me thinking you are describing the same thing:

"With these route-based rules, failover times are about 15 seconds. From the time internet connectivity stops, to failing over, to workstations regaining internet access, is about 5-15 seconds. From testing, failing back to primary is a little quicker, maybe 5 seconds."

I can live with that. It will be much, much better than what I have now, which is a slow degrading of connectivity followed by a wait until the snow stops falling and/or melts off of things.

From your description near the end of your post, it *sounds* like there will be code running on the destination end NB as well as the RB960. Code on the NB5 monitoring signal conditions and blocking pings (which originate at the 960) when the signal falls below the preset value, with more on the 960 that sends the pings, detects the loss of them, and when they are lost, invokes a different route to the secondary link. And of course the reverse operation when sufficient signal is again available. If we get the Blizzard of the Century and both links stop passing data, it sounds like traffic will switch to the secondary link and just stay there until sufficient signal returns to the primary. Very good. If I am understanding this correctly, this approach will also be "device-agnostic" regarding the secondary link in that I can swap different devices in for the old B2HP's at any time with no re-config necessary.

I am probably only re-stating what you've already said in a different way - with included mistakes due to my lack of (but increasing) knowledge. But I really want to understand this. Given the remote nature of the network, I am mostly on my own out there when I make changes and if I don't understand what I just did that sent everything sideways, can't quickly fix it other than to return to the previous config.

One potential benefit of the code residing on two devices that I see is that it makes it more modular - the code on the NB for signal monitoring and blocking pings can be implemented and tested separately from the code on the 960 (to some degree). I can appreciate that approach.

It also sounds like this can all happen with no need for me to make a visit to the source end. True or false? Another benefit, if true.

sindy · Sun Nov 01, 2020 11:26 pm

a scheduled script to record signal level every 15 min

I'm not sure whether 15 min intervals are sufficient to notice short-time fluctuations caused by the packet nature of the link. With a traditional full duplex link where the carrier is always on I would have no doubt, but with this packet thing, I can see Rx level to drift in a range from -80 to -77 dBm within seconds on a link of about 50 feet with omnidirectional antennas. So some averaging may be required, which would make the script a little bit more complex (and its response a little bit slower, but that's not critical).

Also, the Rx level is not as important as the signal-to-noise ratio. I know it is unlikely that someone would run a link close enough to interfere, but I've stopped using the word "impossible" a long time ago.

A quick search term found this:

https://itimagination.com/mikrotik-wan- ... -reliable/

This sentence there has me thinking you are describing the same thing:

"With these route-based rules, failover times are about 15 seconds. From the time internet connectivity stops, to failing over, to workstations regaining internet access, is about 5-15 seconds. From testing, failing back to primary is a little quicker, maybe 5 seconds."

Initially I did talk about the same thing, monitoring of transparency of the path all the way to the internet. But then I've realized that this only makes sense when the two WAN interfaces are connected to paths which reunite many hops away from the device (i.e. each goes through a different ISP, and you want to know that the path through that ISP to the "big internet" is transparent as a whole).

Your case is simpler, the point where the two paths reunite is always the router on the opposite end of the parallel radio links, so those "three lines of code" (actually, three lines of configuration - for me "code" is a description of algorithm, not of configuration) can be reduced to just two - the recursive next-hop search is not necessary. So the preferred route will have check-gateway=ping, the backup one won't (unless you want to use LTE as "backup of last resort"), and that's all.

From your description near the end of your post, it *sounds* like there will be code running on the destination end NB as well as the RB960. Code on the NB5 monitoring signal conditions and blocking pings (which originate at the 960) when the signal falls below the preset value, with more on the 960 that sends the pings, detects the loss of them, and when they are lost, invokes a different route to the secondary link. And of course the reverse operation when sufficient signal is again available. If we get the Blizzard of the Century and both links stop passing data, it sounds like traffic will switch to the secondary link and just stay there until sufficient signal returns to the primary. Very good. If I am understanding this correctly, this approach will also be "device-agnostic" regarding the secondary link in that I can swap different devices in for the old B2HP's at any time with no re-config necessary.

As said above, there will be no code "running" on the 960. Just two routes towards 0.0.0.0/0 configured, the preferred one with distance=1 and check-gateway=ping, and the backup one with distance=2, each with another IP address as gateway. The same configuration, except that the destination of both these routes will be just 192.168.0.0/24, at the NB5 at the source side (actually, to provide management access to all the equipment at the 960 end via both links, there will be some more routes but that's not important for the principle).

So the only "running" code (script) will be at the NB5 acting as a STA, controlling the passthrough of the check-gateway pings in both directions based on the SNR of the link. As I think about it again now, the idea with an address-list was a bit too optimistic as we need to filter the pings at L2, hence using /interface bridge filter rather than /ip firewall filter, but that changes nothing about the principle.

I am probably only re-stating what you've already said in a different way - with included mistakes due to my lack of (but increasing) knowledge. But I really want to understand this.

Re-stating what you've heard/read in your own words is a great way to confirm that both parties understand things the same way.

It also sounds like this can all happen with no need for me to make a visit to the source end. True or false? Another benefit, if true.

Yes, scroll a few posts back :) That's what I've stated as the advantage already back then, that you can do all the changes while connected to the 960 alone (provided that you can reach the NB5 at the remote end from there of course).

VanceG · Mon Nov 02, 2020 12:19 am

"I'm not sure whether 15 min intervals are sufficient to notice short-time fluctuations caused by the packet nature of the link."

Point taken. I just pulled the 15 min. interval out of thin air to enable discussion. This can, of course, be adjusted at any time. If you wanted to get really fancy, you could pull values at short intervals and average them if the math functions in ROS support that. Probably not necessary.

"Also, the Rx level is not as important as the signal-to-noise ratio. I know it is unlikely that someone would run a link close enough to interfere, but I've stopped using the word "impossible" a long time ago."

Yes. That will work. Current value:

[XXXXX@HILL_YYYY] > [put [/interface wireless registration-table get value-name=signal-to-noise 0]]
56

"Your case is simpler, the point where the two paths reunite is always the router on the opposite end of the parallel radio links, so those "three lines of code" (actually, three lines of configuration - for me "code" is a description of algorithm, not of configuration) can be reduced to just two - the recursive next-hop search is not necessary. So the preferred route will have check-gateway=ping, the backup one won't (unless you want to use LTE as "backup of last resort"), and that's all."

Simpler is better. I have pretty much had to abandon LTE as backup as the network usage has increased such that I can't afford $10/Gb (!!) any longer.

"The same configuration, except that the destination of both these routes will be just 192.168.0.0/24, at the NB5 at the source side (actually, to provide management access to all the equipment at the 960 end via both links, there will be some more routes but that's not important for the principle)."

Speaking of management access, remember that you coached me through getting that to work last year, and it required this /IP Address entry:

1 ;;; allows access to long link devices on 192.168.1.x
address=192.168.1.1/24 network=192.168.1.0 interface=ether1-gateway
actual-interface=ether1-gateway

as previously mentioned in post #10. Will this affect or be affected by what we are discussing? Just didn't want this to swoop in at the last minute and hose things.

"As said above, there will be no code "running" on the 960."

OK, "operationally-specific configuration" vs. "out of the box defaults" then ;-)

"you can do all the changes while connected to the 960 alone"

Just checking. The approach to the whole problem has changed significantly (at least to me) since that was first proposed.

vladimirslk · Mon Nov 02, 2020 12:49 am

i was recently playing with bonding for fiber backup with wireless. later went to backuping 5ghz link with another one :) after some test will bond them to get double rate... but what i found:
easiest -> hex S boxes each - 3 cables = 1uplink, 1 linkA(tagged), 1linkB(tagged) with bonding config in it. wireless links make simple bridge. each link on ether1(tagged) or lan add two vlans - LinkA(LinkB) vlan=2; management=5 for example. link monitoring ONLY arp, mii did not worked.

why with vlans? because through bonding u will not be able to see boards and or to update them or check.

Do not waste time and try to make bonding on CRS switches, as CPU is not capable to do that. my case crs106 or what ever with 5 sfp slots was hitting top at around 300mbps, so use it only as switch. hex s does bonding sufficiently, thus i did not checked if passes 1gbps but 600mbps for sure.

VanceG · Mon Nov 02, 2020 2:59 am

vladimisrk:

I am not trying to bond links for throughput (my backhaul is only 50x5), but to provide failover.

I will likely never be able to afford a link that needs bonding to achieve speed.

sindy · Mon Nov 02, 2020 8:55 am

I have pretty much had to abandon LTE as backup as the network usage has increased such that I can't afford $10/Gb (!!) any longer.

There may still be a point in having the LTE there as a backup of management access.

Speaking of management access, remember that you coached me through getting that to work last year, and it required this /IP Address entry:
...
as previously mentioned in post #10. Will this affect or be affected by what we are discussing? Just didn't want this to swoop in at the last minute and hose things.

I remember this setup, that's why I keep repeating that you currently use two subnets (the management one, 192.168.1.0/24, and the public one, 216.x.x.x) in the same L2 segment (or bridged domain if you prefer to call it that way). And I rely on the existence of this management access via addresses in 192.168.1.x/24 to all the devices whose configuration needs to be augmented to implement the routed failover.

You've mentioned modularity of the link state monitoring; what I've proposed for the bench setup also relies on modularity, which means that it can be first done with two routers interconnected using just cables, and then you can add a radio hop instead of one of the cables to debug the SNR measurement. So my aim was to start with the bench setup I've suggested above, then convert it to an imitation of what is now in production, and then augment that with the failover solution.

VanceG · Wed Nov 04, 2020 2:33 am

Okay. I dug through my boxes of gear and found a mAP that I used to tether to a Globalstar GSP-1600 so I could connect wireless devices to it (9600 bps - woo hoo!). That can be configured (I assume) to be an AP so we can simulate the signal/SNR changes.

I started keeping a log of SNR concurrently with the -dBm signal level yesterday. A day's worth of readings from each are attached.

The SNR varies but a digit or two over 24 hours.

Snow/rain is expected Fri./Sat./Sun. Will collect readings during that time period to see what variance I get under those conditions.

sindy · Wed Nov 04, 2020 9:10 am

So it seems there is no external interference as the SNR copies the Rx level. (-47 dBm => 62 dB, -54 dBm => 56 dB).

I would be reading the SNR and Rx level in one step in the new script so that they would come from the same second, but the drift is not so quick so the results can still be correlated. The speed of
change as the precipitation shifts in will be very interesting. Does it typically move along the link or orthogonally to the link?

:local monitor [/interface wireless monitor wlan1 once as-value] ; :put (($monitor->"signal-to-noise") . " " . ($monitor->"signal-strength"))

VanceG · Fri Nov 06, 2020 2:06 am

Excellent. You can combine more than one registration query in the same step. Hadn't thought of that, but should have. Still have a lot to learn about scripting. I will change my current two separate scripts to one per your suggestion and start saving it to a file. Will need to not run afoul of the "4K limit" there but can probably get 24hr of readings.

I would say, after experiencing weather in the region since 2004, that the storm track during summer monsoon season is rain bands in line with beam of transmission perhaps 60% of the time. Winter snow just comes in from the north as a big event that just blankets the area for a couple days with no specific track except at the beginning and end.

Here's what the weekend has in store:

Detailed Forecast
Tonight Increasing clouds, with a low around 41. East wind 6 to 10 mph becoming north northeast after midnight.
Friday Mostly cloudy, with a high near 64. Northeast wind 7 to 13 mph becoming east southeast in the afternoon.
Friday Night Partly cloudy, with a low around 41. East southeast wind 9 to 14 mph becoming south in the evening.
Saturday A chance of showers before 11am, then rain, mainly between 11am and 5pm. High near 54. Breezy, with a southeast wind 8 to 13 mph becoming south southwest 19 to 24 mph in the afternoon. Winds could gust as high as 39 mph. Chance of precipitation is 80%.
Saturday Night A chance of rain and snow before 8pm, then snow likely. Mostly cloudy, with a low around 26. Breezy, with a southwest wind 15 to 22 mph, with gusts as high as 36 mph. Chance of precipitation is 60%. New snow accumulation of 1 to 2 inches possible.
Sunday Snow, mainly after 11am. High near 37. Chance of precipitation is 80%. New snow accumulation of less than a half inch possible.
Sunday Night Snow showers likely. Mostly cloudy, with a low around 22. Chance of precipitation is 70%. New snow accumulation of less than one inch possible.
Monday A chance of snow showers. Partly sunny, with a high near 36.
Monday Night Mostly clear, with a low around 18.

sindy · Fri Nov 06, 2020 9:12 am

Well, my question regarding the direction of cloud movement relatively to the link direction was solely related to the speed of path loss change - if the precipitation starts at one end of the link and the front gradually moves towards the other one, the changes of path loss will be slow; if the clouds move perpendicularly to the link direction, the path loss may change in a few seconds. Ice forming up and melting on the antenna is usually a slow-paced process. And heated antennas are likely out of question in a solar-powered installation.

Anyway, I can help with configuration, not with natural conditions :)

VanceG · Fri Nov 06, 2020 8:23 pm

Yep. I know about the "crosses over" vs. "tracks in or along" effects of precip. Truthfully, I think it's 50-50 as to which will happen for any given storm.

Included the weather report as I thought you might be interested in a typical "storm event" of the type with which I have to contend.

Simultaneous signal and SNR logging is now in place. My "boots on the ground" neighbor who lives there will be keeping an eye on things from his end. Luckily he doesn't mind slogging out to the site in snow and mud to fix stuff and actually seems to enjoy it. The view is pretty amazing.

s_k_s.jpg

I am hoping that with that -50 signal I am being overconcerned about the whole thing, but I still want the backup link. If for no other reason than to deal with possible hardware failure. I did have one of those earlier this year on the short link devices at the source end and it was no fun at all.

VanceG · Mon Nov 09, 2020 4:35 pm

sindy-

Been pretty busy with my "paying" job but am working on getting the bench cleaned off and hardware reconfigured.

FWIW, we had a pretty good snow event and the signal did not vary any more than usual (+/- 2 to 3 dBm). First test of the 5G link under those conditions. Happy dance about that.

Who is online