Strange bonding behavior with EOIP slaves

horza · Mon May 24, 2021 11:31 am

I'm not sure if my expectations are incorrect, or if this is a bug that got introduced at some point over the years, so I'd like some help to understand behavior.

Basically, my bonding setup works as expected when all slave interfaces (EOIPs) on each side have the same MAC address, and doesn't work as well when those EOIPs have unique MACs.
But it does somewhat work even when all of them have unique MACs, and when just some of them have the same MAC, but in those cases there's a noticeable packet loss.
When all slaves on each side have the same MAC, there's literally no packet loss, and traffic is distributed as expected (XOR L3/L4).

This setup has been working since 2015, and back when I set it up, I'm sure that EOIP slaves had unique MACs.
But over the years I guess I used Winbox to copy those EOIPs and *some* of them got the same MACs because I missed the fact that it gets copied too.
Last week when I added another WAN link, I noticed that some of my EOIPs have the same MAC and tested making them all unique vs all the same and noticed that this was even happening.

More details of my setup can be found in this obsolete blog, but here are the highlights:
- 6 LTE modems hooked up to my home 4011 device, isolated eth interfaces, subnets, routes.
- 6 public IPs on the datacenter CHR virtual machine, l2tp server is there.
- home 4011 establishes 6 l2tp connections to the CHR. I use static routes to ensure that l2tp connections to those 6 public IPs are established over their assigned WAN modem.
- Once I've got 6 l2tp connections, I make 6 EOIP interfaces on 4011 and 6 EOIPs on CHR, targeting their pairs using local/remote IPs that are assigned to l2tp interfaces
- 1 bond interface on each side, slaving all 6 EOIPs and automatically taking MAC address from whichever slave it determines it should do

My packet loss is 0 when all slave EOIPs on the same side have the same MAC address (but each side is different MAC address).
Meaning all 6 EOIPs on the home side have the same MAC, and a different MAC is shared by all 6 EOIPs on the datacenter side.
Note that there are no VLANs, bridges or switches in the mix.

I'm hoping you could help me understand:
- why on earth does bonding even work when all slaves have the same MAC? Wouldn't it totally kill ARP, which is the only way to monitor these slaves?
- why is there a packet loss when slaves all have different MACs?
- shouldn't bonding run just fine when all EOIPs have unique MACs? My expectation is that unique slave MACs are actually necessary, but that expectation might be incorrect.

sindy · Tue May 25, 2021 10:53 am

Look at that from a wider perspective.

each end of the bond uses its own strategy to choose a particular link for a particular frame, independent from the other end's one
in association with the above, each end is only interested in availability (transparency) of the links in its sending direction
the bond as a whole is in many cases a member port of a bridge/switch; in these cases, its own MAC address(es) is (are) not used for anything else but for this transparency checking
the forwarding tables of physical switches are quite limited in size, so using an individual MAC address per member port of a bond would be a waste of this valuable resource

When the local bond sends a transparency check ARP request, it sends it directly from the member port it is checking, i.e. it doesn't need the port's MAC address to choose it. At the remote end, the ARP response to the transparency check request, sent by a host somewhere further in the network, arrives to the silicon side of the bond interface just like any other frame to be sent to the wire side. So should the remote end use the individual MAC address of the local port to send this ARP response to the transparency check request back via the same link, it would need a special treatment bypassing the normal link selection strategy, and it would have to learn the far-end MAC for each link. Plus you would still need an individual address for the bond as a whole, different from those of all member ports, for cases where the local bond is an IP interface. But all these complications would actually be useless, because to obtain the information about transparency of a link in sending direction, it is not important via which link the ARP response arrives to the local end.

I can't explain why there is packet loss when the individual ports have individual MAC addresses, but there is no point in forcing individual MAC addresses to the member ports.

Strange bonding behavior with EOIP slaves [SOLVED]

Strange bonding behavior with EOIP slaves

Re: Strange bonding behavior with EOIP slaves [SOLVED]

Who is online