The well-known problem
L2TP/IPsec clients reaching the server via NAT do work but only one at a time per each public address. A new client connection from behind the same public address ruins the pre-existing client session.
The root cause
To understand what is wrong, one must stop looking at L2TP/IPsec as at a magic blackbox and closely inspect its strictly layered structure. L2TP/IPsec is literally L2TP transported using IPsec, both protocol-wise and (at least in case of RouterOS) application-wise: when you tick the "use IPsec" check-box in L2TP settings, RouterOS automatically creates an IPSec peer and, at client side, an appropriate policy necessary to transport the L2TP connection. The same IPsec setup can be done manually instead of ticking the check-box in the L2TP configuration, and the resulting processing of the packets is exactly the same in both cases.
According to the standard, L2TP is transported using UDP and both the server and the client use port 1701 to both send and receive. So all standards adherent client implementations use port 1701 also locally.
The L2TP/IPsec standard requires that ESP transport mode is used. ESP transport mode is designed for efficient encapsulation of encrypted traffic directly between two machines, so it does not carry any information about source and destination IP address in the encapsulated payload as it would be redundant. It is assumed that these addresses are the same for the original transported packet and the ESP transport packet. If one of the addresses eventually changes on the way from the sender to the recipient, the decrypted and de-encapsulated packet inherits this changed address. The information about ports is part of the transported payload; ESP itself has no notion of ports at all.
The fact that ESP uses no ports means that it cannot be handled by a NAT; to work this around, the NAT-T extension to IPsec has introduced encapsulation of ESP into UDP. However, this only allows the UDP-encapsulated ESP to be forwarded through the NAT properly. In the process of de-encapsulation of the ESP payload from the UDP transport, the information about source port of the UDP (which has been changed by the NAT to distinguish one flow from another) is lost. The source address of the ESP packet is inherited from the UDP packet in which the ESP has come encapsulated.
When, in the next step, the UDP packet carrying the L2TP is decrypted and de-encapsulated from the ESP, its source address is inherited from the ESP, so it is again the public IP of the NAT, and the source port is the one from which the client has sent the UDP before encapsulating it into the ESP. So if both clients behind the same NAT send from the same port (1701), the L2TP server sees both as coming from the same socket address, consisting of the public address of the NAT and the same port provided by each client. Thus not only that the flows are indistinguishable from each other when received but there is also no way to address the response packets selectively to one of them - the IPsec policy matches only on address and port, not on anything in the payload.
Some client implementations are aware of this and use random ports. RouterOS server implementation is not strict in this and accepts connections from such clients, so this is a solution of the problem for these implementations but not for other ones, like the one of Microsoft Windows. And it is also not a solution for the Android embedded client which uses a random port for L2TP but doesn't restrict its IPsec policy to this port.
The solution at server side RouterOS
As there is no out-of-the-box way to link the source port of the "inner" UDP (the one which carries the L2TP) to the source port of the "outer" UDP packet (the one which carries the ESP), we need to convey the information distinguishing the two client's flows from one another across two encapsulating layers using some other means. Without coding, the only way is to change the transport packets' source IP addresses to unique ones depending on their source ports. So what we need is to use source-NAT on received packets. Problem solved - from the point of view of the L2TP server and IPsec policy matching, there is no ambiquity or conflict as each flow is coming from a unique IP address.
Implementation Details
The above sounds easy but the first cooldown comes when you realize that src-nat does not work for received packets as it is deemed useless for any sane application scenario. We could use two routers in a chain to do that but that would mean extra hardware costs and also some inconvenience. So the first task is to make RouterOS src-nat a received packet and then continue handling it as a received one. The solution is to take any two local addresses and establish an IP-IP tunnel between them, so that we could run the same packet through the whole firewall machinery twice on a single machine. We receive the packet from a physical interface, route it "out" using one of the IP-IP tunnel's ends as the output interface, and we "receive" it the second time from the other end of that tunnel.
Another task is the assignment of a unique source address depending on port. While we could theoretically map the port number 1:1 into the lower two bytes of the IP address, the issue here is that very often flows from two clients behind different NAT boxes come from the same port (like 1024), so that method would not be 100% safe. So this is not the way even if we neglect some other associated complications. The solution is to use a stack of addresses and give each new connection its own address regardless the actual source address and port.
The address assignment policy of src-nat action of RouterOS firewall is not helpful here, as it does not prefer diversity, so if two connections come from different ports on the same IP address, the src-nat rule is likely to assign them the same address no matter how large pool it has got as to-address. A per-connection-classifier can use a hash of source address and port but even with an annoying amount of rules the uniqueness of address assignment would not be guaranteed. The solution is to use a single address in the src-nat rule and increment it each time it gets assigned to a new connection. As using the available means it is impossible to do this fast enough, there is a big chance that two connections get src-nat'ed to the same source address, so we need to resolve that systematically. To do so, the firewall filter in the second pass checks a src-address-list for the first packet of each new connection; if the source address is already there, the firewall drops the packet, thus preventing the connection from establishing, otherwise it adds the source to the address list with a 1 minute lifetime and accepts the packet. So the first connection with a particular source address always succeeds and all the following ones always fail. Once per a couple of seconds, a cleaner script checks whether the source address currently configured in the src-nat rule is used by a properly established connection; if it is, the script updates the rule with an unused address and cleans any connection attempts waiting for a response because their establishing packets have been dropped by the second pass through the firewall, which makes the firewall treat a retransmission of the same packet as a brand new one and thus use the src-nat rule on it. The client retransmits the IPsec connection establishing packet for tens of seconds before it gives up, therefore many clients would have to attempt to connect during the same time window of several seconds so that the most unlucky ones would fail. This can happen e.g. after a network outage if the clients are autonomous devices which keep on trying, so effectively for some of them the network outage will seem to last longer than for others.
It is also worth noting that we need to prevent the incoming UDP transport packets from being delivered to the IPsec stack already in the first pass; to do that, we must dst-nat them to an address which is not a local one to the router using a dst-nat rule which matches in the first pass, and dst-nat them back to the original address using a dst-nat rule which matches in the second pass. Routes must be added to send the internal traffic through the tunnel.
What makes all this even possible is that it is enough to give this special treatment to packets coming to UDP port 4500. The NAT-T mechanism of IPsec does not require that the UDP-encapsulated communication coming to port 4500 would come from the same IP address from which the initial IKE communication was coming to port 500. There would be no way to pair connections to port 500 with connections to port 4500 coming from the same client.
The limitations
The pool of IP addresses used for src-nat must be significantly larger than the maximum expected number of clients connected simultaneously, because if an established connection breaks, the connection tracker remembers it for minutes so its src-nat address cannot be reused during that time. So if some of the clients have unstable connections and reconnect quite frequently, they would exhaust the pool. If your network is so large and complex that you cannot find a free pool of thousands of addresses, the way out is to mark the response IPsec packets from the server and policy-route them back to the tunnel, while the traffic to actual owners of these addresses is routed normally.
The risks
Each modification of the src-nat rule causes a configuration save. This means an extra wear for the flash chip. I have no idea what this may cause in real life deployment on Routerboards. To reduce the impact and also to save some resources, I recommend to exclude packets with source port 4500 from the special treatment. The background is such that NATs usually only change the source port if the original one is already used for another connection to the same remote socket. So in many cases the connection from our first client behind each NAT comes from port 4500 and thus does not cause a modification of the src-nat rule, and in many of these cases the client will be the only one behind that NAT.
The configuration
# Create a bridge without any member ports so that we'd have something to attach the additional local IP address to.
# Actually the address could be added to an existing interface, but a member-less bridge never fails.
/interface bridge
add name=aux-lo protocol-mode=none
# Add another local address - just to have this part independent from the rest of the configuration.
/ip address
add address=127.0.1.1 interface=aux-lo network=127.0.1.1
# Add a firewall rule permitting local traffic - currently, default firewall rules drop traffic from in-interface-list=!LAN which
# includes local traffic
/ip firewall filter
add chain=input src-address=127.0.0.0/8 dst-address=127.0.0.0/8 action=accept place-before=right after the "accept established,related" rule
# Create the two ends of the local tunnel
/interface ipip
add local-address=127.0.0.1 mtu=1500 name=ipip-inner remote-address=127.0.1.1
add local-address=127.0.1.1 mtu=1500 name=ipip-outer remote-address=127.0.0.1
# Add routes for the addresses used for the solution
/ip route
add distance=1 dst-address=10.0.0.0/20 gateway=ipip-inner
add distance=1 dst-address=10.0.15.254/32 gateway=ipip-outer
# Add the chain of firewall rules preventing newer connections from killing an older one before the cleaner script changes the src-nat address
/ip firewall filter
add chain=udp-4500-in src-address-list=src-addresses-in-use action=drop
add chain=udp-4500-in action=add-src-to-address-list address-list=src-addresses-in-use address-list-timeout=1m
add chain=udp-4500-in action=accept
# Add the firewall rule sending new packets to UDP 4500 coming from the tunnel to the chain above
/ip firewall filter
add action=jump chain=input connection-state=new dst-port=4500 in-interface=ipip-inner jump-target=udp-4500-in protocol=udp place-before=right after the "accept established,related" rule
# The usual IPsec- and L2TP-related firewall rules must be there as well, usually they already exist
add action=accept chain=input connection-state=new dst-port=500,4500 protocol=udp
add action=accept chain=input connection-state=new dst-port=1701 ipsec-policy=in,ipsec protocol=udp
# Add the firewall rule permitting forwarding of dst-nated packets in the first pass
/ip firewall filter
add action=accept chain=forward connection-state=new dst-address=10.0.15.254
# Add the NAT rules
/ip firewall nat
# Restore our public IP address on packets after they've passed through the tunnel
add action=dst-nat chain=dstnat dst-address=10.0.15.254 in-interface=ipip-inner to-addresses=1.2.3.4
# src-nat the packets before sending them to the tunnel
add action=src-nat chain=srcnat out-interface=ipip-outer protocol=udp to-addresses=10.0.0.1
# Redirect packets to port 4500 to the auxiliary destination address to give them the special treatment;
# for testing that it works with only two client devices, remove the "src-port=!4500"
add action=dst-nat chain=dstnat dst-port=4500 src-port=!4500 dst-address=1.2.3.4 protocol=udp to-addresses=10.0.15.254
# Add the cleaner script
/system script
add name=l2tp-helper owner=admin policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon source=\
":local cntr 0; \\\
\n:local auxip [/ip firewall nat get [find chain=\"srcnat\" && out-interface=\"ipip-outer\"] to-addresses]; \\\
\n:while ([/ip firewall connection print count-only where src-address~\"^\$auxip\" && dst-address~\":4500\" && seen-reply]=1) \
do={\
\n :set auxip (\$auxip+1); \\\
\n :if (\$auxip>10.0.15.253) do={:set auxip 10.0.0.1};:set cntr (\$cntr+1)\
\n}\
\n:if (\$cntr>0) do={\
\n /ip firewall nat set [find chain=\"srcnat\" && out-interface=\"ipip-outer\"] to-addresses=\"\$auxip\"; \\\
\n /ip firewall connection remove [find dst-address~\":4500\" && !seen-reply]\
\n}\
\n"
# Schedule the cleaner script to run every 3 seconds right from the restart
/system scheduler
add interval=3s name=l2tp-scheduler on-event=l2tp-helper policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon \
start-time=startup