In summary, if you are using VPLS on your network, take advantage of VPLS fragmentation, and are using any CCR1xxx (Tilera SoC) series routers as PEs, you are probably experiencing this bug even if you don't know it. It has existed for years. Maybe the reason it has managed to escape attention for so long is because most people running VPLS on MikroTik are not using fragmentation. But if you still have parts of your network that are not yet "jumbo-clean" and are forced to use VPLS fragmentation as a result, unless/until this gets fixed, your options are to either force lower MTU for end-users and stop allowing VPLS to fragment your frames, finish making your network jumbo-clean from end to end, or to rip & replace any TILE-based CCRs acting in the capacity of a PE.
- - - -
We have been chasing a problem for some months now where some network users will mysteriously and randomly experience transmission stalls that eventually lead to premature termination of their TCP data transfer, resulting in incomplete file download or upload, or interruption in streaming.
I think I have finally nailed down the problem, and found it in a place that we were not expecting: a bug in RouterOS related to VPLS pseudowire defragmentation. And bizarrely, it only affects CCR1xxx (Tilera) products. This is actually a strange and complex bug, both to describe and to reproduce, so I will attempt to be very exact as I work to lay out the details below. I have also put together a minimum-viable config and test scenario to reliably reproduce the bug, which I will outline here.
As far as I can tell, this bug has existed throughout every RouterOS 6.x release and also into all of RouterOS 7.x even up to the current beta. Because it only affects TILE-based MikroTik routers, my theory is that there is some MPLS/VPLS processing assist or acceleration support (maybe Fast Path related?) in the Linux driver for Tilera SoC Ethernet interfaces, and the bug exists in there somewhere. This is of course only a guess.
Brief description: if a VPLS payload gets fragmented, and then delivered to a CCR1xxx router to be reassembled, if particular bits at particular offsets within one of the fragments are set to particular values, then even though the fragment is not corrupt and its contents are perfectly valid, the CCR will reject the fragment, and the entire payload will be lost (Ethernet frame will never finish being reassembled, and will never be forwarded to the end-user).
If this happens to a TCP packet encapsulated within fragmented VPLS, TCP on the sender side will of course attempt to retransmit when ACK fails to be sent back by the recipent. But since the contents of the retransmitted packet will be 100% identical to the one that was never delivered, the same bits at the same offsets will still have the same values, and the CCR will reject the same packet every single time. Since the recipient never gets the packet no matter how many times it is retransmitted, and therefore never sends back ACK to the sender, eventually one party or the other will give up and send TCP RST, aborting the entire transfer.
When encryption is employed (e.g., HTTPS), these failures occur seemingly at random even when downloading the same content (it will work one time but not another, the failure will occur at different places in the transfer, etc.), since the keys for each session will be different, and thus the actual bits sent across the wire will be different every time. But unencrypted streams that are affected will always hang and then abort at the exact same place in the transfer. So once you find a sequence of bytes that can cause the problem every time, the issue is infinitely reproducible with that sequence.
If we dig deeper and look at affected payloads, it appears that what is happening is that if it is possible for the contents of the fragment to be (incorrectly) interpreted as another MPLS-tagged Ethernet frame -- specifically, one with two labels but without a control word -- then the CCR that is tasked with reassembling the fragments will choose to interpret it that way even though it isn't, and throw out the fragment.
Here are some example screenshots from Wireshark of a "good" fragment (one that CCR does not reject), and a "bad" one (that CCR erroneously rejects).
"Good":
"Bad":
Notice that in the "bad" example, even Wireshark tries to interpret the contents of the encapsulated fragment as if it is another MPLS-tagged Ethernet frame, with two MPLS labels in the stack. I believe this is actually exactly what RouterOS running on TILE is doing. You can see here that Wireshark even tries to interpret the bytes that would normally be where the source and destination would be located as MAC addresses, even though they are just part of a TCP data stream and do not represent host MAC addresses.
Based on this, we can conclude that the following conditions within the packet fragment are necessary to trigger the bug:
For any fragment past the first one, the contents need to be ambiguously (and incorrectly) interpretable as a 2-label MPLS stack with no PW Ethernet Control Word. So...
`
- The word at the relative offset of 0x0C past the end of the PW control word (so, absolute offset of 0x26) needs to contain 0x8847 (Ethertype for MPLS-labeled frame)
- The following 8 bytes need to be interpretable as an MPLS label stack with 2 labels; so, the 3rd byte of those 8 bytes needs to have its last bit set 0, and then 4 bytes after that one (the 7th byte) needs to have its last bit set to 1
- The byte immediately following all of that needs to not have its first 4 bits set to 0 (first 4 bits as 0 following last MPLS label represent PW control word, so flipping any bit on within those 4 bits satisfies this criterion)
And in fact, we can see in the "bad" screenshot that all of these are true: octet 0x26 and 0x27 happen to be "0x88" and "0x47" (0x8847) respectively [criterion #1], the following 64 bits of hex "c0 ae 2c f5 1b 4a 47 b2" are interpreted as two MPLS labels, and last bit of even number "0x2c" is 0 while last bit of odd number "0x47" (octet 7) is 1 [criterion #2], and the first 4 bits of immediately following octet "0xd6" are not all 0 [criterion #3].
Further validating this theory, if you take the exact same payload, and you replay it while changing ANY one of these factors by simply flipping a bit or two (for example: change "0x8847" to "0x8848", or change "0x2c" to "0x2d", or change "0xd6" to "0x06"), then the CCR will reassemble the PW fragment just fine and forward the whole frame to the end-user.
Thus to reproduce, it should simply be necessary to take a sequence of bytes that would normally pass, and then to change the right bits in the right places in order for the conditions necessary to trigger the bug to be satisfied, which you can do once you know where the fragmentation will occur.
For the minimum-viable config, this is precisely what I have done: set up a lab network with end-to-end path MTU of 1500 and MPLS MTU of 1492, calculated where a 1500-byte IP packet (== 1514-byte Ethernet frame) would get split in this scenario by the ingressing PE router, send a 1500-byte IP packet from one end to the other and confirm that it was received okay, modify the same payload to meet the above-described necessary conditions, and then transmit the modified payload and confirm that it was NOT received (bug triggered on egressing PE).
For this lab, we will set up the following:
`
- 2x P routers (named P-1 and P-2; MPLS switchers/forwarders)
- 2x PE routers (named PE-1 and PE-2; MPLS label push/pop & VPLS encapsulate/deencapsulate and fragment/reassemble)
- 2x PC hosts of some sort (I call the sending one "server" and receiving one "client")
This is very basic / "MPLS 101": P-1 and P-2 are connected to each other, PE-1 is connected to P-1 and PE-2 is connected to P-2, a VPLS tunnel is established between PE-1 and PE-2, and the server is connected to PE-1 while the client is connected to PE-2.
`
Code: Select all
+--------+ +------+ +-----+ +-----+ +------+ +--------+
| Server | <==> | PE-1 | <==> | P-1 | <==> | P-2 | <==> | PE-2 | <==> | Client |
+--------+ +------+ +-----+ +-----+ +------+ +--------+
TILE
For this lab, PE-1, P-1, and P-2 can be any hardware arch; in my own tests I have just been using x86, but it does not matter. But PE-2 (the receiving/deencapsulating PE that has the "client" attached to it) MUST be a TILE-arch CCR1xxx model. Once you reproduce the problem with that CCR, you can then replace PE-2 with a non-TILE device with 100% identical configuration, retry the test, and see that it passes successfully. We will also start out by running 6.49.7 on everything, then upgrading the TILE-based PE-2 from 6.49.7 to 7.7rc1, re-running the test, and seeing that it still fails even on RouterOS 7. All of this will establish that the problem only happens on Tilera SoCs, and it affects every single version of RouterOS to-date. (I have also tested back to early versions of RouterOS 6 and the problem has been there since the beginning; it was not introduced as a regression in some later version of 6.x...yes, this bug has existed for YEARS and has somehow flown under the radar this whole time)
For the "server" and the "client", I am simply attaching two Linux hosts, and running Nping from Nmap on the server to send a custom-crafted 1500-byte ICMP payload to the client. I am attaching two versions of the payload: test-payload-pass.txt and test-payload-fail.txt. The "pass" one is simply a repeating sequence of bytes (0 through 9, a space, roman alphabet a-z lowercase then A-Z uppercase, a space). The "fail" one takes this same payload, changes bytes at offsets 0x05aa and 0x05ab to "0x8847", and the byte at offset 0x05b2 from ASCII "N" to "O" (0x4e to 0x4f). This makes the "fail" one satisfy all 3 requirements previously described in order to reproduce the bug, so if you use Nping on the server to send an ICMP ping with the "pass" payload, you will get a response from the client, but if you use the "fail" payload, you will never get a response from the client, because the client never receives the reassembled packet from PE-2.
(For the lab "server" and "client", I like to use Alpine Linux since it is lightweight -- a default installation from the "extended" ISO can easily fit within 2GiB -- and is very quick to install. But you can of course use whatever is most convenient or desireable for you. Nping is not included by default, so if using Alpine, you can install Nping from Alpine with "apk add nmap-nping". This means you will need to attach your "server" host to the internet briefly to install Nping before attaching it to PE-1 in the lab. On Debian or Ubuntu, Nping is included in the full Nmap package, so "apt install nmap". If using Windows then naturally disable Windows Firewall on both sides.)
In my case, I have assigned the server 192.168.10.1/24, and the client 192.168.10.2/24. On Linux, the nping commands I am using are as follows (while in the directory where I have copies of the test payload files, running as superuser/root, with Bash as my shell):
`
Code: Select all
# nping --icmp --data-string "$(cat ./test-payload-pass.txt)" 192.168.10.2
# nping --icmp --data-string "$(cat ./test-payload-fail.txt)" 192.168.10.2
(note that Nping does not seemingly have a way to ingest a custom payload directly from a file, and on Windows cmd.exe or Powershell I am not sure how to read in a file and incorporate its contents into a parameter)
Here are the example outputs demonstrating the problem:
`
Code: Select all
# nping --icmp --data-string "$(cat ./test-payload-pass.txt)" 192.168.10.2
WARNING: Payload exceeds maximum recommended payload (1400)
Starting Nping 0.7.70 ( https://nmap.org/nping ) at 2022-12-17 18:15 PST
SENT (0.0054s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=1] IP [ttl=64 id=649 iplen=1500 ]
RCVD (0.2044s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=1] IP [ttl=64 id=23373 iplen=1500 ]
SENT (1.0057s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=3] IP [ttl=64 id=649 iplen=1500 ]
RCVD (1.0178s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=3] IP [ttl=64 id=23393 iplen=1500 ]
SENT (2.0070s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=3] IP [ttl=64 id=649 iplen=1500 ]
RCVD (2.0344s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=3] IP [ttl=64 id=23496 iplen=1500 ]
SENT (3.0078s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=4] IP [ttl=64 id=649 iplen=1500 ]
RCVD (3.0511s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=4] IP [ttl=64 id=23562 iplen=1500 ]
SENT (4.0094s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=15922 seq=5] IP [ttl=64 id=649 iplen=1500 ]
RCVD (4.0678s) ICMP [192.168.10.2 > 192.168.10.1 Echo reply (type=0/code=0) id=15922 seq=5] IP [ttl=64 id=23608 iplen=1500 ]
Max rtt: 198.913ms | Min rtt: 12.002ms | Avg rtt: 67.988ms
Raw packets sent: 5 (7.500KB) | Rcvd: 5 (7.500KB) | Lost: 0 (0.00%)
Nping done: 1 IP address pinged in 4.07 seconds
Code: Select all
# nping --icmp --data-string "$(cat ./test-payload-fail.txt)" 192.168.10.2
WARNING: Payload exceeds maximum recommended payload (1400)
Starting Nping 0.7.70 ( https://nmap.org/nping ) at 2022-12-17 18:16 PST
SENT (0.0027s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=1] IP [ttl=64 id=20226 iplen=1500 ]
SENT (1.0031s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=2] IP [ttl=64 id=20226 iplen=1500 ]
SENT (2.0044s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=3] IP [ttl=64 id=20226 iplen=1500 ]
SENT (3.0050s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=4] IP [ttl=64 id=20226 iplen=1500 ]
SENT (4.0063s) ICMP [192.168.10.1 > 192.168.10.2 Echo request (type=8/code=0) id=64955 seq=5] IP [ttl=64 id=20226 iplen=1500 ]
Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
Raw packets sent: 5 (7.500KB) | Rcvd: 0 (0B) | Lost: 5 (100.00%)
Nping done: 1 IP address pinged in 5.01 seconds
NOTE: strangely, once a TILE-based PE router receives a VPLS fragment that meets all the conditions necessary to trigger the bug, it will also stop forwarding non-triggering VPLS frames for a few seconds to the "client" host. So if you send a ping with the "fail" payload, and then immediately try to send the "pass" payload afterward, you will also likely drop one or more of the packets with the "pass" payload.
Below you will find the config exports for P-1, P-2, PE-1, and PE-2; in this ticket I am also attaching supout from the CCR1009 I am using in my lab while it is running 7.7rc1 (first configured while running 6.49.7, then in-place upgraded straight to 7.7rc1; note however that I after upgrade I had to manually do "/mpls interface add interface=all mpls-mtu=1492" because the upgrade from 6.x to 7.x "lost" the 'all' MPLS interface that was already configured...so, another bug):
P-1:
Code: Select all
/system identity set name=mpls-lab-p-1
/interface bridge add name=loopback
/interface ethernet set [ find default-name=ether1 ] name=ether1-to-pe1
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-p2
/ip address add address=192.168.1.1 interface=loopback
/ip address add address=192.168.0.1/30 interface=ether2-to-p2
/ip address add address=192.168.0.5/30 interface=ether1-to-pe1
/routing ospf instance set [ find default=yes ] router-id=192.168.1.1
/routing ospf interface add interface=ether1-to-pe1 network-type=broadcast
/routing ospf interface add interface=ether2-to-p2 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/23
/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.1.1 transport-address=192.168.1.1 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-pe1
/mpls ldp interface add interface=ether2-to-p2
Code: Select all
/system identity set name=mpls-lab-p-2
/interface bridge add name=loopback
/interface ethernet set [ find default-name=ether1 ] name=ether1-to-pe2
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-p1
/ip address add address=192.168.1.2 interface=loopback
/ip address add address=192.168.0.2/30 interface=ether2-to-p1
/ip address add address=192.168.0.9/30 interface=ether1-to-pe2
/routing ospf instance set [ find default=yes ] router-id=192.168.1.2
/routing ospf interface add interface=ether1-to-pe2 network-type=broadcast
/routing ospf interface add interface=ether2-to-p1 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/23
/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.1.2 transport-address=192.168.1.2 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-pe2
/mpls ldp interface add interface=ether2-to-p1
Code: Select all
/system identity set name=mpls-lab-pe-1
/interface bridge add name=loopback
/interface ethernet set [ find default-name=ether1 ] name=ether1-to-p1
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-server
/ip address add address=192.168.2.1 interface=loopback
/ip address add address=192.168.0.6/30 interface=ether1-to-p1
/routing ospf instance set [ find default=yes ] router-id=192.168.2.1
/routing ospf interface add interface=ether1-to-p1 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/22
/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.2.1 transport-address=192.168.2.1 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-p1
/interface vpls add disabled=no l2mtu=1500 name=vpls1 remote-peer=192.168.2.2 use-control-word=yes vpls-id=1:1
/interface bridge add name=lan protocol-mode=none
/interface bridge port add bridge=lan interface=ether2-to-server
/interface bridge port add bridge=lan interface=vpls1
Code: Select all
/system identity set name=mpls-lab-pe-2
/interface bridge add name=loopback
/interface ethernet set [ find default-name=ether1 ] name=ether1-to-p2
/interface ethernet set [ find default-name=ether2 ] name=ether2-to-client
/ip address add address=192.168.2.2 interface=loopback
/ip address add address=192.168.0.10/30 interface=ether1-to-p2
/routing ospf instance set [ find default=yes ] router-id=192.168.2.2
/routing ospf interface add interface=ether1-to-p2 network-type=broadcast
/routing ospf network add area=backbone network=192.168.0.0/22
/mpls interface set [ find default=yes ] mpls-mtu=1492
/mpls ldp set enabled=yes lsr-id=192.168.2.2 transport-address=192.168.2.2 use-explicit-null=yes
/mpls ldp interface add interface=ether1-to-p2
/interface vpls add disabled=no l2mtu=1500 name=vpls1 remote-peer=192.168.2.1 use-control-word=yes vpls-id=1:1
/interface bridge add name=lan protocol-mode=none
/interface bridge port add bridge=lan interface=ether2-to-client
/interface bridge port add bridge=lan interface=vpls1