CCR1036 freezes with over 1Gbps of packets exhausting TTL

macpacheco · Thu Mar 02, 2017 12:25 pm

This problem happened with 2 CCR1036-12g-4s and 2 CCR1036-8g-2s+.
This problem has since been worked around after the customer stopped sending traffic meant for its own CIDR.
All CCRs are running RouterOS 6.38.3. We had do downgrade due to issue with ix.br peering 3 months ago.
A single BGP customer didn't have the usual blackhole route to its own CIDR on his router and he uses default gateway pointed at us, so all traffic to his unused IPs came to us and due to BGP routes, looped back and forth until TTL exhaustion. At some point our router ceases to respond to all ethernet traffic (even ARP/ping), the router is still apparently online, interfaces are physically up, but even unrelated interfaces cease to respond.
We saw up to 2 Gbps rise in up/down traffic until the router crashes.
After the customer inserted a blackhole route on his router for his CIDRs, the problem stopped completely.
Such fragility on a "cloud core router" is extremely problematic.
I'm posting this issue in hopes somebody else sees a similar problem and helps Mikrotik identify the problem.
This isn't a case of 100% interface usage. The customer has 1.5Gbps bandwidth, and we run 2xSFP+ interfaces, bandwidth limitations happens on distribution ethernet switches. So from the CCR view, bandwidth is kept limited at 1.5Gbps, with upload/download levels similar as looped traffic overwhelms normal traffic.
We have 2 core CCR1036-8g-2s+. Normally we run wholesale customers sharing both routers, with common IPv4 gateway and 2 separate IPv6 gateway IPs.
After the customer crashed both core routers, we switched him to use core router 1. Core router 2 continued normally, and router 1 crashed. Then moved him to core router 2, disabling him on core router 1. The reverse happened. Finally we moved him to a distribution router (CCR1036-12g-4s), and the router crashed just the same. Finally we setup a dedicated CCR1036-12g-4s router to serve only this customer and the problem followed the customer. So this doesn't seem like a hardware problem.
We had to power cycle the routers to bring them back online.
We're very concerned about the risk of other customers making the same mistake in the future and crashing our network. But for now the problem is gone.