The first time it happened a reboot of the CCR fixed the glitch. Recently it happened to another one of our CCR. This time we had to revert to a backup from our FTP system that was made a few days earlier to the glitch. If you don't have this you need it. it is a lifesaver https://wiki.mikrotik.com/wiki/Automated_Backups
Anyway, While troubleshooting we tried
- disabling everything under IP firewall
- sending the traffic through a different BGP peer. (we have a single-multihomed set up there)
This looks exactly like some kind of BGP error. We thought that our IP blocks were just not making it to the entire internet. That would explain why we could reach some sites and not others. Talking to our BGP peer we found that the smaller packets needed to open the Netflix.com website where flowing both ways. But the larger packets with the website HTML in them where never being sent to our clients. There was 2-way communication between the user PC and netflix.com but the session would never complete.
We solved the issue by changing the path of the traffic so that it did not flow through the CCR that was blocking Netflix. This is what clued us in that the issue was the CCR its self. If the CCR that had previously lost power routed the traffic we could not reach netflix.com. If we sent the traffic through another CCR and to the internet then we could reach netflix.com.
I know this is vague and does not have enough detail to diagnose anything. What kind of info should I gather to better demonstrate the issue? I expect to encounter the issue again and would like to know what Info I should gather to help pin down what is happening.