Layer7 and WEB spider

KDmitrii · Sun Apr 02, 2017 6:32 pm

Hello gentlemen!
Mikrotik is used as a transparent firewall. The firewall protects there are several WEB servers.
The problem: to block queries from search engines (WEB spiders) exept from Google and Ynadex bots.
Below there are requests from unknown robot which I'd like to block:

46.229.168.68 - - [01/Apr/2017:21:37:07 +0600] "GET /eng/business_news/3053 HTTP/1.1" 301 562 "-" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"
46.229.168.68 - - [01/Apr/2017:21:37:31 +0600] "GET /eng/business_news/3053 HTTP/1.1" 301 562 "-" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"

216.244.66.247 - - [01/Apr/2017:22:18:19 +0600] "GET /fr/business_news/183 HTTP/1.1" 200 424 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)"
216.244.66.247 - - [01/Apr/2017:22:18:25 +0600] "GET /fr/business_news/435 HTTP/1.1" 200 424 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)"

How can I do it, if used Layer7. I can't write regex

Sob · Sun Apr 02, 2017 7:59 pm

Both of those claim to obey robots.txt rules, so you can use it to tell them to leave you alone.

Some other bots may be less behaving, so you may want to block them using other means, but I don't think L7 filter is the right way. L7 filter is stupid, it doesn't know about protocol internals, it just looks for pattern somewhere inside packet. You can have either simple regexp and more false positives, or less false positives, but complex regexp. Neither is good. False positives are simply bad, because you don't want to block innocent traffic. And complex regexp is bad for performance.

KDmitrii · Mon Apr 03, 2017 6:37 am

Some other bots may be less behaving, so you may want to block them using other means, but I don't think L7 filter is the right way.

For example what? The usual way to close to an IP address?

Sob · Mon Apr 03, 2017 9:00 pm

Collecting IP addresses is possible, but it might be endless battle. Other option is to block them by user agent (as you wanted to), but on server. The difference is that web server understands http protocol, so it will be looking only at right User-Agent header and nowhere else.

Ok, I'd be lying if I claimed that it was completely impossible using L7, you can get pretty close with e.g.:

/ip firewall layer7-protocol
add name=botblock regexp="\\x0aUser-Agent:[^\\x0a]+(DotBot|SemrushBot)"
/ip firewall filter
add action=reject chain=forward dst-address=<server> dst-port=80 layer7-protocol=botblock \
    protocol=tcp reject-with=tcp-reset

The [^\\x0a]+ part means "no LF character", i.e. no end of line, so it will be checking only the right header. And it will work as long as there's not the same string anywhere else, which I agree is unlikely. But I still don't think it's good idea, because it will be slowing down everything and you want your packets get through router as fast as possible. Plus it's limited to plain http, it won't work with https. If you do it on server, it can work for both.

Layer7 and WEB spider

Layer7 and WEB spider

Re: Layer7 and WEB spider

Re: Layer7 and WEB spider

Re: Layer7 and WEB spider