Forum Topic: Monitoring 6G
So I’ve had 6G up for a few days now. The way I ended up doing this was by first excluding certain directories such as my newsletter & cart programs, because they use ultra-long query strings and a few other things that 6G wouldn’t allow, and, I just wasn’t able to track these all down. I also installed the blackhole, so now everything is set up such that I get an email whenever someone gets 403’ed, blackholed, etc., and then I’m scanning through logs and trying to figure out why exactly they ended up getting in trouble.
Alright, so the first big question that comes up is, this one person or bot has a user-agent that is encased in ‘this’ (little quotes or whatever.)
"'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'"
Why do they wrap it in
'this' (and why did you decide to exclude
Also what strikes me as off putting is I’ve noticed that the classier bots tend to include a URL that points to a page that explains what their bot does, how you can influence its behavior on your site, and so on.
The next major thing I’ve noticed is that many of the questionable bots are using HEAD requests. I’m tempted to ban HEAD requests.
I think that’s all for 6G. I thought it would take me *forever* to decode its meaning but I’m surprised to see I understood most of it fairly quickly. You print out the character def’s and stick it on your wall and next thing you know, everything starts to make perfect sense in no time.
Actually there’s two more questions:
The third redirectmatch line seems to want to redirect the letter ‘s’?
RedirectMatch 403 (?i)(<|>|:|;|\'|\s)
And the fifth redirectmatch line doesn’t like
RedirectMatch 403 (?i)(\"|\.|\_|\&|\&)$
(My cart’s ‘checkout with paypal’ button uses
3 Replies to “Monitoring 6G”
“Why do they wrap it in ?this? (and why did you decide to exclude
I could only guess at “why” skiddies do some of the things they do.. perhaps it’s their way of telling their victims that they are spoofing the UA, like when people do that two-finger quote thing with their hands.. When included with code, quotes play an important role in syntax, etc., so it could be related to that or perhaps a relic of frantic copy/pasting.. only wild stabs here. And for the this that and hrefs, you’ll have to let me know which patterns/lines you’re referring to — I’m working remotely and unable to reference any of my codes at the moment.
Blocking HEAD requests may seem counter-intuitive, but it’s totally fine in most cases — only bots and scripts should be effected.
“The third redirectmatch line seems to want to redirect the letter ‘s’?”
Nope, that’s an escaped blank space :)
“And the fifth redirectmatch line doesn’t like
Correct, but only when it is appended as the last character on a requested URL. It won’t be blocked anywhere else (note the
$, which denotes the end of a line).
OK, I’m referring to these lines:
RedirectMatch 403 (?i)(<|>|:|;|\'|\s) . . . SetEnvIfNoCase User-Agent (<|>|'|<|%0A|%0D|%27|%3C|%3E|%00|href\s) keep_out
One thing I’m noticing is that googlebot is getting nailed by a few 403’s, I’ll try to figure it out tonight.
href\s would be equal to
href and one blank space?
If you have another ebook coming out about this stuff put me on the waiting list. :)
href\s matches the string, “href” followed by a single blank space.
For the other character patterns (angle brackets, single quotes, etc.), those are matching the unencoded characters only. As explained here (and elsewhere), certain characters must always be encoded in URLs. So theoretically it’s fine to block such characters, but in reality, even some legit URLs are not encoded properly, resulting in false positives. More info: