Whitelist Good Bots
Instead of blacklisting a million billion different bad bots, a perhaps more effective strategy is to whitelist the good bots. In this post, I discuss the pros and cons of a general strategy for whitelisting bad bots, and then examine an experimental set of .htaccess rules that you can customize and use to develop your own whitelist solution.
Counterfeits
Before allowing only a specific set of bots access to your site, it’s important to understand the mechanism by which we determine the identity of a “bot”. In general, a bot is referred to and identified by its reported user agent. Notice that word, “reported” — we say that because it’s up to each bot to accurately identify itself. Unfortunately many bad bots disguise their true identity (and intention) by spoofing or mimicking known good bots. Faking the user agent is trivial to do, so that is why it’s important to understand that we’re not dealing in absolutes when discussing “good” vs. “bad” bots.
Imposters
So let’s say that we’ve implemented a robust whitelist that aims to allow only good bots into our site. Seems legit, right? Sure enough, we’re discovering that all the good bots that we’ve specified on our list are enjoying unlimited access to the entire site. What we’re not noticing, however, are all of the visits from bad bots disguising themselves as Google, Bing, Firefox, and other well-known and trusted user agents. So the whitelist, although theoretically a solid idea, is only as good as bots’ willingness to truthfully identify themselves. So there’s a bit of a gaping hole in our otherwise ideal whitelist strategy.
Evolution
Another downside that needs addressed is the Web’s perpetual state of change. Everything is changing all of the time, and that includes user agents. Browser vendors, search engines, software, and so forth are constantly changing the names of their bots, spiders, and scripts. So for example, if you have whitelisted all agents that include the phrase “googlebot” in their reported name, what if Google unleashes a new crawler that reports as something completely different? The whitelist would not recognize the bot and it would be blocked. Not good, you know, cuz it’s Google.
Likewise if a good agent simply changes their name so that it no longer matches any regular expression in the whitelist: access denied for a legit bot. This means that you would have to routinely check all of the entries in your whitelist and compare their names with current data. And believe me, that requires a lot more work than you may think.
Unknowns
Say we’ve set up a super effective whitelist that allows open access to all of the currently known good bots. All may be well, but what if a new or otherwise unknown bot tries to access our site? With a blacklist, we give the bot the benefit of the doubt; but with a whitelist, we are saying up front, “sorry bud, you may be legit, but you’re not on the list so we’re blocking you.” This sort of false positive, er, negative, may result in missed traffic and other opportunities.
Further, these days there are hundreds of apps and browsers by which users may try to access your site. It’s no longer the case where there are only 10 major browsers and four or five major search engines. There are new apps and bots popping up all over the place, so a whitelist on today’s Web would be much more difficult than it would have been even five years ago.
Benefits
With those downsides to reckon with, are there any benefits to running a whitelist? Well, the first argument for whitelisting (as opposed to blacklisting) is that the list is gonna be much, much shorter. There are thousands upon thousands of bad bots crawling around the Web, but only, say, several hundred really good bots. So apart from having to actively maintain the list and keep it current, the actual size of the list would be much shorter than any sort of comprehensive blacklist.
And that gets into site performance as well. It takes much less processing power and server resources to compare the user-agent string against, say, 300 whitelist terms than it does to compare against thousands of blacklist terms.
The other big benefit would be the effectiveness of the list. With a blacklist, your site is like an open door to any bad bot who is not on the list (or who fakes their user agent). Whereas with a whitelist, your site is locked down and protected against literally every bot that’s not on the list (except those who fake their user agent). Better security is the primary appeal and benefit of running an effective whitelist.
Whitelist
Now that we’ve discussed some of the major pros and cons inherent in protecting your site with a whitelist for good bots, let’s take a look at what such a whitelist might look like using Apache and .htaccess. Here is an experimental set of rules that I’ve played with on and off for many years, never having actually implemented it on a live production site. Drum roll, please..
# whitelist: good bots
# allow major browsers
RewriteCond %{HTTP_USER_AGENT} !^.*AOL.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Mozilla.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Opera.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Msie.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Firefox.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Netscape.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Safari.* [NC]
# allow major search engines
RewriteCond %{HTTP_USER_AGENT} !^.*Google.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Slurp.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Yahoo.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*MMCrawler.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*msnbot.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*SandCrawl.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*MSRBOT.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Teoma.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*Jeeves.* [NC]
RewriteCond %{HTTP_USER_AGENT} !^.*inktomi.* [NC]
# allow misc user agents
RewriteCond %{HTTP_USER_AGENT} !^.*libwww.* [NC]
RewriteRule .* - [F]
As you can see, the whitelist is very straightforward. There are three sections:
- Browsers
- Search Engines
- Miscellaneous
Each of the rewrite conditions targets a specific user agent, which helps with maintainability going forward. The rules could be condensed into a single directive, but that is left as an exercise for the reader to enjoy.
Notice there are only a handful of user agents specified in the list. This is because the list is meant as a starting point. As discussed previously, there are many legitimate user agents crawling around these days, so more work would be required before this list could be used on a live site.
The other notable aspect of this experimental whitelist is the RewriteRule
, which simply denies access via 403 – “Forbidden” response. That of course could be modified for more intelligent handling, traffic reports, and so forth.
Experimental
I want to emphasize that the above list is experimental. I offer it as a starting point for those of you who want to develop and implement your own great user-agent whitelist. So it’s a starting point, a template, a bit of inspiration that hopefully will serve you well as you venture forth. Some things to keep in mind:
- Do your research to get the most current set of good bots possible
- Test the list thoroughly using a free online user-agent spoofer/tester
- Implement on a small, side-project site before moving to anything serious
- Keep a close eye on your site’s access and error logs until you’re confident
- Remember to stay current and update the list as needed with any changes, etc.
Above all else, you accept full responsibility for the code once you take it from the this site. I cannot be held responsible for anything that you end up doing with the experimental whitelist. So it’s 100% free and open source, do with it as you will, it’s entirely up to you.
Feedback
I hope this post helps anyone looking to improve security by implementing their own whitelist solution. If you have any feedback or questions, feel free to drop me a line anytime. Thanks for reading :)