Forum Topic: Web crawlers

Forum: .htaccess Forum : Security • Posted by Mitch • Updated:

I am wondering how people handle all the web crawlers out there. A year and a half ago we blocked most to the web crawlers to reduce traffic on our shared host. I both listed the user-agents in robots.txt and denied by IP in the .htaccess file. After combing the visitor log for a few months and building this list, I reduced server load and haven’t really thought about it too much. I am curious to know other view of these crawlers?

The following user agents are blocked – User-agent:

  • AhrefsBot
  • Baiduspider
  • Ezooms
  • MJ12bot
  • Sosospider
  • Yandex
  • 360spider
  • sogou web spider
  • SemrushBot
  • JikeSpider
  • adbeat_bot
  • careerbot
  • sistrix
  • AcoonBot
  • Abonti
  • UnwindFetchor
  • SiteExplorer
  • SeznamBot
  • EasouSpider

2 Replies to “Web crawlers”

Posted by Jeff Starr

In general there are two approaches to blocking/controlling access with .htaccess: blacklist or whitelist. I have many extensive blacklists at Perishable Press:

https://perishablepress.com/search/user+agent+blacklist/

..and a whitelist is available here:

https://perishablepress.com/invite-only-visitor-exclusivity-via-the-opt-in-method/

Note that the lists may need updating, especially the whitelist.

Posted by Mitch •

Jeff,
Thank you for your reply. Your lists, web sites and books have been invaluable in helping me as I learn about managing our WordPress site.

I think I know better now what I really wanted to ask. Is there a place where people discuss what bots/crawlers/user agents/spiders they consider good/okay and what bots they consider bad? For those in the U.S. we could probably get agreement that there are some indexers generally thought of as okay (e.g. Google, Microsoft, Yahoo). Is there a place where people discuss Yandex, Baidu, Soso, etc.? Where can I fin

Your white- and black-lists express are a great start and reveal your opinion!

Mitch