johnpfeiffer
  • Home
  • Engineering (People) Managers
  • John Likes
  • Software Engineer Favorites
  • Categories
  • Tags
  • Archives

Attack of the Spiders, Bots, and Crawlers

It is well known that search engines need to index what the contents of webpages are in order to return accurate results to Users. It may not be well known how much traffic that generates.

In this sampling from my logs (not representative of anything), 57% of the traffic in the log is from bots (I'm not turning them away, skip to the end on how to do that).

Drupal (CMS) is remarkably good at making content for bots to read so it's not surprising they're all slurping, and of course if you want to be popular you need all of those indexes to know about you...

How many hits from each kind of bot

cat john-pfeiffer.access | wc -l

1108 hits

cat john-pfeiffer.access | grep -v "Baiduspider" | grep -v "bingbot" | grep -v "YandexBot" | grep -v "Sogou" | grep -v "Mail.RU\_Bot" | grep -v "Googlebot" | grep -v "SISTRIX Crawler" | grep -v "MJ12bot" | wc -l  
637

cat john-pfeiffer.access | grep "Googlebot" | wc -l

105 hits from https://en.wikipedia.org/wiki/Googlebot

cat john-pfeiffer.access | grep "bingbot" | wc -l

96 hits from http://en.wikipedia.org/wiki/Bingbot

cat john-pfeiffer.access | grep "Baiduspider" | wc -l

93 hits from http://baidu.com/search/spider_english.html

cat john-pfeiffer.access | grep "SISTRIX Crawler" | wc -l

76 hits from http://crawler.sistrix.net

cat john-pfeiffer.access | grep "YandexBot" | wc -l

64 hits from http://www.botopedia.org/user-agent-list/search-bots/yandex-bot

cat john-pfeiffer.access | grep "Mail.RU\_Bot" | wc -l

23 hits from http://www.webmasterworld.com/search_engine_spiders/4520951.htm , http://www.botopedia.org/user-agent-list/search-bots/mailru-bot

cat john-pfeiffer.access | grep "MJ12bot" | wc -l

12 hits from http://www.majestic12.co.uk/projects/dsearch/mj12bot.php

cat john-pfeiffer.access | grep "Sogou" | wc -l

7 hits from http://www.botopedia.org/user-agent-list/search-bots/sogou-spider

More Info on Bots and Crawlers

http://www.incapsula.com/the-incapsula-blog/item/393-know-your-top-10-bots

http://searchenginewatch.com/article/2067357/Bye-bye-Crawler-Blocking-the-Parasites

(Besides robots.txt you are pretty much left with ban by User-Agent or IP Address Range.)


  • « Git Basics and Hard to Scratch Itches
  • Selenium headless browser automated testing with PhantomJS and Python »

Published

Jul 20, 2013

Category

linux

~256 words