My site is crawling with bots
“Robots” or “spiders” are programs that index online content for search engines. Several have tried to download my free color files many times. Bot or not, every browser that hits my site leaves its tech-ish signature, which makes my site log like a guest book. Last week I found my log full of bot visits:
| Bot identification | Number of visits |
|---|---|
| Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) | 341 |
| Yandex/1.01.001 (compatible; Win16; I) | 190 |
| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 141 |
| Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/) | 76 |
| msnbot/2.0b (+http://search.msn.com/msnbot.htm)._ | 54 |
| Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org) | 32 |
| ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com) | 27 |
| panscient.com | 27 |
| Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) | 27 |
| Linguee Bot (http://www.linguee.com/bot; bot@linguee.com) | 26 |
| Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) | 26 |
What are all these bots? I did some research.
| Bot description | Notes | Rating |
|---|---|---|
Alexa
|
Makes sense, and they provide tools to customize how Alexa crawls a site. The page I saw included detailed instructions preventing (or encouraging) bots with robots.txt. | Usefulness: 4 Annoyance: 1 |
Googlebot
|
Apparently they have more questions about bots pretending to be Google than questions about Googlebot itself. | Usefulness: 2 Annoyance: 2 |
HTTrack
|
Sounds like someone’s stealing my content. Their license states, “We hereby ask people using this source NOT to use it in purpose of grabbing emails addresses, or collecting any other private informations on persons. This would disgrace our work, and spoil the many hours we spent on it.” But how do I know who’s using it for what purpose? | Usefulness: 2 Annoyance: 5 |
Linguee
|
They win points for honesty, but I don’t have any multilingual text on my site. | Usefulness: 3 Annoyance: 1 |
msnbot
|
This explanation was part of a larger collection of Bing-centric SEO guidelines. They lose points for back-button-beating frames, but their advice got right to the point. For example: “The best way to attract people to your website, and keep them coming back, is to fill your webpages with valuable content in which your target audience is interested.” Common sense spelled out in plain English. Oddly, the URL at MSN automatically forwarded me to Bing, which forwarded me to help.live.com. | Usefulness: 4 Annoyance: 2 |
Panscient
|
A straightforward answer, but it dissolved into a sales pitch. | Usefulness: 3 Annoyance: 3 |
Purebot
|
Tells me nothing. In incomplete sentences. | Usefulness: 0 Annoyance: 4 |
Yahoo!
|
Their help section includes 13 pages of detailed information including exactly what I wanted: “How to Mark Web Page Content Which is Extraneous to Your Main Page Content” | Usefulness: 4 Annoyance: 1 |
Yandex
|
Interesting but unhelpful. Most of yandex.com felt like a simplified Yahoo! portal. They had plenty of information about their company, but not much about their bot. | Usefulness: 4 Annoyance: 2 |
Thoughts
Those that provided information about their bots up-front struck me as being legitimage search engines. The others were either knock-offs or suspicious. I can’t explain Google’s roundabout explanation. Maybe their “do no evil” motto doesn’t necessarily mean “be clear and helpful.”