“Robots” or “spiders” are programs that index online content for search engines. Several have tried to download my free color files many times. Bot or not, every browser that hits my site leaves its tech-ish signature, which makes my site log like a guest book. Last week I found my log full of bot visits:
| Bot identification | Number of visits |
|---|---|
| Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) | 341 |
| Yandex/1.01.001 (compatible; Win16; I) | 190 |
| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 141 |
| Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/) | 76 |
| msnbot/2.0b (+http://search.msn.com/msnbot.htm)._ | 54 |
| Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org) | 32 |
| ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com) | 27 |
| panscient.com | 27 |
| Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) | 27 |
| Linguee Bot (http://www.linguee.com/bot; bot@linguee.com) | 26 |
| Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) | 26 |
| Bot description | Notes | Rating |
|---|---|---|
Alexa“Basically, [the Alexa crawler] starts with a list of known URLs from across the entire Internet, then it fetches local links found as it goes. There are several advantages to this approach, most importantly that it creates the least possible disruption to the sites being crawled. We will not index anything you would like to remain private. All you have to do is tell us.” |
Makes sense, and they provide tools to customize how Alexa crawls a site. The page I saw included detailed instructions preventing (or encouraging) bots with robots.txt. | Usefulness: 4 Annoyance: 1 |
Googlebot“You can verify that a bot accessing your server really is Googlebot by using a reverse DNS lookup…” |
Apparently they have more questions about bots pretending to be Google than questions about Googlebot itself. | Usefulness: 2 Annoyance: 2 |
HTTrack“HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer…. Simply open a page of the mirrored’ website in your browser, and you can browse the site from link to link, as if you were viewing it online.” |
Sounds like someone’s stealing my content. Their license states, “We hereby ask people using this source NOT to use it in purpose of grabbing emails addresses, or collecting any other private informations on persons. This would disgrace our work, and spoil the many hours we spent on it.” But how do I know who’s using it for what purpose? | Usefulness: 2 Annoyance: 5 |
Linguee“Since Linguee is a search engine, our web crawler is a fundamental piece of our technology. Most of the multilingual text content you find on Linguee is gathered by an automated indexing process involving the web crawler. The Linguee bot will scan the content of any website it encounters to search for multilingual text. It does not harvest e-mail addresses, and it won’t index content that isn’t multilingual. We want our crawler to be as polite as possible. In case it causes you any inconvenience, please let us know [bot@linguee.com] and make sure you provide all necessary information.” |
They win points for honesty, but I don’t have any multilingual text on my site. | Usefulness: 3 Annoyance: 1 |
msnbot“This website gives you access to all the information webmasters need about using Bing, including how MSNBot (The Bing web crawler, a program that scans websites and indexes their content, such as text, documents, images, and links, for searching.) works, guidelines for getting your website indexed successfully by Bing, and usage information on Bing Webmaster Center tools.” |
This explanation was part of a larger collection of Bing-centric SEO guidelines. They lose points for back-button-beating frames, but their advice got right to the point. For example: “The best way to attract people to your website, and keep them coming back, is to fill your webpages with valuable content in which your target audience is interested.” Common sense spelled out in plain English. Oddly, the URL at MSN automatically forwarded me to Bing, which forwarded me to help.live.com. | Usefulness: 4 Annoyance: 2 |
Panscient“Panscient crawls the web and collects information on people and companies for vertical search applications. Our databases can be used to augment search engines for corporate information, sale leads, business intelligence and genealogy. We help organizations provide complete and comprehensive web information to their customers.” |
A straightforward answer, but it dissolved into a sales pitch. | Usefulness: 3 Annoyance: 3 |
Purebot“Pure search. Pure results.” |
Tells me nothing. In incomplete sentences. | Usefulness: 0 Annoyance: 4 |
Yahoo!“Yahoo! Slurp is the Yahoo! web indexing robot. The Yahoo! Slurp web crawler collects documents from the Web to build a searchable index for search services using the Yahoo! Search engine. These documents are discovered and crawled because other webpages contain links to these documents.” |
Their help section includes 13 pages of detailed information including exactly what I wanted: “How to Mark Web Page Content Which is Extraneous to Your Main Page Content” | Usefulness: 4 Annoyance: 1 |
Yandex“Yandex is Russia’s largest internet company, whose websites attract a workday audience of more than 19 million users (unique visitors as of March 2010) from Russia, Ukraine and other countries. Our major goal is to give answers to users’ questions… Yandex search is the leader in Russia. About 80% of the Runet audience use the Yandex search engine, according to TNS Gallup and comScore.” |
Interesting but unhelpful. Most of yandex.com felt like a simplified Yahoo! portal. They had plenty of information about their company, but not much about their bot. | Usefulness: 4 Annoyance: 2 |
Those that provided information about their bots up-front struck me as being legitimage search engines. The others were either knock-offs or suspicious. I can’t explain Google’s roundabout explanation. Maybe their “do no evil” motto doesn’t necessarily mean “be clear and helpful.”