Sunday, July 28, 2013

Fun with logs: crawling robots


After studying the "human" visits logs of my log, I have checked the visits done by bots identifiable by their userAgent ... And we can say they are legion.

The major search engines are of course present.

Google
Description de cette image, également commentée ci-après
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Bing (Microsoft)
Logo de Bing
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

and another Microsoft bot for Msn
"msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

Yahoo! (I thought they used the Bing search engine ?)
Description de cette image, également commentée ci-après
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
"Yahoo! Slurp China"

But also some lesser known engines:

Exalead (French)
Description de l'image  Exalead.png.
"Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)"

Voilà (French)
Voila.gif
"Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (support.voilabot@orange-ftgroup.com)"

Yacy (a free and decentralized search engine I discovered thanks to my logs)
Image illustrative de l'article YaCy
"yacybot (webportal-global; amd64 Linux 3.6.10-nrj-desktop-1rosa; java 1.7.0_b147-icedtea; Europe/fr) http://yacy.net/bot.html"


Baidu (Chinese)
Description de l'image  Baidu logo.svg.
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

Jike (Chinese)

"Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)"

Yandex (Russian)
Yandex
"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
"Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)"

Blekko



"Mozilla/5.0 (compatible; Blekkobot; ScoutJet; +http://blekko.com/about/blekkobot)"

gimme60





"gimme60 (Gimme60 Store ID Bot; gimme60.com)"


In addition to these search engines, there are also extractors bots data whose activities are less visible on the Internet.

alexa.com (Ranking site)
Alexa Internet
"ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"

Twitter bot
Description de l'image  Twitter Bird.svg.
"Twitterbot/1.0"

A site using Twitter

"Twitmunin Crawler http://www.twitmunin.com"

And many societies which collect and cross data to sale them to their customers
80legs
80legs web crawling
http://www.80legs.com/webcrawler.html;) Gecko/2008032620"


panscient
Panscient
"panscient.com"

Netcraft
Netcraft
"Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)"

ahrefs
Ahrefs
"Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)"


gnip

"UnwindFetchor/1.0 (+http://www.gnip.com/)"

Topsy

"Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8"




And a few small bots that I was not able to know the role
"Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"
"Web front page analyser. robots.txt complaint (norw.acd.inst@gmail.com)"

Next time I'll talk about traces left by geeks on the teapot log!

1 comment: