Web Spider Traps
When an author does not want his site to be copied or indexed by search engines, he can use:
- A meta tag as <meta name="robots" content="noindex,nofollow"> (well-behaved bots).
- A robots.txt file which indicates the parts of the site not to be explored (well-behaved bots).
- .htaccess to ban known or detected robots (any webbot).
- A java applet, some html, a script written in php, javascript or any other language (any webbot).
- How to Defeat Bad Web Robots With Apache
- Improving Web Spider Trap Efficiency
- How to build a Bot Trap and keep bad bots away from a web site
- Stopping Spambots: A Spambot Trap
- How to keep bad robots, spiders and web crawlers away Apache only
- Blocking Bad User Agents with ASP
- E-Mail Protector Script (perl script sending 10,000 fake addresses to identified robots)
- Robots.txt Tutorial
- HTML Author's Guide to the Robots META tag
- Robotcop
- DIE (illustration)
and www.webmasterworld.com (search for "spider traps" / "Blocking Badly Behaved Bots" or have a look at www.webmasterworld.com/forum24/ or www.webmasterworld.com/forum88/ or http://www.webmasterworld.com/forum92/ ).
In French, you can find a few scripts here to stop robots:
- Arrêter les aspirateurs with PHP
- Arrêter les aspirateurs with robots.txt or htaccess
Trap?
All these traps are likely to prevent search engines from indexing the pages, make browsing more difficult and discourage the users.
Fighting against the "Spam harvesters", "email grabbers", "email collectors" and "spambots" can easily be understood and quite easily done, but as all spiders are not used for bad purposes why should they all be blocked, even if they consume bandwidth and sometimes block or overload some sites.
Captures can be done for good reasons and good people: this site tries to help those who mirror sites for their students or those who cannot afford staying online...
Mirror?
Often, after some time, protections are removed: those whose navigators do not have the plugins (Macromedia, java -JRE 6.0-) or do not interpret Javascript are lost readers or lost customers.
If you think that the site is interesting enough to be mirrored, ask the author for a copy that you could browse offline.
Indeed, if you activate the option "no robots.txt rules" you may block any access to the site
with your IP or you may copy hundreds of pages without interest - error pages, images, documentations etc -.
In all the cases, locate the useful folders, use reasonable bandwidth limits and connections per second (Options - Limits - Max transfer rate and Options - Limits - Max connections / second), and limit the number of connections.
Examples 12 and 17 of website mirrors may help you.
Identify a robot
You can read about the different robots identifying themselves here:
- The Web Robots Database (site, details, UA and IP)
- List of Robot Agent Strings (site and UA)
- List of User-Agents/Browsers, Web Spiders, Robots (site and UA)
- User Agents by Category (site, comments and UA)
- List of Known Agents (UA, platform, Browser Capabilities...)
- List of User-Agents (Spiders, Robots, Crawler, Browser) A-L (site and UA)
- List of User-Agents (Spiders, Robots, Crawler, Browser) M-0 (site and UA)
- List of User-Agents (Spiders, Robots, Crawler, Browser) P-Z (site and UA)
- Search Engine Spiders List (site and UA)
- Search Engine IP Addresses (UA and IP)
- Robots Index (site, comments, UA and IP)
- Bot / Spider / Crawler Information (site, comments and UA)
- Search engine robots that visit your web site (site, UA and IP)
- Search Engine Spider Identification (UA and comments at webmasterworld)
- Identification des robots in French (site, UA and IP)
- E-Mail Collectors List (site and UA)
- For this site : Listed identities (771 User Agents Strings)
- - None of these sites gives an entire list.
- - Most robots and spiders give MSIE User Agent:
Mozilla/4.0 (compatible; MSIE 6.0; Windows ...)
do not read robots.txt
and are not well-behaved... - - Robots that regularly request robots.txt (UA).
Robots and this site
List of the robots visiting this site (this list indexes the site, tests the link to the site, does surveys or controls for clients' names, plagiarism, spam...):
"1Noonbot search engine" - "50.nu" - "80legs crawler" - "ABACHOBot search engine" - "abcfr_robot search engine" - "Accoona-AI-Agent search engine" - "ActiveBookmark" - "Advanced URL Catalog bookmark manager" - "Advista search engine" - "aipbot search engine" - "alef" - "Aleksika search engine" - "amagit.com search engine" - "Amfibibot search engine" - "Anonymous / Skywalker" - "AnswerBus search engine" - "antibot crawler" - "appie 1.1 (www.walhello.com) search engine" - "Apple-PubSub RSS monitoring" - "archive.org_bot crawler" - "Argus bookmark managing crawler" - "Art-Online.com 0.9(Beta) crawler" - "Ask Jeeves crawler" - "Asterias crawler" - "atraxbot" - "Baiduspider search engine" - "Bazbot search engine" - "BecomeBot search engine" - "Big Fish log spam" - "Biglotron search engine" - "BlackMask.Net search engine" - "BlogCorpusCrawler" - "Bloglines RSS monitoring" - "Bluebot crawler" - "bogospider" - "boitho.com-robot search engine" - "Bookdog bookmark manager" - "bot/1.0" - "botmobi search engine" - "BruinBot crawler" - "BuzzRankingBot crawler" - "CacheBot" - "Caliperbot" - "CamontSpider crawler" - "capek crawler" - "CatchBot crawler" - "CazoodleBot crawler" - "ccubee search engine" - "CentiverseBot search engine" - "cfetch" - "Charlotte search engine" - "Cherchonsbot search engine" - "Combine crawler" - "comBot search engine" - "cometsystems crawler" - "Convera RetrievalWare" - "CorenSearchBot" - "Cosmix crawler" - "CosmixCrawler search engine" - "Crawl Annu" - "Crawllybot search engine" - "csci_b659 Data Mining" - "CSS/HTML/XTHML Validator" - "CSSCheck" - "cybercity.dk IE 5.5 Compatible Browser" - "CydralSpider search engine" - "darxi spam / email grabbing" - "DataFountains/DMOZ Downloader" - "DAUM Web Robot search engine" - "dcbspider search engine" - "DealGates" - "deepak-USC/ISI spider" - "del.icio.us-thumbnails" - "del.icio.us bookmark manager link checker" - "DepSpid crawler" - "Diamond search engine" - "Directcrawler" - "discobot crawler" - "DMOZ Experiment" - "DNSGroup crawler" - "DotBot crawler" - "DTAAgent search engine" - "Dumbot search engine" - "e-SocietyRobot crawler" - "eApolloBot search engine" - "EasyDL/3.04" - "ejupiter.com search engine" - "EnaBot crawler" - "envolk search engine" - "ETS translation bot" - "Exabot crawler" - "Exabot-Thumbnails" - "exactseek-crawler-2.63" - "Exalead NG" - "exooba crawler" - "Factbot search engine" - "FAST crawler" - "FAST Enterprise Crawler" - "FAST FirstPage retriever" - "fast-search-engine" - "FAST-WebCrawler" - "FAST MetaWeb Crawler" - "FavOrg Link checker" - "favorstarbot Advertising" - "FeedBurner" - "FeedFetcher-Google" - "Fetch API Request" - "Filangy bookmark managing crawler" - "Findexa crawler" - "findfiles.net search engine" - "findlinks" - "flatlandbot" - "fleck" - "Fluffy (searchhippo) search engine" - "flyindex search engine" - "FollowSite" - "Friend search engine" - "FurlBot search engine" - "Gaisbot/3.0 search engine" - "Galbot crawler" - "genevabot search engine" - "geniebot search engine" - "GeoBot" - "Gigabot crawler" - "Gigamega.bot search engine" - "GingerCrawler" - "Girafabot" - "Gnomit crawler" - "GOFORITBOT search engine" - "Google Desktop RSS/Page monitoring" - "Google-Sitemaps" - "Googlebot crawler" - "Googlebot-Image" - "Googlebot-Mobile" - "grub search engine" - "grub crawler" - "grub.org" - "GT::WWW/1." - "gURLChecker Link checker" - "GurujiBot search engine" - "GUSbot" - "Haste" - "hclsreport crawler" - "Helix crawler" - "HenriLeRobotMirago crawler" - "Heritrix crawler" - "Holmes search engine" - "HooWWWer crawler" - "htdig" - "ia_archiver crawler" - "ICC-Crawler crawler" - "ichiro search engine" - "icsbot-0.1" - "IlTrovatore search engine" - "INA dlweb crawler" - "Indy Library Internet Direct Library for Borland - often spambot" - "InelaBot crawler" - "inktomi Slurp crawler" - "InternetSeer Connectivity checker" - "Interseek" - "IntranooBot" - "IP*Works Link checker" - "IRLbot crawler" - "iSearch search engine" - "istarthere search engine" - "IXE Crawler" - "Jakarta Commons" - "Jetbot/1.0 crawler" - "JungleKeyBot search engine" - "Jyxobot search engine" - "KaloogaBot search engine" - "Killou.com search engine" - "Knowledge.com search engine" - "Lachesis" - "larbin crawler" - "ldspider" - "libwww-perl" - "LinguaBot search engine" - "linkaGoGo crawler" - "LinkChecker" - "Link Commander bookmark manager" - "Linkman Link checker" - "Links SQL" - "Link Valet Online Link checker" - "LiteFinder search engine" - "livemark.jp Link checker" - "lmspider crawler" - "Look.com search engine" - "Loopy.fr search engine" - "Loserbot" - "Lsearch/sondeur" - "lwp-request" - "lwp-trivial" - "LWP::Simple" - "MagpieRSS" - "MapoftheInternet search engine" - "Marvin search engine" - "Me.dium OneRiot crawler" - "Mediapartners-Google" - "Megaglobe search engine" - "Megite news aggregator" - "Metaspinner search engine" - "MileNSbot search engine" - "Mirago (HenriLeRobot) crawler" - "MJ12bot crawler" - "MLBot" - "MnogoSearch/3.2.11" - "Monrobot crawler" - "MOSBookmarks Link checker" - "mozDex crawler" - "Mozilla/4.0 (compatible; MSIE 6.0)" - "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;)" - "Mp3Bot search engine" - "MQbot crawler" - "ms research robot" - "MSIE 4.5 log spam" - "MSIE 6.0 (compatible; MSIE 6.0;... log spam" - "MSIE 7.01 log spam" - "MSMOBOT crawler" - "msnbot crawler" - "MSNPTC MSN search robot" - "MSR-ISRCCrawler" - "MSRBOT crawler" - "MultiCrawler search engine" - "MyFamilyBot crawler" - "Nambu" - "NaverBot search engine" - "NaverRobot search engine" - "Nelian Pty Ltd" - "Netcraft survey" - "netEstate crawler" - "NetID Bot Advertising" - "NetResearchServer search engine" - "NetSprint search engine" - "NetWhatCrawler search engine" - "newsg8 RSS monitoring" - "NEWT ActiveX spam / email grabbing" - "NG-Search search engine" - "NG/1.0" - "NG/2.0 crawler" - "NGBot crawler" - "nicebot" - "Nigma search engine" - "NimbleCrawler search engine" - "Norbert the Spider search engine" - "NoteworthyBot" - "NPBot NameProtect crawler" - "nrsbot search engine" - "NuSearch Spider search engine" - "Nutch crawler" - "Nutch (Princeton) crawler" - "ObjectsSearch search engine" - "octopodus search engine" - "Octora crawler" - "ODP::/0.01 Link checker" - "ODP entries" - "ODP links test" - "OmniExplorer_Bot search engine" - "onCHECK" - "OnetSzukaj search engine" - "OOZBOT search engine" - "Openbot search engine" - "OpenISearch search engine" - "OpenTaggerBot social bookmarks" - "OpenX Spider Advertising" - "OrangeBot-Mobile search engine" - "OutfoxBot" - "ozelot" - "page-store" - "Pagebull search engine" - "page_verifier" - "PanopeaBot/1.0 (UCLA CS Dpt.)" - "PEERbot search engine" - "PeerFactor crawler" - "Pete-Spider crawler" - "PHP/4." - "PHP version tracker web stats" - "PicSpider" - "PipeLine spider" - "Pita crawler" - "PollettSearch crawler" - "polybot crawler" - "Pompos - dir.com crawler" - "Popdexter crawler" - "PostFavorites" - "Powermarks Link checker" - "PrivacyFinder search engine" - "PROBE! search engine" - "Program Shareware" - "psbot crawler" - "Python-urllib" - "QEAVis" - "QihooBot search engine" - "RAMPyBot search engine" - "Rapid-Finder search engine" - "Reaper/2.06 search engine" - "RedBot crawler" - "RedCarpet" - "RixBot search engine" - "robotgenius malware detection?" - "Robozilla/1.0" - "RSSMicro search engine" - "RTGI Data Mining" - "RufusBot" - "sagool search engine" - "savvybot search engine" - "SBIder crawler" - "schibstedsokbot search engine" - "Scooter search engine" - "ScoutJet search engine" - "Scrubby search engine" - "search.updated.com search engine" - "SearchByUsa search engine" - "SearchIt.Bot search engine" - "Seekbot crawler" - "Semager search engine" - "Sensis search engine" - "SEOprofiler bot crawler" - "ShablastBot search engine" - "Shelob" - "sherlock search engine" - "Shim-Crawler" - "ShrinkTheWeb crawler" - "ShunixBot crawler" - "silk search engine" - "Skywalker / Anonymous" - "Slurpy Verifier" - "snap.com search engine" - "Snapbot search engine" - "SnapPreviewBot" - "socbay search engine" - "sogou spider" - "sohu-search search engine" - "sohu agent search engine" - "Sosospider search engine" - "SpeedySpider search engine" - "Spinn3r" - "sproose crawler" - "SpurlBot bookmark managing crawler" - "statbot" - "StatusCheckBot Link checker" - "Steeler crawler" - "SuperBot search engine" - "Susie bookmark manager link checker" - "sygol search engine" - "SynapticWalker spam / email grabbing" - "SynooBot search engine" - "Syntryx ANT Chassis crawler" - "Szukacz/1.5 search engine" - "T-H-U-N-D-E-R-S-T-O-N-E" - "TargetYourNews Link checker" - "Teemer" - "Teoma search engine" - "TerraSpider" - "test Link checker" - "TFC" - "Theophrastus" - "Thumbnail.CZ robot" - "thumbshots-de-bot" - "TinEye crawler" - "TranSGeniKBot" - "trexmod" - "Tubenowbot Link checker" - "TurnitinBot crawler" - "TutorGigBot crawler" - "Tutorial Crawler" - "TweetmemeBot" - "Twiceler crawler" - "Twitturl" - "TygoBot search engine" - "uberbot crawler" - "UnChaosBot search engine" - "updated search engine" - "UptimeAuditor Connectivity checker" - "UptimeBot" - "URLBase bookmark manager" - "Valizbot crawler" - "VDSX.nl search engine" - "versus crawler" - "Visbot search engine" - "VoilaBot crawler" - "Vortex crawler" - "voyager search engine" - "VSE/1.0 crawler" - "W3C-checklink" - "WebAlta crawler" - "WebarooBot crawler" - "WebCorp search engine" - "webcrawl search engine" - "WebFilter" - "WebIndexer search engine" - "WebRACE/1.1" - "Webscan" - "WebsiteWorth log spam" - "wikiwix search engine" - "Willow Internet Crawler" - "WinkBot search engine" - "Winsey search engine" - "WIRE" - "woriobot search engine" - "WorQmada Link checker" - "Wotbox search engine" - "wume_crawler" - "www.almaden.ibm.com/cs/crawler" - "www.IsMySiteUp.Net" - "www.pisoc.com search engine" - "Xenu Link checker" - "Xerka Data Mining" - "xirq search engine" - "XmarksFetch bookmark manager search engine" - "yacybot search engine" - "Yahoo! Slurp crawler" - "Yahoo! Mindset" - "Yahoo-MMCrawler" - "Yahoo-Test crawler" - "YahooSeeker search engine" - "YahooVideoSearch search engine" - "Yandex search engine" - "Yanga search engine" - "yellowJacket Link checker" - "YesupBot" - "Yeti search engine" - "Yooda" - "yoono search engine" - "YottaCars search engine" - "YottaShopping search engine" - "YoudaoBot search engine" - "ZeBot search engine" - "zerxbot search engine" - "Zeus search engine" - "Zion crawler" - "ZipppBot search engine" - "ZyBorg/1.0 search engine" -
As well as the following ones that do not identify
"Alexa crawler" - "Ask crawler" - "bloglines" - "Cox Communications" - "exabot thumb" - "IP 62.193.214.* spambot" - "IP 66.185.126.130 log spam" - "IP 67.15.68.85" - "IP 67.108.232.229" - "IP 82.99.30.[1-7]?\d$" - "IP 84.19.188.24[3-9] spam" - "IP 89.122.57.185 crawler" - "IP 149.5.168.19" - "IP 208.115.138.*" - "IP 209.11.247.146" - "IP 217.74.99.* crawler" - "IP 217.169.46.98" - "MSNBot crawler" - "netsweeper" - "NOOS crawler" - "ODP entries" - "Other Microsoft bot" - "UUNET" - "WebSense IP 66.194.6.*" - "WebSense IP 208.80.19\d.*" - "webtrends" - "Yahoo crawler" - , "www.dir.com"
You can see their last visits or find their identity (771 User Agent strings) or download a list.
Some robots regularly request robots.txt but link checkers (inbound links from other sites or search engines), validation tools and log spamming do not read robots.txt.
Among those exploring the site
Did not follow robots.txt rules:
- Advista AdBot,alef/0.0, Alexa, BIGLOTRON(Beta 2), Asterias, boitho.com, DTAAgent, fast-search-engine, Fetch API Request, Gigamega.bot, grub (looksmart & other users), Helix, ia_archiver (Alexa), IRLbot, INA dlweb, Jyxobot, libwww-perl, LiteFinder, Lsearch/sondeur, LWP (simple & trivial), msnbot/2.0b, MSR-ISRCCrawler, NetResearchServer, NOOS, OmniExplorer_Bot, Pompos (www.dir.com), Program Shareware, shunix (libwww-perl/5.803), TygoBot, wbdbot, WebCrawler, Yahoo! Slurp/3.0, ZyBorg
- recently:
- IP 89.122.57.185, PollettSearch
Did not limit bandwidth usage:
- appie, Ask Jeeves, Exalead ou NG/1.0, Fetch API Request, msnbot/0.1, msnbot/0.11, NaverRobot, Pompos (www.dir.com), Program Shareware, shunix (Xun), TygoBot, WebCrawler
- recently:
- e-SocietyRobot, INA dlweb, LWP (simple & trivial), NG/2 (Exalead), OmniExplorer_Bot, Seekbot
Followed robots.txt rules except for exe, pdf, tar and zip files:
- recently:
- larbin, Sensis.com.au, sygol, ZyBorg
Recently for this site:
Explore home page only
- Anonymous
- Bazbot
- Big Fish
- BuzzRankingBot
- CentiverseBot
- Cherchonsbot
- comBot
- Cosmix
- Crawl Annu
- Crawllybot
- cybercity.dk
- DataFountains/DMOZ Downloader
- del.icio.us-thumbnails
- DMOZ Experiment
- DNSGroup
- ejupiter.com
- envolk
- exooba
- favorstarbot
- flatlandbot
- Fluffy
- flyindex
- FollowSite
- Gaisbot/3.0
- Galbot
- GeoBot
- Gnomit
- GOFORITBOT
- grub crawler
- GT::WWW/1.02
- Heritrix
- Holmes
- HooWWWer
- HouxouCrawler
- ICC-Crawler
- Indy Library
- InelaBot
- InternetSeer
- IP*Works
- IP 67.15.68.85
- IP 67.108.232.229
- IP 193.109.173.79
- IP 207.44.188.104
- iSearch
- JungleKeyBot
- KaloogaBot
- Knowledge.com
- linkaGoGo
- LinkPimpin
- Links SQL
- Look.com
- Loopy.fr
- Loserbot
- MapoftheInternet
- Marvin
- Metaspinner
- Monrobot
- mozDex
- MQBOT
- MSIE 4.5; Windows 98;
- MSIE 6.0 (compatible; MSIE 6.0;
- MSIE 7.01
- MSNPTC
- MultiCrawler
- Netcraft
- netEstate
- NetID Bot
- NetResearchServer
- NetSprint
- NetWhatCrawler
- NimbleCrawler
- nrsbot
- ObjectsSearch
- octopodus
- ODP::/0.01
- ODP links test
- onCHECK
- OnetSzukaj
- OpenX Spider
- PEERbot
- PHP/4.2.2
- PHP version tracker
- PicSpider
- PipeLiner
- polybot
- PrivacyFinder
- PROBE!
- RAMPyBot
- REBOL View
- Robotzilla
- savvybot
- Scrubby
- search.updated.com
- SearchByUsa
- SearchIt.Bot
- silk
- Skywalker
- Slurpy Verifier
- snap.com
- snipsearch
- sogou spider
- sohu-search
- SynooBot
- Syntryx ANT
- T-H-U-N-D-E-R-S-T-O-N-E
- Teoma
- test
- Thumbnail.CZ robot
- thumbshots-de-bot
- trexmod
- updated
- UUNET
- VDSX.nl
- WebAlta
- webcrawl
- WebRACE
- WebsiteWorth
- wectarbot
- wikiwix
- Willow Internet Crawler
- WinkBot
- Winsey
- WIRE
- WorQmada
- www.IsMySiteUp.Net
- xirq
- yacybot
- Yahoo-MMCrawler
- Yandex
- Yooda
- YottaCars
- YottaShopping
- YoudaoBot
- ZeBot
- zerxbot
- ZipppBot
Explore other pages too
- 1Noonbot
- 80legs
- ABACHOBot
- abcfr_robot
- Accoona-AI-Agent
- ActiveBookmark
- Advista AdBot
- aipbot
- alef
- Aleksika
- Alexa
- amagit
- Amfibibot
- AnswerBus
- antibot
- appie
- Apple-PubSub
- archive.org_bot
- Argus
- Ask Jeeves
- Asterias
- atraxbot
- Baiduspider
- BecomeBot
- Biglotron
- BlogCorpusCrawler
- Blogdimension
- Bloglines (RSS)
- Bluebot
- bogospider
- boitho
- Bookdog
- bot/1.0
- BruinBot
- CacheBot
- Caliperbot
- capek
- CatchBot
- CazoodleBot
- ccubee
- cfetch
- Combine
- cometsystems
- ConveraCrawler
- CorenSearchBot
- Cox Communications
- csci_b659/0.13
- CydralSpider
- Cyveillance
- darxi
- dcbspider
- DealGates
- deepak-USC/ISI
- del.icio.us
- DepSpid
- Diamond
- discobot
- DotBot
- DTAAgent
- Dumbot
- e-SocietyRobot
- eApolloBot
- EasyDL
- EnaBot
- ETS
- Exabot
- Exabot-Images
- Exabot-Thumbnails
- Factbot
- FAST-search-engine
- FAST-WebCrawler
- FAST Enterprise Crawler
- FAST MetaWeb Crawler
- FavOrg
- FeedBurner
- FeedFetcher-Google (RSS)
- Fetch API Request
- Filangy
- Findexa
- findfiles.net
- findlinks
- fleck
- Friend or Winsey
- FurlBot
- Gaisbot
- genevabot
- geniebot
- Gigabot/1.0
- Gigamega.bot
- GingerCrawler
- Girafabot
- Google-Sitemaps
- Googlebot
- Googlebot-Image
- Googlebot-Mobile
- Google Desktop
- grub
- grub.org
- gURLChecker
- GurujiBot
- GUSbot
- hclsreport
- Helix
- HenriLeRobotMirago
- htdig
- ia_archiver
- ichiro
- Iltrovatore-Setaccio
- INA dlweb
- interseek
- IntranooBot
- IP 63.247.72.42
- IP 89.122.57.185
- IP 217.74.99.100
- IRLbot
- istarthere
- Jakarta Commons-HttpClient
- Jetbot
- Jyxobot
- larbin
- ldspider
- libwww-perl
- LinguaBot
- Link Commander
- Linkman
- Link Valet Online
- LiteFinder
- livemark.jp
- lmspider
- Lsearch/sondeur
- LWP (simple & trivial)
- Me.dium
- Mediapartners-Google
- Megaglobe
- Megite
- MJ12bot
- MLBot
- MOSBookmarks
- Mozilla/4.0 (compatible; MSIE 6.0)
- Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.0;)
- Mp3Bot
- MQbot
- MSMOBOT
- msnbot
- MSR-ISRCCrawler
- MSRBOT
- MyFamilyBot
- Nambu
- NaverBot
- NaverRobot
- Nelian Pty Ltd
- netsweeper
- newsg8 (RSS)
- NEWT ActiveX
- NG-Search
- NG/2.0
- NGBot
- nicebot
- Nigma
- NOOS
- Norbert the Spider
- NoteworthyBot
- NPBot
- NuSearch Spider
- Nutch
- OmniExplorer
- OOZBOT
- OpenISearch
- OpenTaggerBot
- OrangeBot-Mobile
- OutfoxBot
- ozelot
- page-store
- Pagebull
- page_verifier
- PeerFactor crawler
- Pete-Spider
- PollettSearch
- PostFavorites
- Powermarks
- Program Shareware
- psbot
- Python-urllib
- QEAVis
- QihooBot
- Rapid-Finder
- RedBot
- RixBot
- RSSMicro
- RTGI
- RufusBot
- Sagool
- SBIder
- schibstedsokbot
- ScoutJet
- ScSpider
- Seekbot
- Semager
- Sensis
- SEOprofiler bot
- ShablastBot
- Shelob
- sherlock
- Shim-Crawler
- ShrinkTheWeb
- ShunixBot
- Snapbot
- SnapPreviewBot
- socbay
- sohu agent
- SpeedySpider
- sproose
- SpurlBot
- statbot
- StatusCheckBot
- Steeler
- SuperBot
- Susie
- sygol
- SynapticWalker
- Szukacz
- TargetYourNews
- Teemer
- TerraSpider
- TFC
- Theophrastus
- TinEye
- Tubenowbot
- TurnitinBot
- TutorGigBot
- Tutorial Crawler
- TweetmemeBot
- Twiceler
- Twitturl
- TygoBot
- uberbot
- UnChaosBot
- UptimeAuditor
- URLBase
- Valizbot
- versus crawler
- Visbot
- VoilaBot
- Vortex
- voyager
- wbdbot
- WebarooBot
- WebCorp
- WebFilter
- WebSense
- Winsey or Friend
- woriobot
- Wotbox
- wume_crawler
- www.almaden...
- www.pisoc.com
- Xenu
- Xerka
- XmarksFetch
- Yahoo! Mindset
- Yahoo! Slurp
- Yahoo-Test
- YahooSeeker
- YahooVideoSearch
- Yanga
- yellowJacket
- YesupBot
- Yeti
- yoono
- Zion
- ZyBorg
- curl
- Pompos
- shunix (Xun)
- DataCha0s
- libwww-perl
- LWP (simple & trivial)
- Mozilla/3.0 (compatible; Indy Library)
- Mozilla/5.0
Detecting a robot
Using its User Agent
Here is a PHP script (which is used by the site stats) allowing you to know if a robot or a search engine is requesting a page:
A script using the User Agent is now online here
It is more difficult to spot robots that do not identify:
- give a MSIE 6 (as UUNET or Websense) or Mozilla 5 (net-sweeper) or Konqueror (twtc / Websense - RegExp: 3\.[0-1](\-rc[1-6])?; i686 Linux; 2002[0-9]{4}-, exabot - Exalead User Preview) or Mozilla 4.01 (NOOS) identification,
- change IP address each time they load a page (a WHOIS search on Ripe or Whois Source or Openrbl may give you a clue),
- combine all the methods (qwest.net, .ev1servers.net as well).
Using its host
A good example seems to be the www.dir.com (search engine) robot which uses many IP addresses (from 212.27.33.164 to 212.27.33.173 in May 2003, 212.27.41.18 in November 2003). Its activity could be seen on the page logging servers, but is filtered now by the following PHP routine.
if (!$robot)
{
$robot=strchr(gethostbyaddr($no_ip),".dir.com");
}
//if it's the www.dir.com robot then $robot is set as .dir.com
Using its IP
A robot requesting pages from a few IPs an be spotted likewise:
if (!$robot)
{
$robot=strchr($no_ip,"208.53.138.");
}
/*
if the IP is between 208.53.138. and 208.53.138.
$robot is set as 208.53.138.
*/
In any case, maintaining a list of User Agents, hosts and IPs noticed as having a strange behaviour will be necessary.
Using the request method
It seems that, at the present time (June 2005), only robots and download utilities use a HEAD request (then a GET if the page exists or has been modified). Thus $_SERVER["REQUEST_METHOD"] can allow the identification of a robot using a browser User Agent. (Read RSS feed for tests in progress).
/*this method must come first*/
if ($_SERVER["REQUEST_METHOD"]=="HEAD") {$robot="robot";};
/*if head is used, $robot will not be empty*/
All these methods seem to be rather accurate.
Blocking a robot with PHP
When some Apache modules are not available for use and having access to .htaccess files is restricted (my case) or if we want to cut down the size of the file .htaccess and let the server do what's useful, PHP allows us to redirect or block a robot.
If we want to stop a robot (here Fetch API Request) , we just have begin all our pages (before any output to the browser) with the following script so that the webbot is redirected toward the page bye.html, any other page or send a 403 Access Denied status message.
<?php $UA=getenv("HTTP_USER_AGENT"); if (stristr($UA,"Fetch API Request")!="") { header("Location:http://mydomain/bye.html"); die(); /*this line can be replaced by the HTML redirection*/ } ?>
This page not being linked, the spidering will immediately stop.
The same can be done with an IP by using getenv("REMOTE_ADDR");.
More sophisticated techniques are listed above.
About two thirds of the robots will follow the redirection if the domain name does not change, almost none if it changes.
A redirection in HTML will be necessary if we want to redirect all of them or let them know where the new page is:
<?php echo"<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <title>Redirection</title> <meta http-equiv="Refresh" content="0;URL=http://mydomain/bye.html"> </head> <body> <p> Redirection: <a href="http://mydomain/bye.html">http://mydomain/bye.html</a> </p> </body> </html>"; die(); ?>
Allowing some robots and blocking others
A function to include and call at the begining of each page can allow us to manage robots.
/*start*/
function redirect_robots()
{
$requested_page=$_SERVER["REQUEST_URI"];
if (preg_match("/([enptux]|\b)(ftp|https?):\/\//i",$requested_page)){die();} /*blocks the majority of zombies*/
When we are unlucky and visited by zombies, or when we are using a CMS, the best is to block all these requests.
if ($_SERVER["REQUEST_METHOD"]=="HEAD") return;
Why should we block this type of request? The "harm's done", link checkers toward our site (Xenu, Powermarks, Link Commander, HTTrack, IRLbot...) and search engines (Speedy Spider, sygol...) will have a positive answer and their case, if they come back with a GET or POST request, will be considered later.
There, we can store the IP in a MySQL table to block any comeback of the utility or webbot.
$UA=getenv("HTTP_USER_AGENT");
if (eregi("Googlebot|Yahoo|VoilaBot|Ask Jeeves|SpeedySpider",$UA)) return;
No problem for the robots we accept: those who identify themselves and are named in the regular expression above. The host can be checked to see if it matches the User Agent.
/*
Including bot in the expression will block aipbot, antibot, boitho, OmniExplorer...
As for this site, up to 408 robots!
*/
if (eregi("[^e]crawler|spider|bot|custo |web(cow|moni|capture)|wysigot|httrack|wget|xenu",$UA))
{
header("Location:http://mydomain/bye.html");die();
/*another option is to send a 403 Access Denied status message
handled by Apache .htaccess
header("Status: 403 Forbidden");die();*/
}
Even if I am not convinced by the necessity to block the ones that do not exaggerate, all those in the regular expression will be redirected.
Many utilities like Wysigot leave their name in the User Agent even when they are not active.
$no_ip=getenv("REMOTE_ADDR");
$host=gethostbyaddr($no_ip);
if (eregi("(becquerel|66\-132|64\-225)\.noos\.(net|fr)",$host) && (strchr($UA,"MSIE 4.01"))
{
header("Location:http://mydomain/bye.html");die();
}
if (eregi("exabot|lehigh",$host))
{
header("Location:http://mydomain/bye.html");die();
}
We can test the host and ban a few badly-behaved robots or the reading by a request from a search engine. Is it really useful?
//$no_ip=getenv("REMOTE_ADDR");
if (eregi("63\.247\.72\.42|208\.53\.138\.1",$no_ip))die();
We can ban an IP or a group of IP, get from a MySQL database the IP to ban...
return; } /*end*/
Now, those who are still here can browse.
We can optimize the code, add a few rules for the referrer, the number of pages requested (stored with MySQL)... It will be easy to update or modify the code, but how many errors?
A few ideas...
As indexing activity shouldn't be blocked (even if no one can stop a web spider user to declare a robot identifier), knowing whether a human being is viewing a page is done in the site with two bot traps in the French home page (and only one robot trap in the English home page):
They consist in links without text so that no one can see them.
- The first is in an allowed folder. Any access to the file allows me to update the list above.
- The second is in a folder marked as prohibited to robots in the file robots.txt ( Disallow: /interdit/). Even if all indexing robots do not always respect the rules, if the page is hit it must be a web copier.
As the site is rarely copied and even if few users follow robots.txt rules, these two traps do not initiate an action.
If some people find the site interesting enough to be mirrored, they can archive it but I could stop them with a script from the sites mentioned above, the methods following the detection script, an anti-mirroring PHP script, I could limit the number of pages per session or per IP (robots usually follow the same route), or slow them by counting the number of pages visitors or robots are trying to get by second and allow less than a page per second which will be a problem for web spiders and people who do not read.
Using the IP to do so works if the visitor's provider gives a unique IP address. This is not the case with AOL and many big companies.
Changing provider is one option: some filter web spiders (just as www.free.fr sometimes does!!!).
Therefore preventing or stopping website mirroring is difficult or risky.
If you prefer offline browsing, you can download the static part of the site (extension of compressed files : exe~597k or bz2~631k - December 2005 / use the site map).
Top of the page
