Webbot activity on Jun 26, 2017

name activity 
  IP 46.29.251.11  log spam1 page 
  IP 198.27.91.223  log spam1 page 
Googlebotcrawler1 page 
DotBotcrawler1 page 
Googlebotcrawler1 page 
bingbotcrawler1 page 
BUbiNGcrawler10 pages in  1' 13'' 
DotBotcrawler1 page 
Yahoo! Slurpcrawler1 page 
Yahoo! Slurpcrawler1 page 
  IP 195.154.21.201  log spam1 page 
Baiduspidersearch engine1 page 
DuckDuckGosearch engine1 page 
Yahoo! Slurpcrawler1 page 
  IP 80.241.223.216  log spam1 page 
Yahoo! Slurpcrawler1 page 
Googlebotcrawler1 page 
Python-urllib 1 page 
bingbotcrawler1 page 
CCBotcrawler33 pages in  9' 19'' 
661,478 visits of identified bots
about 107 a day in 2014
38 today at 14:08 (+55 visitors)
zombies : 3 visits / 31 requests - spammers : 5 visits

What do these statistics mean?

They have been extracted from $_SERVER["HTTP_USER_AGENT"] -$HTTP_USER_AGENT with PHP 3- and gethostbyaddr().
As this website host (free.fr) sometimes filters access, they are biased.
The site statistics do not consider robots as visitors. The browser and country they show are ignored. Therefore it is quite easy to log the activity of those which are not banned.
Even if they read a few pages they are stored once except if they come back after more than 10 minutes (for Google Desktop after more than 30 minutes).
This list gives approximate information as it assumes a perfect connection to MySQL which is not the case for this site. But web hosting here is free, so...

Robot Detection

This routine is commented in the page about webbot traps.
Logging other visitors' user agent is necessary to update the lists of webbots visiting the site and of their User Agents.

The Data Table

Here is the structure of the table robots I use:
#
# Structure of the table `robots`
#
CREATE TABLE `robots` (
  `timeoflastaccess` int(10) unsigned NOT NULL default '0',
  `timeofarrival` int(10) unsigned NOT NULL default '0',
  `nameofbot` varchar(64) NOT NULL default '',
  `lastpage` varchar(30) NOT NULL default '',
  `numberofpages` mediumint(8) unsigned NOT NULL default '0',
  KEY `timeoflastaccess` (`timeoflastaccess`),
  KEY `timeofarrival` (`timeofarrival`),
  KEY `nameofbot` (`nameofbot`),
  KEY `numberofpages` (`numberofpages`)
) TYPE=MyISAM;

You can use double or datetime for times, double or int for numberofpages, if necessary, increase the number of caracters for lastpage.

Table Update

Data Display

We have the name of the robot, the time of its arrival, the time of the last page it loaded and the total number of pages indexed.
If there is one page and different values for timeoflastaccess and timeofarrival, then the page was reloaded.
I chose to display the number of pages loaded and the length of reading time.

A similar script is now online here

topTop of the page

With javascript

W3C XHTML 1.0
W3C CSS