A website copy with HTTrack

Tested with version WinHTTrack Website Copier 3.30-RC-4 (+swf)

Canobie June 2003

project name: canobie
Web(URL) address: http://www.canobie.com
amount of time: 1/2 hour (56k modem)

problems:

Missing or corrupted files,
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> tag
robots.txt file

Other examples with similar difficulties: Martin Luther King 2002 | Martin Luther King 2004 | Recycling | Extreme World | Herberton

solutions:

At the end of the capture, many images and pdf files are missing.
HTTrack error log file shows:
Info: Note: due to www.canobie.com remote robots.txt rules, links begining with these path will be forbidden: /cgi-bin/, /stats/, /applets/, /counterparts/, /webmaster/ (see in the options to disable this) Warning: Link www.canobie.com/fr102.html not scanned (follow robots meta tag) Warning: Link www.canobie.com/general.html not scanned (follow robots meta tag) ... Warning: Link www.canobie.com/marriott.html not scanned (follow robots meta tag)
This site wants to limit offline browsers. It was not the case last year. Some people must have abused, they did not limit the number of connections, the maximum transfer rate...

Open one of the files with a META tag limiting the capture. You will find:
```
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
```
This tag is in all the files that have not been scanned.
Read the page about bot traps, then select in the Spider tab
no robots.txt rules
A quick look at the site will show you the folders disallowed in the file robots.txt are not interesting, if they exist.
Not to be trapped, disallow these folders in the scan rules:
-*/cgi-bin/*
-*/stats/* and so on.
Many advantages: you will limit the capture, will not overload the server by downloading useless files and you will not be blocked if one of these folders contain a script for those who do not respect rules.

Launch the mirroring operation
Once finished, some files are still missing in the capture. HTTrack error log file shows:
Error: "Not Found" (404) at link www.canobie.com/Resources/davinci.html (from www.canobie.com/adonis.html) ... Error: "Bad Request" (400) at link www.canobie.com/Images/logo104.jpg (from www.canobie.com/general.html) ...
With Windows explorer, search for the files with the text "404 FILE NOT FOUND" and you will find

Then search for the files with the text "400 Bad Request" and you will find

Now you have two solutions:
1. Visit the missing pages online and the ones with missing images, then copy them from the Internet Explorer cache, delete the figure between brackets [1] added by the explorer in order to replace the error files.
2. Visit and save the images, the pages and the pdf file in the proper folder from Internet Explorer (right click / save image as, right click / Save target link as, File / Save as).

Everything should work now.

Top of the page

With javascript