This page uses CSS and javascript. If you can see this message, CSS (or javascript) is not enabled in your browser options.
The page will not appear as intended.
You can visit the other version of the site.

A website copy with HTTrack

httrack

Tested with version WinHTTrack Website Copier 3.30-RC-4 (+swf)

Canobie June 2003

project name: canobie
Web(URL) address: http://www.canobie.com
amount of time: 1/2 hour (56k modem)

problems:

  1. Missing or corrupted files,
  2. <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> tag
  3. robots.txt file

Other examples with similar difficulties: Martin Luther King 2002 | Martin Luther King 2004 | Recycling | Extreme World | Herberton

solutions:

  1. At the end of the capture, many images and pdf files are missing.
    HTTrack error log file shows:

    This site wants to limit offline browsers. It was not the case last year. Some people must have abused, they did not limit the number of connections, the maximum transfer rate...

    Open one of the files with a META tag limiting the capture. You will find:
    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
    
    This tag is in all the files that have not been scanned.
    Read the page about bot traps, then select in the Spider tab
    no robots.txt rules
    A quick look at the site will show you the folders disallowed in the file robots.txt are not interesting, if they exist.
    Not to be trapped, disallow these folders in the scan rules:
    -*/cgi-bin/*
    -*/stats/* and so on.
    Many advantages: you will limit the capture, will not overload the server by downloading useless files and you will not be blocked if one of these folders contain a script for those who do not respect rules.

    Launch the mirroring operation

  2. Once finished, some files are still missing in the capture. HTTrack error log file shows:

    With Windows explorer, search for the files with the text "404 FILE NOT FOUND" and you will find
    search result

    Then search for the files with the text "400 Bad Request" and you will find
    search result

    Now you have two solutions:
    1. Visit the missing pages online and the ones with missing images, then copy them from the Internet Explorer cache, delete the figure between brackets [1] added by the explorer in order to replace the error files.
    2. Visit and save the images, the pages and the pdf file in the proper folder from Internet Explorer (right click / save image as, right click / Save target link as, File / Save as).

Everything should work now.

topTop of the page
Valid CSS! Valid XHTML 1.0!