A website copy with HTTrack
Tested with version WinHTTrack Website Copier 3.30-beta-7b (+swf)
Marian High School May 2003
project name: marianWeb(URL) address: http://marian.creighton.edu for the "Master"
amount of time: 1 hour + 1 hour + many hours +1 hour (56k modem)
problem:
Huge siteOther examples with similar difficulties: Areaparks | Kakadu
solution:
The site is in construction and updated by students, so there will be many errors as in any school site.Because each student has her own folder, there are more than 700. Each folder can be entirely modified any day.
In these folders, the image files are not optimized or may have a bad extension. Setting a limit for file size is necessary.
As the students may publish private information or email addresses, the capture must only be used as a way to understand how to mirror huge websites.
To capture all the information you think is interesting, you have to download step by step.
- First, in "Set options", select "store HTML files" as Primary Scan Rule in the tab "Experts Only" in order to have a copy of the site.
- With other spiders than WinHTTrack, capture HTML files first and the home page images.-
That way you have a map of the site and you can choose or find the parts you want.
With a 56k modem, it will take about an hour.
WinHTTrack replaces ~ by _ in the folder names.
At the end of the capture you will have more than 1400 errors, 5600 files in 716 folders.
As three or four folders are interesting, the rest being very instructive but incomplete, filtering is necessary. - Now, you can examine the different parts of the site and decide which ones to keep or remove.
Here we can remove the folder about Japan by adding:
-marian.creighton.edu\~marian-w\academics\english\japan\*
in "Set options" "Scan rules".
We can do the same with all the unnecessary folders.
We can also set the "Max size of any non-Html files in the tab "Limits". - Copy the whole folder marian three times.
Rename marian as marianMaster. This folder will serve as a master and final local site for the different mirrors.
Rename Copy of marian as marian1, Copy (2) of marian as marian2, Copy (3) of marian as marian3. These copies allow you not to start from scratch with many Web spiders. With WinHTTrack, it's less obvious, but it will allow a relative link building at the end of the capture. - Select the project marianMaster and in "Set options", tick "do not purge old files" in the tab "Build".
When updating, existing file in the local site will be kept. - Select the project marian1 and in "Set options" select the option "store all files" in the tab "Experts only", deselect "do not purge old files" in the tab "Build" if necessary.
In Web Addresses, replace http://marian.creighton.edu with marian.creighton.edu/~marian-w/ to capture the general information.
Launch the capture. You will get:
It will take about one hour with a 56k modem. - Copy the folder marian.creighton.edu in the folder marianMaster and overwrite existing files.
Copy the other folders except hts-cache. - Select the project marian2 and in "Set options" select the option "store all files" in the tab "Experts only", deselect "do not purge old files" in the tab "Build" if necessary.
In Web Addresses, replace http://marian.creighton.edu with
marian.creighton.edu/~crusader/
marian.creighton.edu/~mascu/ to capture the school magazine.
Launch the capture. You will get:
It will take about one hour with a 56k modem. - Copy the folder marian.creighton.edu in the folder marianMaster and overwrite existing files.
Copy the other folders except hts-cache.. - Do the same with the other folders you are interested in by using marian3.
- If you use WinHTTrack, now you can select the project marianMaster and launch the capture.
It may last many hours as the students store thousands of files (sound, video, image, animation...).
After an hour with a 56k modem, the HTML files have been rewritten and link between folders are normally working. - Browse the mirror and add the missing folders using copies.
If some images are missing, launch the capture without file size limits or download them to copy them in the local site.
If they are in the local site but do not display, that is most of the time because a student changed an extension instead of converting it or modifying the html file. Use Irfanview to find out.
Everything should work now.