IT Desktop: Download entire site with wget

$ wget -nc -E -r -k -p -D example.com,example.org -np
$ wget --no-clobber 
       --adjust-extension \
       --recursive  --convert-links --page-requisites \
       --domains=example.com,example.org --no-parent \
       www.example.com/mysite/

-nc
--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options,
including -nc. In certain cases, the local file will be clobbered, or overwritten, upon repeated
download. In other cases it will be preserved.

When running Wget without -N, -nc, -r, or -p, downloading the same file in the same directory will
result in the original copy of file being preserved and the second copy being named file.1. If that
file is downloaded yet again, the third copy will be named file.2, and so on. (This is also the
behavior with -nd, even if -r or -p are in effect.) When -nc is specified, this behavior is suppressed,
and Wget will refuse to download newer copies of file. Therefore, ""no-clobber"" is actually a misnomer
in this mode---it's not clobbering that's prevented (as the numeric suffixes were already preventing
clobbering), but rather the multiple version saving that's prevented.

When running Wget with -r or -p, but without -N, -nd, or -nc, re-downloading a file will result in the
new copy simply overwriting the old. Adding -nc will prevent this behavior, instead causing the
original version to be preserved and any newer copies on the server to be ignored.

When running Wget with -N, with or without -r or -p, the decision as to whether or not to download a
newer copy of a file depends on the local and remote timestamp and size of the file. -nc may not be
specified at the same time as -N.

Note that when -nc is specified, files with the suffixes .html or .htm will be loaded from the local
disk and parsed as if they had been retrieved from the Web.

-E
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the
regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local
filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but
you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is
when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be
saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because
Wget can't tell that the local X.html file corresponds to remote URL X (since it doesn't yet know that
the URL produces output of type text/html or application/xhtml+xml.

As of version 1.12, Wget will also ensure that any downloaded files of type text/css end in the suffix
.css, and the option was renamed from --html-extension, to better reflect its new behavior. The old
option name is still acceptable, but should now be considered deprecated.

At some point in the future, this option may well be expanded to include suffixes for other types of
content, including content types that are not parsed by Wget.

-r
--recursive
Turn on recursive retrieving. The default maximum depth is 5.

-l depth
--level=depth
Specify recursion maximum depth level depth.

-k
--convert-links
After the download is complete, convert the links in the document to make them suitable for local
viewing. This affects not only the visible hyperlinks, but any part of the document that links to
external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Each link will be changed in one of the two ways:

Â· The links to files that have been downloaded by Wget will be changed to refer to the file they point
to as a relative link.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link
in doc.html will be modified to point to ../bar/img.gif. This kind of transformation works reliably
for arbitrary combinations of directories.

Â· The links to files that have not been downloaded by Wget will be changed to include host name and
absolute path of the location they point to.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the
link in doc.html will be modified to point to http://hostname/bar/img.gif.

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to
its local name; if it was not downloaded, the link will refer to its full Internet address rather than
presenting a broken link. The fact that the former links are converted to relative links ensures that
you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of
that, the work done by -k will be performed at the end of all the downloads.

-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML
page. This includes such things as inlined images, sounds, and referenced stylesheets.

Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display
it properly are not downloaded. Using -r together with -l can help, but since Wget does not ordinarily
distinguish between external and inlined documents, one is generally left with "leaf documents" that are
missing their requisites.

For instance, say document 1.html contains an "<IMG>" tag referencing 1.gif and an "<A>" tag pointing to
external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to
3.html. Say this continues up to some arbitrarily high number.

If one executes the command:

wget -r -l 2 http://<site>/1.html

then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be downloaded. As you can see, 3.html is without its
requisite 3.gif because Wget is simply counting the number of hops (up to 2) away from 1.html in order
to determine where to stop the recursion. However, with this command:

wget -r -l 2 -p http://<site>/1.html

all the above files and 3.html's requisite 3.gif will be downloaded. Similarly,

wget -r -l 1 -p http://<site>/1.html

will cause 1.html, 1.gif, 2.html, and 2.gif to be downloaded. One might think that:

wget -r -l 0 -p http://<site>/1.html

would download just 1.html and 1.gif, but unfortunately this is not the case, because -l 0 is equivalent
to -l inf---that is, infinite recursion. To download a single HTML page (or a handful of them, all
specified on the command-line or in a -i URL input file) and its (or their) requisites, simply leave off
-r and -l:

wget -p http://<site>/1.html

Note that Wget will behave as if -r had been specified, but only that single page and its requisites
will be downloaded. Links from that page to external documents will not be followed. Actually, to
download a single page and all its requisites (even if they exist on separate websites), and make sure
the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://<site>/<document>

To finish off this topic, it's worth knowing that Wget's idea of an external document link is any URL
specified in an "<A>" tag, an "<AREA>" tag, or a "<LINK>" tag other than "<LINK REL="stylesheet">".

-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not
turn on -H.

--exclude-domains domain-list
Specify the domains that are not to be followed.

-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since
it guarantees that only the files below a certain hierarchy will be downloaded.

IT Desktop

Sunday, June 17, 2012

Download entire site with wget

No comments:

Post a Comment

Performance

Labels

Blog Archive