Wget: Difference between revisions
No edit summary |
|||
Line 5: | Line 5: | ||
== Examples == | == Examples == | ||
=== "Mirroring" a site === | |||
<source lang="bash"> | |||
wget -Erk -nH 'http://www.yoursite.com/yourdirectory/yourfile.php" | |||
</source> | |||
Source: http://www.dreamincode.net/forums/showtopic8317.htm | |||
What are the options used here (from wget manpage): | |||
-E<br /> | |||
--html-extension | |||
If a file of type application/xhtml+xml or text/html is downloaded | |||
and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this | |||
option will cause the suffix .html to be appended to the local | |||
filename. This is useful, for instance, when you’re mirroring a | |||
remote site that uses .asp pages, but you want the mirrored pages | |||
to be viewable on your stock Apache server. Another good use for | |||
this is when you’re downloading CGI-generated materials. A URL | |||
like http://site.com/article.cgi?25 will be saved as | |||
article.cgi?25.html. | |||
=== Grabbing images === | |||
This simple two-line script uses wget to collect all the jpeg's from a website, then uses [[ImageMagick]]'s montage tool to combine them in a single image. | This simple two-line script uses wget to collect all the jpeg's from a website, then uses [[ImageMagick]]'s montage tool to combine them in a single image. |
Revision as of 14:10, 25 June 2009
Program to download files from the Web. Includes powerful "recursive" features that allow easily downloading entire portions of sites including linked images / other resources.
Examples
"Mirroring" a site
wget -Erk -nH 'http://www.yoursite.com/yourdirectory/yourfile.php"
Source: http://www.dreamincode.net/forums/showtopic8317.htm
What are the options used here (from wget manpage):
-E
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you’re mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you’re downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
Grabbing images
This simple two-line script uses wget to collect all the jpeg's from a website, then uses ImageMagick's montage tool to combine them in a single image.
wget -r -nd -np --follow-tags=img -A.jpg,.jpeg http://www.colourlovers.com
montage *.jpg ../public_html/montage.jpg
Doing the same thing in Python
import urllib2
def wget (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
realurl = pagefile.geturl()
return (pagefile, realurl)
if __name__ == "__main__":
# Example to use...
theurl = 'http://pzwart2.wdka.hro.nl/ical/schedule.ics'
(f, theurl) = wget(theurl)
# print thepage
# 1.
cal = ics.parseData(f) # accepts file objects?
# or 2.
thepage = f.read()
cal = ics.parseData(thepage) # or just "raw" data (string)