Simple scraping with wget: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "From Roel, a very nice one-liner: wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com")
 
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 2: Line 2:


   wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com
   wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com
The options:
* --random-wait: Makes wget less likely to be blocked by "bot" detection algorithms that look for regularity in access
* -r (--recursive): Recursive (ie keep following links)
* -p (--page-requisites): Download dependent files
* -e (--execute): Performs the command (in this case robots=off).
* -U (--user-agent): Sets the user agent string to that of a "known" browser, in this case mozilla.

Latest revision as of 15:57, 19 May 2014

From Roel, a very nice one-liner:

 wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com

The options:

  • --random-wait: Makes wget less likely to be blocked by "bot" detection algorithms that look for regularity in access
  • -r (--recursive): Recursive (ie keep following links)
  • -p (--page-requisites): Download dependent files
  • -e (--execute): Performs the command (in this case robots=off).
  • -U (--user-agent): Sets the user agent string to that of a "known" browser, in this case mozilla.