Simple scraping with wget: Difference between revisions
(Created page with "From Roel, a very nice one-liner: wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com") |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com | wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com | ||
The options: | |||
* --random-wait: Makes wget less likely to be blocked by "bot" detection algorithms that look for regularity in access | |||
* -r (--recursive): Recursive (ie keep following links) | |||
* -p (--page-requisites): Download dependent files | |||
* -e (--execute): Performs the command (in this case robots=off). | |||
* -U (--user-agent): Sets the user agent string to that of a "known" browser, in this case mozilla. |
Latest revision as of 15:57, 19 May 2014
From Roel, a very nice one-liner:
wget --random-wait -r -p -e robots=off -U mozilla www.somepage.com
The options:
- --random-wait: Makes wget less likely to be blocked by "bot" detection algorithms that look for regularity in access
- -r (--recursive): Recursive (ie keep following links)
- -p (--page-requisites): Download dependent files
- -e (--execute): Performs the command (in this case robots=off).
- -U (--user-agent): Sets the user agent string to that of a "known" browser, in this case mozilla.