Spider: Difference between revisions
Lassebosch (talk | contribs) No edit summary |
No edit summary |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
See also [[Web Spider in Python]] | |||
An attempt to write a spider which 1) prints all URL's for a desired web-page. 2) excludes/includes certain urls in a filter-script and 3) Picks one of the filterd urls and 4) eventually sends it back to 1) in a continous loop | An attempt to write a spider which 1) prints all URL's for a desired web-page. 2) excludes/includes certain urls in a filter-script and 3) Picks one of the filterd urls and 4) eventually sends it back to 1) in a continous loop | ||
Line 4: | Line 6: | ||
'''1)''' Spider | '''1)''' Spider | ||
<source lang="python"> | |||
import sys, httplib2, os, time, urllib, lxml.html, re | import sys, httplib2, os, time, urllib, lxml.html, re | ||
from urlparse import urlparse, urljoin, urldefrag | from urlparse import urlparse, urljoin, urldefrag | ||
Line 40: | Line 43: | ||
PREFIX = starturl | PREFIX = starturl | ||
visit(starturl) | visit(starturl) | ||
</source> | |||
'''2)''' filter | '''2)''' filter | ||
Line 48: | Line 52: | ||
# examples of what you want to include: | # examples of what you want to include: | ||
include = ['www.'] | include = ['www.'] | ||
for line in sys.stdin: | for line in sys.stdin: | ||
if not any(exclude in line for exclude in exclude) and any(include in line for include in include): | if not any(exclude in line for exclude in exclude) and any(include in line for include in include): | ||
Line 65: | Line 68: | ||
'''4)''' loop?? | '''4)''' loop?? | ||
Currently i'm working on the loop... | |||
Example of how you the [[Pipelines | piping]] (link for nice piping-article) : | |||
python spider04.py http://tatteredcorners.tumblr.com/post/15141435895 | python tumblrfilter.py | python randomc.py > pickedurl.txt |
Latest revision as of 18:26, 4 March 2014
See also Web Spider in Python
An attempt to write a spider which 1) prints all URL's for a desired web-page. 2) excludes/includes certain urls in a filter-script and 3) Picks one of the filterd urls and 4) eventually sends it back to 1) in a continous loop
1) Spider
import sys, httplib2, os, time, urllib, lxml.html, re
from urlparse import urlparse, urljoin, urldefrag
def visit (url, depth=1):
global visited
#print url
print url
# remember we visited
visited[url] = True
if depth >= MAX_DEPTH: return
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for xpath in ['//a/@href', '//img/@src']:
# select the url in href for all a tags(links)
for link in dom.xpath(xpath):
#print link
link = link.strip()
if link.lower().startswith("javascript"):
continue
# normalize url
link = urljoin(url,link)
link = urldefrag(link)[0]
# strip for /
link = link.rstrip('/')
# if (link not in visited) and link.startswith(PREFIX) and depth<MAX_DEPTH:
if (link not in visited) and depth<MAX_DEPTH:
visit(link, depth+1)
MAX_DEPTH = 2
visited = {}
starturl = sys.argv[1]
try:
PREFIX = sys.argv[2]
except IndexError:
PREFIX = starturl
visit(starturl)
2) filter
import sys # examples of what you want to exclude: exclude = ['.jpg', '.png', 'gif'] # examples of what you want to include: include = ['www.'] for line in sys.stdin: if not any(exclude in line for exclude in exclude) and any(include in line for include in include): sys.stdout.write(line)
3) randomchoice
import sys, random urls = [] for line in sys.stdin: urls.append(line) lengthurls=len(urls) randPick = random.randint(0, lengthurls) sys.stdout.write(urls[randPick])
4) loop??
Currently i'm working on the loop...
Example of how you the piping (link for nice piping-article) :
python spider04.py http://tatteredcorners.tumblr.com/post/15141435895 | python tumblrfilter.py | python randomc.py > pickedurl.txt