Jump to content

XPUB & Lens-Based wiki

Web Spider in Python

From XPUB & Lens-Based wiki

Revision as of 19:30, 4 March 2014 by Michael Murtaugh (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Using html5lib

import html5lib, urllib, urlparse

url = "http://wikipedia.org/"
html = urllib.urlopen(url).read()
tree = html5lib.parse(html, namespaceHTMLElements=False)
for a in tree.findall(".//a"):
    if a.attrib.get("href"):
        href = urlparse.urljoin(url, a.attrib.get("href"))
        print href

Retrieved from "https://pzwiki.wdka.nl/mw-mediadesign/index.php?title=Web_Spider_in_Python&oldid=58386"

Pages using deprecated source tags