Scraping web pages with python: Difference between revisions

Revision as of 18:15, 5 April 2011

The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page

You can use xpath expressions:

import html5lib, lxml

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
    p = p[0]
    print "".join([t for t in p.itertext()])

outputs: More stuff.

Also CSS selectors are possible:

import html5lib, lxml, lxml.cssselect

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
    print "-"*20
    print "".join([t for t in p.itertext()])

--------------------
Example page.
--------------------
More stuff.

Function that takes a URL + xpath

NB the function returns a LIST of matching fragments (since xpaths can potentially match multiple things). So, if you expect only one result, use [0] to pull off the first (single) item. lxml.etree.tostring is used to re-serialize the result.

import urllib2, html5lib, lxml, lxml.etree
 
def getXpath (url, xpath):
    htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
    request = urllib2.Request(url)
    request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
    f=urllib2.urlopen(request)
    
    page = htmlparser.parse(f)
    return page.xpath(xpath)

if __name__ == "__main__":
    url = "http://www.jabberwocky.com/carroll/walrus.html"
    xpath = "/html/body/p[6]"
    print lxml.etree.tostring(getXpath(url, xpath)[0])

import html5lib, lxml, lxml.cssselect
 
def getCSS (url, selector):
    htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
    request = urllib2.Request(url)
    request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
    f=urllib2.urlopen(request)
    
    page = htmlparser.parse(f)
    selector = lxml.cssselect.CSSSelector(selector)
    return list(selector(page))

# TEST
if __name__ == "__main__":
    url = "http://www.jabberwocky.com/carroll/walrus.html"
    print lxml.etree.tostring(getCSS(url, "p")[0])