Scraping web pages with python: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extra...")
 
No edit summary
Line 39: Line 39:
  --------------------
  --------------------
  More stuff.
  More stuff.
[[Category: Cookbook]] [[Category: xpath]] [[Category: python]] [[Category: lxml]]

Revision as of 23:11, 22 March 2011

The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page

You can use xpath expressions:

import html5lib, lxml

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
    p = p[0]
    print "".join([t for t in p.itertext()])

outputs: More stuff.

Also CSS selectors are possible:

import html5lib, lxml, lxml.cssselect

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
    print "-"*20
    print "".join([t for t in p.itertext()])
--------------------
Example page.
--------------------
More stuff.