Revision as of 14:02, 16 March 2011

11-18 | Nicolas Maleve - Thematic Project

Cookbook Recipes for Goodiff Workshop

@@ Line 16: / Line 16: @@
 * [[Extracting parts of an HTML document]]
-=== Working with lxml ===
+* [[Extracting the text contents of a node]]
+* [[Turning part of a page back into code (aka serialization)]]
-==== Extracting the text contents of a node (lxml) ====
-The itertext method of a node can be useful.
-<source lang="python">
-for t in node.itertext():
-    print t
-</source>
-<source lang="python">
-text = "".join(list(node.itertext()))
-</source>
-==== Turning part of a page back into code (aka serialization) (lxml) ====
-Imagine you want to print out the full code of part of a page.
-Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
-<source lang="python">
-htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
-htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
-page = htmlparser.parse(htmlsource)
-selector = lxml.cssselect.CSSSelector("p")
-p = selector(page)[1]
-print lxml.etree.tostring(p)
-</source>