Calendars:Networked Media Calendar/Networked Media Calendar/16-03-2011 -Event 1: Difference between revisions

From XPUB & Lens-Based wiki
Line 16: Line 16:
* [[Extracting parts of an HTML document]]
* [[Extracting parts of an HTML document]]


=== Working with lxml ===
* [[Extracting the text contents of a node]]
 
* [[Turning part of a page back into code (aka serialization)]]
==== Extracting the text contents of a node (lxml) ====
 
The itertext method of a node can be useful.
 
<source lang="python">
for t in node.itertext():
    print t
</source>
 
<source lang="python">
text = "".join(list(node.itertext()))
</source>
 
==== Turning part of a page back into code (aka serialization) (lxml) ====
 
Imagine you want to print out the full code of part of a page.
Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
 
<source lang="python">
htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)
</source>

Revision as of 14:02, 16 March 2011