|
|
Line 16: |
Line 16: |
| * [[Extracting parts of an HTML document]] | | * [[Extracting parts of an HTML document]] |
|
| |
|
| === Working with lxml ===
| | * [[Extracting the text contents of a node]] |
| | | * [[Turning part of a page back into code (aka serialization)]] |
| ==== Extracting the text contents of a node (lxml) ====
| |
| | |
| The itertext method of a node can be useful.
| |
| | |
| <source lang="python">
| |
| for t in node.itertext():
| |
| print t
| |
| </source>
| |
| | |
| <source lang="python">
| |
| text = "".join(list(node.itertext()))
| |
| </source>
| |
| | |
| ==== Turning part of a page back into code (aka serialization) (lxml) ====
| |
| | |
| Imagine you want to print out the full code of part of a page.
| |
| Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
| |
| | |
| <source lang="python">
| |
| htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
| |
| htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
| |
| page = htmlparser.parse(htmlsource)
| |
| selector = lxml.cssselect.CSSSelector("p")
| |
| p = selector(page)[1]
| |
| print lxml.etree.tostring(p)
| |
| </source>
| |