Revision as of 13:58, 16 March 2011

11-18 | Nicolas Maleve - Thematic Project

Cookbook Recipes for Goodiff Workshop

Looking up synonym-sets for a word

Splitting text into sentences

Removing common words / stopwords

Finding capitalized words

Extracting parts of an HTML document

Working with lxml

Extracting the text contents of a node (lxml)

The itertext method of a node can be useful.

for t in node.itertext():
    print t

text = "".join(list(node.itertext()))

Turning part of a page back into code (aka serialization) (lxml)

Imagine you want to print out the full code of part of a page. Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.

htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)

@@ Line 14: / Line 14: @@
 * [[Finding capitalized words]]
-=== Extracting parts of an HTML document ===
+* [[Extracting parts of an HTML document]]
-The html5lib parser is code that turns the source text of an HTML page
-into a structured object, allowing, for instance, to use CSS selectors
-or xpath expressions to select/extract portions of a page
-You can use xpath expressions:
-<source lang="python">
-import html5lib, lxml
-htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
-htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
-page = htmlparser.parse(htmlsource)
-p = page.xpath("/html/body/p[2]")
-if p:
-    p = p[0]
-    print "".join([t for t in p.itertext()])
-</source>
-outputs:
-More stuff.
-Also CSS selectors are possible:
-<source lang="python">
-import html5lib, lxml, lxml.cssselect
-htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
-htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
-page = htmlparser.parse(htmlsource)
-selector = lxml.cssselect.CSSSelector("p")
-for p in selector(page):
-    print "-"*20
-    print "".join([t for t in p.itertext()])
-</source>
- --------------------
- Example page.
- --------------------
- More stuff.
 === Working with lxml ===

Calendars:Networked Media Calendar/Networked Media Calendar/16-03-2011 -Event 1: Difference between revisions

Revision as of 13:58, 16 March 2011

Contents

Cookbook Recipes for Goodiff Workshop

Working with lxml

Extracting the text contents of a node (lxml)

Turning part of a page back into code (aka serialization) (lxml)