Calendars:Networked Media Calendar/Networked Media Calendar/16-03-2011 -Event 1: Difference between revisions
Line 10: | Line 10: | ||
* [[Splitting text into sentences]] | * [[Splitting text into sentences]] | ||
* [[Removing common words / stopwords]] | |||
=== Finding capitalized words (regex) === | === Finding capitalized words (regex) === |
Revision as of 13:53, 16 March 2011
11-18 | Nicolas Maleve - Thematic Project
Cookbook Recipes for Goodiff Workshop
Finding capitalized words (regex)
import re
pat = re.compile(r"\b[A-Z]+\b")
print pat.findall(text)
Extracting parts of an HTML document
The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page
You can use xpath expressions:
import html5lib, lxml
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
p = p[0]
print "".join([t for t in p.itertext()])
outputs: More stuff.
Also CSS selectors are possible:
import html5lib, lxml, lxml.cssselect
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
print "-"*20
print "".join([t for t in p.itertext()])
-------------------- Example page. -------------------- More stuff.
Working with lxml
Extracting the text contents of a node (lxml)
The itertext method of a node can be useful.
for t in node.itertext():
print t
text = "".join(list(node.itertext()))
Turning part of a page back into code (aka serialization) (lxml)
Imagine you want to print out the full code of part of a page. Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)